We have seen increasing number of DDoS attacks launched in many high profile services lately, some even succumbing to the attacks and had momentary service disruptions. To architect a highly available service today, one must be aware of the different type of DDoS attacks, and incorporate all the necessary controls and solutions in place.
If a client cannot resolve the DNS of our system, our system is as good as unavailable. Most DNS servers would cache the resolution results based on the Time-To-Live (TTL) specified by the authoritative DNS servers. The longer the TTL is, the slower DNS changes propagate. There are DoS attacks on authoritative DNS servers, the most recent being the attack on Dyn, a DNS service provider. To effectively protect against this, employ a cloud based DNS proxy (e.g. Cloudflare DNS firewall) that has a huge pipe and drop off malicious packets before forwarding legitimate requests to the origin DNS servers. More on the huge pipe and scrubbing to be described in later sections.
Push the static content onto CDN. CDN distributes the content to multiple locations in the world, and based on the geographic location of the client, serves up the content nearest to the requestor. It mitigates DDoS attacks from our origin servers to multiple locations in the world, and sort of distribute the attacks into multiple locations too, providing better availability to the static content as well.
Do not just stop at static content. If we analyse our data and determine that it is ok for dynamic content to be cached for a few hours, push them out to CDN as well. Facebook had a very interesting architecture where user feeds are generated and cached in multiple nodes in the world. These follow the concept of Eventual Consistency. And hey, no one complained about seeing outdated content for a few hours (P.S. they probably did not know there are updated content too!)
Traditionally we will block attacks in reverse proxies and firewalls before they reach our application servers. These are the usual:
In reverse proxies, a request would need to wait for an available backend connection to be freed up before it can get a response. Similarly for applications, a request would need to wait for an available database connection to be freed up before it can execute its query or update. This might slow things down, but it keeps the system alive.
If we had a service that serves both internal and external users with different service level agreement (external user service could be unavailable for a period of time as we have Business Continuity Plan in place for manual workarounds, but internal user service must remain highly available), we could set up different reverse proxies (or even different systems) for the two distinct group of users such that DDoS attacks on the external reverse proxies do not impact the service to the internal users. To provide seamless experience, we could also setup the same DNS resolving to different IP addresses for internal and external users.
The earlier we can detect attacks and steer them off, the better. There are professional grade cloud-based solution provider that acts as a proxy for our web applications, and they do all the standard protection mentioned above and more. Most of these providers had their roots as CDN providers, and they have leveraged the same technologies to mitigate DDoS.
To my knowledge, I am only aware of Amazon Web Service (AWS) supporting this, but there could be other services out there providing this functionality. Essentially you deploy the same applications to multiple regions and availability zones, keeping them replicated and in sync. Next you setup Amazon Route 53 DNS to route user requests to different regions based on either latency or geographic location. The service becomes highly available and distributed, and any DDoS attacks again becomes distributed across multiple locations, improving our system availability and resiliency.
Do we even know if we are under DDoS attacks? We need to have constant monitoring on our systems, to establish baselines of ‘legitimate workloads’ . This could differ from time to time (e.g. peak period of the day, week, month, or even year, or even upon promotion periods). Have we architected such monitoring metrics into our solution design? Are they collected into a centralised repository for our monitoring and analysis? Is this monitoring service itself highly available even in the event of a DDoS? Was it tested with high volume of data? The last thing we want is complete reliance on the tool, only to be completely blind when the tool fail us during an actual DDoS attack.
Prevention can only get us so far. As many cyber security experts would say, it is not a matter of if we get attacked, but when we get attack. Assuming an attack happened, what do we do?
Probably not. Technology is moving at an extremely fast pace, and DDoS attacks are evolving quickly. We need to constantly keep ourselves up to date on any new form of attacks and re-evaluate if our solutions can resist the new attacks. Unlike regular IT Disaster Recovery that some organisations conduct, we seldom conduct DDoS attacks on our own system to validate our design (short of incident role-play). It would be interesting to conduct and schedule such regular DDoS attacks like the Netflix Simian Army, just so we can continuously learn and improve the system resiliency to DDoS. Until then, we should accept DDoS as a possibility of knocking either partial or complete service outage, and design our solution (technically, operationally, business-wise) to withstand such failure.