Architecting for DDoS

We have seen increasing number of DDoS attacks launched in many high profile services lately, some even succumbing to the attacks and had momentary service disruptions. To architect a highly available service today, one must be aware of the different type of DDoS attacks, and incorporate all the necessary controls and solutions in place.

DNS Resolution via DNS Proxy

If a client cannot resolve the DNS of our system, our system is as good as unavailable. Most DNS servers would cache the resolution results based on the Time-To-Live (TTL) specified by the authoritative DNS servers. The longer the TTL is, the slower DNS changes propagate. There are DoS attacks on authoritative DNS servers, the most recent being the attack on Dyn, a DNS service provider. To effectively protect against this, employ a cloud based DNS proxy (e.g. Cloudflare DNS firewall) that has a huge pipe and drop off malicious packets before forwarding legitimate requests to the origin DNS servers. More on the huge pipe and scrubbing to be described in later sections.

Employ Content Delivery Network (CDN) for Static Content

Push the static content onto CDN. CDN distributes the content to multiple locations in the world, and based on the geographic location of the client, serves up the content nearest to the requestor. It mitigates DDoS attacks from our origin servers to multiple locations in the world, and sort of distribute the attacks into multiple locations too, providing better availability to the static content as well.

Cache Dynamic Content on CDN as well

Do not just stop at static content. If we analyse our data and determine that it is ok for dynamic content to be cached for a few hours, push them out to CDN as well. Facebook had a very interesting architecture where user feeds are generated and cached in multiple nodes in the world. These follow the concept of Eventual Consistency. And hey, no one complained about seeing outdated content for a few hours (P.S. they probably did not know there are updated content too!)

Standard Protection and Tunings still apply

Traditionally we will block attacks in reverse proxies and firewalls before they reach our application servers. These are the usual:

  • rate limiting or throttling
  • opening only required ports. Most of time, probably only port 80 HTTP or 443 HTTPS are required. There is no need to response to ping or icmp requests, or even DNS and NTP query responses (DNS and NTP amplification attacks)
  • dropping malformed packets via network firewall devices or Intrusion Prevention Systems (IPS)
  • lower ICMP/Syn/UDP timeout thresholds (and other similar protocol attacks)
  • faster timeout for half-opened connections or incomplete requests (slowloris)
  • faster timeout for slow connections
  • lower payload limit (e.g. 20MB uploads max instead of GBs or unlimited) (But these protection and tunings are not limited to reverse proxies and firewalls. They can be employed on our application servers too!)

Connection Pooling to Systems and Databases

In reverse proxies, a request would need to wait for an available backend connection to be freed up before it can get a response. Similarly for applications, a request would need to wait for an available database connection to be freed up before it can execute its query or update. This might slow things down, but it keeps the system alive.

Different Proxies or Systems for Different ‘Zones’

If we had a service that serves both internal and external users with different service level agreement (external user service could be unavailable for a period of time as we have Business Continuity Plan in place for manual workarounds, but internal user service must remain highly available), we could set up different reverse proxies (or even different systems) for the two distinct group of users such that DDoS attacks on the external reverse proxies do not impact the service to the internal users. To provide seamless experience, we could also setup the same DNS resolving to different IP addresses for internal and external users.

Scrub Attacks before they reach us

The earlier we can detect attacks and steer them off, the better. There are professional grade cloud-based solution provider that acts as a proxy for our web applications, and they do all the standard protection mentioned above and more. Most of these providers had their roots as CDN providers, and they have leveraged the same technologies to mitigate DDoS.

  • Using BGP techniques, they are able to route traffic to the same IP address to different data centers across the world.
  • Each data center has a huge pipe, so they can handle a lot of aggregated bandwidth from DDoS and scrub malicious attacks off and passing only clean traffic to our systems.
  • On top of that, given that these are geographically distributed, in the off-chance that there are attacks that could overwhelm a data centers, our service remains unaffected in other data centers. Most DDoS today are volumetric attacks, and the one with the bigger pipe wins.

Deploying to Multiple Locations and Employ Latency-Based or Geographic Based DNS Routing

To my knowledge, I am only aware of Amazon Web Service (AWS) supporting this, but there could be other services out there providing this functionality. Essentially you deploy the same applications to multiple regions and availability zones, keeping them replicated and in sync. Next you setup Amazon Route 53 DNS to route user requests to different regions based on either latency or geographic location. The service becomes highly available and distributed, and any DDoS attacks again becomes distributed across multiple locations, improving our system availability and resiliency.

Monitoring for Abnormalities

Do we even know if we are under DDoS attacks? We need to have constant monitoring on our systems, to establish baselines of ‘legitimate workloads’ . This could differ from time to time (e.g. peak period of the day, week, month, or even year, or even upon promotion periods). Have we architected such monitoring metrics into our solution design? Are they collected into a centralised repository for our monitoring and analysis? Is this monitoring service itself highly available even in the event of a DDoS? Was it tested with high volume of data? The last thing we want is complete reliance on the tool, only to be completely blind when the tool fail us during an actual DDoS attack.

DDoS Response Plan

Prevention can only get us so far. As many cyber security experts would say, it is not a matter of if we get attacked, but when we get attack. Assuming an attack happened, what do we do?

  • If the cloud based data scrubbing solution provider gets DDoS and looks to be crumbling under the load, what should they do? Are they able to react fast enough to blacklist all those attackers? Increase their bandwidth? Blacklist and drop off more malicious traffic quickly? Do we have a Business Continuity Plan in place to continue without the service being available?
  • If the attacks are on our published IP addresses, can we quickly update our DNS and switch to another set of IP addresses?
  • If the traffic looks legitimate and we could not drop off anymore traffic, could we further throttle the request? Or throttle particular request types? Would that result in degraded service or unavailable service for some users?
  • If there is only partial system failure due to DDoS, can the service continue to operate in a degraded mode?
  • Can we quickly increase our resources and capacity to deal with the increase in volume? This is a particularly interesting point to consider if you have provisioned for just enough capacity for the hardware network devices. It is always safer to over-provision for hardware in case of emergency needs

Are we safe now?

Probably not. Technology is moving at an extremely fast pace, and DDoS attacks are evolving quickly. We need to constantly keep ourselves up to date on any new form of attacks and re-evaluate if our solutions can resist the new attacks. Unlike regular IT Disaster Recovery that some organisations conduct, we seldom conduct DDoS attacks on our own system to validate our design (short of incident role-play). It would be interesting to conduct and schedule such regular DDoS attacks like the Netflix Simian Army, just so we can continuously learn and improve the system resiliency to DDoS. Until then, we should accept DDoS as a possibility of knocking either partial or complete service outage, and design our solution (technically, operationally, business-wise) to withstand such failure.