DDoS Resilient Reference Architecture on AWS

22.03.2020 — AWS, Terraform, Terragrunt, infrastructure-as-code — 8 min read

Prelude

Realizing that you are the target of a DDoS attack is mildly put a very unpleasant feeling. Especially because it can be difficult to properly determine whether you are in fact under attack or just seeing ghosts. In the cloud things can even get more entangled because very often you share network layer components with other tenants. When for example one of your neighbors that you are sharing network switches with is attacked, you might just notice a degraded networking performance although no anomalies or peaks in traffic show up in your monitoring – unless you monitor package loss within your own network.

I ran into a situation just like this at the most inappropriate time of year. 2 months after I joined www.juniqe.com as a CTO in 2014 a back then mysterious weekend was due: Black Friday. Of course we knew how much fuzz US Americans make about Black Friday and that the madness would likely swap over the Atlantic at some point so we prepared and load tested the platform anticipating a 5x increase in traffic at peak times. But Black Friday traffic came rolling in and to our surprise our “over-provisioned” platform was close to being overwhelmed by the traffic on Friday evening and best of all by the CPU hungry checkout traffic. But it sustained well and the conversion rate and overall revenue was almost too good to be true …

… until Saturday morning when a steady decline of internal network speed was causing the whole e-commerce platform to slow down. We were running on multiple Hetzner servers back then – 2 DB servers with a master/slave setup and the rest of the components like nginx, php-fpm, haproxy, memcached and redis spread out over multiple large server instances. All we could see at first was an overall decline of responsiveness and speed for end-users although none of the servers were even close to maxing out on CPU, the traffic on our website was not especially high yet and none of the background processes were overloading any component. And most mysteriously the decline in measured end-user speed was at times steady and then all of a sudden without any noticeable other change everything went back to normal again. Since the system was still quite new to me and we had made a lot of improvements and changes to the overall setup as a preparation to the Black Friday weekend I expected that we introduced a misconfiguration of some component. But double and triple checking every component did not give any hint for a mistake or bug on our side. So I started to investigate for issues with the network itself. And Bingo, mtr traceroute reported a crazy amount of package loss – in the beginning only between app-servers and master DB server, then more or less between all servers. After hours of talking on the phone with Hetzner support they finally admitted that some of their client – unluckily hosted on the same rack as we – was the target of a DDoS attack and that they are trying their best to block the attack. Because we could not port the whole platform in a matter of hours to another hosting provider, the only thing we could do then was to hope Hetzner would manage to block the attackers requests on their outer firewalls and to make plans to move to a hosting provider that invests a little bit more energy into keeping their network layer healthy and functioning. Luckily either Hetzner got hold of the problem by the end of the day or the attacker ran out of breath – either way we could go back to normal operations and smoothly harvest the Black Friday’s Weekend traffic on Sunday.

Obviously some hosting providers have more capacity to keep their networking and hardware in shape than others. Hetzner provides exceptionally great value for money when it comes to pure CPU and Memory. And to be fair Hetzner really tried to help and fend off the attacker but after all we decided to move to a hosting provider that comes with a higher level of cloud security and support. So we moved to AWS and since then network performance was a thing of the past.

AWS

AWS publishes a great amount of whitepapers in which they describe best practices how to make best use of their services. In one of these whitepapers a DDoS resilient reference architecture that mitigates both the risk of infrastructure layer and application layer DDoS attacks is described in great detail. Since some of the risk mitigation strategies against infrastructure attacks like SYN flood or UDP reflection attacks seemed somewhat subtle to me I decided to compile the architecture into an ejectable kickstarter terraform/terragrunt codebase that I can use in my consulting work as an independent AWS Cloud Solutions Architect. In the following I will explain AWS’s recommended best practices and services for secure infrastructure ready to deal with DDoS attacks. To see the best practices in action please checkout my terragrunt/terraform project and follow the instructions in the README.md.

DDoS Attacks

One way of classifying DDoS attacks is by looking at which OSI Layer they are targeting. As a refresher here are the 7 layers of the OSI model.

Layer 7 – Application Layer: Web Browser … closest to the end user.
Layer 6 – Presentation Layer: SSL/TLS, compression … establishes context between application-layer entities … transforms data into the form that the application accepts.
Layer 5 – Session Layer: setup, negotiation, teardown … establishes, manages and terminates the connections between the local and remote application.
Layer 4 – Transport Layer: TCP, UDP … transferring variable-length data sequences from a source to a destination host.
Layer 3 – Network Layer: IP … transferring variable length data sequences (packets) from one node to another
Layer 2 – Data Link Layer: MAC … a link between two directly connected nodes.
Layer 1 – Physical Layer: fiber cabels … transmission and reception of unstructured raw data between a device and a physical transmission medium.

As one would expect a DDoS application layer attack is targeting OSI layer 7 – application layer. Whereas most DDoS infrastructure layer attacks are targeting OSI layer 4 and 3.

Infrastructure Layer Attacks

The 2 most common infrastructure DDoS attacks are UDP reflection and SYN flood attacks.

UDP reflection attacks exploit the fact that UDP is a stateless protocol. In an UDP attack the attacker crafts a packet that contains the targets IP address as the sender. The attacker sends the packet to a server that will in return send the response to the target’s IP address. The trick is to send the packet to a protocol that will respond with a much larger packet. DNS and NTP are popular for UDP reflection attacks.

SYN flood attacks: A TCP connection is established by the client sending a SYN (synchronize) message. The server responds with SYN-ACK to acknowledge the request. The client should respond then in return with ACK to establish the TCP connection. In an SYN flood attack the attacker just sends a lot of SYN requests and never responds to the servers SYN-ACK messages with an ACK message and hence is keeping many connections open and waiting on the server side.

Application Layer Attacks

Application Layer attacks target layer 7 of the OSI model. Examples of application layer attacks are HTTP flood attacks, Cache-busting attacks, WordPress XML-RPC flood attacks or attacks to the TLS negotiation process. In an HTTP flood attack one tries to generate a huge load on the backend by attempting to emulate human interaction with the application. In a Cache-busting attack one varies randomly the query string to by-pass content-delivery caching. WordPress XML-RPC flood attacks are a bit like UDP reflection attacks only that they leverage a WordPress specific API that can establish a connection to another WordPress site. In TLS negotiation process attacks the attacker perpetually renegotiates the encryption method causing a lot of expensive operations.

DDoS resilient reference architecture

To kick things off and without further ado this is how AWS’s recommended reference architecture looks on paper.

Without surprise we discover the usual suspects of the foundation for a secure and peer-able (maybe more about VPC peering in another post) AWS account.

a custom VPC with a private and public subnet
Elastic Load Balancer
Auto Scaling group
AWS Route53 to manage DNS zone files

The maybe more interesting part begins with of the following components and how they get wound up in the big picture.

Amazon WAF (Web Application Firewall) that comes somewhat integrated with AWS Shield
Amazon API Gateway
Amazon CloudFront

Mitigation Techniques

Infrastructure Layer Defense

AWS Shield is a managed DDoS protection service that comes in two flavors: AWS Shield Standard and AWS Shield Advanced.

AWS Shield Standard is integrated free of charge into CloudFront and Route53. According to AWS it gives protection for all known infrastructure layer attacks like UDP reflection and SYN flood attacks by monitoring the network flow and dropping TCP packets that look fishy.

AWS Shield Advanced comes with a hefty price tag of 3000$ per month but it comes as well with a great deal of extra features like automated application (layer 7) traffic monitoring or with something like a bat phone to AWS’s DDoS response team. Whenever you feel hard-pressed by “the internet” you can pick up the phone and call AWS’s Commissioner Gordon team for help. For a full list of features of AWS Shield Advanced check out the AWS Shield’s product page.

Application Layer Defense

In order to protect against application layer attacks like SQL Injection, application specific attacks like WordPress XML-RPC floods or just in order to lock out specific IP addresses that are bugging you, you can make use of AWS Web Application Firewall which integrates nicely with CloudFront, AWS Application Load Balancer and Amazon API Gateway.

So when you are developing for example an API to end users you can ensure advanced resiliency against DDoS attacks by using API Gateway associated with a CloudFront distribution that integrates your application specific AWS Web Application Firewall.

Even for 100% dynamic, non-cacheable content that is served from your web application or API routing the traffic through a CloudFront Distribution is a good idea in order to get more resilient against DDoS attacks. Since CloudFront only accepts well formed HTTP requests it will help to reduce the amount of attacks reaching your origin. You can further use Cloudfront’s variable time-to-live feature to offload traffic from your origin.

Attack Surface Reduction

Another generally good idea is to show users only what really they need to see from your infrastructure in order to use your service and lock down every possible other way into your platform. In AWS you can reduce your attack surface by using a Virtual Private Cloud (VPC). Inside your VPC you best configure a private subnet in which your instances communicate only via private IP addresses. You configure a public subnet inside your VPC only for resources that need public IP addresses like a bastion host, Internet Gateway, Nat Gateway or a public facing Load Balancer. Inside your VPC you should use Security Groups and Network Access Control Lists to give every instance and service the least privilege permissions in order to fulfill their tasks.

Operate at Scale

And last but not least in order to defend against DDoS attacks it is important to be ready to operate at scale. In AWS you can use Load Balancers to distribute traffic and horizontally scale up instances based on demand in a cost efficient way using AutoScaling Groups.

Conclusion

AWS offers an amazing amount of services to operate your workloads at scale. In order to keep your infrastructure as resilient as possible against any kind of DDoS attacks you should

use CloudFront even for dynamic content and integrate AWS WAF (web application firewall)
use regional API Gateway endpoints in order to gain full control over the CloudFront distribution (and integrate AWS WAF)
apply the principle of least privilege, reduce your attack surface as much as possible and run everything that does not need to be publicly available in a private subnet of your VPC
operate at scale by configuring your AutoScaling Groups appropriately