CERT Advisory on DNS Amplification Offers Little Hope

CERT released an advisory today on DNS Amplification Attacks.  These attacks are nothing new; in fact dealing with this kind of load is business as usual for the Tier1/2 providers. But I was surprised with how little apparently CERT has to offer in the way of advice to thwart the attacks.

Amplification attacks work by sending small requests to a server, with a spoofed source address so that the [much larger] responses are sent to the target of the attack. DNS servers are particularly well suited for this since a 64 byte UDP request can result in up to 4,000 bytes being sent from the DNS server to the intended target (60x amplification).

While the attacks are difficult to prevent, network operators can implement several possible mitigation strategies. The primary element in the attack that is the focus of an effective long-term solution is the detection and elimination of open recursive DNS resolvers. … According to the Open DNS Resolver Project, of the 27 million known DNS resolvers on the Internet, approximately “25 million pose a significant threat” of being used in an attack [1].

Let me see if I got that right… the effective long-term solution is that we need to re-configure 93% of the worlds DNS servers. Hmmm…. Well, the next piece of advise is directed at ISPs, which I’m sure are ready to tackle this problem with extreme prejudice:

Because the DNS queries being sent by the attacker-controlled clients must have a source address spoofed to appear as the victim’s system, the first step to reducing the effectiveness of DNS amplification is for Internet Service Providers to deny any DNS traffic with spoofed addresses. The Network Working Group of the Internet Engineering Task Force released a Best Current Practice document in May 2000 that describes … configuration change would considerably reduce the potential for most current types of DDoS attacks.

So about 13 years ago IETF released best practices to cut down on spoofed UDP traffic. How’s that working out again? Maybe re-configuring 25 million DNS servers isn’t such a bad idea. How exactly can we do that?

CERT provides configuration instructions for BIND and Microsoft DNS Server, since they are “two widely deployed servers.” Their first suggestion is to disable ‘recursion’ entirely:

Many of the DNS servers currently deployed on the Internet are exclusively intended to provide name resolution for a single domain. These systems do not need to support resolution of other domains on behalf of a client, and therefore should be configured with recursion disabled.

I think this is misleadingly vague at least, but it’s hard to know what ‘many’ means in this context. To be clear, you cannot disable recursion on a DNS server if it is used by a client at all. You can only disable recursion if the DNS server is used exclusively by other DNS servers looking for an authoritative source of IPs for your domains. In order for disabling recursion entirely to even be an option, what CERT would have to be claiming is that “many of the DNS servers currently deployed on the Internet are exclusively intended to provide name resolution for other DNS servers, and never have clients connect to them,” and I find that hard to believe.

Picture just about any “office LAN” where the email server and office wiki’s are hosted on local machines with DNS names–we’re not going back to NetBIOS and WINS are we?  No, your IT department is going to point clients to a local DNS server with DHCP, and that DNS server has got to enable recursion, since clients will be sending requests for mail.localdomain right alongside www.google.com. Now, when you want to allow Outlook to connect to Exchange using SSL/TLS while your employees are at home, you expose your DNS server to the world, and the trouble begins. But you can’t just turn off recursion entirely.

CERT’s next suggestion is a little better: Limiting Recursion to Authorized Clients:

“For DNS servers that are deployed within an organization or ISP to support name queries on behalf of a client, the resolver should be configured to only allow queries on behalf of authorized clients. These requests should typically only come from clients within the organization’s network address range.”

The title seemed to be going in the right direction, but then they started mixing up terminology — limiting ‘recursion’ and limiting ‘queries’ altogether are two different things.  In the example configuration that follows for BIND, it appears that they restrict queries and recursion to just the local subnet, not just recursion. I’m actually not sure how that’s different from turning off external access entirely. Then to make matters worse, Microsoft DNS Server doesn’t even support turning off recursion for external networks only, and they propose a workaround using a combination of DNS forwarding and firewall rules that I’m sure .1% of the 93% of servers out there will be implementing any day now.

The last nail in the coffin, and their last piece of “advice” is to rate limit the responses from the DNS server. I’m all for rate limiting–when it looks like a core router on the London Internet Exchange is suddenly interested in pulling 100Mbit/sec in the form of the 3,000 requests per second of the same 4,000 byte TXT record from your DNS server, maybe you might ought to stop responding.

Lucky for you… there is an “experimental” feature in BIND9 called the RRL (Rate Response Limiting) patch.  The example configuration CERT gives is this:

On BIND9 implementation running the RRL patches, add the following lines to the options block of the authoritative views [13]:

rate-limit {
responses-per-second 5;
window 5;

So to keep my server from contributing to global internet meltdown (yeah, right) I should just rate limit my server to 5 responses per second.  Well, there’s more to it than that — Paul Vixie who developed the RRL module has a nice write-up of all the details.  What’s that you ask about Microsoft DNS Server? Oh yeah, “this option is currently not available for Microsoft DNS Server.” I’m detecting a trend…

At this point, I’m thinking… what’s with the kid gloves? Is this really the best we can do to stop 300Gbps attacks on critical infrastructure?  OK, to tone down the rhetoric, can we at least show these DDoS clowns that we’re not going to be victimized by bullshit UDP spoofing, without resorting to suggestions that amount to little more than “just turn off your servers, please and thank you!”  A few minutes of Googling turned up RFC 5966.  Have I mentioned I just love RFCs? Nothing is more pleasing to the eye than the perfectly formatted ASCII with well defined and capitalized KEY WORDS which are themselves recursively defined in their own RFC 2119.  Eh hem, back to the topic at hand.

So 5966 is a work-in-progress draft talking about how DNS servers “SHOULD” support TCP as well as UDP, and recommending that we upgrade that keyword to a “MUST” because SHOULD is basically MUST anyway. Heh. Historically, the DNS spec called for any response larger than 512 bytes would set the “TC” flag in a short UDP response which would tell the client to try again on TCP. Of course the attacker can’t spoof a TCP connection because of the 3-way handshake, so that’s a TCP ACK that will be a long time coming.

With the advent of DNSSEC, the average length response has gotten a lot larger, but rather than switch to TCP which is slower to setup, we have something called “EDNS0” (Extension Mechanisms for DNS0) itself defined in RFC 2671 (I really could read these all day…) which lets the “client” kindly inform us of how large a UDP response it’s prepared to receive from the kind server. As you can imagine, in this case the attacker happily requests “Doom BFG 9000 sized responses, if you please!”  So, really what we’ve done here gotten on our knees and straight up begged for this to be exploited.

Can We Just Fix This Please?

But WTF.  All we need is one little indication that the address that’s inundating our server, that we’re pounding out 4,000 byte responses to, the address which is sucking every last bit of bandwidth from our office’s tired T1, or 100Mbit uplink, whatever it may be…  we just want one piece of proof that the address even exists, that it’s even the one making the requests! And there you have the funny thing about software… hundreds of thousands of lines of complex code performing admirably to implement the plumbing of the internet, all reduced to a traffic generator.

So the answer, at least in my opinion, is staring us in the face. We got ourselves in this mess by allowing the client to request massive responses over UDP and never once making the client prove they are who they say they are.  I don’t think you solve this problem by rate limiting the entire server, or white-listing your internal network, or unplugging your DNS server, or even telling ISPs to properly police their own network. The kindest way I can think of describe these efforts would be to call them “not serious.”

We already have the “TC” flag which tells the client it should come back on TCP, where we can prove the request is legit. Every DNS client in the world already has to support this flag. Here I go quoting RFCs again… “When a DNS client receives a reply with TC set, it should ignore that response, and query again, using a mechanism, such as a TCP connection, that will permit larger replies.”

So without further ado, here’s the solution to the global internet DDoS scourge of DNS Amplification Attacks:

if (response.protocol == UDP && response.length > 512 && rand() % 10000 == 0)

Before sending out a largish response, one that would typically have triggered the TC flag anyway, I’m going to give you a 0.01% chance of being put on a “Must TCP” list.  Once you are on the list, any request from that IP received over UDP will automatically get an empty TC  response, and nothing else, until I see you come back at me over TCP. If I see you on TCP, I’ll remove you from the list.  But, if you keep hitting me with UDP, despite my TC flag, you’re going on a blacklist.

Remember, the destination of the response is the target of the attack, not one of a million IPs belonging to a major botnet. So we’re not trying to solve the much harder problem of defending ourselves against a DoS. We’re merely trying to prevent ourselves from DoS’ing the person we think we’re serving!

You might ask, why the rand()?  Well, some really smart people spent a lot of time on EDNS0, so I imagine there are compelling performance reasons for wanting to keep DNS over UDP as much as possible. I’ve spend about 30 minutes reading RFCs I found on Google.  I guess that makes me qualified to write about it on the internet, right? Anyway, I have two reasons;

  • First, I just think it makes sense for a DNS server to say HELO over TCP every once in a while to make sure there’s someone on the other end, but we don’t need or want to do that every time, or even all that frequently.
  • Second, we don’t want to have to remember the list of IPs who we have talked to over TCP, so it’s easier to just make it probabilistic, and then we just need to keep a short list of people on our ‘watchlist’.  We can limit length of ‘client_must_tcp’ to something like a maximum of 512 entries. That’s a grand total of 2560 bytes for client_must_tcp, assuming a 4 byte address and 1 byte counter.

So if those two lines I put above are inside the ‘response’ handler, we have just one more block of code inside the ‘request’ handler which would run first:

if (client_must_tcp.contains(request.source)) {
   if (request.protocol == UDP) {
      if (client_must_tcp.increment_warning_count(request.source) > MAX_WARNING)
   } else {

Note: I would never use underscores like that in my own code, but with this blog’s font camelCase seemed too hard to read.

Endnote: There is one problem this introduces, which is that you can use this algorithm as a way to turn off someone’s access to a DNS server. Just spoof a bunch of UDP requests from their address, and get them added to the blacklist. So, to clarify, a safe-mode that would still achieve the goal of preventing DNS Amplification, would be to continue to return empty TC responses even to blacklisted IP addresses.

Empty TC responses would actually result in deflating the attackers bandwidth. A “real” client would still be able to instantly regain functional access by properly establishing a TCP connection as soon as it saw the TC response. The only difference then between the ‘blacklist’ and the ‘client_must_tcp’ is to decide how long IPs should be kept in their respective lists before aging out. Maybe you don’t even need the separate concept of a ‘blacklist’ or even the warning counter at all.

Problem solved? Well, I guess all you have to do next is convince Microsoft to push it out as a Hotfix! 🙂