1) Run BIND on a server dedicated to DNS only.
Reasons include:
Minimized risk of impact to DNS services as a result of other applications consuming server resources (perhaps due to an attack on those services, or due to application error).
Conversely, minimized risk to other applications as a result of BIND consuming all system or network resources.
Reduced likelihood of unauthorized access to the DNS server (e.g. via a code defect and root access exploit made possible via another application).
Improved ability to monitor DNS server performance (since the server is dedicated to one service).
Improved ability to troubleshoot problems.
2) Run separate authoritative and recursive DNS servers
Do not combine authoritative and recursive nameserver functions -- have each function performed by separate server sets. This advice primarily concerns separation of public-facing authoritative services from internal client-facing recursive services - administrators may, for convenience, choose to serve some internal-only zones authoritatively from their recursive servers, having determined that the benefit outweighs any risks associated with this policy.
If you share recursive and authoritative functions in the one server, then if there is a problem that impacts authoritative servers only - for example, that causes all of your authoritative servers to fail - it will break your recursive service too.
Run multiple, distributed authoritative servers, avoiding single points of failure in critical resource paths. A variety of strategies are available (including anycast and load-balancing) to ensure robust geographic and network diversity in your deployment.
3) Choose appropriate software and hardware
Run currently-supported version(s) of BIND in your environment.
Subscribe to bind-announce@lists.isc.org to get notified of BIND 9 software updates and security issues.
Run currently-supported version(s) of your chosen operating system.
Ensure that system outbound network buffers are large enough to handle your rates of outbound query traffic. Some OS implementations (linux particularly some versions) by default assume low rates of outbound network traffic - but an authoritative server will often be responding with significantly larger packets than the queries it received, particularly for signed zones.
Run a multi-threaded BIND build and launch named with an appropriate number of task threads tuned for the hardware and CPU architecture.
Ensure (and confirm through testing) that your infrastructure supports EDNS0 and large UDP packet sizes.
4) Prevent external access to internal data by design
Originally, DNS was designed to be the same data provided to all clients from all servers. Then, the concept of a split namespace was introduced and included in BIND as the concept of “views”. The use of views is often used to separate internal devices from external devices. lab01printer.example.com might be visible from inside, and possibly via a VPN connection, is probably something that you would want to have visible from the outside.
Problems usually occur in two places with views:
- an accidental “leak” of data that is internal-only
- external clients, expecting internal views get external views
Number 1 is usually caused by an incorrect access control list (ACL) that allows the internal zone to be transferred to one or more external facing servers. This failure may not be obvious, and without specific testing, for example, creation of a canary entry that only resolves in one configuration and then testing for it from locations that should NOT be able to resolve them.
Number 2 is caused when a VPN fails to connect or when connected, is still seen by the DNS server as “outside” the list of internal networks. Both of these boil down to network infrastructure issues.
While internal vs. external DNS names using views are nearly ubiquitous these days, splitting your DNS into internal and external zones solves the problem in a very obvious and safe way. Internal names, for example, living only in int.example.com and sub-zones of int. External name servers would be configured without any knowledge of the int zone.
5) Take basic security measures
Run BIND as an unprivileged user.
To open low-numbered UDP and TCP ports BIND must be launched as root, but an alternate uid can be specified using the -u
command line argument; after opening needed resources named will change its runtime uid to an unprivileged account.
If following the preceding advice (running BIND as an unprivileged user on a dedicated server) chrooting is "de-emphasized." Our operations experts feel that chrooting does not substantially improve security under those conditions and do not affirmatively recommend it, but they do not explicitly discourage it.
Make use of BIND access control mechanisms such as address match lists to restrict recursive query service to known and authorized clients. Ideally your Internet-facing authoritative servers should not perform recursion for any clients at all.
Consider DNSSEC-signing your public authoritative zones. (Recursive servers will then be able to use DNSSEC-validation to authenticate your records). DNSSEC signing does imply additional on-going maintenance. However, if you operate a service where there is an increased risk of impersonation - such as a financial service, or any public service where the user needs to be sure the resource is really your resource - the effort of signing may be well worth it.
6) Prepare for abuse of any external-facing servers
There are a number of tuning parameters for RRL, but for the most part, the default settings are good.
RRL works by dropping responses into different buckets. Each bucket is the IP address (or a collection of addresses) to which the response is being sent. When a given number of identical responses are seen within a certain length of time in a single bucket, the responses hosts in that bucket are limited. The tunable parameters include the number of identical response before limiting is triggered, the length of time a response stays in the bucket, and the size of the network that each bucket contains.
It is impractical to create one bucket per IP address. The default bucket size is a /24 network (256 IP addresses) for IPv4 and a /56 network (256 networks of 18,446,744,073,709,551,616 addresses each) for IPv6. These bucket sizes represent common subnet sizes for each of the address families.
There are some circumstances under which these bucket sizes may be too small, most revolving around the use of NAT on IPv4 networks. If you discover that you are rate limiting hosts that are innocent because they live with a large number of other hosts behind a single NAT’d IP address, you can either change the bucket size or “white-list” the network(s) by adding them to the exempt-clients list…
The example above provides all of the tunable parameters, but as noted, the most useful for initial tuning are the “responses-per-second”, “ipv4-prefix-length” and “exempt-clients”.
There is the “log-only” option that can be set to “yes” to test configurations without actually changing the network performance.
A newer feature, worth considering for mitigating amplification attacks is 'minimal-any'. Unlike RRL, this feature is not enabled by default.
Provision sufficient capacity to handle burst traffic up to 20x normal level. This overcapacity will help your system to withstand spikes in both legitimate and abuse traffic.
Excess capacity must take into account not only server CPU and memory resources but also send and receive capacity along the entire network path.
Consider the length of the TTLs on the delegation records that you manage within your zones, as well as those that are provided by the parent zones that delegate authority to your nameservers. Longer TTLs protect the visibility of a zone, but shorter ones allow for a faster change of nameservers. Long TTLs can also help protect the visibility of a zone when the parent zone's nameservers are under attack. See https://www.dns-oarc.net/index.php/oarc/mitigating-dns-denial-of-service-attacks for more information.
In most instances we would not recommend the use of inbound packet filtering for authoritative nameservers, Response Rate Limiting is the recommended solution. However, there are some circumstances where filtering at very high inbound packet rates can be helpful - please contact ISC if you think you might benefit from our operational experience in this area.
7) Monitor the service
Put in place monitoring scripts to continually check health of servers and alert if conditions change substantially.
Conditions to monitor include:
- process presence
- CPU utilization
- memory usage
- network throughput and buffering (inbound/outbound)
- filesystem utilization (on the log filesystem and also the filesystem containing the named working directory)
Logs should be examined periodically for error and warning messages which may provide a tip-off for incipient problems before they become critical.
Review the logging configuration to ensure it meets your requirements. BIND's logging defaults are generally sane (passing most of the work to syslog), but may not line up with organizational policy and/or desired data collection/retention standards.
When using size-limited files for logging, plan the size of the files and number to retain so that an increased level of logging due to a problem is unlikely to cause the logs from the start of the problem to become unavailable. The exact settings will depend on how quickly problems can be detected and the details of the baseline retention policy.
Query logging adds substantial overhead (on the order of 10x) and so should not be turned on without careful consideration.
8) Consider a nanny
- By design, and for security purposes, the most common mode of failure for BIND is intentional process termination when it encounters an inconsistent state. An automated minder process capable of restarting BIND intelligently is recommended if you do not have 24-hour operations support (and possibly even if you do.) It is especially helpful if any such script can checkpoint and archive the logs when this happens.
9) Prepare for troubleshooting
Prior to any trouble, ensure that a strategy is in place for collecting post-mortem information if a server does encounter a problem. This includes:
- Building named with debug symbols enabled
- Enabling the BIND XML statistics channel for easy data collection.
- Designing an appropriate logging strategy and reserving sufficient space on the log filesystem for information to be collected for a significant context period before an event (several hours at least, 24 hours+ preferred.)
- Ensuring that the uid under which named is running has write permission sufficient to write a core image to its working directory if it segmentation faults and to write named.dump or named.run files if requested by operator.
See What to do with a misbehaving BIND server and What to do if your BIND or DHCP server has crashed for guidance on troubleshooting problems and the type of information that is useful to collect in those circumstances.
Observe query loads periodically to establish baseline expectations. This will enable you to monitor for anything unusual - as defined by the range of 'normal' for your specific operational environment.
You should have a strategy that includes both a planned upgrade path to ensure that you can take advantage of improved features and functionality, as well well as how you will respond if there is a security advisory released that has the potential to impact your servers and services. See Which version of BIND do I want to download and install? for more information.
10) Additional measures for high availability
Our general advice for security practices is included in the list above. However many large production environments with mission-critical DNS needs may opt to run servers on multiple hardware and/or OS platforms to increase the "eco-diversity" of their DNS infrastructure. This also includes running different versions of BIND for resilience to potential defects that may not impact all currently supported versions.
- Many service providers offer "DNS secondary service" in which they also publish your zones. In this situation, you continue to manage your own zones, but you keep copies updated at the service provider. This option is worth considering for added resilience and extra capacity.
- We don't recommend anycasting except in very large deployments or if you already have experience with anycasting.
The concept of anycast is easy to grasp: A netblock is announced on your network in such a way that it appears at more than one location - someone trying to reach an address in that netblock is routed to the service at the closest (network topology wise) location to them.
This is very useful in complex networks where there may be tens, hundreds or even thousands of networks, each with their own name server - put your name servers into the anycast network, configure your network correctly, provide your clients with the single anycast nameserver address and magically all of your name servers become a single network address and all of your clients use the closest one!
The initial configuration of the anycast DNS instances must take into account a number of additional issues. These issues include the ability to quickly withdraw the anycast route from your network in the case of a DNS server malfunction, the ability to correctly transfer zone data between anycast server instances, and the ability of your support team to debug issues that stem from different clients using different servers.
If a DNS server malfunctions - hung, crashed, unable to provide correct data - the route advertisement to that specific DNS server must be quickly (and automatically) removed. Clients attempting to resolve DNS names using the affected server should be routed to other instances.
Debugging a client DNS issue in an anycast network is much more complex and involves many more hands and eyes than does debugging an issue on a traditional network. The most vital concern: When debugging in an anycast environment, be absolutely sure that you and the client with the issue are looking at the same server.