Migrating from BIND to ShadowDNS¶

This document provides the DNS Ops team with operational guidance for replacing a BIND master with ShadowDNS, covering BIND drop-in compatibility, environment prerequisites, the four-phase cutover steps, rollback strategy, monitoring checklist, frequently asked questions, and Day 2 operations after the cutover is complete (monitoring alerts and routine SOPs).

BIND drop-in compatibility¶

ShadowDNS reads an existing BIND named.conf directly — no format conversion, no rewrite. Before planning a cutover, understand exactly what ShadowDNS does with your current configuration on load.

Pointing ShadowDNS at an existing named.conf¶

Point --named-conf straight at your live BIND configuration:

./shadowdns \
    --named-conf /etc/bind/named.conf \
    --config     /etc/bind/shadowdns.yaml

include directives are resolved relative to the including file, so the Debian-idiomatic split (named.conf pulling in named.conf.options and named.conf.local) loads unchanged. ShadowDNS only needs the --config file added for the alias map (the aliases: section); nothing in named.conf has to be edited to get started.

What is tolerated or ignored on load¶

ShadowDNS classifies every construct it encounters into one of four tiers — silent, INFO, WARN, or fail-closed (fatal). A BIND config carries directives ShadowDNS does not act on (recursion settings, DNSSEC options, type slave / type forward zones, key / controls / acl blocks, view-scope access control). Rather than refusing to start, ShadowDNS tolerates these: unsupported zone types and recursion-family directives are dropped and logged at INFO, view/zone-scope access-control directives are dropped and logged at WARN, and most other unrecognized blocks are consumed silently. Only genuine syntax errors and a handful of structural conflicts (a malformed geoip asnum, mixing view blocks with top-level zones, geoip-directory unset while a view uses geoip rules) are fatal.

The full tier-by-tier classification and the per-directive summary live on the named.conf Compatibility page. Run --dry-run against your production config to see exactly which directives ShadowDNS skips and at what level (see Phase 0 below).

Access-control model differences¶

ShadowDNS's access-control model differs from BIND's in one way operators must internalize before cutover:

View selection is by match-clients. ShadowDNS routes a query to a view by evaluating each view's match-clients address-match-list in first-match order — exactly as BIND does. This is the only client-classification mechanism ShadowDNS honors for answering queries.
Options-scope allow-transfer IS enforced — it is the AXFR ACL. The allow-transfer { ... }; declared in the global options block is honored as the zone-transfer ACL: only source IPs listed there receive AXFR; everyone else gets REFUSED, and an empty list denies all. This is the existing zone-transfer behavior relied on throughout this guide (the Prerequisites slave-IP list, the Phase 2 AXFR checks, and the troubleshooting FAQ all assume it).
View- and zone-scope allow-query / allow-recursion / allow-transfer are NOT enforced. Such directives are dropped on load — a view-scope occurrence is logged at WARN with a "does not enforce" message, while a zone-scope occurrence is skipped silently. Either way the ACL has no effect: ShadowDNS does not restrict query answering by client ACL at all, so if a client matches a view via match-clients, that view's zones are served. If your BIND deployment relies on a view-scope allow-query to hide zones from certain clients, replicate that boundary with match-clients (so the client lands in a view that does not contain those zones) rather than expecting allow-query to be honored.

The fail-closed doctrine still applies to view selection: an unevaluable match-clients element (unknown token, undefined acl) is dropped and treated as never-matching, so a misconfigured view serves nothing rather than matching every client.

Prerequisites¶

Before starting any cutover work, confirm each of the following environment conditions:

Item	Description	Confirmed
BIND master running stably	The existing BIND master has no ongoing alerts and no unexpected restarts recently	☐
Full backup of zone data	All zone files and `named.conf` are backed up to a recoverable location	☐
Slave IP list known	List all BIND slave IPs, used for the `allow-transfer` ACL configuration	☐
MaxMind mmdb available	`GeoLite2-Country.mmdb` and `GeoLite2-ASN.mmdb` are downloadable or already in place	☐
mmdb version matches BIND	ShadowDNS uses the same mmdb files as BIND's `geoip-directory`, avoiding GeoIP classification discrepancies	☐
`shadowdns.yaml` generation mechanism confirmed	The management system can automatically generate `shadowdns.yaml` (including the `aliases:` section), or the cost of manual maintenance has been assessed	☐
Monitoring covers both sides	The monitoring system can observe query QPS, error rate, and memory for both BIND and ShadowDNS simultaneously	☐
Rollback procedure rehearsed	The team is familiar with the rollback flow for each Phase (see Rollback Strategy below)	☐

Four-Phase Cutover Steps¶

Phase 0: Development and Testing (within the scope of this change)¶

Goal: Confirm that ShadowDNS can correctly handle production-scale configuration files and zone data in a controlled environment, and that memory usage matches expectations.

Steps:

Copy a production named.conf and its Debian-idiomatic includes (named.conf.options holding the options{} block, named.conf.local holding the views and zone definitions) plus the zone file directory to the test environment.
Prepare shadowdns.yaml (a single YAML file covering aliases and the optional ephemeral_api section; it can be assembled manually at first, or generated by the management system in the test environment).
Build the ShadowDNS binary:

go build -o shadowdns ./cmd/shadowdns

Run a startup smoke test with --dry-run to confirm configuration parsing has no errors:

./shadowdns --dry-run \
    --named-conf /path/to/named.conf \
    --config     /path/to/shadowdns.yaml

Observe the startup log and confirm:
All views and zones load successfully
No fatal startup errors (skipped-directive INFO/WARN logs are expected on a BIND drop-in config — see the FAQ on skipped directives)
GeoIP mmdb loads successfully
Use dig to send representative queries to the ShadowDNS test instance, verifying that root zone and backup zone responses are correct:

# Query the root zone A record
dig @<shadowdns-ip> example.com A

# Query the backup zone (should return the same IP as the root, with owner example.net)
dig @<shadowdns-ip> example.net A

# Query SOA (the backup zone's serial should match the root)
dig @<shadowdns-ip> example.net SOA

Measure memory usage (ps or /proc/<pid>/status) and confirm it is below the expected ceiling (about 50 MB).
Run unit tests and integration tests:

go test ./...

Acceptance criteria:

All tests pass
Memory usage matches expectations
No fatal messages in the startup log
Representative dig query results match BIND

Estimated duration: Medium (includes multiple rounds of iterative testing and bug fixes)

Phase 1: Parallel Validation¶

Goal: Deploy ShadowDNS on a non-production IP, running in parallel with the BIND master, comparing response consistency between the two for identical queries, with continuous observation showing no anomalies.

Steps:

On the BIND master host (or a standby host in the same network segment), start ShadowDNS on a different IP or different port:

./shadowdns \
    --named-conf /etc/bind/named.conf \
    --config     /etc/bind/shadowdns.yaml \
    --listen     192.0.2.20:53

Confirm ShadowDNS starts successfully with no errors in the log.
Design and run a parallel comparison query script, querying both BIND and ShadowDNS simultaneously for each view's representative domains (root zone, backup zone):

# Compare responses from both sides (RDATA should match; SOA serial may differ slightly due to timing)
diff \
  <(dig @<bind-ip>      example.com A +short) \
  <(dig @<shadowdns-ip> example.com A +short)

diff \
  <(dig @<bind-ip>      example.net A +short) \
  <(dig @<shadowdns-ip> example.net A +short)

Expand the comparison scope to multiple record types (A, AAAA, CNAME, NS, MX, TXT, SOA) and multiple views (simulating different source IPs).
Have the monitoring system send probe queries to both sides simultaneously, continuously comparing for differences, observing for at least 7 consecutive days.
If inconsistencies are found, record the specific domain / query type / view, report it, and fix ShadowDNS.

Acceptance criteria:

7 consecutive days of comparison queries with no inconsistencies
No SERVFAIL alerts
No unexpected restarts or panics of the ShadowDNS process

Estimated duration: Medium (includes the 7-day observation window)

Phase 2: Slave Cutover¶

Goal: Have the management system officially generate the aliases: section of shadowdns.yaml, point the BIND slaves' master to ShadowDNS one host at a time, and verify AXFR synchronization is correct.

Steps:

The management system starts officially generating the aliases: section of shadowdns.yaml and synchronizing it to the ShadowDNS configuration directory.
Pick one staging BIND slave, change its masters { } setting from the BIND master IP to the ShadowDNS IP, then reload:

# On the staging slave, edit the masters block in named.conf, e.g.:
# masters { 192.0.2.20; };  ← change to the ShadowDNS IP
rndc reload

Confirm the staging slave completes AXFR successfully:

# Watch the slave's named log; you should see transfer successful for every zone
journalctl -u named -f | grep "transfer of"

Run query comparisons against the staging slave to confirm resolution results match the BIND master.
Confirm that in the backup zone's AXFR content both owner names and RDATA have been correctly rewritten (spot-check with dig AXFR):

dig @<staging-slave-ip> example.net AXFR

After the staging slave passes validation, perform the same cutover flow on the production slaves one at a time (after each cutover, observe for at least 24 hours with no anomalies before continuing).

Acceptance criteria:

All slaves complete AXFR successfully, with no transfer failures
Query results on the slaves match the Phase 1 baseline
The management workflow for the aliases: section of shadowdns.yaml is stable, with no missing domains

Estimated duration: Medium to long (depends on the number of slaves and the host-by-host validation pace)

Phase 3: BIND Master Retirement¶

Goal: Once all slaves are confirmed to be pulling stably from ShadowDNS, demote the old BIND master to hot standby, and eventually decommission it.

Steps:

Confirm all production slaves have been cut over to ShadowDNS and have been running stably for more than 7 days.
Put the BIND master into "hot standby" mode: keep the process running, but stop accepting zone updates from the management system; it serves only as a standby for emergency rollback.
Watch whether any slaves are still accessing the BIND master (if so, the cutover is incomplete):

# Watch the AXFR request log on the BIND master
journalctl -u named | grep "AXFR"

During the hot standby period (1–2 weeks recommended), continuously monitor ShadowDNS QPS, error rate, and memory.
After the hot standby period ends with no anomalies, decommission the BIND master:

systemctl stop named
systemctl disable named

Update documentation to record that ShadowDNS is now the new sole master.

Acceptance criteria:

No ShadowDNS anomalies during the hot standby period
No slave access in the BIND master log (confirming the cutover is complete)
No impact on DNS resolution after decommissioning

Estimated duration: Short to medium (1–2 weeks of hot standby observation)

Rollback Strategy¶

Each Phase has a corresponding rollback path. The design principle is "add a ShadowDNS instance, never remove BIND", ensuring any stage can be rolled back safely.

Problems in Phase 1¶

Rollback method: Simply stop the ShadowDNS process. At this point all slaves still point to the original BIND master, with no traffic impact whatsoever.

# Stop ShadowDNS (adjust according to how it was actually started)
kill <shadowdns-pid>
# or
systemctl stop shadowdns

Impact scope: Only the parallel ShadowDNS instance stops; production is unaffected.

Problems in Phase 2 (some slaves already cut over)¶

Rollback method: Change the masters { } setting of the already-cutover slaves back to the BIND master IP, then reload:

# On the affected slave, edit named.conf and change masters back to the BIND master IP
# masters { 192.0.2.1; };  ← change back to the original BIND master IP
rndc reload

The BIND master still holds the complete zone data; the slave recovers once it re-pulls AXFR from BIND.

Caveats: If zone data was updated during Phase 2, confirm the BIND master's zone data is up to date, or verify after the slave completes AXFR.

Problems in Phase 3 (during hot standby)¶

Rollback method: Restart the old BIND master (provided the hot standby period has not expired and the BIND master still has complete, up-to-date zone data):

systemctl start named

Then change all slaves' masters back to the BIND master IP and reload.

Precondition: The BIND master must keep its zone data consistent throughout the hot standby period (1–2 weeks). Once the management system updates only ShadowDNS without synchronizing to BIND, the hot standby's zone data will gradually go stale, and rollback at that point may require re-synchronizing zone data.

Recommendation: During the hot standby period, have the management system update both sides (BIND master + ShadowDNS) until everything is confirmed problem-free, and only then stop updating BIND.

Monitoring Checklist¶

The following metrics should be observed continuously before and after the cutover. It is recommended to establish baseline values during the Phase 1 parallel validation period.

DNS Query Metrics¶

Metric	How to observe	Expected behavior
Query QPS (per view)	Monitoring system / query log statistics	Post-cutover QPS distribution matches the BIND baseline
NOERROR ratio	DNS server log / monitoring	Should stay at the pre-cutover level
NXDOMAIN ratio	DNS server log / monitoring	Should not rise abnormally after the cutover
SERVFAIL ratio	DNS server log / monitoring	Should be 0 or extremely low; any SERVFAIL warrants immediate investigation
REFUSED ratio	DNS server log / monitoring	Should only appear in legitimate scenarios (out-of-zone queries, CHAOS queries)

Zone Transfer Metrics¶

Metric	How to observe	Expected behavior
AXFR failure rate (per slave)	Slave BIND log (`transfer of ... failed`)	Should be 0 after the cutover
AXFR completion time	Slave log / monitoring	Close to the Phase 1 baseline; a significant increase warrants investigating ShadowDNS performance
NOTIFY sent	ShadowDNS log	NOTIFY sent entries should appear after zone updates

ShadowDNS Process Metrics¶

Metric	How to observe	Expected behavior
Process memory usage	`ps`, `/proc/<pid>/status`, or monitoring	Should be below about 20% of the BIND master (target ~50 MB)
Process liveness	Monitoring / systemd	No unexpected restarts
Startup log errors	ShadowDNS log (ERROR level; actually the console encoder's tab-separated format, not logfmt)	No error logs during normal operation

GeoIP Metrics¶

Metric	How to observe	Expected behavior
Query distribution per view	Query log sampling statistics	Close to the BIND baseline's view distribution (small differences allowed)
No-view-match (REFUSED) ratio	DNS server log	Should not rise abnormally; an increase indicates a problem with GeoIP data or rules

GeoIP sampling method: Sample 1000 entries from the ShadowDNS query log, count queries per view, and compare against BIND's view distribution for the same time window. A difference above 5% warrants further investigation of whether the mmdb versions match.

For Prometheus metrics alerting rules (reload failures, latency, GeoIP freshness), see the "Day 2 Operations" section below.

Frequently Asked Questions¶

Q: After the cutover, a backup domain resolves to the wrong IP

Check whether the mapping for that backup domain in the aliases: section of shadowdns.yaml is correct (i.e., points to the correct root domain). Confirm the corresponding root domain has been loaded correctly in ShadowDNS:

dig @<shadowdns-ip> <root-domain> A
dig @<shadowdns-ip> <backup-domain> A

If the RDATA of the two does not match, either the mapping between the root zone data and the aliases: section of shadowdns.yaml is wrong, or the root zone data itself has a problem.

Q: A slave keeps initiating AXFR and never completes

Possible causes:

SOA serial not synchronized correctly: If after a ShadowDNS reload (restart) the backup zone's SOA serial did not update along with the root zone, the slave may fall into a repeated AXFR loop. Confirm that after ShadowDNS startup the backup zone's SOA serial matches the corresponding root zone:

dig @<shadowdns-ip> <root-domain>   SOA +short
dig @<shadowdns-ip> <backup-domain> SOA +short

allow-transfer ACL misconfiguration: If the slave's IP is not in the allow-transfer list, AXFR will receive REFUSED. Check the allow-transfer setting in named.conf.
TCP connection problems: AXFR runs over TCP; confirm the ShadowDNS host's firewall allows TCP/53 connections from the slave's IP.

Q: GeoIP classification results differ from BIND

Confirm the mmdb files used by ShadowDNS are the same version as those in BIND's geoip-directory:

ls -la /usr/local/share/GeoIP/GeoLite2-Country.mmdb
ls -la /usr/local/share/GeoIP/GeoLite2-ASN.mmdb

Confirm the AS number format in geoip asnum rules is correct (it should be "AS<number> <description>"; ShadowDNS takes only the numeric part). If the format does not match, ShadowDNS will fatal at startup.
If the mmdb versions are identical but results still differ, the cause may be a discrepancy between BIND's GeoIP module version and the newer MaxMind mmdb schema. Sample ShadowDNS's query log to find IPs classified into a different view, and verify with the mmdblookup tool:

mmdblookup --file /usr/local/share/GeoIP/GeoLite2-Country.mmdb \
           --ip <client-ip> country iso_code

Q: The startup log shows directives being skipped — is that a problem?

Usually not. ShadowDNS tolerates BIND directives it does not act on rather than failing. Directives such as type slave / type forward zones (dropped, logged at INFO), dnssec-enable (silent), and view/zone-scope access control like allow-query / allow-update (logged at WARN as not enforced) are skipped, and loading continues. This is expected on a BIND drop-in config — see What is tolerated or ignored on load and the tiered tolerance contract for the full classification.

A WARN about a skipped allow-query / allow-recursion / view-scope allow-transfer is the one to read closely: ShadowDNS does not enforce client query ACLs (see Access-control model differences). If you were relying on that directive to hide zones from certain clients, replicate the boundary with match-clients instead.

ShadowDNS only fails to start on genuine syntax errors (unbalanced brace, missing ;) or a few structural conflicts (a malformed geoip asnum, mixing view blocks with top-level zones, geoip-directory unset while a view uses geoip rules). Those errors name the specific file path and line number; fix the cited location.

Q: Memory usage is higher than expected

Expected memory usage is about 20% of the BIND master (~50 MB for the 25,200-zone scenario). If actual usage is high:

Confirm the aliases: section of shadowdns.yaml lists all backup domains completely, with no backup domain mistakenly loaded in full as a root.
Use ps aux or cat /proc/<pid>/status | grep VmRSS to observe RSS (resident memory), being careful not to confuse it with VSZ (virtual memory).

Listen Address Behavior¶

ShadowDNS reads the listen-on (IPv4) and listen-on-v6 (IPv6) directives from named.conf to decide which IPs to bind to. The behavior is BIND9-compatible: one socket is opened per address, and when a single address fails to bind (for example, systemd-resolved already holds 127.0.0.53:53), it only logs a WARN and continues — it does not prevent the whole server from starting.

Address Source Precedence¶

Scenario	`--listen`	`listen-on`	`listen-on-v6`	Actual binding
Default	`:53`	Not specified	Not specified	All IPv4 interface addresses (implicit `any`), no IPv6
Default + listen-on specified	`:53`	`{ 10.0.0.1; 10.0.0.2; }`	Not specified	`10.0.0.1:53`, `10.0.0.2:53`
Port hint + dual-family combined	`:53`	`{ 10.0.0.1; }`	`{ 2001:db8::1; }`	`10.0.0.1:53`, `[2001:db8::1]:53` (v4 first)
IPv6-only	`:53`	Not specified	`{ ::1; }`	`[::1]:53`
Override (IPv4)	`127.0.0.1:5353`	Any	Any	`127.0.0.1:5353` (both blocks ignored)
Override (IPv6 bracket)	`[::1]:5353`	Any	Any	`[::1]:5353` (both blocks ignored)
Port hint	`:5353`	`{ 10.0.0.1; }`	Not specified	`10.0.0.1:5353` (port inherited from `--listen`)

Key rule: --listen is an override only when it has a host component (for example 127.0.0.1:5353 or an IPv6 bracket literal [::1]:5353); the :PORT form only provides a port, with addresses taken respectively from listen-on (IPv4) and listen-on-v6 (IPv6), v4 listed first and v6 in bracket form [addr]:port. listen-on-v6 defaults to an empty set (opt-in), unlike listen-on, whose default expands to all IPv4 interfaces; if listen-on-v6 is not configured, no IPv6 listener is started. The server fatals only when the merged v4 and v6 resolution results are both empty; if one family is empty and the other is non-empty, startup proceeds normally.

Unsupported listen-on / listen-on-v6 Syntax¶

The following BIND syntax is not supported in either listen-on or listen-on-v6. It will be logged as WARN and skipped, without affecting parsing:

Exclusion syntax: listen-on { !10.0.0.1; any; }; (!addr exclusion)
ACL references: listen-on { trusted-net; };
Port override: listen-on port 5353 { ... }; (use --listen :5353 instead)
The interface keyword

IPv6 literal addresses are now supported in listen-on-v6 (e.g., listen-on-v6 { 2001:db8::1; ::1; };). If an IPv6 literal is placed in listen-on, or an IPv4 literal in listen-on-v6, that entry is logged as WARN and skipped (address family mismatch).

Interaction with systemd-resolved¶

On distributions where systemd-resolved is enabled by default, such as Ubuntu 24.04 / Debian 12, 127.0.0.53:53 and 127.0.0.54:53 are already held by the stub listener. When ShadowDNS expands listen-on { any; };, it will try to bind those addresses, receive EADDRINUSE, and log a WARN with a hint:

level=WARN msg="listener bind failed; skipping address"
  addr=127.0.0.53:53
  err="bind UDP 127.0.0.53:53: ... address already in use"
  hint="likely systemd-resolved stub on loopback; set DNSStubListener=no
         in /etc/systemd/resolved.conf if this address is expected"

This is expected behavior, not an error. External-facing interfaces (10.x.x.x, 192.168.x.x, etc.) still bind successfully, and external DNS service operates normally. If you actually need ShadowDNS to listen on 127.0.0.53, disable the systemd-resolved stub:

sudo sed -i 's/^#DNSStubListener=.*/DNSStubListener=no/' /etc/systemd/resolved.conf
sudo systemctl restart systemd-resolved

SIGHUP Reload Does Not Rebind Listeners¶

If listen-on or listen-on-v6 changes after a reload, ShadowDNS does not reopen sockets. This is a deliberate design to avoid a brief port takeover gap during reload. Reload drift detection covers the union of v4 and v6; when a change in either family's address set is detected, it logs:

level=INFO msg="reload: listen-address set differs from bound set; restart to apply
                (cause: listen-on/listen-on-v6 change and/or interface change since startup)"
  current_bound=[10.0.0.1:53, 127.0.0.1:53]
  new_resolved=[10.0.0.2:53]

Only running systemctl restart shadowdns applies the new listen addresses.

BREAKING Behavior Differences (compared to before v0.3.0)¶

The default binding changed from "a single 0.0.0.0:53 wildcard socket" to "per-address bind". The visible difference: the startup log goes from 1 listener bound entry to N entries.
Newly added NICs / IP aliases are not picked up automatically; BIND's interface-interval dynamic scanning is not supported in this version. Use systemctl restart shadowdns to bring new addresses into the listening set.
The semantics of --listen changed from "bind target" to "override hint + port hint". If you previously wrote --listen :53 expecting 0.0.0.0 wildcard behavior, it is now treated as "port hint, with addresses taken from listen-on (or expanded from any)" — the behavior is identical in most cases, but the explicit logs will differ.

Day 2 Operations¶

This section covers steady-state operations after the cutover is complete (ShadowDNS is the sole master): alerting configuration centered on Prometheus metrics, plus routine SOPs. The metric baselines from the "Monitoring Checklist" above still apply on Day 2.

Detecting Silent Reload Failures¶

When a SIGHUP reload fails (e.g., zone file syntax error, config parse error), ShadowDNS does not crash and does not interrupt service — the process keeps answering with the previous configuration, and the only externally visible symptom is the SOA serial staying at the old value. Without proactive detection, stale data could be served for hours unnoticed.

Primary detection method: reload metrics alerting

ShadowDNS exposes the shadowdns_reload_total{result="success"|"failure"} counter and the shadowdns_config_last_reload_success_timestamp_seconds gauge via --metrics-addr (default :9153) — see shadowdns.yaml configuration reference for the semantics; both result label combinations are pre-initialized at startup, so alert expressions do not need to handle metric absence. Recommended alerting rules:

# Alert immediately on any reload failure
increase(shadowdns_reload_total{result="failure"}[15m]) > 0

# Staleness alert: if the zone push cadence is fixed (e.g., at least once daily),
# this detects the "pushed, but no successful reload ever happened" case
# (note: a process restart also resets this gauge, so it alone cannot cover all scenarios)
time() - shadowdns_config_last_reload_success_timestamp_seconds > 86400

Verification step after every push: serial probe

After pushing a zone change and sending SIGHUP, compare the serial on disk against the serial being served live:

# 1. Read the serial declared in the zone file on disk. In the multi-line SOA style,
#    the serial is the first numeric value after the parenthesis (conventionally
#    annotated "; serial"); note that if the SOA line carries an explicit TTL,
#    the first number on the line is the TTL, not the serial — don't grab the wrong field
grep -m1 -A1 'SOA' /etc/bind/db.example.com-th

# 2. Get the serial being served live (3rd field of dig SOA +short output)
dig @127.0.0.1 example.com SOA +short | awk '{print $3}'

Comparison logic:

Both match → reload succeeded, push complete.
Live serial is older than disk → the reload failed silently and stale configuration is being served. Alert immediately, roll the zone file back to the last known-good version (so the next reload at least restores a consistent state), then investigate the cause of the error in the new zone file.

For the other half of post-push verification — response content comparison — see "Ongoing Answer Consistency Regression Verification" in this section.

Supplementary method: log inspection

In environments without Prometheus, watch the application-level log for ERROR-level reload failed messages:

# Use tail to bound the scope, avoiding repeatedly re-scanning historical events
# within the rotation window on every check
sudo tail -n 5000 /var/log/shadowdns/shadowdns.log | grep -E 'ERROR\s+reload failed'

Note: ShadowDNS logs use the console encoder format (tab-separated time level message fields), not logfmt's level=ERROR msg="..." form; when configuring log-based alerts, the pattern must match the actual format (the log excerpts in earlier sections of this document are illustrative).

Latency Monitoring¶

ShadowDNS records each query's processing time with the shadowdns_dns_request_duration_seconds histogram (label: view), with bucket boundaries covering 0.1 ms to 100 ms, giving sufficient resolution for both the sub-millisecond norm of authoritative DNS and tens-of-milliseconds anomalies.

Derive quantile latencies with histogram_quantile:

# Global p99 (aggregated across all views); use sum by (le, view) to see per-view,
# and replace 0.99 with 0.5 / 0.95 to get p50 / p95
histogram_quantile(0.99,
  sum by (le) (rate(shadowdns_dns_request_duration_seconds_bucket[5m])))

Alerting suggestion: Add > 0.01 (10 ms, which happens to be one of the bucket boundaries, where the quantile estimate is most accurate) to the global p99 query above as the alert condition. Adjust the actual threshold per your environment's SLA; it is recommended to watch both p50 (steady-state level drift) and p99 (tail degradation) — divergence between the two often points to GC, disk I/O (query log), or a hot spot in a single view.

GeoIP DB Staleness Monitoring and Monthly Rotation¶

MaxMind updates the GeoLite2 databases monthly. A stale mmdb causes GeoIP view classification to gradually drift from reality (IP ranges change hands, ASNs get reassigned); the symptom is queries from specific sources being routed to the wrong view — this kind of drift triggers no error alerts and can only be caught by proactively monitoring DB freshness.

Staleness monitoring (alert threshold: 35 days)

ShadowDNS exposes the build time of the loaded mmdb as the shadowdns_geoip_db_info gauge (value always 1, with the build_time label as an RFC3339 string):

shadowdns_geoip_db_info{build_time="2026-05-13T00:00:00Z",database="country"} 1
shadowdns_geoip_db_info{build_time="2026-05-13T00:00:00Z",database="asn"} 1

Because build_time is a string label, pure PromQL cannot compute the age directly; the recommendation is a scheduled script that fetches /metrics and computes it, alerting beyond 35 days (MaxMind's monthly update cycle is about 30 days, so 35 days means one update round has already been missed):

# cron check: print STALE if any database's build_time exceeds 35 days (no output = pass)
curl -s http://127.0.0.1:9153/metrics \
  | awk -F'"' '/^shadowdns_geoip_db_info/{print $2}' \
  | while read -r ts; do
      [ "$(date -d "$ts" +%s)" -lt "$(date -d '35 days ago' +%s)" ] && echo "STALE: $ts"
    done

If your monitoring stack supports extracting numeric values from labels (e.g., VictoriaMetrics' MetricsQL), you can also alert directly with a query expression.

Monthly routine maintenance SOP

mmdb files are reopened on every SIGHUP reload (see the GeoIP database reference), so GeoIP updates do not require a process restart — place the new files at the original path and reload host by host:

Download the new Country and ASN database tar.gz packages, verify the checksum before extracting (the SHA256 files MaxMind provides correspond to the tar.gz archives, not the extracted .mmdb), and once confirmed, place the .mmdb files at the production path:

sha256sum -c GeoLite2-Country_<date>.tar.gz.sha256
sha256sum -c GeoLite2-ASN_<date>.tar.gz.sha256
tar -xzf GeoLite2-Country_<date>.tar.gz
tar -xzf GeoLite2-ASN_<date>.tar.gz

Trigger a reload one host at a time (the systemd unit already defines ExecReload to send SIGHUP):

sudo systemctl reload shadowdns

Confirm that host's shadowdns_geoip_db_info{build_time} reflects the new build date; if it has not updated, use the reload metrics and the application-level log (see "Detecting Silent Reload Failures") to find the cause of the reload failure:

curl -s http://<instance-ip>:9153/metrics | grep '^shadowdns_geoip_db_info'

After that host passes verification, run steps 2–3 on the next host, until all instances are done.

Ephemeral DNS-01 Record Volatility¶

ACME DNS-01 challenge TXT records written via the ephemeral API (PUT /v1/txt/{fqdn}) are purely in-memory storage: a process restart (restart, upgrade, host reboot) and a successful SIGHUP reload both clear all ephemeral records. The latter is deliberate (see internal/ephemeral/store.go: "ephemeral state does not survive a config reload"), ensuring the post-reload service state is derived entirely from the configuration files; for the full lifecycle behavior see docs/ephemeral-api.md.

If the wipe happens after the ACME challenge writes the TXT record but before the CA's validation query, that challenge fails validation and certificate renewal is interrupted.

Pre-operation checklist

Before performing a restart or sending SIGHUP:

Confirm no DNS-01 challenge is currently in progress. The ephemeral API has only PUT / DELETE endpoints and cannot enumerate records, so use either of the following instead:

# Directly query whether the challenge TXT record exists (NXDOMAIN / empty response = no challenge in progress)
dig @127.0.0.1 _acme-challenge.example.com TXT +short

Or check the ACME client's (certbot / lego, etc.) logs and scheduler state, confirming no renewal flow is currently running.

If a challenge is in progress, wait for it to finish (success or failure either way) before restarting / reloading.

Scheduling recommendation: Schedule shadowdns restarts / reloads (including SIGHUPs triggered by zone pushes) outside the ACME certificate renewal window — ACME clients' renewal schedules can usually be pinned to a fixed time slot (e.g., certbot's systemd timer), institutionally eliminating the chance of mutual interference. Once the fixed window is established, routine zone-push SIGHUPs do not require per-domain dig confirmation; the checklist above is mainly for unplanned restarts / reloads.

Restart Cost and Rolling Restart SOP¶

Which changes go through SIGHUP and which require a restart

In the current version, the vast majority of configuration changes are applied via SIGHUP reload; only three categories still require a full restart:

Change type	How it's applied
Zone data (zone files, `named.conf.local`)	SIGHUP reload
GeoIP mmdb updates	SIGHUP reload
RRL (rate-limit) settings	SIGHUP reload
Query log path / options (`logging{}`)	SIGHUP reload
`aliases:` section of `shadowdns.yaml`	SIGHUP reload
`ephemeral_api:` section of `shadowdns.yaml`	full restart (the API server is created exactly once at startup based on the configuration at that moment; reload does not re-read listen / allow / token, nor does it start or stop the API)
Any CLI flag (e.g., `--log-file`, `--listen`, `--metrics-addr`)	full restart (flags are process-lifetime sticky; after modifying the systemd unit, you need `daemon-reload` + restart)
`listen-on` / `listen-on-v6` address changes	full restart (reload detects the drift and logs a hint, but deliberately does not rebind sockets; see "SIGHUP Reload Does Not Rebind Listeners")

The performance cost of restarting

A freshly restarted ShadowDNS is in a cold-start state: in dnspyre benchmarks, the first benchmark round after a restart shows about 30% lower QPS, recovering afterward. This is a benchmark observation (Go runtime warm-up, OS page cache, and similar factors), not a service capacity guarantee, but capacity planning and restart scheduling should still assume reduced peak handling capacity for a short period after a restart.

Rolling Restart SOP

Prerequisite: at least 2 production instances deployed. A single-instance deployment leaves no room for rolling; any restart is a service interruption.

Batch restart-requiring configuration changes: accumulate them and apply once in a maintenance window, rather than restarting a round for every single flag change.
After completing the pre-operation checklist in "Ephemeral DNS-01 Record Volatility", start with the first host: drain traffic (withdraw from the LB / anycast announcement if fronted by one), then restart:

sudo systemctl restart shadowdns

Confirm the host is healthy (this check sequence is also referenced by the upgrade / rollback SOP):

# process alive
systemctl is-active shadowdns

# no ERROR logs (no output = pass; the log file accumulates across restarts,
# so exclude historical lines from before the restart timestamp)
sudo tail -n 200 /var/log/shadowdns/shadowdns.log | grep ERROR

# answering normally
dig @127.0.0.1 example.com SOA +short

Wait for the host's QPS to return to the pre-restart baseline (watch the QPS curve in monitoring, or compare rate(shadowdns_dns_requests_total[1m]) against the pre-restart level), then run steps 2–3 on the next host.
Once all hosts are done, use monitoring to confirm overall QPS and error rate have returned to the pre-change baseline.

Upgrade and Rollback SOP¶

v0.x.x is the experimental stage: assume every version bump may contain breaking CLI / config changes (flag renames, config schema adjustments, default value changes), so --dry-run validation is a mandatory step.

Standard upgrade flow (one host at a time)

Download the new .deb package, and keep the current version's .deb for rollback (record the current version first):

shadowdns --version   # record the rollback baseline

First unpack the new binary into a temporary directory (no install, no impact on the running service), then run --dry-run validation with the new binary against the configuration paths the running service actually uses (paths per the ExecStart flags in systemctl cat shadowdns; the package default is /etc/shadowdns/):

dpkg-deb -x shadowdns_<new-version>_amd64.deb /tmp/shadowdns-new
/tmp/shadowdns-new/usr/bin/shadowdns --dry-run \
    --named-conf /etc/shadowdns/named.conf \
    --config     /etc/shadowdns/shadowdns.yaml

For --dry-run semantics see docs/benchmark.md. Any parse error, unsupported flag, or schema incompatibility surfaces at this step. If the dry run fails, stop the upgrade — nothing has been installed yet, so no rollback is needed; just investigate the compatibility issue.

After the dry run passes, install the new package and apply it host by host per the Rolling Restart SOP (steps 2–4: Ephemeral checklist, restart, health checks, wait for QPS to stabilize):

sudo dpkg -i shadowdns_<new-version>_amd64.deb
sudo systemctl restart shadowdns

Rollback (when any host fails to start or behaves abnormally):

sudo dpkg -i shadowdns_<previous-version>_amd64.deb
sudo systemctl restart shadowdns

After restarting, run the health checks from Rolling Restart SOP step 3; once verified, go back and investigate the cause of the new version's failure. Other instances that upgraded successfully can temporarily stay on the new version (after confirming new and old versions can serve side by side), or be rolled back together to keep versions consistent — judge by the nature of the failure.

Ongoing Answer Consistency Regression Verification¶

The answer-diff comparison from Phase 1 step 3 should not be shelved after the cutover completes — it is a standing operational tool that should run after every zone change push, comparing the response differences between two instances (BIND vs ShadowDNS during the hot standby period; afterwards old vs new ShadowDNS versions, or any two instances that should be identical) for the same queries:

# After pushing a zone change and confirming both hosts completed the reload,
# compare responses for the affected domains
diff \
  <(dig @<instance-a-ip> example.com A +short | sort) \
  <(dig @<instance-b-ip> example.com A +short | sort)

The comparison scope should cover the zones changed in this push plus their backup / alias domains, with record types matching Phase 1 step 4 (A, AAAA, CNAME, NS, MX, TXT, SOA); when comparing SOA, you can first strip the serial field with awk '{$3=""; print}' to avoid false differences from reload timing skew.

Pay special attention to alias / CNAME flattening: ShadowDNS's backup domain rewriting logic (owner name and RDATA rewrite) is where behavior diverges most from BIND; edge cases (deep CNAME chains, cross-zone targets, wildcard records interacting with aliases) may produce responses that differ from expectations. No answer-diff difference should be waved through as noise — first confirm it is a known acceptable difference (such as the SOA serial timing skew above); otherwise always investigate proactively.

Duplicate records collapse at load (matching BIND). ShadowDNS discards byte-identical duplicate resource records within an RRset when it loads a zone — owner, type, and RDATA must all match (TTL is excluded), so the first occurrence is kept and later copies are dropped. This mirrors BIND's RFC 2181 §5.2 set semantics, so a name whose record is declared twice (a common pattern: the same record inline in a per-view file and again inside a shared $INCLUDEd fragment) is served exactly once by both servers — an answer-diff for such a name should show no duplicate on either side. Each zone that dropped at least one duplicate logs a single WARN summary (zone origin, total count, by-type histogram) to the application log; per-duplicate detail is available at DEBUG.

Query Log Disk Management¶

An authoritative DNS server's query log writes one line per query — in production environments with thousands of QPS, a single day's log can reach several GB. An uncontrolled query log is a common cause of full disks and, in turn, service impact.

Audit the logrotate configuration

The logrotate configuration installed by the .deb package lives at /etc/logrotate.d/shadowdns (the content's source of truth is packaging/logrotate.shadowdns in the repo; defaults are daily rotation, 14 copies retained, compression, and notifying ShadowDNS via SIGUSR1 after rotation to reopen the log file). Validate in dry-run mode:

sudo logrotate -d /etc/logrotate.d/shadowdns

The output lists the matched log files and the actions that would be taken without actually rotating; confirm /var/log/shadowdns/*.log is matched and the rotation policy is as expected.

Tune the rotation cadence to actual query volume

The default "daily × 14 copies" is a generic starting point, not a capacity commitment. Calibrate against actual volume after going live: observe the single-day log size (du -sh /var/log/shadowdns/), project whether the 14-day retention fits the disk budget; at high volume switch to hourly or reduce the retention count, and at low volume extend retention to gain a longer query lookback window.

The query log and the application-level log are two separate streams

Query log: a record of every DNS query, with the path determined by the logging{} configuration, managed by the logrotate setup above.
Application-level log: program events such as startup, reload, and errors, written to /var/log/shadowdns/shadowdns.log (reading requires sudo). When troubleshooting service anomalies (reload failures, bind failures, panics), look here — do not fish through the query log.

Emergency Contact¶

For DNS Ops on-call contact details, see the team's internal wiki.

When a production issue is found, prioritize executing the Rollback Strategy for the corresponding Phase; perform root-cause analysis only after service has been restored.