Monitoring with Prometheus and Grafana¶
ShadowDNS exposes Prometheus metrics over HTTP and ships a ready-to-import
Grafana dashboard. This page covers the metrics endpoint, a Prometheus scrape
configuration, the metric families you can graph, how to load the bundled
dashboard, and how to capture CPU/memory flame graphs with the built-in pprof
profiler.
The metrics endpoint¶
ShadowDNS serves metrics on a dedicated HTTP listener controlled by
--metrics-addr. The default is :9153; set it to an
empty string to disable the endpoint (no metrics — including the Go runtime and
process collectors — are registered in that case).
Metrics live on their own registry, so the response contains the shadowdns_*
families plus the standard go_* (Go runtime) and process_* families.
process_* is Linux-only
The process_* family (resident memory, CPU seconds, file descriptors,
process start time) is produced by the process collector, which only reports
data on Linux. On other platforms those series are simply absent — this is
expected, not an error. The go_* family is present on every platform.
Prometheus scrape configuration¶
Add a scrape job that points at the metrics endpoint. The 9153 port is the
ShadowDNS default.
scrape_configs:
- job_name: shadowdns
static_configs:
- targets:
- "ns1.example.com:9153"
- "ns2.example.com:9153"
Confirm the target is up on the Prometheus Status → Targets page before
moving on to Grafana.
Metric reference¶
| Metric | Type | Labels | Meaning |
|---|---|---|---|
shadowdns_build_info |
gauge | version, goversion |
Always 1; build identification |
shadowdns_dns_requests_total |
counter | proto, family, type, view |
DNS requests received |
shadowdns_dns_responses_total |
counter | rcode, view |
DNS responses sent |
shadowdns_dns_request_duration_seconds |
histogram | view |
Request processing latency (buckets 100µs–100ms) |
shadowdns_dns_ecs_queries_total |
counter | family, status |
ECS option classifications (only with --ecs-enable) |
shadowdns_dns_view_selected_total |
counter | view, ecs_geo |
Successful view resolutions on the main query path |
shadowdns_dns_rate_limit_total |
counter | category, action |
RRL decisions |
shadowdns_zones_loaded |
gauge | view |
Root zones loaded per view |
shadowdns_zones_backup |
gauge | view |
Backup-override zones loaded per view |
shadowdns_geoip_db_info |
gauge | database, build_time |
Loaded GeoIP database build time |
shadowdns_reload_total |
counter | result |
SIGHUP reload attempts |
shadowdns_config_last_reload_success_timestamp_seconds |
gauge | — | Unix time of the last successful load |
shadowdns_panics_total |
counter | — | Panics recovered by handlers |
go_* |
various | — | Go runtime (goroutines, heap, GC) |
process_* |
various | — | Process resource usage (Linux-only) |
ECS classification metrics¶
shadowdns_dns_ecs_queries_total is incremented once per query that carries an
EDNS Client Subnet option while ECS handling is enabled (see the
ECS guide). When --ecs-enable is off, or a query carries
no ECS option, this counter is not touched.
statusis one ofvalid,opt_out, ormalformed, matching the option's classification. A malformed option is still answered with FORMERR exactly as before — recording the metric does not change the response.familyis derived from the ECS option's own address-family field:ipv4for family 1,ipv6for family 2,unknownotherwise.
The ECS carry rate is sum(rate(shadowdns_dns_ecs_queries_total[5m])) /
sum(rate(shadowdns_dns_requests_total[5m])).
View selection metrics¶
shadowdns_dns_view_selected_total is incremented once for each query whose
view resolves on the main query path. Queries refused before a view is resolved
(no view matched, unparseable client IP) do not increment it, and zone transfers
(AXFR/IXFR) are out of scope.
What ecs_geo means
ecs_geo="true" means an ECS-derived geo address was available to the
matcher for that query — not that ECS decided the final view. The view may
still have been chosen by an IP/CIDR ACL rule, which always evaluates the
real source IP. Read this label as "ECS geo participation", not "ECS-driven
view selection".
Importing the Grafana dashboard¶
The repository ships a dashboard at
grafana/shadowdns-overview.json.
It is not packaged into the .deb; fetch it from the repository.
- In Grafana, go to Dashboards → New → Import.
- Upload
grafana/shadowdns-overview.json(or paste its contents). - When prompted, select your Prometheus data source for the
DS_PROMETHEUSinput. - Click Import.
The dashboard provides Job and Instance template variables at the top so you
can scope every panel to a single ShadowDNS process or view the fleet in
aggregate.
Panel groups¶
- Overview — build info, process uptime, total QPS.
- Traffic — QPS by protocol/family/query type, responses by rcode, and
SERVFAIL/REFUSED/NXDOMAIN ratios (ratios fall back to
0on zero-traffic windows). - Latency — p50/p90/p99 from the request-duration histogram, overall and per view.
- ECS & Views — per-view selection rate, ECS-geo participation ratio, ECS classification by status/family, and the ECS carry rate.
- Rate Limiting — RRL decisions by category and action.
- Config & Zones — reload attempts, time since last successful reload, the GeoIP database table, and zones loaded per view.
- Runtime — process CPU, memory (RSS and Go heap), goroutines, file descriptors, and GC pause quantiles.
- Panics — panic total and rate.
Empty panels before traffic
The ECS and per-view panels stay empty until matching traffic arrives, and
the process_*-based panels stay empty on non-Linux hosts. Neither is an
error.
Profiling with go tool pprof¶
ShadowDNS embeds Go's pprof profiler. With --pprof-enable
set, the profiling endpoints are mounted under /debug/pprof/ on the same HTTP
listener as the metrics endpoint (--metrics-addr, default :9153). They let
you capture CPU and memory profiles from a running server and render them as
flame graphs to find hot spots — no restart, no rebuild, no separate agent.
Trusted networks only — no authentication
The /debug/pprof/ handlers have no access control. Anyone who can reach
the endpoint can capture profiles and read cmdline. Never expose
--metrics-addr to an untrusted network while pprof is enabled. Reach the
endpoints over a private network or an SSH tunnel (shown below), and leave
--pprof-enable off unless you are actively profiling.
Enabling the endpoints¶
--pprof-enable requires --metrics-addr to be non-empty, because the handlers
mount on the metrics server:
Confirm the profile index is reachable:
Capturing a CPU profile under load¶
A profile is only useful while the server is actually handling queries — capture it during a benchmark or a real traffic peak, never on an idle process. CPU is almost always what caps query throughput, so the CPU profile is the primary tool for QPS bottlenecks.
Because the endpoint is trusted-network-only, the typical workflow tunnels the metrics port to your workstation first:
Then point go tool pprof at the tunnelled port. The -http flag opens an
interactive web UI with a built-in flame graph (open View → Flame Graph):
# 30-second CPU profile — the primary tool for QPS / throughput bottlenecks
go tool pprof -http=:8080 'http://localhost:9153/debug/pprof/profile?seconds=30'
Reading a CPU flame graph¶
- Each frame's width is cumulative CPU time — the wider a frame, the more CPU it and its callees consumed. Height is just call depth.
- Read top-down along the call stack and look for the widest leaf frames on the request hot path: query parsing, view matching (GeoIP/ACL), zone lookup, alias rewriting, CNAME collapsing, and response serialization.
- A wide
runtime.mallocgc/ GC subtree means the bottleneck is allocation pressure rather than logic — switch to the allocation profiles below to see which call sites allocate the most.
Memory and other profiles¶
The same -http flame-graph UI works for every profile type; just change the URL
path:
# Heap — live (in-use) memory, for leak / footprint investigation
go tool pprof -http=:8080 'http://localhost:9153/debug/pprof/heap'
# Allocs — cumulative allocations since start, for GC-pressure hot spots
go tool pprof -http=:8080 'http://localhost:9153/debug/pprof/allocs'
| Endpoint | What it answers |
|---|---|
profile?seconds=N |
Where CPU time goes (QPS ceiling) |
heap |
Live memory by allocation site (leaks, footprint) |
allocs |
Cumulative allocations (GC pressure) |
goroutine |
Goroutine count and stacks (leaks, blocking) |
block / mutex |
Contention — off by default; a sampling rate must be set in code before these produce data |
Saving a profile for later
To archive or share a profile instead of opening it live, download the file first, then point the UI at it:
Requirements and continuous profiling
The interactive UI needs a local Go toolchain (go tool pprof); the flame
graph view renders without Graphviz, while the call-graph view needs it
installed. go tool pprof captures a point-in-time snapshot — for always-on
flame graphs correlated with the dashboard timeline, a continuous-profiling
backend such as Grafana Pyroscope can scrape these same /debug/pprof/
endpoints.