Monitoring with Prometheus and Grafana¶

ShadowDNS exposes Prometheus metrics over HTTP and ships a ready-to-import Grafana dashboard. This page covers the metrics endpoint, a Prometheus scrape configuration, the metric families you can graph, how to load the bundled dashboard, and how to capture CPU/memory flame graphs with the built-in pprof profiler.

The metrics endpoint¶

ShadowDNS serves metrics on a dedicated HTTP listener controlled by --metrics-addr. The default is :9153; set it to an empty string to disable the endpoint (no metrics — including the Go runtime and process collectors — are registered in that case).

curl -s http://127.0.0.1:9153/metrics | head

Metrics live on their own registry, so the response contains the shadowdns_* families plus the standard go_* (Go runtime) and process_* families.

process_* is Linux-only

The process_* family (resident memory, CPU seconds, file descriptors, process start time) is produced by the process collector, which only reports data on Linux. On other platforms those series are simply absent — this is expected, not an error. The go_* family is present on every platform.

Prometheus scrape configuration¶

Add a scrape job that points at the metrics endpoint. The 9153 port is the ShadowDNS default.

scrape_configs:
  - job_name: shadowdns
    static_configs:
      - targets:
          - "ns1.example.com:9153"
          - "ns2.example.com:9153"

Confirm the target is up on the Prometheus Status → Targets page before moving on to Grafana.

Metric reference¶

Metric	Type	Labels	Meaning
`shadowdns_build_info`	gauge	`version`, `goversion`	Always 1; build identification
`shadowdns_dns_requests_total`	counter	`proto`, `family`, `type`, `view`	DNS requests received
`shadowdns_dns_responses_total`	counter	`rcode`, `view`	DNS responses sent
`shadowdns_dns_request_duration_seconds`	histogram	`view`	Request processing latency (buckets 100µs–100ms)
`shadowdns_dns_ecs_queries_total`	counter	`family`, `status`	ECS option classifications (only with `--ecs-enable`)
`shadowdns_dns_view_selected_total`	counter	`view`, `ecs_geo`	Successful view resolutions on the main query path
`shadowdns_dns_rate_limit_total`	counter	`category`, `action`	RRL decisions
`shadowdns_zones_loaded`	gauge	`view`	Root zones loaded per view
`shadowdns_zones_backup`	gauge	`view`	Backup-override zones loaded per view
`shadowdns_geoip_db_info`	gauge	`database`, `build_time`	Loaded GeoIP database build time
`shadowdns_reload_total`	counter	`result`	SIGHUP reload attempts
`shadowdns_config_last_reload_success_timestamp_seconds`	gauge	—	Unix time of the last successful load
`shadowdns_panics_total`	counter	—	Panics recovered by handlers
`go_*`	various	—	Go runtime (goroutines, heap, GC)
`process_*`	various	—	Process resource usage (Linux-only)

ECS classification metrics¶

shadowdns_dns_ecs_queries_total is incremented once per query that carries an EDNS Client Subnet option while ECS handling is enabled (see the ECS guide). When --ecs-enable is off, or a query carries no ECS option, this counter is not touched.

status is one of valid, opt_out, or malformed, matching the option's classification. A malformed option is still answered with FORMERR exactly as before — recording the metric does not change the response.
family is derived from the ECS option's own address-family field: ipv4 for family 1, ipv6 for family 2, unknown otherwise.

The ECS carry rate is sum(rate(shadowdns_dns_ecs_queries_total[5m])) / sum(rate(shadowdns_dns_requests_total[5m])).

View selection metrics¶

shadowdns_dns_view_selected_total is incremented once for each query whose view resolves on the main query path. Queries refused before a view is resolved (no view matched, unparseable client IP) do not increment it, and zone transfers (AXFR/IXFR) are out of scope.

What ecs_geo means

ecs_geo="true" means an ECS-derived geo address was available to the matcher for that query — not that ECS decided the final view. The view may still have been chosen by an IP/CIDR ACL rule, which always evaluates the real source IP. Read this label as "ECS geo participation", not "ECS-driven view selection".

Importing the Grafana dashboard¶

The repository ships a dashboard at grafana/shadowdns-overview.json. It is not packaged into the .deb; fetch it from the repository.

In Grafana, go to Dashboards → New → Import.
Upload grafana/shadowdns-overview.json (or paste its contents).
When prompted, select your Prometheus data source for the DS_PROMETHEUS input.
Click Import.

The dashboard provides Job and Instance template variables at the top so you can scope every panel to a single ShadowDNS process or view the fleet in aggregate.

Panel groups¶

Overview — build info, process uptime, total QPS.
Traffic — QPS by protocol/family/query type, responses by rcode, and SERVFAIL/REFUSED/NXDOMAIN ratios (ratios fall back to 0 on zero-traffic windows).
Latency — p50/p90/p99 from the request-duration histogram, overall and per view.
ECS & Views — per-view selection rate, ECS-geo participation ratio, ECS classification by status/family, and the ECS carry rate.
Rate Limiting — RRL decisions by category and action.
Config & Zones — reload attempts, time since last successful reload, the GeoIP database table, and zones loaded per view.
Runtime — process CPU, memory (RSS and Go heap), goroutines, file descriptors, and GC pause quantiles.
Panics — panic total and rate.

Empty panels before traffic

The ECS and per-view panels stay empty until matching traffic arrives, and the process_*-based panels stay empty on non-Linux hosts. Neither is an error.

Profiling with `go tool pprof`¶

ShadowDNS embeds Go's pprof profiler. With --pprof-enable set, the profiling endpoints are mounted under /debug/pprof/ on the same HTTP listener as the metrics endpoint (--metrics-addr, default :9153). They let you capture CPU and memory profiles from a running server and render them as flame graphs to find hot spots — no restart, no rebuild, no separate agent.

Trusted networks only — no authentication

The /debug/pprof/ handlers have no access control. Anyone who can reach the endpoint can capture profiles and read cmdline. Never expose --metrics-addr to an untrusted network while pprof is enabled. Reach the endpoints over a private network or an SSH tunnel (shown below), and leave --pprof-enable off unless you are actively profiling.

Enabling the endpoints¶

--pprof-enable requires --metrics-addr to be non-empty, because the handlers mount on the metrics server:

shadowdns --metrics-addr :9153 --pprof-enable ...

Confirm the profile index is reachable:

curl -s http://127.0.0.1:9153/debug/pprof/

Capturing a CPU profile under load¶

A profile is only useful while the server is actually handling queries — capture it during a benchmark or a real traffic peak, never on an idle process. CPU is almost always what caps query throughput, so the CPU profile is the primary tool for QPS bottlenecks.

Because the endpoint is trusted-network-only, the typical workflow tunnels the metrics port to your workstation first:

ssh -L 9153:localhost:9153 ns1.example.com

Then point go tool pprof at the tunnelled port. The -http flag opens an interactive web UI with a built-in flame graph (open View → Flame Graph):

# 30-second CPU profile — the primary tool for QPS / throughput bottlenecks
go tool pprof -http=:8080 'http://localhost:9153/debug/pprof/profile?seconds=30'

Reading a CPU flame graph¶

Each frame's width is cumulative CPU time — the wider a frame, the more CPU it and its callees consumed. Height is just call depth.
Read top-down along the call stack and look for the widest leaf frames on the request hot path: query parsing, view matching (GeoIP/ACL), zone lookup, alias rewriting, CNAME collapsing, and response serialization.
A wide runtime.mallocgc / GC subtree means the bottleneck is allocation pressure rather than logic — switch to the allocation profiles below to see which call sites allocate the most.

Memory and other profiles¶

The same -http flame-graph UI works for every profile type; just change the URL path:

# Heap — live (in-use) memory, for leak / footprint investigation
go tool pprof -http=:8080 'http://localhost:9153/debug/pprof/heap'

# Allocs — cumulative allocations since start, for GC-pressure hot spots
go tool pprof -http=:8080 'http://localhost:9153/debug/pprof/allocs'

Endpoint	What it answers
`profile?seconds=N`	Where CPU time goes (QPS ceiling)
`heap`	Live memory by allocation site (leaks, footprint)
`allocs`	Cumulative allocations (GC pressure)
`goroutine`	Goroutine count and stacks (leaks, blocking)
`block` / `mutex`	Contention — off by default; a sampling rate must be set in code before these produce data

Saving a profile for later

To archive or share a profile instead of opening it live, download the file first, then point the UI at it:

curl -o cpu.pprof 'http://localhost:9153/debug/pprof/profile?seconds=30'
go tool pprof -http=:8080 cpu.pprof

Requirements and continuous profiling

The interactive UI needs a local Go toolchain (go tool pprof); the flame graph view renders without Graphviz, while the call-graph view needs it installed. go tool pprof captures a point-in-time snapshot — for always-on flame graphs correlated with the dashboard timeline, a continuous-profiling backend such as Grafana Pyroscope can scrape these same /debug/pprof/ endpoints.