How to Monitor Windows Servers with Prometheus and Grafana

X Facebook LinkedIn

Don’t trust the green checkmark on a Windows server dashboard until you know what is scraping it. A dashboard can look calm while the exporter is dead, the firewall is wrong, or Prometheus is scraping the wrong host label. That’s how you get the worst kind of outage: one that looks healthy until someone opens the wrong panel.

If you want the stack to tell the truth instead of drawing a green lie, start at the exporter, not the dashboard. Install windows_exporter and lock down 9182 first so you know the metrics are real and private. Then let Prometheus, Grafana, and Alertmanager do their jobs in that order, with each layer proving the one before it.

Production Architecture for Windows Server Monitoring

Before you touch a service account or open a firewall rule, get the shape of the system straight. On a production setup, the monitored Windows servers should expose metrics; Prometheus should scrape and store them through a scrape configuration; Grafana should read the same store; and Alertmanager should decide whether a problem deserves a page.

[Image: images/windows-monitoring-architecture.svg]

Production topology

Component	Job	Production note
`windows_exporter`	Exposes host metrics over HTTP	Keep the collector set tight. Only enable role-specific collectors on servers that actually need them.
Prometheus	Scrapes, stores, and evaluates rules	Treat it as the source of truth for alert math, not as a dashboard cache.
Grafana	Visualizes metrics and trends	Provision data sources and dashboards so you can rebuild the stack without clicking around.
Alertmanager	Routes and deduplicates alerts	Let it group noise before it reaches the pager.

If you already run Prometheus and Grafana on a Windows management server, keep them there. If they live elsewhere, the monitoring path does not change; the service wrapper does.

Quick Win: keep one small diagram in your runbook. When the outage starts, you want the topology in front of you before you start guessing.

Prerequisites

If you want to follow along, you’ll need:

Windows Server 2019 or later on the monitored hosts, because windows_exporter supports current Windows Server releases and the service install path below assumes a modern MSI workflow.
Prometheus running somewhere your Windows servers can reach, because it is the component that scrapes and evaluates the metrics.
Grafana with admin access or provisioning access, because you need data source provisioning to keep dashboards repeatable.
Network access from Prometheus to TCP 9182 on each Windows host, because scraping dies fast when the firewall pretends to be security.
Alertmanager if you want pages, grouping, and silences. If you only need dashboards, you can skip it for now.

Install `windows_exporter` on the Windows Servers

windows_exporter is the piece that turns Windows internals into Prometheus metrics. Its MSI install docs cover service creation and firewall exceptions, which is one less excuse to leave the box half-configured.

Use a small collector set first. The default host collectors cover the basics; add role-specific collectors only when a server actually runs that role.

Collector set	Why you want it
`cpu`, `memory`, `logical_disk`, `net`, `os`, `physical_disk`, `service`, `system`	Baseline host health for most Windows servers
`iis`, `mssql`, `dhcp`, `dns`, `hyperv`	Only on servers that run those services
`process`	Use when you need process-level views and can tolerate extra cardinality

Run the installer on each server:

$msi = "C:\Temp\windows_exporter.msi"
$collectors = "cpu,memory,logical_disk,net,os,physical_disk,service,system"

Start-Process msiexec.exe -Wait -ArgumentList @(
  "/i", $msi,
  "/qn",
  "ENABLED_COLLECTORS=$collectors",
  "LISTEN_PORT=9182",
  "ADDLOCAL=FirewallException"
)

That gets you the default metrics path on http://localhost:9182/metrics. If your server has a role collector to add, put it in ENABLED_COLLECTORS and rerun the installer. Keep it deterministic. Your future self will not enjoy spelunking through an exporter that was configured by three different hand edits.

Warning: do not expose 9182 to the world. Scope the firewall exception to your Prometheus scrapers or put the exporter behind a trusted network boundary.

Open and Secure the Metrics Endpoint

Production monitoring fails in a very boring way when the firewall is loose: the endpoint is reachable, but it is reachable by everyone. windows_exporter supports a firewall exception during install, and the MSI also accepts a remote-address allow list through REMOTE_ADDR when you want to keep the scrape path narrow.

If you already know the Prometheus scraper addresses, bake them into the install:

Start-Process msiexec.exe -Wait -ArgumentList @(
  "/i", "C:\Temp\windows_exporter.msi",
  "/qn",
  "ENABLED_COLLECTORS=cpu,memory,logical_disk,net,os,physical_disk,service,system",
  "LISTEN_PORT=9182",
  "REMOTE_ADDR=10.20.0.15",
  "ADDLOCAL=FirewallException"
)

From the Prometheus host, use Test-NetConnection to verify that the endpoint responds before you blame the dashboard:

Test-NetConnection -ComputerName win01.contoso.local -Port 9182
curl.exe -s http://win01.contoso.local:9182/metrics | Select-String "windows_exporter_build_info"

The first command proves the port is reachable. The second proves you are actually talking to the exporter and not to some unrelated process that also happens to listen on 9182. That distinction matters more often than people admit.

Configure Prometheus to Scrape Windows Targets

Prometheus does the boring but essential work: scrape, store, evaluate. If the scrape job is sloppy, every panel and every alert inherits the mess.

Use labels to make the targets useful later. environment, role, and site save you from building a second dashboard because the first one cannot tell production from test.

global:
  scrape_interval: 30s

scrape_configs:
  - job_name: windows-server
    scrape_timeout: 10s
    static_configs:
      - targets:
          - win01.contoso.local:9182
          - win02.contoso.local:9182
        labels:
          environment: prod
          role: app
          site: denver

If you need to validate the config before a restart, use promtool first:

promtool check config C:\Prometheus\prometheus.yml

That check catches bad YAML and broken job syntax before Prometheus does it for you at startup. Prometheus is honest about errors, but it is not gentle.

Add Grafana and Import a Dashboard

Grafana already knows how to talk to Prometheus, so do not turn data source setup into a manual ritual. Provision it through Grafana provisioning docs like the rest of the stack.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus.contoso.local:9090
    isDefault: true

That keeps the dashboard queries stable and gives you one place to change the Prometheus URL if the backend moves.

For dashboards, the cleanest production option is to store the JSON in source control and import it with the Grafana dashboard import flow. If you prefer to build the panels yourself, start with three views:

Host health: CPU, memory, disk, and network.
Service state: only the Windows services that matter to the application.
Fleet view: instance labels, site labels, and a quick scan of who is behind.

The point is not to fill a screen. The point is to make the operator answer obvious when the page lands.

Reality Check: if Grafana is empty, the problem is usually not PromQL. Check the data source, the job labels, and the target status before you start rewriting panels.

[Image: images/windows-monitoring-alert-path.svg]

Alert routing path

Create Production Alert Rules

This is where a lot of “monitoring” setups become expensive wallpaper. If you only alert on exporter down, you will miss the slow failure that drags the host into the ditch. If you alert on everything without grouping, you will page yourself into resentment.

Prometheus alerting rules keep the math in one file, and Alertmanager routing turns the result into one actionable notification instead of a duplicate storm.

groups:
  - name: windows-server.rules
    rules:
      - alert: WindowsExporterDown
        expr: up{job="windows-server"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "windows_exporter is down on {{ $labels.instance }}"
          description: "Prometheus has not scraped {{ $labels.instance }} for five minutes."

That rule is intentionally blunt. If the scrape path dies, it fires. If the host is still reachable, it stays quiet.

Route the alert in Alertmanager by severity:

route:
  receiver: operations
  group_by: ["alertname", "instance", "job"]
  routes:
    - matchers:
        - severity="critical"
      receiver: pager
receivers:
  - name: operations
    email_configs:
      - to: [email protected]
  - name: pager
    pagerduty_configs:
      - routing_key: REDACTED

Validate the rules before you reload Prometheus:

promtool check rules C:\Prometheus\rules\windows-server.rules.yml

If that command fails, fix the YAML first. If it passes, fire a test alert in staging and confirm Alertmanager sends the critical route to the pager receiver, not the general inbox.

Validate the Stack and Fix Common Failures

Validation should happen in the same order every time: endpoint, target, query, panel, notification. If you do it in a different order each incident, you are just making future you do archaeology. The Prometheus targets page is where the scrape truth starts.

Check the exporter endpoint from the Prometheus host.
Check the target state in Prometheus.
Run a PromQL query for one known metric.
Confirm the Grafana panel renders the same value.
Fire a test alert and make sure Alertmanager routes it correctly.

The most common failure modes are boring:

Symptom	Likely cause	Fix
Target is `down`	`9182` blocked or wrong host name	Fix firewall, DNS, or the target list
Dashboard is empty	Wrong Grafana data source or wrong job label	Recheck the datasource and the scrape labels
Metrics are missing	Collector not enabled on that server role	Reinstall with the correct collector set
Alerts are noisy	Thresholds too tight or no grouping	Add inhibition and sane grouping keys

Production Operations Checklist

Once the stack is live, the work does not stop. It just changes shape.

Pin the windows_exporter release stream so you are not surprised by collector changes.
Review collectors whenever a server role changes.
Keep scrape labels consistent across servers, or your fleet view becomes a guessing game.
Watch retention and cardinality in Prometheus before the disk becomes your next outage.
Review alert noise monthly. If a rule never pages, it is probably a dashboard in disguise.
Keep dashboard JSON and scrape config in source control so rollback is a file restore, not a memory test.

That checklist is the difference between a monitoring stack you trust and a monitoring stack you only trust after a quiet week.

Keep the Stack Honest in Production

Production Windows monitoring is not hard because the tools are weak. It is hard because every layer can lie to you in a slightly different way. windows_exporter can be reachable but misconfigured. Prometheus can be scraping the wrong target. Grafana can render a beautiful empty panel. Alertmanager can be so loud that nobody listens.

The fix is disciplined plumbing: install the exporter with a known collector set, lock down the metrics port, give Prometheus stable labels, provision Grafana instead of clicking it together, and route alerts so one real problem becomes one actionable notification. Keep an eye on Prometheus retention settings, too, or the disk will become the next thing that lies to you.

Do that, and your Windows servers stop being mysterious black boxes. They become systems you can inspect, measure, and defend when production gets rude.

Hate ads? Want to support the writer? Get many of our tutorials packaged as an ATA Guidebook.

Explore ATA Guidebooks

How to Monitor Windows Servers with Prometheus and Grafana

Production Architecture for Windows Server Monitoring

Prerequisites

Install `windows_exporter` on the Windows Servers

Open and Secure the Metrics Endpoint

Configure Prometheus to Scrape Windows Targets

Add Grafana and Import a Dashboard

Create Production Alert Rules

Validate the Stack and Fix Common Failures

Production Operations Checklist

Keep the Stack Honest in Production

More from ATA Learning and Partners

Recommended Resources!

Get Paid to Write!

ATA Learning Guidebooks

Production Architecture for Windows Server Monitoring

Prerequisites

Install windows_exporter on the Windows Servers

Open and Secure the Metrics Endpoint

Configure Prometheus to Scrape Windows Targets

Add Grafana and Import a Dashboard

Create Production Alert Rules

Validate the Stack and Fix Common Failures

Production Operations Checklist

Keep the Stack Honest in Production

More from ATA Learning and Partners

Recommended Resources!

Get Paid to Write!

ATA Learning Guidebooks

Install `windows_exporter` on the Windows Servers