How to Troubleshoot Active Directory Replication Errors

X Facebook LinkedIn

What broke first: replication, DNS, RPC, or the topology that told Active Directory Domain Services (AD DS) where replication should happen?

That is the question you need answered when repadmin starts showing 1311, 1722, 2087, or the kind of “successful” replication output that still leaves users with stale passwords. Active Directory replication errors are rarely one neat problem. They are symptoms from different layers: name resolution, network reachability, Kerberos, site topology, database state, and sometimes a domain controller that should not be trusted anymore.

Start Where Replication Is Already Complaining

You do not start an Active Directory replication incident by clicking Replicate Now and hoping the directory forgives you. Start by collecting the failure shape. Microsoft calls out repadmin /showrepl and the Directory Service event log as primary ways to identify replication problems in its Active Directory replication troubleshooting guidance, and that order matters.

Run the broad check first from an elevated prompt on a healthy management workstation or domain controller:

repadmin /replsummary

That output tells you which destination domain controllers are failing inbound replication and how long it has been since the last success. The failed DC is not always the broken DC. In replication output, the destination is the DC trying to pull changes. The source is the partner it cannot pull from. Mixing those up is how simple DNS problems become all-afternoon conference calls.

Use this first pass to decide whether you have an isolated server problem or a site-wide fault.

Signal	What It Usually Means	First Owner
One destination fails from one source	Partner-specific DNS, RPC, or security issue	AD admin
One destination fails from every source	Destination DC service, network, or rollback issue	AD admin + server owner
One site fails to another site	Site link, bridgehead, firewall, or WAN issue	AD admin + network owner
Every DC reports stale deltas	Forest-wide DNS, time, or tooling issue	AD escalation

The triage flow below keeps those questions in order so you do not repair the loudest symptom first.

[Image: images/ad-replication-triage-flow.svg]

Once you know the scope, pull detailed evidence before changing anything. The repair path for 1722 is not the repair path for 1311, and Update Sequence Number (USN) rollback is not a repair path at all. It is containment.

Reality Check: Replication errors are not a personality test for repadmin. They are evidence. Capture the evidence before you start forcing synchronization across a broken topology.

Capture Evidence Before You Repair Anything

You need a baseline that survives the meeting, the change window, and the person who says “it was working yesterday.” Microsoft documents a CSV workflow for diagnosing forest-wide replication failures with repadmin /showrepl, and CSV is still the fastest way to sort failures by source, destination, naming context, and last success.

Create a clean evidence folder and export the current state:

mkdir C:\Temp\ADReplication
repadmin /showrepl * /csv > C:\Temp\ADReplication\showrepl.csv
repadmin /replsummary > C:\Temp\ADReplication\replsummary.txt
dcdiag /e /test:replications /v > C:\Temp\ADReplication\dcdiag-replications.txt

If you prefer a sortable PowerShell view during triage, parse the CSV directly:

$repl = Import-Csv C:\Temp\ADReplication\showrepl.csv
$repl |
    Where-Object { $_.'Number of Failures' -as [int] -gt 0 } |
    Select-Object 'Destination DSA','Source DSA','Naming Context',
        'Last Failure Status','Last Failure Time','Last Success Time' |
    Sort-Object 'Last Failure Status','Destination DSA' |
    Format-Table -AutoSize

Read the output as a relationship, not a server list. If DC3 fails inbound replication from DC2 for CN=Configuration,DC=contoso,DC=com, you have a destination, source, and naming context. That tuple tells you which DNS record, RPC path, Knowledge Consistency Checker (KCC) connection, or permission boundary to test next.

Field	Why You Care
Destination DSA	The DC trying to receive changes
Source DSA	The DC it is trying to pull from
Naming Context	The partition affected: domain, configuration, schema, or DNS
Last Failure Status	The error family you fix first
Last Success Time	The outage window and tombstone-risk clue

Save the Directory Service event log next, especially from the destination DC and the Intersite Topology Generator (ISTG) in any affected site. The event log often names the same failure as repadmin, but with extra context such as an NTDS KCC event, DNS CNAME, or source server name. That extra context is what keeps you from treating a topology failure like a firewall ticket.

Decode the Error Before You Pick a Fix

You should classify the error code before touching DNS zones, site links, or domain controller metadata. Microsoft publishes separate remediation guidance for common codes such as 1722, 2087, and 1311 because each code comes from a different failure boundary.

Use the matrix below as the dispatch table for the incident.

Error	Plain Meaning	Validate With	Fix Path
1722	RPC server unavailable	`Test-NetConnection`, `PortQry`, `repadmin /showrepl`	DNS, routing, firewall, RPC services
2087	DNS lookup failed	`dcdiag /test:dns`, `Resolve-DnsName`, `nltest`	CNAME, SRV, A records, Netlogon
1311	Knowledge Consistency Checker (KCC) cannot build topology	`repadmin /showism`, event 1865, site links	Site links, bridgeheads, ISTG, KCC
8453	Replication access denied	`dcdiag /test:CheckSecurityError`	Permissions, machine account, delegation flags
8606 or 2042	Lingering object or stale DC risk	`repadmin /showrepl`, event 1988/2042	Strict replication, lingering object cleanup
Event 2095	USN rollback detected	Directory Service log, `DSA Not Writable`	Isolate, demote, metadata cleanup, rebuild

Do not skip straight to repadmin /syncall. Forced replication is useful after the root cause is gone. Before that, it only proves the same problem faster.

The next three sections handle the errors in the order you should test them in most enterprise incidents: RPC reachability, DNS identity, and KCC topology. That sequence works because RPC and DNS failures can make a healthy topology look broken.

Fix Error 1722: RPC Server Unavailable

You see 1722 when the destination DC cannot establish the RPC connection it needs to pull changes from the source. Microsoft describes 1722 as a lower-layer connectivity failure surfaced at the RPC layer, not proof that “the RPC service is bad.” That distinction saves time. PortQry is useful when you need endpoint detail instead of guesswork.

Test Name Resolution And Port 135

Start at the destination DC named in repadmin /showrepl and test the source DC named in the same row:

$SourceDC = "DC2.contoso.com"
Resolve-DnsName $SourceDC
Test-NetConnection $SourceDC -Port 135

Port 135 only proves the RPC Endpoint Mapper is reachable. Active Directory replication then uses dynamic RPC ports unless you have intentionally constrained the range. If Test-NetConnection fails on 135, stop there and fix routing, host firewall, or network ACLs. If 135 works but replication still fails, validate dynamic RPC with your network team instead of declaring victory because one port answered.

Check Dynamic RPC And Core Services

PortQry is still useful when you need endpoint detail:

portqry -n DC2.contoso.com -e 135
portqry -n DC2.contoso.com -p tcp -r 49152:65535

Also check the source DC’s core services. A server can ping and still fail replication if the AD DS stack is not answering.

Invoke-Command -ComputerName DC2 -ScriptBlock {
    Get-Service -Name NTDS,DNS,Netlogon,KDC,RPCSS |
        Select-Object PSComputerName,Name,Status
}

If You Find	Do This
DNS resolves to the wrong IP	Fix the A record or stale interface registration
Port 135 blocked	Open domain controller RPC requirements between the two DCs
Dynamic RPC blocked	Adjust firewall rules or constrain RPC intentionally
Netlogon or KDC stopped	Restore service health before forcing replication
Only one WAN path fails	Treat it as a routing or firewall path issue, not an AD issue

Re-Test The Same Pair

When the port and service tests pass, run the replication test again against the same pair. The goal is not “green everywhere” yet. The goal is to prove that this source and destination can talk before you move to DNS identity or KCC topology.

Fix Event ID 2087: DNS Lookup Failure

You get Event ID 2087 when a destination DC cannot resolve the source DC’s identity for replication. Microsoft explains that AD DS tries the source DC’s GUID-based CNAME in _msdcs, then other names, and replication stops when those lookups fail. This is why a normal host lookup can pass while replication still fails.

Test The GUID CNAME

Pull the failing source GUID from the event text or from repadmin /showrepl, then test the record directly:

$ForestRoot = "contoso.com"
$SourceGuid = "b0069e56-b19c-438a-8a1f-64866374dd6e"
Resolve-DnsName "$SourceGuid._msdcs.$ForestRoot" -Type CNAME
Resolve-DnsName "DC2.contoso.com" -Type A

Run the DNS-focused dcdiag test on both sides. The DCDiag command reference documents /test:DNS and the more specific DNS checks for basic connectivity, dynamic updates, and record registration.

dcdiag /s:DC2 /test:DNS /DnsAll /v /f:C:\Temp\ADReplication\dc2-dns.txt
dcdiag /s:DC3 /test:DNS /DnsAll /v /f:C:\Temp\ADReplication\dc3-dns.txt

Re-Register Records

If the source DC is online but missing records, restart Netlogon on the source DC to force registration:

Invoke-Command -ComputerName DC2 -ScriptBlock {
    Restart-Service -Name Netlogon
    ipconfig /registerdns
}

Then validate DC discovery:

nltest /dsgetdc:contoso.com /force

Clean Up A Dead Source

The dangerous case is an Event ID 2087 source DC that no longer exists. If the source domain controller was rebuilt, renamed, or forcefully removed without cleanup, DNS repair is the wrong move. You need AD DS metadata cleanup so KCC stops trying to replicate with a dead identity. Microsoft documents both RSAT console cleanup and ntdsutil.exe in its AD DS metadata cleanup guidance.

Warning: Do not create random _msdcs records to make 2087 quiet. If the source DC identity is stale, fake DNS just points replication at a lie. Clean the metadata instead.

After DNS is correct, rerun repadmin /showrepl for the affected destination. If the same pair moves from 2087 to 1722, that is progress. Name resolution is fixed, and now the network path is telling you what it could not tell you before.

Fix Event ID 1311: KCC Topology Failure

You see Event ID 1311 when the Knowledge Consistency Checker (KCC) cannot build a replication topology that connects the sites and naming contexts it needs to connect. Microsoft lists common causes: site link bridging on a network that is not fully routed, sites missing from site links, disjoint site links, preferred bridgeheads, and broader replication failures. The Intersite Topology Generator (ISTG) matters here because it owns the intersite connection view for its site.

Check Site Links Before Forcing KCC

Start by checking the site link matrix from the console of the affected domain controller, typically the ISTG for that site:

repadmin /showism

That command exposes how the Intersite Messaging service sees site connectivity. You are looking for orphaned sites, disjointed links, and link costs that do not match the actual WAN. If the network cannot route Site A to Site C, but AD site link bridging says everything is transitive, the KCC can build a connection object that the network will never honor.

Then identify the ISTG for the affected site and force KCC recalculation only after the site link configuration matches reality:

repadmin /kcc DC1
repadmin /showconn DC1

1311 Cause	What To Check	Repair
Site missing from links	Every site appears in at least one site link	Add the site to the correct link
Disjoint site links	Site links form a connected path	Connect the links or create the missing link
Bad site link bridging	Physical routing does not match transitive assumptions	Disable bridging or create explicit bridges
Preferred bridgehead offline	Server properties in Sites and Services	Remove preferred bridgehead assignments
One partition cannot replicate	Which naming context appears in event 1311	Fix the DC hosting that partition

Verify Bridgeheads And ISTG Ownership

Be careful with preferred bridgehead servers. They look tidy in documentation, but they remove the KCC’s ability to pick a working bridgehead when the preferred one is down or missing a naming context. If you have a small number of DCs per site, manual bridgehead selection often turns a routine outage into an avoidable topology failure.

Rebuild Convergence Only After The Graph Matches Reality

Once KCC can create valid connection objects, wait for the normal replication interval or use repadmin /syncall after validating DNS and RPC. For 1311, the win is not forcing one connection to work. The win is making the topology accurate enough that KCC can keep it working after you leave.

Recover from USN Rollback Without Making It Worse

You handle Update Sequence Number (USN) rollback differently because the affected domain controller’s database has become untrustworthy. Microsoft describes USN rollback as a silent replication failure that can happen when an older copy of an AD database is incorrectly restored, and its recovery guidance centers on detection, quarantine, and rebuild.

The core mechanism is simple and ugly: replication partners remember the highest USN they accepted from a DC. If that DC is rolled back to an older database state without a new invocation ID, it starts presenting old USN values under the same identity. Partners believe they already have those changes and ignore them.

That means repadmin /showrepl can look clean while users, groups, and passwords drift. When Event ID 2095 appears, do not try to “make replication start” by deleting quarantine signals. Preserve the evidence and remove the DC as a replication source.

The containment path below is the boundary. After this point, you are protecting the forest from a compromised replica, not repairing the replica in place.

[Image: images/ad-replication-usn-rollback.svg]

Contain The Replica

Use this recovery sequence:

Isolate the affected DC from client authentication and administrative changes.
Confirm Event ID 2095 and capture the Directory Service log.
Forcefully remove AD DS from the affected server if graceful demotion fails.
Clean up the old DC metadata from a healthy DC.
Seize Flexible Single Master Operations (FSMO) roles if the affected DC held any.
Rebuild or re-promote the server as a new domain controller.

Any unique changes made only on the rolled-back DC can be lost. That is not pleasant, but keeping a bad replica alive is worse. The clean recovery point is a new DC identity with healthy inbound replication from known-good partners.

Force Replication Only After the Root Cause Is Gone

You should force replication when you have fixed the reason replication failed and need to validate convergence. You should not force replication when DNS still resolves to stale records, RPC ports are blocked, KCC has invalid site links, or a DC is suspected of USN rollback.

Recalculate Topology If It Changed

After fixing the specific cause, run repadmin /kcc if topology changed:

repadmin /kcc DC1

Sync The Fixed Path

Then synchronize a known-good destination with repadmin /syncall. Use /A for all naming contexts, /e for enterprise-wide partners, /d for distinguished names in output, and /P when you intentionally want push behavior from the specified DC.

repadmin /syncall DC1 /AdeP

Verify Convergence Before You Declare Victory

Validate with the same command that caught the failure:

repadmin /replsummary
repadmin /showrepl DC1

If the failure count stops increasing but the last success time is still stale, wait for the partner to complete or inspect the replication queue:

repadmin /queue DC1

The important habit is symmetry. Capture before, repair one root cause, force only when safe, then validate with the same artifacts. That gives you a defensible incident record instead of a collection of commands you happened to run.

Build Daily Replication Monitoring

You do not need to wait for the help desk to report stale passwords before checking replication. A daily report built from repadmin /showrepl /csv gives you the same evidence you used during the incident, but before the incident becomes visible to users.

Capture The Daily Snapshot

Create a lightweight PowerShell report:

$Path = "C:\Reports\ADReplication"
New-Item -ItemType Directory -Path $Path -Force | Out-Null

$Csv = Join-Path $Path "showrepl-$(Get-Date -Format yyyyMMdd-HHmm).csv"
repadmin /showrepl * /csv > $Csv

$Failures = Import-Csv $Csv |
    Where-Object { ($_.'Number of Failures' -as [int]) -gt 0 } |
    Select-Object 'Destination DSA','Source DSA','Naming Context',
        'Last Failure Status','Last Failure Time','Last Success Time'

$Failures | Export-Csv (Join-Path $Path "failures-latest.csv") -NoTypeInformation

Alert On Failures

Turn that into a scheduled task running under a monitored service account, and alert when either condition is true:

A destination DC has any replication failure count greater than zero.
Any naming context has a last success time older than your acceptable recovery window.

For enterprise sites, add context fields your network and security owners need: source site, destination site, port test result, and whether the failing naming context is domain, configuration, schema, or DNS. That information prevents the familiar ticket loop where every owner asks for the same evidence in a different format.

Quick Win: Store daily reports for at least the tombstone lifetime window. When a DC comes back from a long outage, old replication history helps you decide whether it is safe to reconnect.

Keep Enough History

Monitoring will not prevent every bad snapshot, stale DNS record, or firewall change. It will shorten the time between first failure and first proof, which is the part you actually control.

Close the Incident With Proof

You are done when the forest can prove convergence, not when one command returns green once. Re-run the baseline from the first section and compare it to your evidence capture. repadmin /replsummary and the CSV capture should agree on the same fix.

Re-Run The Baseline

repadmin /replsummary
repadmin /showrepl * /csv > C:\Temp\ADReplication\showrepl-after.csv
dcdiag /e /test:replications /q

For a 1722 incident, prove DNS resolution and RPC reachability from the destination to the source. For a 2087 incident, prove the GUID CNAME, host record, dynamic registration, and dcdiag /test:dns results. For a 1311 incident, prove the site link matrix and KCC connection objects match the routed network. For USN rollback, prove the old DC identity is gone and the rebuilt DC replicates under a new, healthy identity.

Write The Repair Record

Finish with a short repair record:

Record Item	Example
Failure scope	DC3 inbound from DC2, configuration partition
First failing status	2087 DNS lookup failure
Root cause	Missing `_msdcs` CNAME after failed DC rebuild
Repair	Metadata cleanup for retired DC2 identity
Validation	`repadmin /showrepl * /csv` shows zero failures
Prevention	Daily CSV report and DC decommission checklist updated

Decide Whether The Forest Is Healthy

Active Directory replication troubleshooting is not about memorizing every event ID. It is about reading the failed relationship, proving which layer failed, and refusing to force replication until the directory has a clean path to converge. Do that, and errors 1311, 1722, and 2087 become fixable signals instead of a pile of red events waiting for someone else to own them.

Hate ads? Want to support the writer? Get many of our tutorials packaged as an ATA Guidebook.

Explore ATA Guidebooks