What broke first: replication, DNS, RPC, or the topology that told Active Directory Domain Services (AD DS) where replication should happen?
That is the question you need answered when repadmin starts showing 1311, 1722, 2087, or the kind of “successful” replication output that still leaves users with stale passwords. Active Directory replication errors are rarely one neat problem. They are symptoms from different layers: name resolution, network reachability, Kerberos, site topology, database state, and sometimes a domain controller that should not be trusted anymore.
Start Where Replication Is Already Complaining
You do not start an Active Directory replication incident by clicking Replicate Now and hoping the directory forgives you. Start by collecting the failure shape. Microsoft calls out repadmin /showrepl and the Directory Service event log as primary ways to identify replication problems in its Active Directory replication troubleshooting guidance, and that order matters.
Run the broad check first from an elevated prompt on a healthy management workstation or domain controller:
repadmin /replsummary
That output tells you which destination domain controllers are failing inbound replication and how long it has been since the last success. The failed DC is not always the broken DC. In replication output, the destination is the DC trying to pull changes. The source is the partner it cannot pull from. Mixing those up is how simple DNS problems become all-afternoon conference calls.
Use this first pass to decide whether you have an isolated server problem or a site-wide fault.
| Signal | What It Usually Means | First Owner |
|---|---|---|
| One destination fails from one source | Partner-specific DNS, RPC, or security issue | AD admin |
| One destination fails from every source | Destination DC service, network, or rollback issue | AD admin + server owner |
| One site fails to another site | Site link, bridgehead, firewall, or WAN issue | AD admin + network owner |
| Every DC reports stale deltas | Forest-wide DNS, time, or tooling issue | AD escalation |
The triage flow below keeps those questions in order so you do not repair the loudest symptom first.
[Image: images/ad-replication-triage-flow.svg]Once you know the scope, pull detailed evidence before changing anything. The repair path for 1722 is not the repair path for 1311, and Update Sequence Number (USN) rollback is not a repair path at all. It is containment.
Reality Check: Replication errors are not a personality test for repadmin. They are evidence. Capture the evidence before you start forcing synchronization across a broken topology.
Capture Evidence Before You Repair Anything
You need a baseline that survives the meeting, the change window, and the person who says “it was working yesterday.” Microsoft documents a CSV workflow for diagnosing forest-wide replication failures with repadmin /showrepl, and CSV is still the fastest way to sort failures by source, destination, naming context, and last success.
Create a clean evidence folder and export the current state:
mkdir C:\Temp\ADReplication repadmin /showrepl * /csv > C:\Temp\ADReplication\showrepl.csv repadmin /replsummary > C:\Temp\ADReplication\replsummary.txt dcdiag /e /test:replications /v > C:\Temp\ADReplication\dcdiag-replications.txt
If you prefer a sortable PowerShell view during triage, parse the CSV directly:
$repl = Import-Csv C:\Temp\ADReplication\showrepl.csv
$repl |
Where-Object { $_.'Number of Failures' -as [int] -gt 0 } |
Select-Object 'Destination DSA','Source DSA','Naming Context',
'Last Failure Status','Last Failure Time','Last Success Time' |
Sort-Object 'Last Failure Status','Destination DSA' |
Format-Table -AutoSize
Read the output as a relationship, not a server list. If DC3 fails inbound replication from DC2 for CN=Configuration,DC=contoso,DC=com, you have a destination, source, and naming context. That tuple tells you which DNS record, RPC path, Knowledge Consistency Checker (KCC) connection, or permission boundary to test next.
| Field | Why You Care |
|---|---|
| Destination DSA | The DC trying to receive changes |
| Source DSA | The DC it is trying to pull from |
| Naming Context | The partition affected: domain, configuration, schema, or DNS |
| Last Failure Status | The error family you fix first |
| Last Success Time | The outage window and tombstone-risk clue |
Save the Directory Service event log next, especially from the destination DC and the Intersite Topology Generator (ISTG) in any affected site. The event log often names the same failure as repadmin, but with extra context such as an NTDS KCC event, DNS CNAME, or source server name. That extra context is what keeps you from treating a topology failure like a firewall ticket.
Decode the Error Before You Pick a Fix
You should classify the error code before touching DNS zones, site links, or domain controller metadata. Microsoft publishes separate remediation guidance for common codes such as 1722, 2087, and 1311 because each code comes from a different failure boundary.
Use the matrix below as the dispatch table for the incident.
| Error | Plain Meaning | Validate With | Fix Path |
|---|---|---|---|
| 1722 | RPC server unavailable | Test-NetConnection, PortQry, repadmin /showrepl |
DNS, routing, firewall, RPC services |
| 2087 | DNS lookup failed | dcdiag /test:dns, Resolve-DnsName, nltest |
CNAME, SRV, A records, Netlogon |
| 1311 | Knowledge Consistency Checker (KCC) cannot build topology | repadmin /showism, event 1865, site links |
Site links, bridgeheads, ISTG, KCC |
| 8453 | Replication access denied | dcdiag /test:CheckSecurityError |
Permissions, machine account, delegation flags |
| 8606 or 2042 | Lingering object or stale DC risk | repadmin /showrepl, event 1988/2042 |
Strict replication, lingering object cleanup |
| Event 2095 | USN rollback detected | Directory Service log, DSA Not Writable |
Isolate, demote, metadata cleanup, rebuild |
Do not skip straight to repadmin /syncall. Forced replication is useful after the root cause is gone. Before that, it only proves the same problem faster.
The next three sections handle the errors in the order you should test them in most enterprise incidents: RPC reachability, DNS identity, and KCC topology. That sequence works because RPC and DNS failures can make a healthy topology look broken.
Fix Error 1722: RPC Server Unavailable
You see 1722 when the destination DC cannot establish the RPC connection it needs to pull changes from the source. Microsoft describes 1722 as a lower-layer connectivity failure surfaced at the RPC layer, not proof that “the RPC service is bad.” That distinction saves time. PortQry is useful when you need endpoint detail instead of guesswork.
Test Name Resolution And Port 135
Start at the destination DC named in repadmin /showrepl and test the source DC named in the same row:
$SourceDC = "DC2.contoso.com" Resolve-DnsName $SourceDC Test-NetConnection $SourceDC -Port 135
Port 135 only proves the RPC Endpoint Mapper is reachable. Active Directory replication then uses dynamic RPC ports unless you have intentionally constrained the range. If Test-NetConnection fails on 135, stop there and fix routing, host firewall, or network ACLs. If 135 works but replication still fails, validate dynamic RPC with your network team instead of declaring victory because one port answered.
Check Dynamic RPC And Core Services
PortQry is still useful when you need endpoint detail:
portqry -n DC2.contoso.com -e 135 portqry -n DC2.contoso.com -p tcp -r 49152:65535
Also check the source DC’s core services. A server can ping and still fail replication if the AD DS stack is not answering.
Invoke-Command -ComputerName DC2 -ScriptBlock {
Get-Service -Name NTDS,DNS,Netlogon,KDC,RPCSS |
Select-Object PSComputerName,Name,Status
}
| If You Find | Do This |
|---|---|
| DNS resolves to the wrong IP | Fix the A record or stale interface registration |
| Port 135 blocked | Open domain controller RPC requirements between the two DCs |
| Dynamic RPC blocked | Adjust firewall rules or constrain RPC intentionally |
| Netlogon or KDC stopped | Restore service health before forcing replication |
| Only one WAN path fails | Treat it as a routing or firewall path issue, not an AD issue |
Re-Test The Same Pair
When the port and service tests pass, run the replication test again against the same pair. The goal is not “green everywhere” yet. The goal is to prove that this source and destination can talk before you move to DNS identity or KCC topology.
Fix Event ID 2087: DNS Lookup Failure
You get Event ID 2087 when a destination DC cannot resolve the source DC’s identity for replication. Microsoft explains that AD DS tries the source DC’s GUID-based CNAME in _msdcs, then other names, and replication stops when those lookups fail. This is why a normal host lookup can pass while replication still fails.
Test The GUID CNAME
Pull the failing source GUID from the event text or from repadmin /showrepl, then test the record directly:
$ForestRoot = "contoso.com" $SourceGuid = "b0069e56-b19c-438a-8a1f-64866374dd6e" Resolve-DnsName "$SourceGuid._msdcs.$ForestRoot" -Type CNAME Resolve-DnsName "DC2.contoso.com" -Type A
Run the DNS-focused dcdiag test on both sides. The DCDiag command reference documents /test:DNS and the more specific DNS checks for basic connectivity, dynamic updates, and record registration.
dcdiag /s:DC2 /test:DNS /DnsAll /v /f:C:\Temp\ADReplication\dc2-dns.txt dcdiag /s:DC3 /test:DNS /DnsAll /v /f:C:\Temp\ADReplication\dc3-dns.txt
Re-Register Records
If the source DC is online but missing records, restart Netlogon on the source DC to force registration:
Invoke-Command -ComputerName DC2 -ScriptBlock {
Restart-Service -Name Netlogon
ipconfig /registerdns
}
Then validate DC discovery:
nltest /dsgetdc:contoso.com /force
Clean Up A Dead Source
The dangerous case is an Event ID 2087 source DC that no longer exists. If the source domain controller was rebuilt, renamed, or forcefully removed without cleanup, DNS repair is the wrong move. You need AD DS metadata cleanup so KCC stops trying to replicate with a dead identity. Microsoft documents both RSAT console cleanup and ntdsutil.exe in its AD DS metadata cleanup guidance.
Warning: Do not create random _msdcs records to make 2087 quiet. If the source DC identity is stale, fake DNS just points replication at a lie. Clean the metadata instead.
After DNS is correct, rerun repadmin /showrepl for the affected destination. If the same pair moves from 2087 to 1722, that is progress. Name resolution is fixed, and now the network path is telling you what it could not tell you before.
Fix Event ID 1311: KCC Topology Failure
You see Event ID 1311 when the Knowledge Consistency Checker (KCC) cannot build a replication topology that connects the sites and naming contexts it needs to connect. Microsoft lists common causes: site link bridging on a network that is not fully routed, sites missing from site links, disjoint site links, preferred bridgeheads, and broader replication failures. The Intersite Topology Generator (ISTG) matters here because it owns the intersite connection view for its site.
Check Site Links Before Forcing KCC
Start by checking the site link matrix from the console of the affected domain controller, typically the ISTG for that site:
repadmin /showism
That command exposes how the Intersite Messaging service sees site connectivity. You are looking for orphaned sites, disjointed links, and link costs that do not match the actual WAN. If the network cannot route Site A to Site C, but AD site link bridging says everything is transitive, the KCC can build a connection object that the network will never honor.
Then identify the ISTG for the affected site and force KCC recalculation only after the site link configuration matches reality:
repadmin /kcc DC1 repadmin /showconn DC1
| 1311 Cause | What To Check | Repair |
|---|---|---|
| Site missing from links | Every site appears in at least one site link | Add the site to the correct link |
| Disjoint site links | Site links form a connected path | Connect the links or create the missing link |
| Bad site link bridging | Physical routing does not match transitive assumptions | Disable bridging or create explicit bridges |
| Preferred bridgehead offline | Server properties in Sites and Services | Remove preferred bridgehead assignments |
| One partition cannot replicate | Which naming context appears in event 1311 | Fix the DC hosting that partition |
Verify Bridgeheads And ISTG Ownership
Be careful with preferred bridgehead servers. They look tidy in documentation, but they remove the KCC’s ability to pick a working bridgehead when the preferred one is down or missing a naming context. If you have a small number of DCs per site, manual bridgehead selection often turns a routine outage into an avoidable topology failure.
Rebuild Convergence Only After The Graph Matches Reality
Once KCC can create valid connection objects, wait for the normal replication interval or use repadmin /syncall after validating DNS and RPC. For 1311, the win is not forcing one connection to work. The win is making the topology accurate enough that KCC can keep it working after you leave.
Recover from USN Rollback Without Making It Worse
You handle Update Sequence Number (USN) rollback differently because the affected domain controller’s database has become untrustworthy. Microsoft describes USN rollback as a silent replication failure that can happen when an older copy of an AD database is incorrectly restored, and its recovery guidance centers on detection, quarantine, and rebuild.
The core mechanism is simple and ugly: replication partners remember the highest USN they accepted from a DC. If that DC is rolled back to an older database state without a new invocation ID, it starts presenting old USN values under the same identity. Partners believe they already have those changes and ignore them.
That means repadmin /showrepl can look clean while users, groups, and passwords drift. When Event ID 2095 appears, do not try to “make replication start” by deleting quarantine signals. Preserve the evidence and remove the DC as a replication source.
The containment path below is the boundary. After this point, you are protecting the forest from a compromised replica, not repairing the replica in place.
[Image: images/ad-replication-usn-rollback.svg]Contain The Replica
Use this recovery sequence:
-
Isolate the affected DC from client authentication and administrative changes.
-
Confirm Event ID 2095 and capture the Directory Service log.
-
Forcefully remove AD DS from the affected server if graceful demotion fails.
-
Clean up the old DC metadata from a healthy DC.
-
Seize Flexible Single Master Operations (FSMO) roles if the affected DC held any.
-
Rebuild or re-promote the server as a new domain controller.
Any unique changes made only on the rolled-back DC can be lost. That is not pleasant, but keeping a bad replica alive is worse. The clean recovery point is a new DC identity with healthy inbound replication from known-good partners.
Force Replication Only After the Root Cause Is Gone
You should force replication when you have fixed the reason replication failed and need to validate convergence. You should not force replication when DNS still resolves to stale records, RPC ports are blocked, KCC has invalid site links, or a DC is suspected of USN rollback.
Recalculate Topology If It Changed
After fixing the specific cause, run repadmin /kcc if topology changed:
repadmin /kcc DC1
Sync The Fixed Path
Then synchronize a known-good destination with repadmin /syncall. Use /A for all naming contexts, /e for enterprise-wide partners, /d for distinguished names in output, and /P when you intentionally want push behavior from the specified DC.
repadmin /syncall DC1 /AdeP
Verify Convergence Before You Declare Victory
Validate with the same command that caught the failure:
repadmin /replsummary repadmin /showrepl DC1
If the failure count stops increasing but the last success time is still stale, wait for the partner to complete or inspect the replication queue:
repadmin /queue DC1
The important habit is symmetry. Capture before, repair one root cause, force only when safe, then validate with the same artifacts. That gives you a defensible incident record instead of a collection of commands you happened to run.
Build Daily Replication Monitoring
You do not need to wait for the help desk to report stale passwords before checking replication. A daily report built from repadmin /showrepl /csv gives you the same evidence you used during the incident, but before the incident becomes visible to users.
Capture The Daily Snapshot
Create a lightweight PowerShell report:
$Path = "C:\Reports\ADReplication"
New-Item -ItemType Directory -Path $Path -Force | Out-Null
$Csv = Join-Path $Path "showrepl-$(Get-Date -Format yyyyMMdd-HHmm).csv"
repadmin /showrepl * /csv > $Csv
$Failures = Import-Csv $Csv |
Where-Object { ($_.'Number of Failures' -as [int]) -gt 0 } |
Select-Object 'Destination DSA','Source DSA','Naming Context',
'Last Failure Status','Last Failure Time','Last Success Time'
$Failures | Export-Csv (Join-Path $Path "failures-latest.csv") -NoTypeInformation
Alert On Failures
Turn that into a scheduled task running under a monitored service account, and alert when either condition is true:
-
A destination DC has any replication failure count greater than zero.
-
Any naming context has a last success time older than your acceptable recovery window.
For enterprise sites, add context fields your network and security owners need: source site, destination site, port test result, and whether the failing naming context is domain, configuration, schema, or DNS. That information prevents the familiar ticket loop where every owner asks for the same evidence in a different format.
Quick Win: Store daily reports for at least the tombstone lifetime window. When a DC comes back from a long outage, old replication history helps you decide whether it is safe to reconnect.
Keep Enough History
Monitoring will not prevent every bad snapshot, stale DNS record, or firewall change. It will shorten the time between first failure and first proof, which is the part you actually control.
Close the Incident With Proof
You are done when the forest can prove convergence, not when one command returns green once. Re-run the baseline from the first section and compare it to your evidence capture. repadmin /replsummary and the CSV capture should agree on the same fix.
Re-Run The Baseline
repadmin /replsummary repadmin /showrepl * /csv > C:\Temp\ADReplication\showrepl-after.csv dcdiag /e /test:replications /q
For a 1722 incident, prove DNS resolution and RPC reachability from the destination to the source. For a 2087 incident, prove the GUID CNAME, host record, dynamic registration, and dcdiag /test:dns results. For a 1311 incident, prove the site link matrix and KCC connection objects match the routed network. For USN rollback, prove the old DC identity is gone and the rebuilt DC replicates under a new, healthy identity.
Write The Repair Record
Finish with a short repair record:
| Record Item | Example |
|---|---|
| Failure scope | DC3 inbound from DC2, configuration partition |
| First failing status | 2087 DNS lookup failure |
| Root cause | Missing _msdcs CNAME after failed DC rebuild |
| Repair | Metadata cleanup for retired DC2 identity |
| Validation | repadmin /showrepl * /csv shows zero failures |
| Prevention | Daily CSV report and DC decommission checklist updated |
Decide Whether The Forest Is Healthy
Active Directory replication troubleshooting is not about memorizing every event ID. It is about reading the failed relationship, proving which layer failed, and refusing to force replication until the directory has a clean path to converge. Do that, and errors 1311, 1722, and 2087 become fixable signals instead of a pile of red events waiting for someone else to own them.