I recently was faced with a problem that most IT pros can relate to and thought I'd share. I've been working on an obscure Active Directory secure channel problem on a production server where it will intermittently decided to lose its domain trust. It's one of those problems that come and go. To make matters worse, it's the kind of problem that when it breaks all hell breaks loose but it's not broken all the time. No one knows when it's going to break and the only way to do a temporary "fix" is to reboot it. However, you no longer have a problem to troubleshoot.
I'm now in a constant battle with users. They won't let me leave it broke for any period of time yet they want the problem fixed. Something that seems so obvious to me such as some short-term pain (leaving it broke) for long-term gain (permanently fixing the problem) just doesn't apply to this set of users. It's driving me crazy! Anyway, I always try to squeeze some good from all situations like this and decided to document a few steps that might help others track down intermittent problems like this.
Set appropriate expectations
One of the big problems with my instance is that the users just had to deal with this for so long and were fed up. They wanted the problem fixed and fixed NOW. On the flip side, they wanted 100% uptime and when the problem occurred the server would be immediately rebooted not allowing for any troubleshooting.
Attempt to define some kind of recurrence predictability
When this problem first started happening the helpdesk would light up and technicians would just reboot the server to "fix" it. No log was taken as to when it happened. When this happened for the 6th time, management got involved and a task force was put together to find the root cause and fix it. This was a helpful meeting because I was able to determine a very rough, yet semi-reliable recurrence as to how often this happened.
Document events during the timeframes you receive
In this case, I noticed Event ID 5719 and multiple group policy warnings would always come around the same time that the problem was said to be occurring.
Immediately try to figure out a way to define "broke".
In this instance, the initial problem was a custom LOB application on it was failing to connect to another domain member server. Users would call in when that application went down. This is all I had to go on. However, once I was able to find that set of common events from the event log I was then able to create a pattern of what "broke" meant.
Setup a monitor to proactively know when the problem occurs again
I used a simple tool my client had called PA Monitor. This tool allowed me to set up a monitor on the System event log to monitor for the common event ID I had been seeing. I also set Netlogon debug logging on the member server to further catch information.
Work as quickly as possible immediately following notification
I was seeing 30-45 minutes after the actual problem occurred before the helpdesk would start getting lit up. This was my window I had to work in.
Don't lose your shit
It can get extremely frustrating knowing that if you only had a little bit more time to troubleshoot you might be able to fix the root cause but can't because you're not allowed to keep it down. Push through, have patience and your hard work will eventually pay off!
Join the Jar Tippers on Patreon
It takes a lot of time to write detailed blog posts like this one. In a single-income family, this blog is one way I depend on to keep the lights on. I'd be eternally grateful if you could become a Patreon patron today!Become a Patron!