Monday morning – all domain services are down. No email, no logons – nothing. A crisis like this is not the time to learn about Active Directory recovery. The key to preventing an Active Directory Domain Services (AD DS) disaster is to understand and mitigate common risks beforehand. The first section in this article will cover those topics and preventive solutions. In the unlikely event that a disruption occurs, being familiar with the recovery tools and implementing your recovery plan will make this process much smoother. This will be discussed in section two.
Only you can prevent AD Domain Service Failures
There are many causes for AD DS failures. Some, like natural disasters, are hard to control for. Others, like database corruption, are unpredictable. Many are the result of human error though.
The number one thing you can do to prevent AD DS failures is to remove or completely prevent the usage of privileged AD accounts – specifically members of the Domain Administrators group or higher.
Your privileged accounts, whether accidentally or maliciously, can break AD by doing any of the following:
- Deletion of AD objects
- Making direct OS changes to DCs. This can make directory service files unavailable to the DC.
- Making untested software changes (updates, new software, uninstalls, etc.)
The first problem can be solved (or at least reduced) by enabling the AD Recycle Bin and enabling the Protect from Accidental Deletion ACE on sensitive objects. You can also hide objects from users to be even more sure. The AD Recycle Bin is available in 2008R2+ domains. Protect from Accidental Deletion is available on any domain level. If you don’t have the fancy check box under the Object tab, add a Deny – Everyone – Delete All ACE to every object that you want to project.
The Protect from Accidental Deletion permission on the Domain Controllers OU
Our second problem can be lessened by using Server Core installations and restricting access to the domain controllers. The vast majority of your interaction should be done from a privileged workstation through the RSAT tools. If you using remote desktop to log into the DC to launch AD Users and Computer/AD Administrative center, you are doing it wrong.
The third problem can be controlled as well. Do not install any extra software on your domain controllers unless absolutely required. Run your DCs under a Server Core installation. Finally, use update rings (staged updates) to test changes against a less critical DC group before expanding the deployment.
All three problems created above do not apply to just other administrators in your domain – they apply to you as well! I have often seen organizations go through the process of removing domain administrator level privileges from everyone but the person who started the project. Getting to a least privileged administration setup takes a decent amount work but the rewards are worth it. The specific steps are outside the scope of this article but you can use this overview guide on TechNet to learn more.
With the advent of domain controller virtualization, your virtualization administrators (or even backup administrators) can wreak havoc as well. This is due to initiating a snapshot rollback on the virtual host for the domain controller. This problem that occurs is known as update sequence number (USN) rollback and it can stop AD replication completely. In fact, Microsoft lists three key causes for directory service disruptions and all three can be caused by USN rollback.
For the curious, those three causes are:
- Rollback to a known good point in time
- Corruption that is localized to a domain controller
- Corruption that has replicated (the worst-case scenario)
The safest way to prevent USN rollback is to get your domain level to Windows Server 2012 or above. Windows Server 2012+ is cloning, virtualization, and snapshot aware. This safety is achieved through the introduction of the msDS-GenerationId (which is local to each DC). Your hypervisor would also need to be aware of this VM-GenerationID so that the domain controller can store the value in the msDS-GenerationId attribute. VMWare and Hyper-V both support this (and have for years now).
The msDS-GenerationId attribute on a Windows Server 2012+ domain controller
If you are not at a 2012+ domain level, it is best to not use snapshots. This applies to snapshots created through your virtualization platform and snapshots that might be created through an automated backup.
We have covered several common reasons for AD domain service failures (both simple and complex). Keeping those in mind, let’s look at the built-in tools that you will use in an actual failure.
Tools to Test and Recover from AD Domain Service Failures
When discussing domain service failures, problems can be generally grouped into replication issues and specific DC problems. Troubleshooting replication will always involve the Repadmin and DCDiag tools. These tools are made available once RSAT is installed. Go ahead and launch a command prompt as a standard user. Note that you will see a few errors in the DCDiag section as a domain administrator is required for some specific tests.
Working with Repadmin will give you a visual of replication in the environment. The three most common command parameters that you will use are:
- Repadmin /replsummary : This shows connections between each DC, the largest sync delta, and a summary of recent failures/successful replications.
- Repadmin /showrepl: This will list object GUIDs, when they were replicated, and the result of that replication. If you have a single object failing to replicate, you will know exactly what it is.
- Repadmin /replicate: Starts replication manually.
DCDiag can also be ran from your administrative machine. To specify a domain controller to run it against, use the /S: parameter. When faced with a disaster, you can find the troublesome DC by looping through the available DCs. Tools like DCDiag and Repadmin can even be integrated into PowerShell for quicker reporting. Before you are forced to learn Repadmin or DCDiag, spend some time familiarizing yourself with it and the switches.
In a worst case scenario, you may have to initiate an Active Directory restore. AD restores are either authoritative or nonauthoritative. An authoritative restore means that the data on the DC after the restore is desired or most current. For example, an authoritative restore of an OU on a specific DC would mean that this change will replicate to all other DCs. It is literally the authority for that object.
Nonauthoritative restores are normally used for hardware or critical OS issues. Specific DC problems are easier to get rid of if you think of DCs as disposable (really helps if they are virtual). Once the DC is restored, all changes are replicated to the DC – it is brought back into compliance. With either restore mode, you will have to bring the DC into Directory Services Restore Mode. Make sure that you know the DSRM password for the DC now instead of when you need it!
How prepared are you?
A successful Active Directory recovery depends on what you do now more than the specific situation later. Most common AD disruptions can be prevented or reduced. Those that can’t are much easier to handle if you use the steps listed in this guide and prepare in advance.