Table of Contents
Introduction
In this blog post, I’ll walk you through an interesting case we encountered where a financial transaction processing application, Postilion, failed after the customer applied security patches to their domain controllers. The problem was critical, affecting their entire environment, and the resolution required a deep understanding of authentication protocols and timely action.
Application and Impact
The customer reported that after patching their domain controllers, Postilion, a transactional processing software used by financial institutions, started failing. Postilion relies on Windows authentication to access its database, and the failures caused a major disruption in their environment.
The system administrators initially suspected authentication issues, and they had already spent significant time rolling back and reapplying patches to troubleshoot the issue. Compounding the problem was the fact that the domain controllers hadn’t been patched for several months, leaving the system vulnerable to security threats.
Evidence and Root Cause Analysis
A root cause analysis revealed several key clues. The security event log showed multiple 4625 events, which indicate failed logon attempts. These failures occurred when the system attempted to use the NTLM authentication protocol. The error pointed to a wrong password as the reason for the logon failure, which seemed unusual given that the password hadn’t been changed.
- Event ID: 4625
- Failure Reason: Unknown username or bad password
- Status: 0xC00006D
- Sub Status: 0xC000006A
- Authentication Package: NTLM
At the same time, the database logs showed a similar authentication error with event 17806, stating that the connection handshake failed due to an operating system error.
- Database Event: 17806
- Error: 0x8009030c (SEC_E_LOGON_DENIED)
- Message: The logon attempt failed.
Authentication Flow
To better understand the problem, let’s look at the authentication flow in this setup.
- The
servicespostilion
service account runs on two application servers, which communicate with two dedicated SQL database servers. - The authentication method between Postilion and the database servers is Windows authentication using NTLM.
- The service account sends jobs to the database server, which works smoothly until domain controllers are patched.
The typical NTLM authentication flow is as follows:
- Step A: The
servicespostilion
service sends the username to the database server. - Step B: The database server sends back a challenge (nonce).
- Step C: The service responds with a challenge-response.
- Step D: The database server sends the username, challenge, and challenge-response to the domain controller for verification.
Action Plan
Given the impact on a production financial system, our priority was to resolve the issue without rolling back the patches and to address the security risks posed by unpatched domain controllers.
The customer scheduled their monthly maintenance window, and we prepared for a night of troubleshooting. I had a theory that the issue was related to NTLM compatibility level mismatches between the domain controllers and the application servers. If that was indeed the case, adjusting the settings should resolve the problem.
However, I also had a backup plan to switch to Kerberos if NTLM continued to fail.
In case you are not familiar with the NTLM protocol and its versions
NTLM (NT LAN Manager) is an authentication protocol used in Microsoft environments. While newer systems favor Kerberos, NTLM is still required in legacy scenarios. NTLM works via a challenge-response mechanism where a client sends a username, receives a challenge (nonce) from the server, and responds with a hashed password. The domain controller then verifies the response.
LMCompatibilityLevel and NTLM Versions
NTLM has two main versions: NTLMv1 (less secure) and NTLMv2 (more secure). The LMCompatibilityLevel setting determines which version is used. Higher values enforce stronger security. For example:
- 0: Accepts both LM and NTLM.
- 3: Uses NTLMv2 for authentication.
- 5: Uses only NTLMv2, rejecting older versions.
Configuring NTLM via Group Policy
You can manage NTLM through Group Policy:
- Open Group Policy Management.
- Go to Computer Configuration > Windows Settings > Security Settings > Local Policies > Security Options.
- Set Network security: LAN Manager authentication level to 5 for NTLMv2.
Configuring NTLM via Registry
For smaller setups, use the Registry:
- Open regedit and navigate to:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa
- Modify LMCompatibilityLevel (DWORD) and set it to 5.
NTLM version compatibility
NTLM versions have varying degrees of compatibility, which can lead to authentication failures if misconfigured:
- NTLMv1: This older version uses a less secure challenge-response mechanism. It is compatible with servers accepting NTLMv1 or LM responses (lower security).
- NTLMv2: A more secure version, required by higher security levels. It can authenticate with servers set to accept NTLMv2, but is incompatible with servers or clients set to use only NTLMv1 or LM.
- LM (LAN Manager): This is the oldest and least secure method. Modern systems reject it if they are set to enforce NTLMv2 only.
For example, if a domain controller is configured for NTLMv2 only (LMCompatibilityLevel 5) but a client is using NTLMv1 (LMCompatibilityLevel 1 or 2), authentication will fail. Both client and server need to be on compatible NTLM versions for successful authentication.
Resolution
Once the servers were patched, I joined the customer’s troubleshooting session and was able to reproduce the failure immediately. The NTLM compatibility level was the first thing I checked, and my suspicions were confirmed. The LMCompatibilityLevel setting was mismatched:
- Application servers (Postilion): NTLM compatibility level set to 1.
- Domain controllers: NTLM compatibility level set to 5.
This mismatch was causing the challenge-response mechanism to fail, resulting in the “bad password” error. Interestingly, the application worked fine before patching, but this discrepancy caused issues after the patches were applied.
We changed the compatibility level on the application servers to 3, aligning it with the domain controllers. The system immediately started functioning again, and the authentication failures were resolved within 30 minutes. This not only restored functionality but also enhanced the security of the application by using a more secure NTLM version.
Lessons Learned and Final Thoughts
While we always strive to understand every detail of why something works or doesn’t, some nuances remain unexplained. In this case, the fact that the application worked with unpatched domain controllers but failed after patching is still somewhat mysterious. However, in production environments, the immediate priorities are security and making the system work.
For this case, we succeeded in both: we resolved the problem without rolling back patches and strengthened security by using a higher NTLM version. That said, there is always room for improvement. In the future, switching to Kerberos, the industry-standard authentication protocol, would be a better long-term solution.
Conclusion
This case highlights the importance of proper authentication configurations and the potential impact of security patching on enterprise applications. By aligning the NTLM settings across systems, we were able to quickly resolve the issue and ensure the continued operation of a critical financial application. It serves as a reminder to review and understand the authentication mechanisms in your environment, especially when applying patches to domain controllers.
If you’re running applications that rely on NTLM authentication, now might be a good time to review your LMCompatibilityLevel settings across your environment or even start thinking about moving them to Kerberos. Have you experienced similar issues post-patching? Drop a comment below or share your story
Leave a Reply