That Time our Devices had the Wrong Time

That Time our Devices had the Wrong Time
Photo by Kevin Ku / Unsplash

Some years ago I inherited a network that did not adhere to best practices. With a mix of Windows Servers and Windows Enterprise devices, one would expect the network to leverage a proper domain setup (Group Policy, centralized management, remote software installations). While Active Directory was configured and functioning, my predecessors decided that instead of joining devices to the domain, they would add wildcard credentials in Windows Credential Manager to access shared resources. To handle printers and shared drives, they used a .bat script. Printer management was just as dodgy - user accounts were manually created in the web interface of the MFPs to match domain credentials.

This setup was far from ideal and gave me anxiety. I overhauled the environment ensuring all shared devices were updated to the latest Windows version (they were four major releases behind and physically deteriorating). Most importantly, all new devices were domain joined from the start. While this transition came with expected teething issues, one glitch caught me by surprise: some devices’ clocks would randomly jump forward by 45 minutes and then correct themselves a few minutes later.

I reviewed configurations, logs, and settings to pinpoint the issue but was stumped. Until I checked the hosts running the guest servers. They were running ESXi in a failover cluster with a shared SAN - Despite everything else this setup was impressive. However, the ESXi hosts lacked proper time server configuration and had incorrect times themselves. This caused the hosts to adjust the Domain Controller's clock at a virtualised hardware level, which then propagated the incorrect time to domain-joined client devices. Shortly after, the Domain Controller would synchronize with Microsoft’s NTP servers, correcting the time on the clients, only for the cycle to eventually repeat.

Once I configured the ESXi hosts to use a time server and synchronized them properly, the issue was resolved. The network ran smoothly until those servers were eventually retired.

A few years and failed disks later along with random purple screens of death on one host (it's an ESXi thing) we transitioned to a single server with RAID for redundancy and a good enough support plan. The new setup utilized Microsoft technologies, including Hyper-V and wasn't as mission critical given our increasing reliance on cloud services.