It was one of those days that started as usual. I logged in to my machine to start working. A few hours later I got a message from my colleague regarding a user complaint. The system is not working was the initial symptom, but that’s not unusual. At the scale the system was operating, we usually got similar complaints a few times a week from different users that tend to end up being either a misuse of the software or just some network issue on the user’s end. I acted coldly, reached out to user to learn more about the issue. I have done some quick tests to evaluate the system state, it all seemed fine. Additionally, we had background services doing similar regression testing periodically to alert us in case of system instability. I have done some quick troubleshooting with the user to eliminate some common factors, but the issue seemed to be continuing to occur. I asked for more details and dug a bit deeper.
Something doesn’t feel right…
When I checked the “faulty” parts of the system, I noticed something unusual, the system was reporting a last modification date that supposed to be on 11st of November 2024, but it was early February 2024 when this even happened. I sensed some sort of data corruption. One of the my latest changes was updating underlying protocol to handle different data types more efficiently. I was afraid somehow the change was producing invalid data, however the timestamps was stored separately from the changed parts. This seemed to be not the case, but then what really went wrong? And more importantly why was system not functioning as expected for some users, but it was working for others.
While this all was going, I received a bunch of emails and texts from other users and seems like the issue was escalated. Something was really off with the system. After talking to my colleague, I decided to raise an incident alert to our users to let them aware of the issue. Comms are done, but accepting the failure doesn’t magically fix the issue, I still had to figure out what went wrong.
I decided that it must be related to the “11-11-2024” timestamps. I got curious to check all the available records that created “in future”. Querying our database revealed that there’s more than hundreds of records with invalid timestamps, but surprisingly they all seemed to have “11-11-2024” timestamp with the same hours
and close minutes:seconds
values. If it was due to a magical bit flip, or encoding error it should have only affected either very small entropy for a few records or very notable entropy on the range of records. However the corruption seemed to be systematic. What could have been gone wrong?
It was not my fault
We were running our service on Windows Servers (unfortunately; I love UNIX systems). I took a look at Event Viewer to see any suspicious activities, and indeed I noticed the same pattern of events tagged with “11-11-2024” timestamps on event log as well. This time it was relieving at least, because now I surely know that it wasn’t our service that faulted, but something have gone wrong with the server. And indeed, filtering event logs for Event ID: 4616 revealed that the system time indeed changed around 10am to 11st of November and back to a proper timestamp a few minutes later.
After identifying the issue I sent another email to users informing them about the findings, and let them know that I will have to patch invalid records manually and I may have to take the system down for a few minutes to proceed with the patch. The restart was necessary because despite we had access to the records using SQL server, the system relied on database as a backing storage and used its in memory data store for primary bookkeeping. I prepared SQL scripts for patching the data and almost managed to invalidate all the valuable metadata, but fixed it on the last minute, it’s always worth triple checking your WHERE
clause just in case.
Within around 20 minutes I patched all the corrupted records and restarted the servers. For context we had 6 instances synchronizing with each other, so all the changes had to be done while all the instances were turned off, otherwise a corrupted data might have ended up on our patched instances via data synchronization. I decided to shut down 3 of those servers first, patch them, then shut down the other 3 while switching the previous 3 servers back online and patch the rest of the servers. That way I reduced the total downtime to the minimal duration and managed to patch all the instances.
A few more manual checks to ensure stability and I sent another email to users letting them know that the issue have been resolved!
Aftermath of the Incident
Now that it was apparent that the system clock was changed and caused the incident, but it was still not clear “why?”. My first idea was that the NTP servers advertised the invalid wall clock value to us. However, if that was the case it must have affected other instances as well, but upon checking event logs on other instances I noticed only one of the instances got system clock changed; and I was lucky to check that machine when I was first trying to find out the root cause, because if I were to check another machine for the root cause of the issue, I might have not found the system time change event. Nevertheless, it seemed like the NTP server was not the cause, but then why would a completely normal instance suddenly decide to take a trip in time? It was getting late that day and my research has not reached to any end. I decided to take a break and start fresh the next day. I decided to just try out my chance with very vague research instead of trying to be clever. I googled for “EC2 spontaneous time change”, and voila. I found this reddit thread which itself didn’t contained that much of discussion, however it linked to another thread that explained what was going on. It turned out that Windows Server has a feature that adjusts its clock if it finds out the system clock could be malfunctioning.
Apparently Windows Server 2016 introduced time accuracy improvements that ironically caused the inaccuracy. With the same set of the changes Secure Time Seeding was also introduced. In theory if you happen to have a faulty hardware clock and a connection issue to your time server either due to network or failure on establishing secure connection because the underlying certificates does also depend on having a roughly valid system clock to check for validity; if you happen to be this unfortunate, your device may never able correct its clock. Windows turns into a clever guy and starts picking up random TLS connections it establishes with random endpoints and uses metadata from them to correct its own clock. The exact details are not clear to me, but seems like it collects a bunch of data points and aggregates them and compares with system clock to check if it drift away and try to fix it. However, in our case some of the connections from systems with invalid system clocks drifted our “correct” system clock and caused all the mess down the line.
To be completely honest, I am still not sure how to get even deeper than this and find out which TLS connections exactly caused this issue, or even whether this was indeed the case for us. However, at this point I am keen to just disable that feature and cross my fingers that it fixes the issue. I literally don’t have any more data from the incident to find out, to inspect TLS connections I had to have packet collection enabled during the incident but collecting all the packets on the server in case one in a 10eN (where N is some big number) chance is quite a lot of waste data and wasted performance. On top of that the services running on the instance are network intensive, using quite a lot of network bandwidth, storing all that would pile up very quick, let alone analyzing them later on.
This is a rather unusual experience for me at work I wanted to share with you all. If you happen to know more about the said issue I would love to hear more from you (drop me a mail me@themisir.com). I hope this writing didn’t bore you (: