The United States Digital Service is creating an application called Caseflow, which will create a more efficient process for managing veterans’ appeals by increasing productivity, preventing errors, and automating or eliminating manual steps.
The digital services team makes changes to the production of the application daily, according to Alan Ning, reliability engineer working with the Digital Service at Veterans Affairs team.
However, before the team was able to make updates this quickly an elusive bug was preventing them from performing rolling deployments into the Amazon Web Services AutoScaling group. The team would make an improvement, the application would work for exactly five minutes, and then the server would crash.
“This meant that our systems were down and preventing Veterans’ Appeals from being certified by Region Offices around the entire nation,” Ning wrote in a blog post. “We were only causing further frustration as a result of our daily application downtime. Despite wanting to deploy daily to production, we could only stand to deploy once a week during off hours. We panicked more and more with each deployment failure, and with little time to think things through, we had to reboot until the problem went away.”
Ning said that the team knew that the problem related to the connection with the Veterans Affairs’ Oracle database. The team developed a list of possible origins of the problem and worked through them one by one. They checked to ensure that the Oracle database was not dropping their session silently, that the problem was not specific to their version of Linux, and that the site-to-site Virtual Private Network was not dropping the connection.
“The most interesting aspect of this bug was the magical five-minute death mark,” Ning said.
The team found the root cause in the Cisco routers thanks to an “ancient” feature in the Linux kernel, which prompted them to fix the problem by upgrading the firmware.
“While you may encounter problems like this elsewhere, USDS teams solve them every day knowing that the problem affects not just a nebulous release date, but potentially the lives of hundreds of thousands of people,” Ning said.