Some Background (Windows Workflow)
At company x we have been using Microsoft's Windows Workflow Foundation hosted and scaled using a number of Azure Web Instances. In other words, when there is a problem with the workflows it adversely effects the end users. Pretty flawed design, but this is what we had been working with for sometime.
In general Windows Workflow is a highly effective way to manage software development. It offers a clear overview of a highly complex systems and a highly scalable run-time engine out of the box. This guy (Blake Helms) is a major fan and so is the CTO at my company.
Of course, Windows Workflow does have some peculiarities. See these excellent posts on the dreaded tight loop and managing the workflow persistence store.
A Case Study
On Thursday afternoon our web instances started bouncing wildly between 50% -100% CPU usage, our users were complaining that the site was "slow", and restarting one web instance at a time was only solving the problem temporarily. Looking through the logs it seemed that this had been going for a few days to a lesser or greater extent depending on load. The same pattern was occurring again and again each day getting worse and more noticeable - the overall CPU usage would ramp up and then ramp back down again temporarily disabling Web Instances.
Having spent some time looking at this with our normal analysis techniques it seemed that there was no problem with the workflows. Looking at the workflows coming in every couple of minutes showed that they were being processed as expected. I suggested that we see what happens when the users log off at the end of the day - hoping that we would see a clearer picture out of hours.
Accept The Problem
Dev-ops problems tend to be unpopular in organisations. Most people dislike being disturbed of course, but it is also the stress and fraught nature of such incidents which cause Developers to shy away.
I personally believe that there are huge incentives to take on real live problems, but to reap the benefits they must be seen through from start to finish. Like many things in computer science, when a problem is hard do more of it! This, I guess, is why we are seeing more developers become dev-op specialists.
The Danger
The danger with all dev-ops is that we spot the problems and find a work-around, because this gets the situation off our backs.
This behaviour can create the "dev-ops infinite loop" because the real source of the problem isn't fed back into the development process.
As I am writing this now, it sounds obvious: find the issue and make sure there is a pull request into the next release which fixes it, what's the problem?
Well, it is quite possible that you may not have considered where or how the situation came about from the very beginning. In larger organisations a particular release may have involved many people, which could mean that vital information is lost or unavailable to you. It could also be that information may have been forgotten completely due to a release cycle that is too long or too complex.
Understanding Why - Continuing The Case Study
By Friday team was starting to get desperate because the problem was not going away and we still could not understand why the problem was happening. There seemed to be no difference between out of ours and in hours. A few theories were put forward with minor bug fixes in the code, one of which was a bug which had been present in the system since nearly the first release, lets call this "Theory 1":
- Email templates are being compiled through the Razor engine every time an email is sent. This causes the CPU usage on the Web Instance to ramp up because it is such a resource heavy operation.
If you have worked with Razor in a highly scaled system, you may have also come across this problem. A fix was developed and released over the weekend. The email Razor templates would now be pre-compiled on app startup thus reducing the load and solving the problem.
However, come Monday morning this did not calm the server instances down. Every time we ran up a new server instance the load would be OK for a time but then would start bouncing up and down giving our users a poor experience when using the site.
A Moment Of Monday Clarity
"Theory 2":
- There is a workflow which is running which is blocking the web instance and causing the increase in CPU usage.
We had to accept on Monday that it was back to the drawing board. The workflows had been disregarded as a potential cause because the monitoring tools we had written to check the database showed that the workflows were being processed very efficiently. However, there seemed to be no other plausible explanation.
Further analysis on the "workflow instance table" in the database showed us that there were a growing number of workflows being fired of a particular type at very particular intervals. We were able to identify this by changing our workflow monitoring tools to look at the number of workflows queued of a particular type, rather than how many workflows were suspended or overdue.
The Temporary Fix
On Monday evening we were able to push a small fix which restricted the workflow in question so that it ran only in out of office hours. We hoped this would get the users and the CEO off our backs - yes it got pretty serious! But we were still not entirely clear on what the route cause of the problem was.
Finding A Lasting Solution
The final solution to the problem was eventually deduced through some more thorough detective work. The kind of detective work that is hard to do properly when under pressure.
What went into the last release that could be very resource hungry? And what could cause this hunger to increase gradually over the days that proceeded afterwards?
The answer turned out to be from a new more generic workflow which had been written to replace a number of other workflow processes. There was actually nothing wrong with what the workflow was doing. The only error that the developer had made was to fail to consider its impact when deployed "en masse" if you will. This workflow had been written by a developer who had left the company some months before. Something like this would have been tricky to predict.
Areas For Improvement
I would say that this tale lays bare some quite common problems within software organisations.
- Before the software was released, there was no effort made to load test the solution.
- When the release was promoted to production, there were so many changes made by so many developers over so much time that there was no "scrum master" or technical leader who knew enough about what was in the release.
- After the release was running, the tools for monitoring the solution where not sensitive or sophisticated enough to show a larger than expected increase in server resource usage.
- The problem went unnoticed for too long.
- When the problem was identified, developers hoped that it would just "go away" without pursuing the problem vigorously enough.
- Developers put too much faith in one solution working without thinking enough about developing some ideas on other solutions.
- Nobody in the team was calm enough to look through the new feature list and suggest a feature with risks.
What has happened since this incident:
- UAT goes through a load test before being promoted to Production.
- Releases are kept much smaller and more frequent
- The workflow database monitoring tools have been improved
- We gave ourselves a kick and tried to learn as much as we can from the case.
- The incident resolution process has been looked at.
No comments:
Post a Comment