The performance decline was dramatic and it didn't occur straight after a release. This suggested that something had changed elsewhere in the organisation.
The approach on our project has been to regularly release new modules side-by-side with the legacy system. Integration messages keep the data in the two systems synchronized. This significantly minimizes risk compared to a "big bang" approach.
But some users continue using the old system surreptitiously. Printing is a case in point. Print driver issues had been a frequent source of problems in the past. So some users had gone back to using the old system for printing.
Unbeknownst to us, the legacy team had removed the ability to print documents through the legacy system. So the printing load had suddenly increased by 50%.
This led to other problems in the system. For example, our performance logging queue started building up on one of the app servers. Messages are inserted in parallel, but the logging service is throttled to only process one message at a time. So when the logging service is starved of CPU cycles, it starts falling behind. This leads to a constant drain on CPU, making further "flat-lining" more likely.
We had been aware for some time that printing could cause performance problems. So fortunately I already had a design in place for moving most of the printing load onto the print servers, where CPU was under-utilized. But it would take a week or two to develop and test the code.
As an interim solution we commissioned a few more app servers.
To check whether printing was the sole source of problems, I took some random snapshots of resource monitor graphs on one of the application servers. At times printing was clearly causing the CPU to flatline at 100%. But at other times the app servers flatlined when no printing was taking place. So I suspected that printing was not the only cause of high CPU usage. Instead of app servers being able to recover from short-lived "distress states", the extra printing load had caused performance to reach a runaway situation.
To quantify this, I wrote a Powershell script to use WMI to sample CPU utilization for each process on each app server. I wrote the results to a csv file, opened it in Excel, and generated a pivot table off the data.
This highlighted some other issues that we hadn't been aware of before. For example, there was an SNMP monitoring process, SysEdge, which was consuming 25% CPU on 3 of the app servers for long periods each morning. I alerted the infrastructure team and they were able to fix this issue so that SysEdge was consuming under 1% CPU on all app servers.
You can download the script from this location... Get-CPUUtilizationByServerAndProcess.ps1:
param ( [string[]] $serverNames = $( throw 'The list of server names was not provided' ), [int] $coresPerServer = $( throw 'The number of cores per server was not provided' ), [int] $repetitions = 1, [int] $intervalInSeconds = 60 ) 1..$repetitions | foreach-object { $repetition = $_ [DateTime] $now = [DateTime]::Now Write-Host "Getting CPU utilizations at $now (repetition $repetition)" -foregroundColor Green foreach ($server in $serverNames) { Write-Host " Getting processes on $server" -foregroundColor Cyan $prc = gwmi Win32_PerfFormattedData_PerfProc_Process -computerName $server # To get around bug where it is zero the first time called $prc = gwmi Win32_PerfFormattedData_PerfProc_Process -computerName $server | ? { $_.Name -ne '_Total' -and $_.Name -ne 'Idle' } $recordedAt = [DateTime]::Now $summary = $prc | select-object IDProcess,Name,PercentProcessorTime,WorkingSet,@{n='PercentTime';e={$_.PercentProcessorTime/$coresPerServer}} foreach ($processSummary in $summary) { $processName = $processSummary.Name $processName = ($processName.Split('#'))[0] $percentTime = $processSummary.PercentTime $workingSet = $processSummary.WorkingSet $record = new-object -typeName 'PSObject' -property @{ Server = $server Process = $processName Repetition = $repetition AvgCPU = [Math]::Round( $percentTime, 2 ) WorkingSet = $workingSet RecordedAt = $recordedAt Year = $recordedAt.Year Month = $recordedAt.Month Day = $recordedAt.Day DayOfWeek = $recordedAt.DayOfWeek Hour = $recordedAt.Hour Minute = $recordedAt.Minute } $record } } $timeTillNextRepetition = $now.AddSeconds($intervalInSeconds) - [DateTime]::Now $secondsToWait = $timeTillNextRepetition.TotalSeconds if ($secondsToWait -gt 0) { Start-Sleep -Seconds $secondsToWait } }
A few notes on the Powershell script...
You will notice that I make the WMI call twice. This is because the first call for a particular app server will return all zeroes. I suspect the first call for each app server could be moved outside the loop (e.g. have a loop zero whose data is discarded). But it was good enough for my purposes.
I also kept the following piece of code to illustrate a point...
1..$repetitions | foreach-object { $repetition = $_ ... }I used to write code like this a lot, assigning the $_ iteration variable to another variable. This was for 2 reasons...
Firstly, for debugging purposes. I would assign the variable a value to test. Then copy the remainder of the inner script block into the Powershell REPL.
And secondly, so that I could nest loops but still access the outer iteration variable.
However there is an easier way to get the same effect...
foreach ($repetion in 1..$repetitions) { ... }
We have since gone live with the new printing solution. And so far things are looking very promising. CPU load has only decreased slightly, but its "spikiness" seems to have improved significantly. Average execution times of service calls appear to have decreased by up to 18% at the busiest time of day. And it certainly looks like the print processes are terminating faster than before, presumably because print spooling is now taking place locally.
This is also the first time that we have moved an asynchronous process to a different server, thus improving the performance of the synchronous processes which affect user perception of speed. I can see a lot of opportunity for doing more of this in future.
No comments:
Post a Comment