Andrew Tweddle's Blog: December 2012

Sunday, December 16, 2012

Two potato (the basic equations)

All posts in this series:

One Potato: A description of the problem
Two potato: The basic equations
Three potato: An algebraic formula for the last person left
Four: A neat calculation method using binary numbers
Five potato: F# functions to check the various formulae
Six potato: Using a calculation graph to see the pattern
Seven potato: Generalizing to other intervals
More: An efficient algorithm for arbitrary intervals
Online research: In which I discover that it is called the Josephus problem

Introduction

In my previous blog entry I described a problem from the 2010 South African Mathematics Olympiad. In the next few posts I'm going to show you a few ways of solving the problem. But in this blog entry, I'm just going to set up the equations which the various solutions will all make use of.

Let's start with a recap of the problem...

Imagine n people sitting in a circle. Label them 1 to n in a clockwise direction. Imagine walking around the circle tapping every second person on the shoulder (person 2, 4, 6 and so on). People leave the circle when they are tapped. This continues until only one person is left. What is the label of that person?

Derivation

Firstly let's define f(n) to be the label of the last person remaining when there are n people in the circle.

Clearly f(1) = 1.

What we'd like to do is find expressions for f(n) in terms of f(m) where m is smaller than n.

Every second person is being asked to leave with roughly half as many people remaining after each walk around the circle. This suggests that we should consider the cases of n being odd and even separately.

So let's see what happens when there are an even number of people around the circle, say 2n people.

In our first trip around the circle, the n even numbered people will leave the circle. This will leave n odd-numbered people.

Let's re-label these odd people as 1, 2, ..., n. Then the (new) label of the last person remaining would equal f(n). That comes straight from the definition of f(n).

But we are interested in the original label of this last person. That is not f(n) but 2.f(n) - 1. This is because 2k - 1 is the kth odd number.

Hence f(2n) = 2 f(n) - 1.

Next we will find a similar formula for circles with an odd number of people.

We already have a formula for n = 1. All other positive odd numbers can be expressed as 2n+1 for some positive integer n.

Initially the situation looks like this for 2n+1 people:

We would like to express f(2n+1) in terms of f(n). That means still having n people in the circle after the first round of removals.

To do this we need to remove n+1 people. The first n people removed are all even numbered as before. But now the (n+1)th person removed is person number 1. This leaves the following arrangement:

Note that the kth person in the new circle of people has a label of 2k+1.

So we can re-label the remaining people from 1 to n as before, and f(n) is the new label of the last person remaining if we continue the elimination process. But in the original labelling scheme, this becomes 2.f(n) + 1.

So we have shown that f(2n+1) = 2.f(n) + 1.

The basic equations

If we put all of this together, we get the following set of equations (where n is a positive integer):

$$ \begin{align*} f(1) &= 1 \tag{1} \\ f(2n) &= 2.f(n) - 1 \tag{2} \\ f(2n+1) &= 2.f(n) + 1 \tag{3} \end{align*} $$

Next time

In my next few blog posts I will use these equations to provide a formula for f(n).

Sunday, December 02, 2012

One potato (the problem statement)

All posts in this series:

One Potato: A description of the problem
Two potato: The basic equations
Three potato: An algebraic formula for the last person left
Four: A neat calculation method using binary numbers
Five potato: F# functions to check the various formulae
Six potato: Using a calculation graph to see the pattern
Seven potato: Generalizing to other intervals
More: An efficient algorithm for arbitrary intervals
Online research: In which I discover that it is called the Josephus problem

Introduction

Some months back I discovered the web site of the SA Math Foundation including past papers for the South African Mathematics Olympiad. Back in high school I used to love solving Mathematics problems. So I downloaded a bunch of question papers and starting solving some of them.

This series of blog postings describes some interesting solutions I found to one of the problems.

[Edit: I've since discovered that it is known as the Josephus problem]

The problem statement

One of the first papers I glanced at was the 2010 senior paper, second round.

The first question described a situation in which there are 10 people in a circle, and every 2nd person is asked to leave the circle. This continues until only one person remains. If we number the people 1 to 10, and eliminate person 2 then 4 and so on, what is the number of the last person left?

So it works much like the children's game "One potato, two potato", except that every 2nd person is eliminated, not every 8th person.

The plan

Later I was thinking back to the problem and made a fortuitous mistake. I accidentally thought that the problem was phrased as 2010 people, not 10 people (Math Olympiad problems often use the current year in the problem statement).

So I ended up solving the problem generally, not just for the number 10 (which can easily be solved by brute force).

In the next few blog postings I intend to show you a few solutions to the problem. As I did with the Pyramid of Hexagons problem, I want to provide an algebraic proof as well as intuitive insight into why the proof is true.

So my plan is as follows:

First I will generate the basic equations which all the solutions will use
Thereafter I will provide my first algebraic solution
Then I will try to give some insight into an interesting pattern behind the formula Hint: There is a link to the binary number system.
Then I will give a simpler proof inspired by this pattern.
And finally I will provide the code I wrote to generate the illustrations.

Disclaimer

While the visualization should be very helpful, it's not going to be as striking as my Pyramid of Hexagons visualization.

So I'm hoping that a reader of this blog will be able to find an even more elegant way of explaining the formula.

A Powershell script to sample CPU utilizations by process

We recently had an issue at work where CPU utilizations on our app servers started "flat-lining" at 100%. This starved our WCF service hosts of CPU cycles, leading to complaints of slow performance.

The performance decline was dramatic and it didn't occur straight after a release. This suggested that something had changed elsewhere in the organisation.

The approach on our project has been to regularly release new modules side-by-side with the legacy system. Integration messages keep the data in the two systems synchronized. This significantly minimizes risk compared to a "big bang" approach.

But some users continue using the old system surreptitiously. Printing is a case in point. Print driver issues had been a frequent source of problems in the past. So some users had gone back to using the old system for printing.

Unbeknownst to us, the legacy team had removed the ability to print documents through the legacy system. So the printing load had suddenly increased by 50%.

This led to other problems in the system. For example, our performance logging queue started building up on one of the app servers. Messages are inserted in parallel, but the logging service is throttled to only process one message at a time. So when the logging service is starved of CPU cycles, it starts falling behind. This leads to a constant drain on CPU, making further "flat-lining" more likely.

We had been aware for some time that printing could cause performance problems. So fortunately I already had a design in place for moving most of the printing load onto the print servers, where CPU was under-utilized. But it would take a week or two to develop and test the code.

As an interim solution we commissioned a few more app servers.

To check whether printing was the sole source of problems, I took some random snapshots of resource monitor graphs on one of the application servers. At times printing was clearly causing the CPU to flatline at 100%. But at other times the app servers flatlined when no printing was taking place. So I suspected that printing was not the only cause of high CPU usage. Instead of app servers being able to recover from short-lived "distress states", the extra printing load had caused performance to reach a runaway situation.

To quantify this, I wrote a Powershell script to use WMI to sample CPU utilization for each process on each app server. I wrote the results to a csv file, opened it in Excel, and generated a pivot table off the data.

This highlighted some other issues that we hadn't been aware of before. For example, there was an SNMP monitoring process, SysEdge, which was consuming 25% CPU on 3 of the app servers for long periods each morning. I alerted the infrastructure team and they were able to fix this issue so that SysEdge was consuming under 1% CPU on all app servers.

You can download the script from this location... Get-CPUUtilizationByServerAndProcess.ps1:

param (
    [string[]] $serverNames = $( throw 'The list of server names was not provided' ),
    [int] $coresPerServer = $( throw 'The number of cores per server was not provided' ),
    [int] $repetitions = 1,
    [int] $intervalInSeconds = 60
)

1..$repetitions | foreach-object {
    $repetition = $_
    [DateTime] $now = [DateTime]::Now
    Write-Host "Getting CPU utilizations at $now (repetition $repetition)" -foregroundColor Green

    foreach ($server in $serverNames)
    {
        Write-Host "    Getting processes on $server" -foregroundColor Cyan
        
        $prc = gwmi Win32_PerfFormattedData_PerfProc_Process -computerName $server  # To get around bug where it is zero the first time called
        $prc = gwmi Win32_PerfFormattedData_PerfProc_Process -computerName $server | ? { $_.Name -ne '_Total' -and $_.Name -ne 'Idle' }
        $recordedAt = [DateTime]::Now
        
        $summary = $prc | select-object IDProcess,Name,PercentProcessorTime,WorkingSet,@{n='PercentTime';e={$_.PercentProcessorTime/$coresPerServer}}
        
        foreach ($processSummary in $summary)
        {
            $processName = $processSummary.Name
            $processName = ($processName.Split('#'))[0]
            $percentTime = $processSummary.PercentTime
            $workingSet = $processSummary.WorkingSet
            
            $record = new-object -typeName 'PSObject' -property @{
                Server = $server
                Process = $processName
                Repetition = $repetition
                AvgCPU = [Math]::Round( $percentTime, 2 )
                WorkingSet = $workingSet
                RecordedAt = $recordedAt
                Year = $recordedAt.Year
                Month = $recordedAt.Month
                Day = $recordedAt.Day
                DayOfWeek = $recordedAt.DayOfWeek
                Hour = $recordedAt.Hour
                Minute = $recordedAt.Minute
            }
            $record
        }
    }
    
    $timeTillNextRepetition = $now.AddSeconds($intervalInSeconds) - [DateTime]::Now
    $secondsToWait = $timeTillNextRepetition.TotalSeconds
    if ($secondsToWait -gt 0)
    {
        Start-Sleep -Seconds $secondsToWait
    }
}

A few notes on the Powershell script...

You will notice that I make the WMI call twice. This is because the first call for a particular app server will return all zeroes. I suspect the first call for each app server could be moved outside the loop (e.g. have a loop zero whose data is discarded). But it was good enough for my purposes.

I also kept the following piece of code to illustrate a point...

1..$repetitions | foreach-object {
    $repetition = $_
    ...
}

I used to write code like this a lot, assigning the $_ iteration variable to another variable. This was for 2 reasons...

Firstly, for debugging purposes. I would assign the variable a value to test. Then copy the remainder of the inner script block into the Powershell REPL.

And secondly, so that I could nest loops but still access the outer iteration variable.

However there is an easier way to get the same effect...

foreach ($repetion in 1..$repetitions) {
    ...
}

We have since gone live with the new printing solution. And so far things are looking very promising. CPU load has only decreased slightly, but its "spikiness" seems to have improved significantly. Average execution times of service calls appear to have decreased by up to 18% at the busiest time of day. And it certainly looks like the print processes are terminating faster than before, presumably because print spooling is now taking place locally.

This is also the first time that we have moved an asynchronous process to a different server, thus improving the performance of the synchronous processes which affect user perception of speed. I can see a lot of opportunity for doing more of this in future.

My toolkit for troubleshooting performance issues

For the past couple of years I've been involved in a long-running project to replace many of the software applications for a large hospital network. For much of that time I've been the team lead for the production support team (although more recently my role has changed to being the team's designer, using Enterprise Architect to create UML diagrams for various enhancements to deployed modules).

Over that time I've spent a lot of time troubleshooting and fixing performance issues. This has ranged from identifying and fixing problems caused by database blocking, hunting down memory leaks with windbg and sosex and investigating the causes of high CPU utilization on app servers.

My approach invariably starts with finding a way to gather the right data to allow me to pinpoint where the problem lies.

Even in a high pressure severity 1 outage, this is still my most common starting point. The data allows me to cut out a whole lot of areas of the system for further investigation. With just a few areas to look at in detail, I can then delegate in-depth troubleshooting across team members.

In a crisis, the last thing you must do is panic and take a shotgun approach to looking for problems. First make sure you understand the problem. Data is a very objective way of doing this. But you can't do that unless you have data readily at hand and a toolkit of scripts and habits for rapidly gathering and analyzing that data.

The general purpose tools that I keep on using are Powershell and Excel pivot tables.

These are complemented by specific sources of data for each layer of the application, such as:

Various SQL DM views and query plans for understanding SQL performance problems
Our performance logs (saved to the database) for analyzing WCF service call durations
WMI for app server and print server performance
Memory dumps for memory leak issues at the client

In addition to this, there are some other tools that are very useful for getting a gut feel for a problem, such as SCOM (Systems Center Operations Manager) and Windows Resource Monitor.

SCOM graphs are great for getting a high level view of what's happening across a range of servers.

Resource Monitor is very useful for seeing what's happening on your servers at a point in time. The Process Explorer utility from the sysinternals suite is more powerful (we have it installed on all our app servers). But I've found that Resource Monitor is usually good enough for my purposes.

Interestingly enough, I haven't yet needed to use a code profiler, though this is probably the tool most developers think of first when you mention performance troubleshooting.

Even when the problem is with the code, there are usually other ways of pinpointing which code is at fault.

For example, one of the developers wrote code to hook into the NHibernate layer and write a debug message whenever a possible N+1 selects problem was detected. So those kind of problems can be picked up by the "new development" teams before they get deployed into the production environment.

We also have a Castle Dynamix Proxy interceptor which hooks into the web service proxy and logs performance data for every WCF service call made.

We have a monthly release cycle, and after each release we run a query against the performance logs to look for service calls which are performing significantly worse than a month earlier (this is the only reasonable baseline, since service call volumes differ significantly by day of the week and time of the month). So this also helps us to find poorly performing code without the need for a profiler.

Over the coming months I would like to share more details with you, including some of the Powershell and SQL scripts I've written to gather and analyze the data.

UPDATE:

A post about a Powershell script to sample CPU utilization by process