Sunday, December 02, 2012

My toolkit for troubleshooting performance issues

For the past couple of years I've been involved in a long-running project to replace many of the software applications for a large hospital network. For much of that time I've been the team lead for the production support team (although more recently my role has changed to being the team's designer, using Enterprise Architect to create UML diagrams for various enhancements to deployed modules).

Over that time I've spent a lot of time troubleshooting and fixing performance issues. This has ranged from identifying and fixing problems caused by database blocking, hunting down memory leaks with windbg and sosex and investigating the causes of high CPU utilization on app servers.


My approach invariably starts with finding a way to gather the right data to allow me to pinpoint where the problem lies.

Even in a high pressure severity 1 outage, this is still my most common starting point. The data allows me to cut out a whole lot of areas of the system for further investigation. With just a few areas to look at in detail, I can then delegate in-depth troubleshooting across team members.

In a crisis, the last thing you must do is panic and take a shotgun approach to looking for problems. First make sure you understand the problem. Data is a very objective way of doing this. But you can't do that unless you have data readily at hand and a toolkit of scripts and habits for rapidly gathering and analyzing that data.


The general purpose tools that I keep on using are Powershell and Excel pivot tables.

These are complemented by specific sources of data for each layer of the application, such as:
  • Various SQL DM views and query plans for understanding SQL performance problems
  • Our performance logs (saved to the database) for analyzing WCF service call durations
  • WMI for app server and print server performance
  • Memory dumps for memory leak issues at the client

In addition to this, there are some other tools that are very useful for getting a gut feel for a problem, such as SCOM (Systems Center Operations Manager) and Windows Resource Monitor.

SCOM graphs are great for getting a high level view of what's happening across a range of servers.

Resource Monitor is very useful for seeing what's happening on your servers at a point in time. The Process Explorer utility from the sysinternals suite is more powerful (we have it installed on all our app servers). But I've found that Resource Monitor is usually good enough for my purposes.


Interestingly enough, I haven't yet needed to use a code profiler, though this is probably the tool most developers think of first when you mention performance troubleshooting.

Even when the problem is with the code, there are usually other ways of pinpointing which code is at fault.

For example, one of the developers wrote code to hook into the NHibernate layer and write a debug message whenever a possible N+1 selects problem was detected. So those kind of problems can be picked up by the "new development" teams before they get deployed into the production environment.

We also have a Castle Dynamix Proxy interceptor which hooks into the web service proxy and logs performance data for every WCF service call made.

We have a monthly release cycle, and after each release we run a query against the performance logs to look for service calls which are performing significantly worse than a month earlier (this is the only reasonable baseline, since service call volumes differ significantly by day of the week and time of the month). So this also helps us to find poorly performing code without the need for a profiler.


Over the coming months I would like to share more details with you, including some of the Powershell and SQL scripts I've written to gather and analyze the data.


UPDATE:

No comments: