SQL Server Performance tuning approach – deduce information from wait statistics data

In our last post, we started a journey for SQL Server DBAs so that they can establish an approach for performance tuning issue. We went through some of the basic wait types are and shared links for downloading queries to gather those.

Now that you have gathered total wait statistics from a SQL Server instance, it is time to understand what they are trying to tell us.

Below are few of the examples of wait statistics gathered from some SQL instances. Let’s look into them and try to make that data into useful information.

Example 1 –

Here top wait is CXPACKET which is consuming 76% of total waits. Other wait types are contributing to rest of the waits.

What information can be deduced from this:

  • High CXPACKET points us that there may be an issue with parallelism setting. So we need to check MAXDOP and Cost threshold for parallelism.
  • BACKUPBUFFER and BACKUPIO wait are generated during backup operations and mostly are seen when the database size is big or you are having third party tool like commvault to take the backups.
  • PAGEIOLATCH_SH denotes that waits happens when SQL is waiting for pages to be fetched from disk. It may be due to disk latency and you may need to check that using PERFMON. But, more often then not, PAGEIOLATCH along with CXPACKET wait are due to bad index optimization.
  • ASYNC_NETWORK_IO normally points towards a network bandwidth issue or a client application server is slow to process the data sent by database server.

Key take away and pointer to next phase of troubleshooting:

  • Check MAXDOP and Cost Threshold for Parallelism values. Check best practices for MAXDOP. For Cost Threshold for Parallelism don’t use default value of 5. Increase the value to 40-50 and then monitor the performance. If needed, optimize this value according to your environment.
  • For PAGEIOLATCH_SH, check for any average disk latency. More often then not, do optimize your indexes. Check for unused indexes, missing indexes and duplicate indexes. Please note that older the SQL instance restart time, better is the data to bank upon.

We will talk about indexes more in later posts.

Example 2 –

In this case, highest wait type is BACKUPIO (~31%) followed by BACKUPBUFFER(~12%) and ASYNC_IO_COMPLETION (~11%).

  • For BACKUPIO and BACKUPBUFFER, we know from previous explanation what it means.
  • ASYNC_IO_COMPLETION wait, by definition, occurs when a task is waiting for I/Os to finish. This can be due to a slow application server processing the data slowly or there can be a disk related issue which needs to be taken care off.
  • SOS_SCHEDULER_YIELD happens when a CPU thread yield the scheduler after completing its quantum period of 4 ms. This may point towards high CPU utilization.
  • OLEDB waits normally happens when there are lots of queries using Linked Server.
  • WRITELOG wait simply signifies that it is taking too much of time to write logs into physical disk.
  • Other waits like PAGEIOLATCH_SH and ASYNC_NETWORK_IO are almost same as we discussed earlier.

Key take away and pointer to next phase of troubleshooting:

  • Here ASYNC_IO_COMPLETION and WRITELOG are directly pointing that I/Os are taking time to complete. The reason could be a disk latency or more often a bad performing log file (.ldf file). In my experience, many times, the main reason has been high VLF count which causes slow performance of log file and hence causing other waits.
  • SOS_SCHEDULER_YIELD also signifies that there have been too many quantum cycles for a thread. It may be due to pressurized CPU, which may be caused by bad indexes or may be due to slow performing log files.

Here in both cases, we found that individual wait type may not describe the underlying issue. You may have to read all the waits and deduce what they are pointing us towards.

There is a story in these wait statistics and we as a DBA must be able to summarize that. As a golden rule, more you will work on these, better your understanding will become.

Do share what wait statistics you are getting in your environment and share us in the comment section. We will do best to summarize those in a meaningful way so that you can optimize performance in your environment.