Wednesday, December 10, 2008

Oracle Enterprise SQL Tuning, going beyond poorly performing queries

Sooner or later every DBA is faced with tuning an very large application. After tuning the worst performing queries, the way the data is structured, and looking at indexes where do we go? Some opt for more hardware, others start manually looking at the code if possible (which is isn't most of the time). But most of the time when I have run into this problem the amount of time and number of resources assigned to this task is greatly lacking.
My response to this is to try and create some tools that will allow me to look at what the application is doing very quickly. In order to do this I need to have statistics that can be queried many different ways. I need to know which bind variables are used for which columns and how often. I want to be able to see where the problem areas are quickly within an application so I can show development hard numbers to convince them to work on the areas I need them to work on.
The first question I ask myself when looking at any application (in a performance viewpoint) that used to perform but doesn't now is "Is this application scalable?".

To answer that question there are two areas I look at. The first is whether or not the application uses SQL who's block gets will significantly grow as the data grows. A application should be able to keep the same number of users and maintain close to the same performance over time even if the data is growing. The things that cause that not to happen are poorly written SQL, poorly constructed tables and indexes, not having a good (or any) archiving mechanism, and/or not using partitions.

Poorly written SQL can be the easy to spot type where a query keeps coming to the top of the list as reading tons of blocks. Or it can be the not so easy to spot type where the application makes a lot of very small, quick, SQL calls to return all of the data it needs.

If you have the small, quick SQL call problem it's not easy to convince developers to do anything about it. Often this SQL is generated by object to relational mapping tools like Hibernate and changing its behavior is more work for the developer (they use this tool to save themselves work so why would they not use the part of it's functionality the saves them the effort of writing SQL or even thinking about the database?).

Now is a great time to introduce the concept of working with your QA department. Most QA departments have tools that script the testing of an applications functionality. These tools record each test case and the start and stop times. These tools can be used by a DBA to help provide a very detailed analysis of what the application is doing both inside and outside of the database.

You don't have to have automated test scripts, you can do what I suggest without them, but they make it a lot easier.

What we are after is being able to show what SQL is being executed, along with all of that SQL's stats, while the application is doing X for the user. Let's say the test cases are:
1. login to the application
2. search for a product
3. put the product into a cart
4. checkout screen 1
5. checkout screen 2
6. checkout screen 3
7. checkout screen 4

Then during each of those steps we want to be able to to tell what SQL was executed between the time the user clicked 'login' to when the search for a product screen was displayed. We want that for each step listed above. We not only want that but we want to know if there is any SQL going on in between steps.

Once we have that information we can create queries that show the number distinct SQL statements for each step and how many times it was executed. We can show what the elapsed time the user sees verses the elapsed time within the database for each step listed. We can also tell the elapsed time between each query executed within each step.

Actually we can now tell almost anything we want to about how the application is using the database.

If the application is showing the same query being executed hundreds of times with only milliseconds in between executions, then we can deduce this application is probably in a loop trying to pull a lot of data back from the database. This could probably be done in one query instead of many.

If the application is showing a lot of elapsed time between queries within a step. And that step starts when the users clicks a button and stops when a page is displayed, then we know the application is pulling a database record, processing it, and then pulling another one without waiting for user input. At this point I always check with development to see if the application is using threads or not and if the work being done in the database can be parallelized by using threads within the application.

In order to accomplish this we need a few things. We need to be able to upload 10046 trace files in a relational form so that we can query them in a lot of different ways. In one instance I had to work with an application that was performing badly. QTP scripts had already been written to test the application and provide timings...

So, To collect these numbers I modified the QTP scripts. Everytime the QTP scripts made a Start Transaction or End Transaction call within QTP I had the script record the current timestamp, along with the usecase and usecase step names, in a table within the database the application was connected to. 10046 level 12 tracings were enabled in the database for all traffic coming from the box QTP was running on as well as a speical mid-tier box that only the QTP box used.

Once the QTP scripts were finished executing the raw trace files were uploaded into the database in a relational form that provides the ability to link cursors,executions,fetches, parses, binds, and waits together. During the upload of the trace file the timestamp of each individual cursor, parse, fetch, execute, bind, wait, etc.. was calculated by taking the timestamp provided in the tracing file at the top and then calculating the timestamp of all statements by equating that time stamp with the first 'tim' value of an individual record with the first timestamp in the file. Following 'tim' values in records in the file provide the number of 1/1000000 of a second from the first timestamp by substracting the first 'tim' value. The timestamp can then be calculated for each bind, parse, wait, etc.. in the file by taking the difference betweent he first 'tim' value and the current records 'tim' value, dividing it by 1,000,000 and adding it to the timestamp recorded at the top of the raw trace file.

Once timestamps have been calculated the database elapsed time can be calculated by taking all of the records between the start and stopping timestamp values for the usecase or the usecase step for the parse,execute and fetch records and adding their elapsed time ('e' value in the raw trace file) and dividing it by 1,000,000 to get the number of seconds spent in each phase.
By creating this relational form and dividing records up so they can be queried by time we can generate a lot of stats quickly that tell us what the client is doing. We can tell if it's executing one query, processing the information and then executing more. We can tell if the queries are being executed all at once without much time for the client to process the rows being returned. Stats can be generated that tell us how many executions occur for each use case and if database activity is happening outside of the defined usecase steps in the QTP files.

I have written some tools to help do this and I will post them later. I will also show the queries and tables I created to support the relational trace file model.