Sunday, May 9, 2010

Optimizing MySQL: Queries and Indexes

You know the scene. The database is just too slow. Queries are queuing up, backlogs growing, users being refused connection. Management is ready to spend millions on "upgrading" to some other system, when the problem is really that MySQL is simply not being used properly. Badly defined or non-existent indexes are one of the primary reasons for poor performance, and fixing these can often lead to phenomenal improvements. Consider an extreme example:

CREATE TABLE employee (
   employee_number char(10) NOT NULL,
   firstname varchar(40),
   surname varchar(40),
   address text,
   tel_no varchar(25),
   salary int(11),
   overtime_rate int(10) NOT NULL
);
To find employee Fred Jone's salary(employee number 101832), you run: SELECT salary FROM employee WHERE employee_number = '101832'; MySQL has no clue where to find this record. It doesn't even know that if it does find one matching, that there will not be another matching one, so it has to look through the entire table, potentially thousands of records, to find Fred's details.
An index is a separate file that is sorted, and contains only the field/s you're interested in sorting on. If you create an index on employee_number, MySQL can find the corresponding record very quickly (Indexes work in very similar ways to an index in a book. Imagine paging through a technical book (or more often, an scrambled pile of notes!) looking for the topic "Optimizing MySQL". An index saves you an immense amount of time!
Before we repair the table structure above, let me tell you about a most important little secret for anyone serious about optimizing their queries: EXPLAIN. EXPLAIN shows (explains!) how your queries are being used. By putting it before a SELECT, you can see whether indexes are being used properly, and what kind of join is being performed...
For example:

EXPLAIN SELECT employee_number,firstname,surname FROM employee WHERE employee_number= '10875';
+----------+------+---------------+------+---------+------+------+------------+
| table    | type | possible_keys | key  | key_len | ref  | rows | Extra      |
+----------+------+---------------+------+---------+------+------+------------+
| employee | ALL  | NULL          | NULL |    NULL | NULL |    2 | where used |
+----------+------+---------------+------+---------+------+------+------------+
So what are all these things?
  • table shows us which table the output is about (for when you join many tables in the query)
  • type is an important one - it tells us which type of join is being used. From best to worst the types are: system, const, eq_ref, ref, range, index, all
  • possible_keys Shows which possible indexes apply to this table
  • key And which one is actually used
  • key_len give us the length of the key used. The shorter that better.
  • ref Tells us which column, or a constant, is used
  • rows Number of rows mysql believes it must examine to get the data
  • extra Extra info - the bad ones to see here are "using temporary" and "using filesort"
Looks like our query is a shocker, the worst of the worst! There are no possible keys to use, so MySQL has to go through all the records (only 2 in this example, but imagine a really large table).
Now lets add the index we talked about earlier.
If we re-run the EXPLAIN, we get:
+----------+-------+---------------+---------+---------+-------+------+-------+
| table    | type  | possible_keys | key     | key_len | ref   | rows | Extra |
+----------+-------+---------------+---------+---------+-------+------+-------+
| employee | const | PRIMARY       | PRIMARY |      10 | const |    1 |       |
+----------+-------+---------------+---------+---------+-------+------+-------+
The query above is a good one (it almost falls into the category of "couldn't be better"). The type of "join" (not really a join in the case of this simple query) is "const", which means that the table has only one matching row. The primary key is being used to find this particular record, and the number of rows MySQL thinks it needs to examine to find this record is 1. All of which means MySQL could have run this query thousands of times in the time it took you to read this little explanation. 

Some knowledge of how indexes work allows you to use them more efficiently. Firstly, note that when you update a table with an index, you have to update the index as well, so there is a performance price to pay. But unless your system runs many more inserts than selects and the inserts need to be quick, and not the selects, this is a price worth paying.
What about if you want to select on more than one criteria? (As you can see, it only makes sense to index those fields you use in the WHERE clause.) The query:
SELECT firstname FROM employee;
makes no use of an index at all. An index on firstname is useless. But,
SELECT firstname FROM employee WHERE surname="Madida";
would benefit from an index on surname.
Let's look at some more complex examples where EXPLAIN can help us improve the query. We want to find all the employees where half their overtime rate is less than $20. Knowing what you do, you correctly decide to add an index on overtime_rate, seeing as that's the column in the where clause.
ALTER TABLE employee ADD INDEX(overtime_rate);
Now let's run the query.
EXPLAIN SELECT firstname FROM employee WHERE overtime_rate/2<20;

+----------+------+---------------+------+---------+------+------+------------+
| table    | type | possible_keys | key  | key_len | ref  | rows | Extra      |
+----------+------+---------------+------+---------+------+------+------------+
| employee | ALL  | NULL          | NULL |    NULL | NULL |    2 | where used |
+----------+------+---------------+------+---------+------+------+------------+
Not good at all! Every single employee record is being read. Why is this? The answer lies in the "overtime_rate/2" part of the query. Every overtime_rate (and hence every record) has to be read in order to divide it by 2. So we should try and leave the indexed field alone, and not perform any calculations on it. How is this possible? This is where your school algebra comes to the rescue! You know that 'x/2 = y' is the same as 'x = y*2'.We can rewrite this query, by seeing if the overtime_rate is less than 20*2. Let's see what happens.
EXPLAIN SELECT firstname FROM employee WHERE overtime_rate<20*2;
+--------+-------+---------------+---------------+---------+------+------+----------+
|table   | type  | possible_keys | key           | key_len | ref  | rows |Extra     |
+--------+-------+---------------+---------------+---------+------+------+----------+
|employee| range | overtime_rate | overtime_rate |       4 | NULL |    1 |where used|
+--------+-------+---------------+---------------+---------+------+------+----------+
Much better! MySQL can perform the 20*2 calculation once, and then search the index for this constant. The principle here is to keep your indexed field standing alone in the comparison, so that MySQL can use it to search, and not have to perform calculations on it.
You may say that I was being unfair, and should have phrased the request as "where the overtime rate is less than 40", but users seem to have a knack of making a request in the worst way possible!


Ordering by surname is a common requirement, so it would make sense to create an index on surname. But in this example our employee table consists of thousands of people from Swaziland, and with the surname "Dlamini". So we need to index on firstname as well. The good news is that MySQL uses leftmost prefixing, which means that a multi-field index A,B,C will also be used to search not only for a,b,c combinations, but also A,B as well as just A.
In our example, this means that an index of the type
ALTER TABLE employee ADD INDEX(surname,firstname);
is used for a queries such as
EXPLAIN SELECT overtime_rate FROM employee WHERE surname='Madida';
as well as
EXPLAIN SELECT overtime_rate FROM employee WHERE surname='Madida' and firstname="Mpho";
which both result in

+--------+------+-------------------+---------+---------+-------+------+-----------+
|table   | type | possible_keys     | key     | key_len | ref   | rows |Extra      |
+--------+------+-------------------+---------+---------+-------+------+-----------+
|employee| ref  | surname,surname_2 | surname |      41 | const |    1 |where used |
+--------+------+-------------------+---------+---------+-------+------+-----------+
However, the query
EXPLAIN SELECT overtime_rate FROM employee WHERE firstname='Mpho';
does not use an index, as firstname is not available from the left of the index, as shown below.

+----------+------+---------------+------+---------+------+------+------------+
| table    | type | possible_keys | key  | key_len | ref  | rows | Extra      |
+----------+------+---------------+------+---------+------+------+------------+
| employee | ALL  | NULL          | NULL |    NULL | NULL |    3 | where used |
+----------+------+---------------+------+---------+------+------+------------+
If you needed this kind of query, you would have to add a separate index on firstname.

The Query Optimizer, OPTIMIZE and ANALYZE

The magic inside MySQL that decides which keys, if any, to use to in the query, is called the query optimizer. It takes a quick glance at the index to see which indexes are the best to use. Compare it to searching for a CD by the artist "Savuka", called "Third World Child", where there are 2 indexes, one alphabetical by artist name, and the other by album name. At a glance, you see that there are 20000 unique artists, and 400000 unique albums, so you decide to search by artist. But if you knew that there were 50 Savuka albums, and that Third World child is the only album starting with "T", your search criteria would change. You can provide similar information for the Optimizer by running
ANALYZE TABLE tablename;
This stores the key distribution for the table (running ANALYZE is equivalent to running myisamchk -a or myismachk --analyze).
Many deletes and updates leave gaps in the table (especially when you're using varchar, or in particular text/blob fields). This means there are more unnecessary disk I/O's, as the head needs to skip over these gaps when reading. Running
OPTIMIZE TABLE tablename
solves this problem. Both of these statements should be run fairly frequently in any well looked after system.
Another factor that most people don't use when indexing is to take advantage of short indexes. You don't have to index on the entire field. Our surname and firstname fields are 40 characters each. That means the index we created above is 80 characters. Inserts to this table then also have to write an additional 80 characters, and selects have 80 character blocks to maneuvre around (disk I/O is the primary hardware bottleneck, but that's for another day!). Try reducing the size of your index - in the example above, rather use.
ALTER TABLE employee ADD INDEX(surname(20),firstname(20));
Now our updates write to an index half the size, and selects have a smaller index to search. Both will be faster (unless you make the indexes too short - imagine a book index, instead of giving the full word, only contained the first letter of the word!. You'd spend a lot of time looking up "semaphore" and "saxophone" when you actually wanted "SQL". Don't do the same to MySQL!
The same applies to the original field definitions. In these days of ample disk space, we don't often worry about space. But smaller usually means faster, so defining our surname and firstname fields as CHAR (255) would be a mistake if the biggest firstname is never more than 20 characters! You don't want to cut names off, but remember that you can ALTER the field later if conditions change, and you need to allow for more characters. I also suggest using VARCHAR rather than CHAR (variable length characters rather than fixed length characters), even though many don't recommend this as they are more subject to fragmentation. I overcome this by using OPTIMIZE often. 


Most systems need to be highly optimized for selects - take a news site which performs millions of queries per day, but where the data arrives in large batches of text files. So for parts of the day, inserts need to be optimal, without noticeably affecting the millions trying to access the data...
Assuming a nicely formatted '|' delimited text file that we want to insert into the table above, take this piece of PHP code:
if (!($fp = fopen("datafile.txt","r"))) {               // open the file for reading
  print "\nUnable to open datafile.txt for writing";    // display error
  exit();                                       // end the running of the program
}


while (!feof ($fp)) {                   // loop through the file line by line
  $sline = fgets($fp, 4096);            // put the data into the variable $sline
  $sline = chop($sline);                        // remove the newline
  list($eno,$fname,$sname,$telno,$salary) = split("|",$code);
                        // split the line on "|", populating the ind. variables
  $db->query("insert into employee(employee_number,firstname,surname,
tel_no, salary
   values($eno,'$fname','$sname','$tel_no', $salary)"); 
}                                               // end while loop
This would work, but would be very slow. The index buffer would be flushed after every insert. Until recently, MyISAM tables (The MySQL default) did not allow data to be inserted at the same time as being read. The new format does, but only if there are no deleted records present (highly unlikely in a heavily used system). So the entire table is locked for the duration of each insert. Fine on a low volume site, but when you're getting hundreds or thousands of queries per second, you'll soon notice the backlog!
There's a solution however - the best way to insert the data is to use MySQL's "LOAD DATA INFILE". This is much faster (20 times according to MySQL), and the only way some systems I've seen are still hanging in there!
The syntax is simple, and the code becomes a lot simpler too:
$db->query("LOAD DATA INFILE 'datafile.txt' INTO TABLE employee (employee_number,firstname,surname,tel_no,salary) FIELDS TERMINATED BY '|'");
LOAD DATA INFILE has defaults of:
FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\'
if you don't specify any of these clauses. And, just as with an ordinary insert, you need to specify a field list if the order of the fields is different, or, as in the example above, you're not inserting data for every field. Always specifying a field list is good practice for all queries anyway - if someone adds a field to the table at a later stage, you don't want to go back and have to fix all your previous INSERT and SELECT * statements.
If you can't get this to work properly, have a look at the format of your text file - every problem I've seen with LOAD DATA has been because of a corrupted text file. Every field in every row must be delimited correctly!
You may not always be inserting from a text file - perhaps your application needs to do many unrelated inserts continually. There are ways to make sure the mass of users selecting are not badly affected... The first is to use INSERT LOW PRIORITY. This waits until there are no more reads waiting to happen, waiting for the gap, and not pushing in as it were. Of course, if your database is a rush hour special, there may never be a gap, and the client performing the INSERT LOW PRIORITY may start to grow cobwebs! An alternative here is INSERT DELAYED. The client is immediately freed, and the insert put into a queue (with all the other INSERT DELAYED's still waiting for the queue to end). This means that there can be no meaningful information passed back to the client, (such as the auto_increment value), as the INSERT has not been processed when the client is freed. Also, be aware that a catastrophe such as an unexpected power failure here will result in the queued INSERT's being lost. For neither of these methods do you have any idea when the data will be inserted, if at all, so I suggest you use with caution.

Conclusion

It's not only getting the data in that needs to be quick - sometimes you need to get it out quickly too. (Say you've accidentally loaded yesterday's classified ads, for example). Don't do a: DELETE FROM classifieds;
Rather, use:
TRUNCATE TABLE classifieds;
The difference here is that DELETE drops records one by one, and that can be 1 million one by one's too slow! Note that this does not apply before version 4.0 of MySQL. At time of writing, most of you will still be using 3.x versions (if you do a DELETE FROM tablename on a non-empty table, and get 0 records back as a result, you're running an earlier version. To fix this problem, MySQL made DELETE remove records one by one so as to return the number of records deleted, but TRUNCATE still did the quick delete. Also, earlier versions than 3.23.33 used TRUNCATE tablename, not TRUNCATE TABLE tablename)
This has only been a brief introduction to optimizing queries and indexes. Once you've mastered these tips, you'll want to look at your more complex joins that just never seem to perform quickly. The MySQL site has lots of useful information, and there are a few good books out there as well. But don't forget EXPLAIN! Often the best way is to try and rewrite the query in as many different ways as possible, and see which one runs more efficiently. You can learn a lot by trying to figure out why one alternative ran faster than the other. Source: http://www.databasejournal.com