Thursday, September 17, 2009

Creating a Code Search Engine with PHP and MySQL

I'm just a few days away from launching a comprehensive support website for my book, "Beginning PHP and MySQL 5, Second Edition", and among other features, have built a search engine for sifting through the more than 500 code snippets found throughout the book. This was an interesting exercise because it involves a number of storing a fairly significant amount of text within a MySQL database, using MySQL's full-text search facility, and devising an effective way to extract and display the code in the browser.
In this article I'll offer a simplified version of this search engine, introducing you to some compelling PHP and MySQL features along the way. You might adopt what you learn towards building your own search engine, or towards other applications.

The Database Schema

Just a single table is required for the engine's operation. The table, code, serves as the code repository. Each example is stored along with a suitable title and the chapter number in which it appears. Because the search engine should retrieve examples based on keywords found in the example title or in the code itself, a FULLTEXT index has been added for these columns. Because the table contents will rarely change beyond the occasional bug fix, its backed by the read-optimized MyISAM storage engine. The table follows:
CREATE TABLE code (
 id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
 title VARCHAR(50) NOT NULL,
 chapter TINYINT UNSIGNED NOT NULL,
 code TEXT NOT NULL,
 FULLTEXT (title,code)
) TYPE = MYISAM;

Loading the Table

The downloadable zip file containing all of the book's code should be easily navigable so readers can easily retrieve the desired example. To meet this requirement, the zip file contains a number of directories labeled according to chapter number (1, 2, 3, ... 37), and each script is aptly named with a lowercase title and series of underscores, for example retrieving_array_keys.php. Therefore a script capable of dealing with these two organizational matters is required in order to automate the process of loading the scripts into the database.
You might recognize this task as one well suited for recursion, and indeed it is. The following script does the job nicely:
<?php

mysql_connect("localhost","gilmore","secret");

mysql_select_db("beginningphpandmysqlcom");

// Running on Windows or Linux/Unix?
$delimiter = strstr(PHP_OS, "WIN") ? "\" : "/";

function parseCodeTree($path) {

  global $delimiter;

  if ($dir = opendir($path)) {
 
    while ($item = readdir($dir)) {

      // If $item is a directory, recurse
      if (is_dir($path.$delimiter.$item) && $item != "." && $item != "..") {
   
        //printf("Directory: %s <br />", $item);
        parseCodeTree($path.$delimiter.$item);

      // $item is a file, so insert it into database
      } elseif ($item != "." && $item != "..") {

        // Retrieve the chapter number
        $directory = substr(strrchr($path, "$delimiter"), 1);

        //printf("File: %s <br />", $item);

        // Convert the file name to a readable title
        $scriptTitle = str_replace(".php", "", $item);
        $scriptTitle = str_replace("_", " ", $scriptTitle);
  
        // Retrieve the file contents
        $scriptContents = file_get_contents($path.$delimiter.$item);

        // Insert the file information into database
        $query = "INSERT INTO code VALUES('NULL', '$scriptTitle', '$directory', '$scriptContents')";
        $result = mysql_query($query);

      }
    }
    closedir($dir);
  }
  return 1;
}

parseCodeTree('code');

?>
I've purposefully left two printf() statements in the script so you can view the script's logic. Some sample output follows:
Directory: 4
File: array_key.php
File: is_array.php
Directory: 5
File: multidimensional_array.php
File: retrieving_array_keys.php
File: retrieving_array_values.php
File: slicing_an_array.php

Building the Search Engine

With the code and corresponding metadata inserted into the database, all that's left to do is build the search engine. Believe it or not, this is perhaps the easiest part of the project, thanks to MySQL's fulltext search capabilities. Although I've used the symfony framework to abstract the database interaction, for the purposes of this article I've used POPM (Plain Old PHP and MySQL) to build the search engine. The search form is exceedingly simple, and looks like this:
<form method="POST" action="search.php">
Search the code repository:<br />
<input type="text" id="keyword" name="keyword" /><br />
<input type="submit" value="Search!" />
</form>
The search script (search.php) looks something like this. Provided you've used PHP to interact with MySQL before, there shouldn't be any surprises, except for perhaps the query itself. This query takes advantage of MySQL's fulltext feature to compare the keyword against those columns that have been identified as searchable using MySQL's fulltext conditions. These conditions can produce unexpected results without doing some advance reading, so be sure to peruse the appropriate section of the MySQL documentation before building your own queries.
<?php

  mysql_connect("localhost","gilmore","secret");
  mysql_select_db("beginningphpandmysqlcom");

  $keyword = mysql_real_escape_string($_POST['keyword']);

  // Perform the fulltext search
  $query = "SELECT id, title, chapter, code 
            FROM code WHERE MATCH(title, code) AGAINST ('$keyword')";

  $result = mysql_query($query);

  // If results were found, output them
  if (mysql_num_rows($result) > 0) {

    printf("Results: <br />");

    while ($row = mysql_fetch_array($result)) {

      printf("Chapter %s: <a href='displaycode.php?id=%s'>%s</a>", 
       $row['chapter'], $row['id'], ucfirst($row['title']));

    }

  } else {
    printf("No results found");
  }

?>