Sphinx Full Text Search, What I have learned
For those of you that do not know what Sphinx is, check out their site here.
I started writing a post on Sphinx a few weeks ago, only to realise that the way I was approaching Sphinx was wrong and that I had to rethink my strategy. Anybody that follows my personal blog will see that I had to rewrite my Sphinx implementation several times because of my approach to the problem.
Let me just say that I was initially attracted to Sphinx a few years ago, when my site Skylines Australia kept hitting table-locks due to MyISAM. We needed MyISAM for Full Text searches, or at least we did, until we discovered Sphinx. Once our indexes were built, we were searching very quickly across 4.5 million posts with far less overhead than MySQL. It also meant we could change our table engine to InnoDB so we could get around the table-level-locks.
The issue is, that we have a lot of data here at work and we have lots of search inputs. Think ‘keyword’ and ‘category’ in ‘location’; We initially approached it like so;
- Search the location index, get a location id
- Search the Category index, get a category id
- Search the Main index using the location id and category id as filters.
Now, thats ok, but what if you have hierarchical categories and locations? Ie, Searching for ‘Mechanics’ needs to include ‘Motor Mechanics,’ ‘Bike Mechanics,’ ‘Boat Mechanics’ etc… We had one search that needed to search across 1800 categories! The other issue we ran into is we needed to have proximity searches. If there were less than n results at the Suburb Level, we would expand to the Area, then the Region, all the way to the State. This meant that we would need to run the searches 4 times (or 12 in total) so we could keep our results separate. Couple this with 1.9 million searchable rows, 25,000 locations and 2000 categories, searches were taking a __LONG__ time to complete with Sphinx.
So.. Back to the drawing board.
My boss mentioned that Sphinx is built to search documents and that maybe we need to make a ‘document’ structure for each of our records. I was a little dubious at first.. Here is the idea;
As Sphinx searches for matches and does not actually return the data it has found (only the data’s “document id” so you can query MySQL for the exact match) it means that our data in Sphinx does not need to be readable to humans, it can be a mashup of multiple data sources.
- We get all the items we want searchable and all their keywords, put them into a single column named ‘document’ and create a new table in MySQL.
- We get all the categories that this item must appear in, so not exact categories (like ‘Motor Mechanics’) but all the parents also, so we have fewer search terms, we get the category id’s and prefix them with ‘cat_’.
- We create a new column called ‘location’ and in that column, we get the id’s for the location the item is in, we prefix each ‘part’ of the location with its ‘type’, ie ’suburb_’, ‘area_’, ‘region_’, ’state_’, etc.
Once we are done, we have a row that looks like this;
Id: Actual id of the item row (relating to its original table)
Document: ‘Jims Mechanics quality mechanics we do good work cat_199 cat_432 cat_909 cat_2 cat_93′
Location: ‘Manly Sydney NSW Australia suburb_1022 area_300 region_23 state_3′
Once we build this table and index it, we can query the index using the PECL Sphinx library (much better than the supplied Sphinx PHP API) and Sphinx’s ‘EXTENDED2′ query type like so;
‘@document $keywords @document (cat_1|cat_2|cat_3) @location suburb_1′
AND is implied with Sphinx, so the above query is saying match the keywords, any of the 3 categories AND location. More about the cool EXTENDED2 query syntax can be found in the Sphinx Docs.
Basically, because we restructured our data to match Sphinx and didn’t try to make Sphinx work the way it wasn’t designed to, we saved lots and lots of queries, made it perform hundreds of times better (think thousandth’s of seconds to perform a search) and have alleviated MySQL of a lot of potentially damaging work. If you do anything with Full Text or require indexes for anything at all, I’d highly recommend Sphinx. It may seem a little odd at first and hard to get to do something, but persevere and try to rethink your situation, it will be worth it in the end.

