Monday 21 September 2009

Got Sphinx search working

A few days ago I was wondering how to indent/tab a whole block of text/code in gedit. Turns out it's pretty simple:
  1. In gedit go to Edit > Preferences
  2. In the gedit preferences dialog, click on the plugins tab, and then make sure the 'Indent Lines' plugin is ticked'.
  3. Close the preferences dialog. You can now indent lines or whole blocks of text in gedit using Ctrl+T, or un-indent (tab to the right) text using Shift+Ctrl+T. These two options will also now appear on gedit's Edit menu


Also recently, I've been working on getting Sphinx configured for my website.

Sphinx doesn't come with any stopwords as standard. Stopwords are common words that are excluded from search queries, like 'the'. By excluding common words you make your search engine faster, as it doesn't need to bother searching for the stop words. MySQL has a list of english stop-words that I could use, however, I found a list of google english stopwords, and used that instead.

Reading this article Google Stopwords Patent though, it seems that Google no longer uses a straight stopwords list, bust instead uses phrase based matching to determine how much weight should be given to a stopword.

After getting my stopwords list sorted out, I wanted to ensure that the search would work for foreign characters/phrases. Sphinx has a list of characters you need to add to your charset_tables parameter in the sphinx.conf file to support other languages: Unicode Character Set Tables. After looking at that, I decided that there wasn't much point adding all the different character sets, but instead I would be better off to only add the character sets I needed. After all, I could always add more characters sets in the future.

I found a page that listed the charset_tables you need to support Japanese in Sphinx, and I think the CJK (Chinese Japanese Korean) character set listed in the Unicode Character Set Tables should work okay as well.

I didn't enable ngram indexing, as I intend putting spaces between different Japanese/Korean words. If I change my mind, then I can always enable ngram indexing.

I got Sphinx working, and then wanted to have a 'taster' of the page/image description in my search results. Looking into this, the recommendation seemed to be to use SUBSTRING(description, 0, 100) or LEFT(description, 100) in the mysql query when fetching the description, so you only got the 1st 100 characters of the description, then use PHP to remove the rest of the string after the last space, so you get a string that ends on a word.

Then I found out that Sphinx actually already has the ability to to create excerpts, where it will also highlight the word(s) that were searched for.

Next I needed to make Sphinx work okay when there was both an exact phrase and another search term, like
word "exact phrase"
The answer to this was to set the matching mode to SPH_MATCH_EXTENDED2. There is also a lot of different parameters/options your users can use with the extended query syntax.

In the evening I did some more work on my website, trying to get the google search javascript to load dynamically (it wouldn't work), and also did another Japanese lesson with Moccle.

The weather was mostly overcast today, though it was sunny for a bit in the afternoon. Around sunset, it looked quite overcast, but then actually the bottom of the clouds got lit up orange quite nicely, while it was raining a little bit.

Food
Breakfast: Bowl of chocolate oat crunch cereal and fake Cheerios; cup o' tea.
Lunch: Mature cheddar cheese sandwich; banana; a few grapes; Cranberry and Muesli bar; cup o' tea.
Dinner: Eggy bread with bovril; baked beans; potatoes; bacon. Pudding was an apple pie with custard and cream.

No comments: