Fulltext Indexing Wikipedia with Sphinx

Posted by hank, Sat Sep 15 22:17:00 UTC 2007

So, earlier this year, I decided it would be cool to mirror Wikipedia. So, I successfully set up a local copy on my system, and it’s been just sitting there ever since. But lately, I’ve been interested in fulltext indexing offered by various indexing engines, and Sphinx has looked especially tasty. So, I figured I’d sit down and try it today.

I pointed it at my 16GB of Wikipedia text in my MySQL database like so:

sphinx.conf


source src1
{
  type        = mysql
  strip_html      = 0
  index_html_attrs  =
  sql_host      = localhost
  sql_user      = wikipedia
  sql_pass      = wikipedia
  sql_db        = wikidb
  sql_query_pre   =
  sql_query     = \
    SELECT old_id, old_text\
    FROM text
  sql_query_post    =
  sql_query_info    = SELECT * FROM text WHERE old_id=$id
}

Next, I set up the indexing section.


index wikipedia
{
  source      = src1
  path      = /nexus/rofl/sphinx/wikipedia.sphinx
  docinfo     = extern
  morphology      = none
  stopwords     =
  min_word_len    = 1
  charset_type    = utf-8
  min_prefix_len    = 0
  min_infix_len   = 0
}
index wikipediastemmed : wikipedia
{ 
  path      = /var/data/wikipediastemmed
  morphology    = stem_en
}
indexer
{ 
  mem_limit     = 512M
}

I left all the other options as default. Next, I turned on the indexing and waited for about 2.5 hours. Now, bear in mind that 2.5 hours isn’t all that long to index this much data, especially given the results I’m about to show you.

Now it’s time to test this out!



hank@rofl:/usr/local/etc$ time search endothermic
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
= Sterling D. | title = Cold FireĀ® is a Hot Fire Extinguisher | publisher = 
Company press release | date = Nov. 28, 2003 | url= http://www.greaterthings.com/News/ColdFire/pr031122.html | accessdate = August 21, 2006}}</ref>
==References==
<references/>
== External links ==
* [http://www.firefreeze.com Fire Freeze Worldwide Inc.]

[[Category:Firefighting]]
        old_flags=utf-8
20. document=112594001, weight=1
        old_id=112594001
        old_text=#REDIRECT[[Endothermic]]
        old_flags=utf-8

words:
1. 'endothermic': 173 documents, 293 hits

real    0m0.831s
user    0m0.004s
sys     0m0.080s

hank@rofl:/usr/local/etc$ time search "hello & world" >/dev/null

real    0m0.659s
user    0m0.032s
sys     0m0.052s

Look at that time!! 0.8 Seconds to search 16GB of text!

Sphinx is indeed the master of the fulltexting.

I’m very impressed. I’m sure I will find a use for this soon.

Update: It’s actually faster.

Due to the comment from Sphinx’s author below, I ran a searchd instance with gets rid of all the overhead when searching from the command line.

Here are some results I got using the Ruby API that’s included with Sphinx:


irb(main):010:0> t = Time.now; s.query('(Single & mother) & !father'); puts Time.now - t
0.016864
=> nil

It only took 0.017 seconds to find all instances of single and mother without mention of father in Wikipedia’s database.

This is indeed impressive.

Tags:

Comments

  • photo of David Gerard David Gerard
    September 16, 2007 @ 04:39 PM

    If you can beat this into a MediaWiki extension, please put details on http://mediawiki.org/ and the mediawiki-l mailing list. The default MediaWiki MySQL full-text search is literally worse than useless; Wikimedia sites use a Lucene variant; more options would be most welcomed.

  • no avatar available for Andrew Aksyonoff Andrew Aksyonoff
    September 17, 2007 @ 04:26 PM

    The actual search should be faster than that - CLI search has a lot of preload overhead which is not there in production mode when using searchd (which preloads data only once at startup) - especially when it warms up.

  • no avatar available for Hank Hank
    September 17, 2007 @ 04:57 PM

    Andrew:

    Thanks for responding so quickly. I updated the post with my latest test using searchd as per your suggestion. You were very right - I can’t believe the speed on this thing. Thanks for building it.

  • no avatar available for Paul Grinberg Paul Grinberg
    September 19, 2007 @ 08:50 PM

    Just to second an earlier post, a MediaWiki extension to enable full text searching with Sphinx would be excellent.

  • no avatar available for Paul Grinberg Paul Grinberg
    September 23, 2007 @ 01:36 PM

    Just wanted to point out that I started work on integrating the Sphinx Search Engine into MediaWiki. See more at http://www.mediawiki.org/wiki/Extension:SphinxSearch . I still have quite a ways, to go, so keep your eyes on that page.

  • no avatar available for Hank Hank
    September 23, 2007 @ 04:07 PM

    Paul:

    Excellent plan! Thanks for the credit. I am pretty busy so I haven’t had time to pursue this on my own, but I’m glad you’re up to the task. Good luck!

Have your say

A name is required. You may use Markdown in your comments.