Fulltext Indexing Wikipedia with Sphinx
Posted by hank, Sat Sep 15 22:17:00 UTC 2007
So, earlier this year, I decided it would be cool to mirror Wikipedia. So, I successfully set up a local copy on my system, and it’s been just sitting there ever since. But lately, I’ve been interested in fulltext indexing offered by various indexing engines, and Sphinx has looked especially tasty. So, I figured I’d sit down and try it today.
I pointed it at my 16GB of Wikipedia text in my MySQL database like so:
sphinx.conf
source src1
{
type = mysql
strip_html = 0
index_html_attrs =
sql_host = localhost
sql_user = wikipedia
sql_pass = wikipedia
sql_db = wikidb
sql_query_pre =
sql_query = \
SELECT old_id, old_text\
FROM text
sql_query_post =
sql_query_info = SELECT * FROM text WHERE old_id=$id
}
Next, I set up the indexing section.
index wikipedia
{
source = src1
path = /nexus/rofl/sphinx/wikipedia.sphinx
docinfo = extern
morphology = none
stopwords =
min_word_len = 1
charset_type = utf-8
min_prefix_len = 0
min_infix_len = 0
}
index wikipediastemmed : wikipedia
{
path = /var/data/wikipediastemmed
morphology = stem_en
}
indexer
{
mem_limit = 512M
}
I left all the other options as default. Next, I turned on the indexing and waited for about 2.5 hours. Now, bear in mind that 2.5 hours isn’t all that long to index this much data, especially given the results I’m about to show you.
Now it’s time to test this out!
hank@rofl:/usr/local/etc$ time search endothermic
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
= Sterling D. | title = Cold FireĀ® is a Hot Fire Extinguisher | publisher =
Company press release | date = Nov. 28, 2003 | url= http://www.greaterthings.com/News/ColdFire/pr031122.html | accessdate = August 21, 2006}}</ref>
==References==
<references/>
== External links ==
* [http://www.firefreeze.com Fire Freeze Worldwide Inc.]
[[Category:Firefighting]]
old_flags=utf-8
20. document=112594001, weight=1
old_id=112594001
old_text=#REDIRECT[[Endothermic]]
old_flags=utf-8
words:
1. 'endothermic': 173 documents, 293 hits
real 0m0.831s
user 0m0.004s
sys 0m0.080s
hank@rofl:/usr/local/etc$ time search "hello & world" >/dev/null
real 0m0.659s
user 0m0.032s
sys 0m0.052s
Look at that time!! 0.8 Seconds to search 16GB of text!
Sphinx is indeed the master of the fulltexting.
I’m very impressed. I’m sure I will find a use for this soon.
Update: It’s actually faster.
Due to the comment from Sphinx’s author below, I ran a searchd instance with gets rid of all the overhead when searching from the command line.
Here are some results I got using the Ruby API that’s included with Sphinx:
irb(main):010:0> t = Time.now; s.query('(Single & mother) & !father'); puts Time.now - t
0.016864
=> nil
It only took 0.017 seconds to find all instances of single and mother without mention of father in Wikipedia’s database.
This is indeed impressive.

Blog Posts
September 16, 2007 @ 04:39 PM
If you can beat this into a MediaWiki extension, please put details on http://mediawiki.org/ and the mediawiki-l mailing list. The default MediaWiki MySQL full-text search is literally worse than useless; Wikimedia sites use a Lucene variant; more options would be most welcomed.
September 17, 2007 @ 04:26 PM
The actual search should be faster than that - CLI search has a lot of preload overhead which is not there in production mode when using searchd (which preloads data only once at startup) - especially when it warms up.
September 17, 2007 @ 04:57 PM
Andrew:
Thanks for responding so quickly. I updated the post with my latest test using
searchdas per your suggestion. You were very right - I can’t believe the speed on this thing. Thanks for building it.September 19, 2007 @ 08:50 PM
Just to second an earlier post, a MediaWiki extension to enable full text searching with Sphinx would be excellent.
September 23, 2007 @ 01:36 PM
Just wanted to point out that I started work on integrating the Sphinx Search Engine into MediaWiki. See more at http://www.mediawiki.org/wiki/Extension:SphinxSearch . I still have quite a ways, to go, so keep your eyes on that page.
September 23, 2007 @ 04:07 PM
Paul:
Excellent plan! Thanks for the credit. I am pretty busy so I haven’t had time to pursue this on my own, but I’m glad you’re up to the task. Good luck!