Zend_Search_Lucene Quick Start

Posted in PHP, Programming and Zend Framework on Tuesday, the 3rd of June, 2008.

I recently had a spontaneous urge to add a search form to my weblog - this one you're reading right now - and it seemed like a good opportunity to have a look at Zend_Search_Lucene.

I'm really impressed with the simplicity and power of the module. Sadly the documentation, whilst extensive, isn't particularly clear - so here's a quick overview of getting Zend_Search_Lucene up and running.

For the uninitiated, Apache Lucene is an open-source indexing and search tool written in Java, and Zend_Search_Lucene is the purely PHP5 implementation of Lucene [1] that ships with Zend Framework.

Indexing

Before we can do any searching, we need to initialise an index. This is done through the Zend_Search_Lucene::create() method. Indexes are stored on disk, so we will need to create a directory which is readable and writeable by whichever user the script will run as. I've imaginatively called that /path/to/index for the purposes of this post.

Here's an example script which initialises the index, and adds three documents to it, ready for searching:

<?php

$index 
Zend_Search_Lucene::create('/path/to/index/');

$doc = new Zend_Search_Lucene_Document();
$doc->addField
    
Zend_Search_Lucene_Field::unIndexed(
        
'title''Item number 1') );
$doc->addField
    
Zend_Search_Lucene_Field::text(
        
'contents''cow elephant dog hamster') );
$index->addDocument($doc);

$doc = new Zend_Search_Lucene_Document();
$doc->addField
    
Zend_Search_Lucene_Field::unIndexed(
        
'title''Item number 2') );
$doc->addField
    
Zend_Search_Lucene_Field::text(
        
'contents''cow aardvark dog hamster') );
$index->addDocument($doc);

$doc = new Zend_Search_Lucene_Document();
$doc->addField
    
Zend_Search_Lucene_Field::unIndexed(
        
'title''Item number 3') );
$doc->addField
    
Zend_Search_Lucene_Field::text(
        
'contents''cow elephant dog esquilax elephant') );
$index->addDocument($doc);

$index->commit();

It's important not to overlook that final call to commit() - nothing will work without that. The 'title' field is unIndexed as we won't be searching on it, merely displaying it in our list of results. The 'contents' field is text, and this will be indexed for searching.

Where you get your document data from is completely up to you. It might be an RSS feed, a website crawler or - as in my case - a tiny PHP cron script which queries the weblog table in my database.

Either way, that's our index created. Since an index is no use unless you query it, let's have a look at how we can do that.

Searching

Here's about the simplest search you can possibly do with Zend_Search_Lucene:

<?php

$index   
Zend_Search_Lucene::open('/path/to/index/');
$results $index->find('contents:elephant');

foreach ( 
$results as $result ) {
    echo 
$result->score' :: '$result->title"\n\";
}

The 'contents:elephant' query specifies that we wish to search for documents whose 'contents' field contains the term 'elephant'. That runs in a flash, and produces the following output:

0.61871843353823 :: Item number 3
0.5 :: Item number 1

As you can see, the two Zend_Search_Lucene_Document objects which contain the word 'elephant' are returned, ordered by descending 'score'. Item 3 contains the word twice, which is why it receives the highest score.

Of course, there are far more features than I've even hinted at here, so I'll more than likely return to Zend_Search_Lucene in a further post looking at some of the more advanced stuff, but for now, that's your lot.

Footnotes

[1] Incidentally, the index files created by Zend_Search_Lucene are entirely compatible with those created by Apache Lucene, allowing the two implementations to interoperate happily, should the need arise.

Send to:

Comments

Posted by Ciaran McNulty on Sunday, the 8th of June, 2008.

Out of interest, why index on a schedule rather than on an update?

Posted by Simon on Saturday, the 28th of June, 2008.

Absolutely no reason other than simplicity! I'm not using any kind of CMS as it stands, so there's not really anywhere to hook the indexer in.

If I were integrating Zend_Search_Lucene with a CMS I'd want to look at - as you say - triggering the indexer on an update event, and having it run asynchronously. The Zend Platform (of which, more later!) "job queue" looks quite neat for that kind of thing.

Posted by Clive on Friday, the 11th of July, 2008.

Short, sweet, simple and super - thank you for this. :-)

Join the discussion! Post your comment.