Abstract: Lucene tutorial Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. Please use the links on the left to access Lucene. 26 May 2006 - Release 2.0.0 available
Lucene installation- Download the latest binary Lucene release from the download area of the
Jakarta web site: http://jakarta.apache.org. The latest version is 2.0.0; Download either the .zip or .tar.gz file, whichever format is most convenient for your environment. - Extract the binary file to the directory of your choice on your file system.The archive contains a top-level directory named lucene-2.0.0, so it’s safe to extract to c:\ on Windows or your home directory on UNIX. On Windows, if you have WinZip handy, use it to open the .zip file and extract its contents to c:\. If you’re on UNIX or using cygwin on Windows, unzip and untar (tar zxvf lucene-2.0.0.tar.gz) the .tar.gz file in your home directory.
- Under the created lucene-2.0.0 directory, you’ll find lucene-2.0.0.jar. This is the only file required to introduce Lucene into your applications.
- Include Lucene’s JAR file in your application’s distribution appropriately. For example, a web application using Lucene would include lucene-2.0.0.jar in the WEB-INF/lib directory. For command-line applications, be sure Lucene is on the classpath when launching the JVM.
How to integrate your application with LuceneBelow figure is a typical application integration with Lucene 
An eample about LuceneCreate an index The simple program CreateIndex.java creates an empty index by generating an IndexWriter object and instructing it to build an empty index. In this example, the name of the directory that will store the index is specified on the command line. public class CreateIndex { // usage: CreateIndex index-directory public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; // An index is created by opening an IndexWriter with the // create argument set to true. writer = new IndexWriter(indexPath, null, true); writer.close(); } } Index text documents
IndexFile.java shows how to add documents -- the files named on the command line -- to an index. For each file, IndexFiles creates a Document object, then calls IndexWriter.addDocument to add it to the index. From Lucene's point of view, a Document is a collection of fields that are name-value pairs. A Field can obtain its value from a String, for short fields, or an InputStream, for long fields. Using fields allows you to partition a document into separately searchable and indexable sections, and to associate metadata -- such as name, author, or modification date -- with a document. For example, when storing mail messages, you could put a message's subject, author, date, and body in separate fields, then build semantically richer queries like "subject contains Java AND author contains Gosling." In the code below, we store two fields in each Document: path, to identify the original file path so it can be retrieved later, and body, for the file's contents. public class IndexFiles { // usage: IndexFiles index-path file . . . public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false); for (int i=1; i<args.length; i++) { System.out.println("Indexing file " + args[i]); InputStream is = new FileInputStream(args[i]); // We create a Document with two Fields, one which contains // the file path, and one the file's contents. Document doc = new Document(); doc.add(Field.UnIndexed("path", args[i])); doc.add(Field.Text("body", (Reader) new InputStreamReader(is))); writer.addDocument(doc); is.close(); }; writer.close(); } } Search
Search.java provides an example of how to search the index. While the com.lucene.Query package contains many classes for building sophisticated queries, here we use the built-in query parser, which handles the most common queries and is less complicated to use. We create a Searcher object, use the QueryParser to create a Query object, and call Searcher.search on the query. The search operation returns a Hits object -- a collection of Document objects, one for each document matched by the query -- and an associated relevance score for each document, sorted by score. public class Search { public static void main(String[] args) throws Exception { String indexPath = args[0], queryString = args[1]; Searcher searcher = new IndexSearcher(indexPath); Query query = QueryParser.parse(queryString, "body", new SimpleAnalyzer()); Hits hits = searcher.search(query); for (int i=0; i<hits.length(); i++) { System.out.println(hits.doc(i).get("path") + "; Score: " + hits.score(i)); }; } }
Conclusion Lucene is the most flexible and convenient open source search toolkit I've ever used. Cutting describes his primary goal for Lucene as "simplicity without loss of power or performance," and this shines through clearly in the result. Lucene's design seems so simple, you might suspect it is just the obvious way to design a search toolkit. We should all be so lucky as to craft such obvious designs for our own software |