Re: Best Practices for Distributing Lucene Indexing and Searching
Yonik Seeley wrote: 6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. This is a promising idea for handling a high update volume because it avoids all of the search nodes having to do the analysis phase. A clever way to do this is to take advantage of Lucene's index file structure. Indexes are directories of files. As the index changes through additions and deletions most files in the index stay the same. So you can efficiently synchronize multiple copies of an index by only copying the files that change. The way I did this for Technorati was to: 1. On the index master, periodically checkpoint the index. Every minute or so the IndexWriter is closed and a 'cp -lr index index.DATE' command is executed from Java, where DATE is the current date and time. This efficiently makes a copy of the index when its in a consistent state by constructing a tree of hard links. If Lucene re-writes any files (e.g., the segments file) a new inode is created and the copy is unchanged. 2. From a crontab on each search slave, periodically poll for new checkpoints. When a new index.DATE is found, use 'cp -lr index index.DATE' to prepare a copy, then use 'rsync -W --delete master:index.DATE index.DATE' to get the incremental index changes. Then atomically install the updated index with a symbolic link (ln -fsn index.DATE index). 3. In Java on the slave, re-open 'index' it when its version changes. This is best done in a separate thread that periodically checks the index version. When it changes, the new version is opened, a few typical queries are performed on it to pre-load Lucene's caches. Then, in a synchronized block, the Searcher variable used in production is updated. 4. In a crontab on the master, periodically remove the oldest checkpoint indexes. Technorati's Lucene index is updated this way every minute. A mergeFactor of 2 is used on the master in order to minimize the number of segments in production. The master has a hot spare. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practices for Distributing Lucene Indexing and Searching
> 6. Index locally and synchronize changes periodically. This is an > interesting idea and bears looking into. Lucene can combine multiple > indexes into a single one, which can be written out somewhere else, and > then distributed back to the search nodes to replace their existing > index. This is a promising idea for handling a high update volume because it avoids all of the search nodes having to do the analysis phase. Unfortunately, the way addIndexes() is implemented looks like it's going to present some new problems: public synchronized void addIndexes(Directory[] dirs) throws IOException { optimize(); // start with zero or 1 seg for (int i = 0; i < dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j < sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } optimize(); // final cleanup } We need to deal with some very large indexes (40G+), and an optimize rewrites the entire index, no matter how few documents were added. Since our strategy calls for deleting some docs on the primary index before calling addIndexes() this means *both* calls to optimize() will end up rewriting the entire index! The ideal behavior would be that of addDocument() - segments are only merged occasionally. That said, I'll throw out a replacement implementation that probably doesn't work, but hopefully will spur someone with more knowledge of Lucene internals to take a look at this. public synchronized void addIndexes(Directory[] dirs) throws IOException { // REMOVED: optimize(); for (int i = 0; i < dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j < sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } maybeMergeSegments(); // replaces optimize } -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practices for Distributing Lucene Indexing and Searching
: We have a requirement for a new version of our software that it run in a : clustered environment. Any node should be able to go down but the : application must keep functioning. My application is looking at similar problems. We aren't yet live, but the only practicle solution we have implimented so far is the "apply all adds/deletes to all instances in parallel or sequence" model which we don't really like very much. I don't consider it a viable option for our launch given the volume of updates we need to be able to handle in a timely manor. I'm also curious as to what ideas people on this list have about realiable index replication. I've included my thoughts on some of the possible solutions below... : 2. Don't distribute indexing. Searching is distributed by storing the : index on NFS. A single indexing node would process all requests. : However, using Lucene on NFS is *not* recommended. See: I don't really consider reading/writing to an NFS mounted FSDirectory to be viable for the very reasons you listed; but I haven't really found any evidence of problems if you take they approach that a single "writer" node indexes to local disk, which is NFS mounted by all of your other nodes for doing queries. concurent updates/queries may still not be safe (i'm not sure) but you could have the writer node "clone" the entire index into a new directory, apply the updates and then signal the other nodes to stop using the old FSDirectory and start using the new one. : 3. Distribute indexing and searching into separate indexes for each : node. Combine results using ParallelMultiSearcher. If a box went down, a : piece of the index would be unavailable. Also, there would be serious I haven't really considered this option because it would be unacceptable for my application. : 4. Distribute indexing and searching, but index everything at each node. : Each node would have a complete copy of the index. Indexing would be : slower. We could move to a 5 or 15 minute batch approach. As i said, tis is our current "last resort" but there are some serious issues i worry baout with this under high concurrent update/query load. they are the same issues you would face if you only had one box -- but frankly one of the main oals i see for a distributed solution is to reduce the total amount of processing that needs to be done -- not multiply it by the number of boxes, so i'm trying to find something better. : 5. Index centrally and push updated indexes to search nodes on a : periodic basis. This would be easy and might avoid the problems with : using NFS. : : 6. Index locally and synchronize changes periodically. This is an : interesting idea and bears looking into. Lucene can combine multiple : indexes into a single one, which can be written out somewhere else, and : then distributed back to the search nodes to replace their existing : index. Agreed. These are two of the most promising ideas we're currently considering, but we haven't acctually tried implimenting yet. The other thing we have considered is having a pool of "updater" nodes which process batches of additions into a small index, which is then copied out to all of hte other nodes. these nodes can then either Multisearch between their existing index and the new one, or they can acctally merge the new one in (based on their current load). The concern i have with approaches like this, is that they still require the individual nodes to all duplicate work of merging, and ultimately: optimizing. that's something i don't wnat them to have to do, especially under potentially heavy query load. What i'd really like is a single "primary indexer" box, that builds up lots of small RAMDirectory indexes as updates come in, and periodically writes them to files to be copied over to "warm standby indexer" boxes. All of the indexer boxes eventually merge these small indexes into the master, which is versioned on a regular basis. The primary indexer would also be the main guy to decide how often to do an optimize() if the primary indexer goes down, and of the warm standy indexers can take over with minimal loss of updates. Then the various "query boxes" can periodically copy the most recent rev of the index over whenever they want, close their existing IndexReader and open a new one poited at the new rev. Problems that come up: 1) for indexes big enough to warant these kinds of realiability concerns, you need a lot of bandwidth to copy that much data arround. 2) our application has an expecation that issuing the same query to two different nodes in the cluster at the same time should give you the same results. In order for that to be true, in an approach like the one i described would reuire some coordination mechanism to know what the highest rev# of the index had been copied to all of boxes and then signal them all to start using that rev at the same time. -Hoss --
Best Practices for Distributing Lucene Indexing and Searching
Lucene Users, We have a requirement for a new version of our software that it run in a clustered environment. Any node should be able to go down but the application must keep functioning. Currently, we use Lucene on a single node but this won't meet our fail over requirements. If we can't find a solution, we'll have to stop using Lucene and switch to something else, like full text indexing inside the database. So I'm looking for best practices on distributing Lucene indexing and searching. I'd like to hear from those of you using Lucene in a multi-process environment what is working for you. I've done some research, and based on on what I've seen so far, here's a bit of brainstorming on what seems to be possible: 1. Don't. Have a single indexing and searching node. [Note: this is the last resort.] 2. Don't distribute indexing. Searching is distributed by storing the index on NFS. A single indexing node would process all requests. However, using Lucene on NFS is *not* recommended. See: http://lucenebook.com/search?query=nfs ...it can result in "stale NFS file handle" problem: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12481.html So we'd have to investigate this option. Indexing could use an JMS queue so if the box goes down, when it comes back up, indexing could resume where it left off. 3. Distribute indexing and searching into separate indexes for each node. Combine results using ParallelMultiSearcher. If a box went down, a piece of the index would be unavailable. Also, there would be serious issues making sure assets are indexed in the right place to prevent duplicates, stale results, or deleted assets from showing up in the index. Another possibility would be a hashing scheme for indexing...assets could be put into buckets based on their IDs to prevent duplication. Keeping results consistent as you're changing the number of the buckets as the nodes come up and down would be a challenge though 4. Distribute indexing and searching, but index everything at each node. Each node would have a complete copy of the index. Indexing would be slower. We could move to a 5 or 15 minute batch approach. 5. Index centrally and push updated indexes to search nodes on a periodic basis. This would be easy and might avoid the problems with using NFS. 6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. 7. Create a JDBCDirectory implementation and let the database handle the clustering. A JDBCDirectory exists (http://ppinew.mnis.com/jdbcdirectory/), but has only been tested with MySQL. It would probably require modification (the code is under the LGPL). At one time, an OracleDirectory implementation existed but that was in 2000 and so it is surely badly outdated. But in principle, the concept is possible. However, these database-based directories are slower at indexing and searching than the traditional style, probably mostly due to BLOB handling. 8. Can the Berkely DB-based DBDirectory help us? I am not sure what advantages it would bring over the traditional FSDirectory, but maybe someone else has some ideas. Please let me know if you've got any other ideas or a best practice to follow. Thanks, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]