Best Practices for Distributing Lucene Indexing and Searching
Lucene Users, We have a requirement for a new version of our software that it run in a clustered environment. Any node should be able to go down but the application must keep functioning. Currently, we use Lucene on a single node but this won't meet our fail over requirements. If we can't find a solution, we'll have to stop using Lucene and switch to something else, like full text indexing inside the database. So I'm looking for best practices on distributing Lucene indexing and searching. I'd like to hear from those of you using Lucene in a multi-process environment what is working for you. I've done some research, and based on on what I've seen so far, here's a bit of brainstorming on what seems to be possible: 1. Don't. Have a single indexing and searching node. [Note: this is the last resort.] 2. Don't distribute indexing. Searching is distributed by storing the index on NFS. A single indexing node would process all requests. However, using Lucene on NFS is *not* recommended. See: http://lucenebook.com/search?query=nfs ...it can result in stale NFS file handle problem: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12481.html So we'd have to investigate this option. Indexing could use an JMS queue so if the box goes down, when it comes back up, indexing could resume where it left off. 3. Distribute indexing and searching into separate indexes for each node. Combine results using ParallelMultiSearcher. If a box went down, a piece of the index would be unavailable. Also, there would be serious issues making sure assets are indexed in the right place to prevent duplicates, stale results, or deleted assets from showing up in the index. Another possibility would be a hashing scheme for indexing...assets could be put into buckets based on their IDs to prevent duplication. Keeping results consistent as you're changing the number of the buckets as the nodes come up and down would be a challenge though 4. Distribute indexing and searching, but index everything at each node. Each node would have a complete copy of the index. Indexing would be slower. We could move to a 5 or 15 minute batch approach. 5. Index centrally and push updated indexes to search nodes on a periodic basis. This would be easy and might avoid the problems with using NFS. 6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. 7. Create a JDBCDirectory implementation and let the database handle the clustering. A JDBCDirectory exists (http://ppinew.mnis.com/jdbcdirectory/), but has only been tested with MySQL. It would probably require modification (the code is under the LGPL). At one time, an OracleDirectory implementation existed but that was in 2000 and so it is surely badly outdated. But in principle, the concept is possible. However, these database-based directories are slower at indexing and searching than the traditional style, probably mostly due to BLOB handling. 8. Can the Berkely DB-based DBDirectory help us? I am not sure what advantages it would bring over the traditional FSDirectory, but maybe someone else has some ideas. Please let me know if you've got any other ideas or a best practice to follow. Thanks, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RangeQuery With Date
Your dates need to be stored in lexicographical order for the RangeQuery to work. Index them using this date format: MMDD. Also, I'm not sure if the QueryParser can handle range queries with only one end point. You may need to create this query programmatically. Regards, Luke Francl
Re: Why IndexReader.lastModified(index) is depricated?
On Wed, 2005-01-19 at 23:24, Otis Gospodnetic wrote: To answer the original question, yes, I think it would be handy to have this method back. Perhaps we should revive it/them, ha? LIMO and Luke use this method (even though it is deprecated) to show the user when the index was last updated. I think it would be nice to have it back, but it should be clearly noted that it is for informational purposes _only_. If you want to see if the index has changed, use the version number. Luke Francl LIMO co-developer http://limo.sourceforge.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi-threading problem: couldn't delete segments
I didn't get any response to this post so I wanted to follow up (you can read the full description of my problem in the archives: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=11986). Here's an additional piece of information: I wrote a small program to confirm that on Windows, you can't rename a file while another thread has it open. If I am performing a search, is it possible that the IndexReader is holding open the segments file when there is an attempt by my indexing code to overwrite it with File.renameTo()? Thanks, Luke Francl On Thu, 2005-01-06 at 17:43, Luke Francl wrote: We are having a problem with Lucene in a high concurrency create/delete/search situation. I thought I fixed all these problems, but I guess not. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi-threading problem: couldn't delete segments
On Thu, 2005-01-13 at 12:33, David Townsend wrote: Just read your old post. I'm not quite sure whether I've read this correctly. Is the search worker thread also doing deletes from the index a test script is going that is hitting the search part of our application (I think the script also updates and deletes Documents, but I am not sure. Deleting also locks the index, so maybe the indexwriter is waiting for the search thread to release the lock. I checked with my co-worker, and his script is doing a search, modifying assets (which deletes and re-inserts) and then deleting them. This is going on while new Documents are being added to the index from another thread. (Due to some weirdness in our application, it is also trying to delete Documents that don't exist before inserting them -- should be harmless, though.) I control access to the index with a lock object during all write accesses to the index, including deletes. You can see the code here: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=2068605attachId=1 Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do you handle dynamic html pages?
On Mon, 2005-01-10 at 10:03, Jim Lynch wrote: How is anyone managing reindexing of pages that change? Just periodically reindex everything or do you try to determine frequency of each changes to each page and/or site? If you are using a CMS, your best bet is to integrate Lucene with the CMS's content update mechanism. That way, your index will always be up-to-date. Otherwise, I would say reindexing everything is easiest, provided it doesn't take too long. If it's ~15 minutes or less, you could schedule a processes to do it at a low activity period (2 AM or whenever) every day and that would probably handle your needs. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use search engine technology for object persistence
On Fri, 2005-01-07 at 08:05, Erik Hatcher wrote: Interesting article: http://www.javaworld.com/javaworld/jw-01-2005/jw-0103-search_p.html Sort of off-topic, but does this mean JavaWorld is publishing again? I had read Bill Venners's post from back in January '04 that they shut down. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Check to see if index is optimized
On Fri, 2005-01-07 at 13:24, Crump, Michael wrote: Is there a simple way to check and see if an index is already optimized? What happens if optimize is called on an already optimized index - does the call basically do a noop? Or is it still and expensive call? If an index has no deletions, it does not need to be optimized. You can find out if it has deletions with IndexReader.hasDeletions. I am not sure what the cost of optimization is if the index doesn't need it. Perhaps someone else on this list knows. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multi-threading problem: couldn't delete segments
We are having a problem with Lucene in a high concurrency create/delete/search situation. I thought I fixed all these problems, but I guess not. Here's what's happening. We are conducting load testing on our application. On a Windows 2000 server using lucene-1.3-final with compound file enabled, a worker thread is creating new Documents as it ingests content. Meanwhile, a test script is going that is hitting the search part of our application (I think the script also updates and deletes Documents, but I am not sure. My colleague who wrote it has left for the day so I can't ask him.). The scripted test passes with 1, 5, and 10 users hitting the application. At 20 users, we get this exception: [Task Worker1] ERROR com.ancept.ams.search.lucene.LuceneIndexer - Caught exception closing IndexReader in finally block java.io.IOException: couldn't delete segments at org.apache.lucene.store.FSDirectory.renameFile(FSDirectory.java:236) at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java(Compiled Code)) at org.apache.lucene.index.SegmentReader$1.doBody(SegmentReader.java:179 ) at org.apache.lucene.store.Lock$With.run(Lock.java:148) at org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java(Comp iled Code)) at org.apache.lucene.index.IndexReader.close(IndexReader.java(Inlined Co mpiled Code)) at org.apache.lucene.index.SegmentsReader.doClose(SegmentsReader.java(Co mpiled Code)) at org.apache.lucene.index.IndexReader.close(IndexReader.java(Compiled C ode)) at com.ancept.ams.search.lucene.LuceneIndexer.delete(LuceneIndexer.java: 266) All write access to the index is controlled in that LuceneIndexer class by synchronizing on a static lock object. Searching is handled in another part of the code, which creates new IndexSearchers as necessary when the index changes. I do not rely on finalization to clean up these searchers because we found it to be unreliable. I keep track of threads using each searcher and then close it when that number drops to 0 if the searcher is outdated. My problem seems similar to what Robert Leftwich asked about on this mailing list in January 2001. Google Cache: http://64.233.179.104/search?q=cache:1D4h1vSh5AQJ:www.geocrawler.com/mail/msg.php3%3Fmsg_id%3D5020057++lucene+multithreading+problems+site:geocrawler.comhl=en Doug Cutting replied to him saying that he should synchronize calls to IndexReader.open() and IndexReader.close(): Google Cache: http://64.233.179.104/search?q=cache:arztiytQ42QJ:www.geocrawler.com/archives/3/2624/2001/1/0/5020870/++lucene+multithreading+problems+site:geocrawler.comhl=en Robert Leftwich then found a problem with his code and eliminated a second IndexReader that was messing stuff up: Google Cache: http://64.233.179.104/search?q=cache:jSIsi6t9KH8J:www.geocrawler.com/mail/msg.php3%3Fmsg_id%3D5037517++lucene+multithreading+problems+site:geocrawler.comhl=en However, there are differences between Leftwich's design and mine, and besides, that thread is four years old. (Are there even exisiting archives for lucene-user throughout 2001 anywhere?) So any advice would be appreciated. Do I need to synchronize _all_ IndexReader.open() and IndexReader.close() calls? Or is it more likely that I'm missing something in my class that modifies the index? The code is attached. Thank you, Luke Francl // $Id: LuceneIndexer.java 20473 2004-10-19 17:20:10Z lfrancl $ package com.ancept.ams.search.lucene; import com.ancept.ams.asset.AssetUtils; import com.ancept.ams.asset.AttributeValue; import com.ancept.ams.asset.IAsset; import com.ancept.ams.asset.IAssetIdentifier; import com.ancept.ams.asset.IAssetList; import com.ancept.ams.asset.ITimeMetadataAsset; import com.ancept.ams.asset.IVideoAssetView; import com.ancept.ams.controller.RelayFactory; import com.ancept.ams.enums.AttributeNamespace; import com.ancept.ams.enums.AttributeType; import com.ancept.ams.enums.TimeMetadataType; import com.ancept.ams.relay.IAssetRelay; import com.ancept.ams.search.Indexer; import com.ancept.ams.search.Fields; import com.ancept.ams.util.SystemConfig; import com.ancept.ams.util.PerformanceMonitor; import org.apache.log4j.Logger; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.snowball.SnowballAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import java.io.File; import java.io.IOException; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.Iterator; import java.util.List; /** * Controls access to the Lucene index. * * @author Luke Francl **/ public final class LuceneIndexer implements Indexer { private static final Logger l4j = Logger.getLogger
Re: LIMO problems
On Thu, 2004-12-09 at 07:32, Daniel Cortes wrote: Hi, I'm tying Limo (Index Monitor of Lucene) and I have a problem, obviously it will be a silly problem but now I don't have solution. Someone can tell me how structure it have limo.properties file? because I have any example thanks. If you know another web-aplication for administration Lucenes Index say me. Thanks for all, and excuse me for my silly questions. Daniel, Julien or I will be happy to help you, but I need more information. What version of LIMO are you using? In LIMO 0.5.2, Julien added a new feature which allows you to configure the LIMO web application while it is running through the limo.properties file. This file is in the standard Java properties file format: index name=filesystem location However, you shouldn't need to care about this detail, as there is a method to add indexes from the web application. If you have any other questions, please don't hesitate to ask. Regards, Luke Francl LIMO developer P.S.: LIMO 0.5.2 adds a new index file browser that shows you some interesting details about your index files. Check it out! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LIMO problems
On Thu, 2004-12-09 at 10:07, Daniel Cortes wrote: I've the last version of LIMO. It is running in a Tomcat and I can't add any Index and don't load the index that I create the index before from console (java org.apache.lucene.demo.IndexFiles ...) This is the reasson that I demand the structure of limo.properties because this file don't exist and this maner I can force to load the localitation of de Index File. Thanks for your time. Ah, this probably means that LIMO cannot write to this location. If you give the user you are running Tomcat as permission to write files to your webapps/limo.war directory (or whatever it's called, I don't actually use Tomcat), it should work. If you don't want to do that for security reasons, simply create the file and put it there yourself. It should be at the same level as the index.jsp file. Regards, Luke Francl LIMO developer - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: partial updating of lucene
On Thu, 2004-12-09 at 09:00, Erik Hatcher wrote: Have a look at the tool Luke (Google for luke lucene :) and see how it does its Reconstruct and Edit facility. It is possible, though potentially lossy, to reconstruct a document and add it again. Or look at LIMO's implementation of that feature, which to my eyes is a little easier to read (of course that's probably because I wrote it... ;): http://cvs.sourceforge.net/viewcvs.py/limo/limo/src/net/sourceforge/limo/LimoUtils.java?rev=1.6view=markup (check out LimoUtils.reconstructDocument()) However, if you're doing analysis on your text to remove stopwords and stuff like that, this WILL be lossy. I consider it more of an aid for debugging than a way to re-index documents, though I suppose it would work for that as well. However, I believe the process would be highly resource intensive so I wouldn't recommend it. The better solution is to add a stored keyword field that stores the location of your document, and then re-index it from the source. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document-Map, Hits-List
On Wed, 2004-12-01 at 10:27, Otis Gospodnetic wrote: This is very similar to what I do - I create a List of Maps from Hits and its Documents. So I think this change may be handy, if doable (I didn't look into changing the two Lucene classes, actually). How do you avoid the problem Eric just mentioned, iterating through all the Hits at once to populate this data structure? I do a similar thing, creating a List of asset references from a field in each Lucene Document in my Hits list (actual data for display retrieved from a separate datastore). I was not aware of any performance problems from doing this, but now I am wondering about the implications. Thanks, Luke
Re: modifying existing index
On Tue, 2004-11-23 at 13:59, Santosh wrote: I am using lucene for indexing, when I am creating Index the docuemnts are added. but when I want to modify the single existing document and reIndex again, it is taking as new document and adding one more time, so that I am getting same document twice in the results. To overcome this I am deleting existing Index and again recreating whole Index. but is it possibe to index the modified document again and overwrite existing document without deleting and recreation. can I do this? If so how? You do not need to recreate the whole index. Just mark the document as deleted using the IndexReader and then add it again with the IndexWriter. Remember to close your IndexReader and IndexWriter after doing this. The deleted document will be removed the next time you optimize your index. Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Limo 0.5
On Mon, 2004-11-22 at 02:27, Chandrashekhar wrote: Hi, With Limo 0.5 , can i find out if certain word from some Document is indexed or not? This feature doesn't exist as such. You could search for it and if results come up, then the word is in the documents it returns. I'll add enumerating the terms in an index to my list of things to add. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
LIMO 0.5 released
I am pleased to announce that version 0.5 of LIMO, the Lucene Index Monitor, has been released. LIMO is a web application that allows you to browse your Lucene indexes remotely. It is an ideal companion for Lucene applications that run in a servlet container. The 0.5 release adds some cool new features such as: * More index summary statistics, including index version number, deletion status, number of documents, number of fields, number of indexed fields, and number of unindexed fields. * Querying the index. * Display expanded wild card and range queries (using Query.rewrite) with term count so you can see how many terms a complex query is expanded to. This is particularly helpful if you are trying to track down an annoying TooManyClauses exception. * Query timing to show how expensive queries are. * Estimated query memory consumption (as given by the formula in this message: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1757461). * Query result count. * Query result explanation. * Stored field reconstruction as in Luke. * Highlighting of matching terms in search results and reconstructed documents using Mark Harwood's library. LIMO requires Java 1.4 or later and a servlet container. Download it from SourceForge: http://sourceforge.net/projects/limo/ LIMO is still ready to go out of the box (er, war file). Just edit the web.xml to point LIMO to your indexes. Thanks to Julien Nioche for starting a great and very useful project and letting me join it; and to Andrzej Bialecki for Luke from which I appropriated several ideas and his GrowableStringArray class. If you are interested in getting involved, LIMO is now available in SourceForge CVS. Regards, Luke Francl
RE: disadvantages
Well that really depends on how big your index is and what they search for, now doesn't it? ;) -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sun 11/21/2004 2:52 PM To: Lucene Users List Subject: Re: disadvantages On Nov 21, 2004, at 12:00 PM, Miguel Angel wrote: What are disadvantages the Lucene?? The users of your system won't have time to get coffee when running searches. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many files exception
On Thu, 2004-11-18 at 07:09, Neelam Bhatnagar wrote: Hello all, We have been using Lucene 3.1 version with Tomcat 4.0 and jdk1.4. It seems that sometimes we see a Too many files open exception which completely garbles the whole index and whole search functionality crashes on the web site. It has also been known to crash the complete JSP container of tomcat. (I'm assuming you meant Lucene 1.3) This exception happens when your process has too many file handles open. Values differ by operating system. With Lucene, this is caused by having a number of IndexReaders open. Each IndexReader will open each file in your index. If you do not close your IndexReaders, this exception can happen, especially if you have a lot of heap and the IndexReaders are not getting garbage collected. My guess is that you are creating a new IndexSearcher for each search request and then not closing it after the search is complete. Lucene 1.3 added a feature called compound index files that much alleviates this problem by greatly reducing the number of files required in an index. You can use it by turning on IndexWriter.useCompoundFile( true ): http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#getUseCompoundFile() Combined with closing your IndexReaders, this should fix the problem. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: _4c.fnm missing
On Tue, 2004-11-16 at 14:57, Luke Shannon wrote: This is the latest error I have received: IndexReader out of date and no longer valid for delete, undelete, or setNorm operations What you need to do is check the version number of the index to determine if you need to open a new IndexReader for deletes. Remember, this must be synchronized with the same lock you are using to control access to the index for locks. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - TooManyClauses Issue
On Tue, 2004-11-16 at 16:32, Paul Elschot wrote: Once you approach 1000 days, you'll get the same problem again, so you might want to use a filter for the dates. See DateFilter and the archives on MMDD. Can anyone point to a good example of how to use the DateFilter? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index File
On Fri, 2004-11-12 at 19:07, Richard Greenane wrote: You might wat to look at LUKE @ http://www.getopt.org/luke/ A great tool for checking the index to make sure that everything is there There is also a web-based tool that you can run in your servlet container called LIMO. I've added some query features to it in CVS, which you can check out from Sourceforge: http://sourceforge.net/projects/limo But I will second what Otis said: you must (or rather your colleague must) check to see if the index has been updated before a search (use IndexReader.getCurrentVersion), and if it is, close the IndexSearcher and create a new one. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index File
On Mon, 2004-11-15 at 09:52, Luke Shannon wrote: Once this was modified to create a new IndexerSearch for every search request, all my problems went away. Be careful with this. You could conceivably run out of file handles. This problem got a lot better in Lucene 1.3 with the compound file format, it could still happen if you have a lot of heap and aren't garbage collecting very often. So close the old one when you're done with it. Also, creating a new IndexSearcher only when the index has been modified will give you a performance boost because you do not have to open the index with every search. Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene : avoiding locking (incremental indexing)
This is how I implemented incremental indexing. If anyone sees anything wrong, please let me know. Our motivation is similar to John Eichel's. We have a digital asset management system and when users update, delete or create a new asset, they need to see their results immediately. The most important thing to know about incremental indexing that multiple threads cannot share the same IndexWriter, and only one IndexWriter can be open on an index at a time. Therefore, what I did was control access to the IndexWriter through a singleton wrapper class that synchronizes access to the IndexWriter and IndexReader (for deletes). After finishing writing to the index, you must close the IndexWriter to flush the changes to the index. If you do this you will be fine. However, opening and closing the index takes time so we had to look for some ways to speed up the indexing. The most obvious thing is that you should do as much work as possible outside of the synchronized block. For example, in my application, the creation of Lucene Document objects is not synchronized. Only the part of the code that is between your IndexWriter.open() and IndexWriter.close() needs to be synchronized. The other easy thing I did to improve performance was batch changes in a transaction together for indexing. If a user changes 50 assets, that will all be indexed using one Lucene IndexWriter. So far, we haven't had to explore further performance enhancements, but if we do the next thing I will do is create a thread that gathers assets that need to be indexed and performs a batch job every five minutes or so. Hope this is helpful, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene : avoiding locking
Luke, I also integrated Lucene into a content management application with incremental updates and ran into the same problem you did. You need to make sure only one process (which means, no multiple copies of the application writing to the index simultaneously) or thread ever writes to the index. That includes deletes as in your code below, so make sure that is synchronized, too. Also, you will find that opening and closing the index for writing is very costly, especially on a large index, so it pays to batch up all changes in a transaction (inserts and deletes) together in one go at the Lucene index. If this still isn't enough, you can batch up 5 minutes worth of changes and apply them at once. We haven't got to that point yet. I am curious, though, how many people on this list are using Lucene in the incremental update case. Most examples I've seen all assume batch indexing. Regards, Luke Francl On Thu, 2004-11-11 at 18:33, Luke Shannon wrote: Syncronizing the method didn't seem to help. The lock is being detected right here in the code: while (uidIter.term() != null uidIter.term().field() == uid uidIter.term().text().compareTo(uid) 0) { //delete stale docs if (deleting) { reader.delete(uidIter.term()); } uidIter.next(); } This runs fine on my own site so I am confused. For now I think I am going to remove the deleting of stale files etc and just rebuild the index each time to see what happens. - Original Message - From: [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:56 PM Subject: Re: Lucene : avoiding locking I'm working on a similar project... Make sure that only one call to the index method is occuring at a time. Synchronizing that method should do it. --- Luke Shannon [EMAIL PROTECTED] wrote: Hi All; I have hit a snag in my Lucene integration and don't know what to do. My company has a content management product. Each time someone changes the directory structure or a file with in it that portion of the site needs to be re-indexed so the changes are reflected in future searches (indexing must happen during run time). I have written a Indexer class with a static Index() method. The idea is too call the method every time something changes and the index needs to be re-examined. I am hoping the logic put in by Doug Cutting surrounding the UID will make indexing efficient enough to be called so frequently. This class works great when I tested it on my own little site (I have about 2000 file). But when I drop the functionality into the QA environment I get a locking error. I can't access the stack trace, all I can get at is a log file the application writes too. Here is the section my class wrote. It was right in the middle of indexing and bang lock issue. I don't know if the problem is in my code or something in the existing application. Error Message: ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent) |INFO|INDEXING INFO: Start Indexing new content. |INFO|INDEXING INFO: Index Folder Did Not Exist. Start Creation Of New Index |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING ERROR: Unable to index new content Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432 10f7fe8-write.lock |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent) Here is my code. You will recognize it pretty much as the IndexHTML class from the Lucene demo written by Doug Cutting. I have put a ton of comments in a attempt to understand what is going on. Any help would
Re: Lucene : avoiding locking
On Fri, 2004-11-12 at 09:51, Luke Shannon wrote: Hi Luke; Currently I am experimenting with checking if the index is lock using IndexReader.locked before creating a writer. If this turns out to be the case I was thinking of just unlocking the file. Do you think this is a good strategy? No, because if the index is locked, that means another thread or process is writing to it. If you're getting spurious locks, stop your application and clean our the /tmp/ directory (you should see files named *lucene* -- these are the lock files). Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Thu, 2004-11-11 at 14:48, Daniel Naber wrote: On Thursday 11 November 2004 20:57, Sanyi wrote: What I'm saying is that there is no reason for the optimizer to expand wild* to more than 1024 variations That's the point: there is no query optimizer in Lucene. Would it be possible to write one? I would be very interested in this feature. I poked around in the index and search packages today to see if it could be done. I think it would take a big change in the Query.rewrite and related code in the IndexReaders to make the results of the required and prohibited parts of the query available. Again, I don't know if that's even possible. But it would be a great feature. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Fri, 2004-11-12 at 14:52, Daniel Naber wrote: There are two different issues: first, reorder the query so that those terms with less matches appear first, because as soon as the first term with 0 matches occurs, search stops. There will probably be a non-so-difficult implementation for that, but this will have more overhead than it saves time I guess. It could be done only with searches that have expansions (RangeQuery, WildcardQuery, etc) to prevent unnecessary work... The other thing is that prefix queries get expanded first, then the search happens. And that TooManyQueries exception happens when expanding the query, not during search. I'm not sure, but I think that's difficult to change, at least in a clean way. Expansion (and thus TooManyClauses) happens during Query.weight(), which is right before the search. Maybe after it's rewriten the query could be tested for instanceof BooleanQuery and then see if BooleanQuery.getClauses().length BooleanQuery.maxClauseCount? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What is the difference between these searches?
Hi, I've implemented a converter to translate our system's internal Query objects to Lucene's Query model. I recently realized that my implementation of OR NOT was not working as I would expect and I was wondering if anyone on this list could give me some advice. I am converting a query that means foo or not bar into the following: +item_type:xyz +(field_name:foo -field_name:bar) This returns only Documents where field_name contains foo. I would expect it to return all the Documents where field_name contains foo or field_name doesn't contain bar. Fiddling around with the Lucene Index Toolbox, I think that this query does what I want: +item_type:xyz field_name:foo -field_name:bar Can someone explain to me why these queries return different results? Thanks, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
On Tue, 2004-11-09 at 15:48, Erik Hatcher wrote: This last query has a required clause, which is what BooleanQuery requires when there is a NOT clause. You're getting what you want here because you've got an item_type:xyz clause as required. In your first example, you're requiring field_name:foo, whereas in this one it is not mandatory. So, essentially, my query: +item_type:xyz +(field_name:foo -field_name:bar) Gets translated to: +item_type:xyz +field_name:foo -field_name:bar Whereas the more lenient one does not require field_name:foo and returns what I expect. Is that right? Now, to decide whether to try to make this work the way I thought it would, or just document that it doesn't. ;) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
On Tue, 2004-11-09 at 16:00, Paul Elschot wrote: Lucene has no provision for matching by being prohibited only. This can be achieved by indexing something for each document that can be used in queries to match always, combined with something prohibited in a query. But doing this is bad for performance for querying larger nrs of docs. I'm familiar with Lucene's restrictions on prohibited queries, and I have a required clause for a field that will always be part of the query (it's not a nonsense value, it's the item type of the object in a CMS). My problem is that I have been considering the whole query object that I've generated. Every BooleanQuery that's a part of my finished query must also have a required clause if it has a prohibited clause. I'm thinking of refactoring my code so that instead of joining together Query objects into a large BooleanQuery, it passes around BooleanClauses and assembles them into a single BooleanQuery. Thanks for your help, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Thread safety of QueryParser
Thank you for the update, Doug. On Tue, 2003-08-26 at 11:57, Doug Cutting wrote: This method constructs a new query parser each time it is called, so it is thread safe. Perhaps the JGuru FAQ should be updated... Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Thread safety of QueryParser
[Note: I sent this email before I recieved my subscription confirmation message and I have not seen it in the archives yet. If you recieved this message twice, my appologies. -- Luke] According to the jGuru FAQ, QueryParser is not thread safe: http://www.jguru.com/faq/view.jsp?EID=492389 However, this information is several years old. Is this still true? The answer to the question suggests using a new parser for every thread, but the QueryParser.parse(String query,String field,Analyzer analyzer) method is static, and I don't see any way to set the default field on an instance of the QueryParser. Is that what the f parameter of the QueryParser(String f, Analyzer a) constrcutor is for? Thanks for your advice, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]