Re: OutOfMemoryError with Lucene 1.4 final
You probably need to increase the amount of RAM available to your JVM. See the parameters: -Xmx :Maximum memory usable by the JVM -Xms :Initial memory allocated to JVM My params are; -Xmx2048m -Xms128m (2G max, 128M initial) On Fri, 10 Dec 2004 11:17:29 -0600, Sildy Augustine [EMAIL PROTECTED] wrote: I think you should close your files in a finally clause in case of exceptions with file system and also print out the exception. You could be running out of file handles. -Original Message- From: Jin, Ying [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 11:15 AM To: [EMAIL PROTECTED] Subject: OutOfMemoryError with Lucene 1.4 final Hi, Everyone, We're trying to index ~1500 archives but get OutOfMemoryError about halfway through the index process. I've tried to run program under two different Redhat Linux servers: One with 256M memory and 365M swap space. The other one with 512M memory and 1G swap space. However, both got OutOfMemoryError at the same place (at record 898). Here is my code for indexing: === Document doc = new Document(); doc.add(Field.UnIndexed(path, f.getPath())); doc.add(Field.Keyword(modified, DateField.timeToString(f.lastModified(; doc.add(Field.UnIndexed(eprintid, id)); doc.add(Field.Text(metadata, metadata)); FileInputStream is = new FileInputStream(f); // the text file BufferedReader reader = new BufferedReader(new InputStreamReader(is)); StringBuffer stringBuffer = new StringBuffer(); String line = ; try{ while((line = reader.readLine()) != null){ stringBuffer.append(line); } doc.add(Field.Text(contents, stringBuffer.toString())); // release the resources is.close(); reader.close(); }catch(java.io.IOException e){} = Is there anything wrong with my code or I need more memory? Thanks for any help! Ying - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: partial updating of lucene
You unstored fields were not stored in the index, only their terms were stored. When you get the document from the index and modify it, those terms are lost when you add the document again. You can either simply create a new document and populate all the fields and add that document to the index, or you can add the unstored fields to the document retrieved in step 1. On Wed, 8 Dec 2004 17:53:26 -0500, Praveen Peddi [EMAIL PROTECTED] wrote: Hi all, I have a question about updating the lucene document. I know that there is no API to do that now. So this is what I am doing in order to update the document with the field title. 1) Get the document from lucene index 2) Remove a field called title and add the same field with a modified value 3) Remove the docment (based on one of our field) using Reader and then close the Reader. 4) Add the document that is obtained in 1 and modified in 2. I am not sure if this is the right way of doing it but I am having problems searching for that document after updating it. The problem is only with the un stored fields. For example, I search as description:boy where description is a unstored, indexed, tokenized field in the document. I find 1 document. Now I update the document the document's title as descripbed above and repeat the same search description:boy and now I don't find any results. I have not touched the field description at all. I just updated the field title. Is this an expected behaviour? If not, is it a bug. If I change the field description as stored, indexed and tokenized, the search works fine before and after updating. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
corrupted index
Somehow today one of my indexes became corrupted. I get the following IO exception when trying to open the index: Exception in thread main java.io.IOException: read past EOF at org.en.lucene.store.InputStream.refill(InputStream.java:154) at org.en.lucene.store.InputStream.readByte(InputStream.java:43) at org.en.lucene.store.InputStream.readVInt(InputStream.java:83) at org.en.lucene.index.FieldInfos.read(FieldInfos.java:195) at org.en.lucene.index.FieldInfos.init(FieldInfos.java:55) at org.en.lucene.index.SegmentReader.initialize(SegmentReader.java:109) at org.en.lucene.index.SegmentReader.init(SegmentReader.java:94) at org.en.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480) at org.en.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458) at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:310) at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:294) at org.en.global.indexer2.Minnow.main(Minnow.java:142) Any ideas on what could cause this type of corruption, and what I can do to avoid it in the future. Also, any ideas on repairing the index if this happens? I removed the index directory and marked the rows to be reindexed from the database, but the data is unavailable to my users while the index rebuilds. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Thread safety
You can only have one open writer at a time. A writer is either an IndexWriter object, or an IndexReader object that has modified the index, by deleting documents for instance. You must close your existing writer before you open a new one. You should not get lock exceptions with IndexSearchers. The only time the locks come into play is when you are trying to open a writer when a writer process already has the lock, or the write process died w/out removing the lock so you have a stale lock left behind. I've run into FileNotFound exceptions on occasion, and have pretty much pinned it down to modifying the index on a slow device (NFS) with a very large index and trying to instantiate a new searcher. I solved the problem by catching the exception and trying to create the searcher again. That resolved the problem for me. On Fri, 03 Dec 2004 08:58:41 +0100, sergiu gordea [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: 1. yes 2. yes error, meaningful, it depends what you find meaningful :) 3. searcher will still find the document, unless you close it and reopen it (searcher) ... What about LockException? I tried to index objects in a thread and to use a IndexSearcher to search objects, but I have had problems with this. I tried to create a new IndexSearcher object if the index version was changed, but unfortunately I got some Lock Exceptions and FileNotFound Exceptions. If the answer number 3. is correct, then why did I get these exceptions. Sergiu Otis --- Zhang, Lisheng [EMAIL PROTECTED] wrote: Hi, I have an urgent question about thread safety in lucene, from lucene doc and code I could not get a clear answer. 1. is Searcher (IndexSearcher, MultiSearcher ..) thread safe, can multi-users call search(..) method on the same object at the same time? 2. if on the same object, one user calls close( ) and another calls search(..), I assume we should have a meaningful error message? 3. what would happen if one user calls Searcher.search(..), but at the same time another user tries to delete that document from index files by calling IndexReader.delete(..) (either through two threads or two separate processes)? A brief answer would be good enough for me now, thanks very much in advance! Lisheng - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
On Tue, 30 Nov 2004 12:07:46 -, Pete Lewis [EMAIL PROTECTED] wrote: Also, unless you take your hyperthreading off, with just one index you are searching with just one half of the CPU - so your desktop is actually using a 1.5GHz CPU for the search. So, taking account of this its not too surprising that they are searching at comparable speeds. HTH Pete Actually, that isn't how hyperthreading works. The second CPU in a hyperthreaded system should only run threads when the main cpu is waiting on another task, like a memory access. The second, or sub CPU is only a virtual processor. There aren't really two chips on board. New multicore processors will actually have more than one processor in one chip. Problems can arise when you are using a HT processor on an operating system that doesn't know about HT technology. The OS should only schedule jobs to run on the sub CPU under very specific circumstances. This is one of the major reasons for the scheduler overhaul in Linux 2.6. The default scheduler in 2.4 would assign threads to the sub CPU that shouldn't have been, and those threads would suffer from resource starvation. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
As a generalisation, SuSE itself is not a lot slower than Windows XP. I also very much doubt that filesystem is a factor. If you want to test w/out filesystem involvement, simply load your index into a RAMDirectory instead of using FSDirectory. That precludes filesystem overhead in searches. There are quite a number of factors involved that could be affecting performance. First off, 1.8GHz Pentium-M machines are supposed to run at about the speed of a 2.4GHz machine. The clock speeds on the mobile chips are lower, but they tend to perform much better than rated. I recommend you take a general benchmark of both machines testing both disk speed and cpu speed to get a baseline performance comparision. I also suggest turning of HT for your benchmarks and performance testing. Secondly, while the second machine appears to be twice as fast, the disk could actually perform slower on the Linux box, especially if the notebook drive has a big (8M) cache like most 7200RPM ata disk drives do. I imagine that if you hit the index with lots of simultaneous searches, that the Linux box would hold its own for much longer than the XP box simply due to the random seek performance of the scsi disk combined with scsi command queueing. RAM speed is a factor too. Is the p4 a xeon processor? The older HT xeons have a much slower bus than the newer p4-m processors. Memory speed will be affected accordingly. I haven't heard of a hard disk referred to as a winchester disk in a very long time :) Once you have an idea of how the two machines actually compare performance-wise, you can then judge how they perform index operations. Until then, all your measurements are subjective and you don't gain much by comparing the two indexing processes. Justin On Tue, 30 Nov 2004 02:04:46 -0800 (PST), Sanyi [EMAIL PROTECTED] wrote: Hi! I'm testing Lucene 1.4.2 on two very different configs, but with the same index. I'm very surprised by the results: Both systems are searching at about the same speed, but I'd expect (and I really need) to run Lucene a lot faster on my stronger config. Config #1 (a notebook): WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester Config #2 (a desktop PC): SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 15000RPM U320 SCSI winchester You can see that the hardware of #2 is at least twice better/faster than #1. I'm searching the reason and the solution to take advantage of the better hardware compared to the poor notebook. Currently #2 can't amazingly outperform the notebook (#1). The question is: What can be worse in #2 than on the poor notebook? I can imagine only software problems. Which are the sotware parts then? 1. The OS Is SuSE 9.1 a LOT slower than WinXP pro? 2. The file system Is reisefs a LOT slower than NTFS? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
My indexes are stored on a NetApp filter via NFS. The indexer process updates the indexes over NFS. I have multiple indexes. My search process determines if the nfs indexes have been updated, and if they have, then loads the index into a RAMDirectory. RAMDirectory is of course much faster than searching over NFS. This way, I can also have multiple search servers running easily. The drawback of course is startup time. It takes a few minutes to start each search server because it has to load the data into memory. RAMDirectory also seems to be kind of memory inneficient, using a lot more memory than the data actually consumes on disk. On Wed, 24 Nov 2004 14:26:40 -0800, Jonathan Hager [EMAIL PROTECTED] wrote: When comparing RAMDirectory and FSDirectory it is important to mention what OS you are using. When using linux it will cache the most recent disk access in memory. Here is a good article that describes its strategy: http://forums.gentoo.org/viewtopic.php?t=175419 The 2% difference you are seeing is the memory copy. With other OSes you may see a speed up when using the RAMDirectory, because not all OSes contain a disk cache in memory and must access the disk to read the index. Another consideration is there is currently a 2GB limitation with the size of the RAMDirectory. Indexes over 2GB causes a overflow in the int used to create the buffer. [see int len = (int) is.length(); in RamDirectory] I ended up using RAM directory for a very different reason. The index is 1 to 2MB and is rebuilt every few hours. It takes 3 to 4 minutes to query the database and rebuild the index. But the search should be available 100% of the time. Since the index is so small I do the following: on server startup: - look for semaphore, if it is there delete the index - if there is no index, build it to FSdirectory - load the index from FSDirectory into RAMDirectory on reindex: - create semaphore - rebuild index to FSDirectory - delete semaphore - load index from FSDirecttory into RAMDirectory to search: - search the RAMDirectory RAMDirectory could be replaced by a regular FSDirectory, but it seemed silly to copy the index from disk to disk, when it ultimately needs to be in memory. FSDirectory could be replaced by a RAMDirectory, but this means that it would take the server 3 to 4 minutes longer to startup every time. By persisting the index, this time would only be necessary if indexing was interrupted. Jonathan On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Yes... I performed the same benchmark and in my situation RAMDirectory for searches was about 2% slower. I'm willing to bet that it has to do with the fact that its a Hashtable and not a HashMap (which isn't synchronized). Also adding a constructor for the term size could make loading a RAMDirectory faster since you could prevent rehash. If you're on a modern machine your filesystme cache will end up buffering your disk anyway which I'm sure was happening in my situation. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.FileNotFoundException: ... (No such file or directory)
Is it possible that while my searcher process is reading the directory that the index writer process performs a merge? If that is so, then the I think that the merge could remove segment files before they are read by the reader. When the reader tries to read one of the now missing segment files it throws the IOException. That file was in the segments file when the RAMDirectory started loading the directory, but now it is missing because of the merge. This would most likely not affect small indexes, but large indexes like mine, especially over a network file system could definitely be affected. If this is what is happening, a way around it would be to open all the files in the segment file when the segment file is read. Then valid file handles will be maintained for all the files that need to be read. If the index writer process removes a segment, then the file handle should still be valid. This might only work for local filesystems though, I'm not sure if NFS works that way or not. On Thu, 18 Nov 2004 19:16:46 -0500, Will Allen [EMAIL PROTECTED] wrote: I have gotten this a few times. I am also using a NFS mount, but have seen it in cases where a mount wasn't involved. I cannot speak to why this is happening, but I have posted to this forum before a way of repairing your index by modifying the segments file. Search for wallen. The other thing I have done, is use code to copy the documents that can be read by a reader to a new index. I suppose I should submit those tools to open source! Anyway, this error will break the searcher, but the index can still be read with an indexreader. -Will Here is the source of a method that should get you started (logger is a log4j object): public void transferDocuments() throws IOException { IndexReader reader = IndexReader.open(brokenDir); logger.debug(reader.numDocs() + ); IndexWriter writer = new IndexWriter(newIndexDir, PopIndexer.popAnalyzer(),true); writer.minMergeDocs = 50; writer.mergeFactor = 200; writer.setUseCompoundFile(true); int docCount = reader.numDocs(); Date start = new Date(); //docCount = Math.min(docCount, 500); for(int x=0; x docCount; x++) { try { if(!reader.isDeleted(x)) { Document doc = reader.document(x); if(x % 1000 == 0) { logger.debug(doc.get(subject)); } //remove the new fields if they exist, and add new value //TODO test not having this in /* for ( Enumeration newFields = doc.fields(); newFields.hasMoreElements(); ) { Field newField = (Field) newFields.nextElement(); doc.removeFields( newField.name() ); doc.add( newField ); } */ doc.removeFields(counter); doc.add(Field.Keyword(counter, counter)); // reinsert old document writer.addDocument( doc ); } } catch(IOException ioe) { logger.error(doc: + x + failed, + ioe.getMessage()); } catch(IndexOutOfBoundsException ioobe) { logger.error(INDEX OUT OF BOUNDS! + ioobe.getMessage()); ioobe.printStackTrace(); } } reader.close(); //logger.debug(done, about the optimize); //writer.optimize(); writer.close(); long time = ((new Date()).getTime() - start.getTime())/1000; logger.info(done optimizing: + time + seconds or + (docCount / time) + rec/sec); } -Original Message- From: Justin Swanhart [mailto:[EMAIL PROTECTED] Sent: Thursday, November 18, 2004 5:00 PM To: Lucene Users List Subject: java.io.FileNotFoundException: ... (No such file or directory) I have two index processes. One is an index server, the other is a search server. The processes run on different machines. The index server is a single threaded process that reads from the database and adds unindexed rows to the index as needed. It sleeps for a couple minutes between each batch to allow newly added/updated rows to accumulate. The searcher process keeps an open cache of IndexSearcher objects and is multithreaded. It accepts connections on a tcp port, runs the query and stores the results in a database. After a set interval, the server checks to see if the index on disk is a newer version. If it is, it loads the index into a new IndexSearcher as a RAMDirectory. Every once in awhile, the index reader process gets a FileNotFoundException
java.io.FileNotFoundException: ... (No such file or directory)
I have two index processes. One is an index server, the other is a search server. The processes run on different machines. The index server is a single threaded process that reads from the database and adds unindexed rows to the index as needed. It sleeps for a couple minutes between each batch to allow newly added/updated rows to accumulate. The searcher process keeps an open cache of IndexSearcher objects and is multithreaded. It accepts connections on a tcp port, runs the query and stores the results in a database. After a set interval, the server checks to see if the index on disk is a newer version. If it is, it loads the index into a new IndexSearcher as a RAMDirectory. Every once in awhile, the index reader process gets a FileNotFoundException: 20041118 1378 1383 (index number, old version, new version) [newer version found] Loading index directory into RAM: 20041118 java.io.FileNotFoundException: /path/omitted/for/obvious/reasons/_4zj6.cfs (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:204) at org.en.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376) at org.en.lucene.store.FSInputStream.init(FSDirectory.java:405) at org.en.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:60) at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:89) at org.en.global.searchserver.UpdateSearchers.createIndexSearchers(Search.java:89) at org.en.global.searchserver.UpdateSearchers.run(Search.java:54) the code being called at that point is: //add the directory to the HashMap of IndexSearchers (dir# = IndexSearcher) indexSearchers.put(subDirs[i],new IndexSearcher(new RAMDirectory(indexDir + / + subDirs[i]))); The indexes are located on a NFS mountpoint. Could this be the problem? Or should I be looking elsewhere... Should i just check for an IOException, and try reloading the index if I get an error? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index copy
You could lock your index for writes, then copy the file using operating system copy commands. Another way would be to lock your index, make a filesystem snapshot, then unlock your index. You can then safely copy the snapshot without interupting further index operations. On Wed, 17 Nov 2004 11:25:48 -0500, Ravi [EMAIL PROTECTED] wrote: Whats the bestway to copy an index from one directory to another? I tried opening an IndexWriter at the new location and used addIndexes to read from the old index. But that was very slow. Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Something missing !!???
The HEAD version of CVS supports gz compression. You will need to check it out using cvs if you want to use it. On Wed, 17 Nov 2004 21:43:36 +0200, abdulrahman galal [EMAIL PROTECTED] wrote: i noticed in the last period that alot of people disscus with each others about the bugs of lucene ... but something is missing ... i consider lucene is an indexing tool for text files and so one ... but there are alot of tools that makes this indexing like access ... what about compression ... compressing original text files and its indexes and performing indexing on them like (MG) system which is effecient in compression and indexing ... where all of that in Lucene please help me if these requierments satisfied in Lucene please anyone notify me and send link of the new version... thanks alot ... _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: version documents
Split the filename into basefilename and version and make each a keyword. Sort your query by version descending, and only use the first basefile you encounter. On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon [EMAIL PROTECTED] wrote: Hey all; I have ran into an interesting case. Our system has notes. These need to be indexed. They are xml files called default.xml and are easily parsed and indexed. No problem, have been doing it all week. The problem is if someone edits the note, the system doesn't update the default.xml. It creates a new file, default_1.xml (every edit creates a new file with an incremented number, the sytem only displays the content from the highest number). My problem is I index all the documents and end up with terms that were taken out of note several version ago still showing up in the query. From my point of view this makes sense because the files are still in the content. But to a user it is confusing because they have no idea every change they make to a note spans a new file and now the are seeing a term they removed from their note 2 weeks ago showing up in a query. I have started modifying my incremental update to be look for multiple version of the default.xml but it is more work than I thought and is going make things complex. Maybe there is an easier way? If I just let it run and create the index, can somebody suggest a way I could easily scan the index folder ensuring only the default.xml with the highest number in its filename remains (only for folders were there is more than one default.xml file)? Or is this wishful thinking? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser: [stopword] AND something throws Exception
Try using 1.4.2. The change file says that ArrayIndexOutOfBoundsExceptions have been fixed in the queryparser. On Fri, 12 Nov 2004 12:04:31 -0500, Will Allen [EMAIL PROTECTED] wrote: Holy cow! This does happen! -Original Message- From: Peter Pimley [mailto:[EMAIL PROTECTED] Sent: Friday, November 12, 2004 11:52 AM To: Lucene Users List Subject: QueryParser: [stopword] AND something throws Exception [this is using lucene-1.4-final] Hello. I have just encountered a way to get the QueryParser to throw an ArrayIndexOutOfBoundsException. It can be recreated with the demo org.apache.lucene.demo.SearchFiles program. The way to trigger it is to parse a query of the form: a AND b ...where 'a' is a stop word. For example, the AND vector. It only happens when the -first- term is a stop word. You could search for vector AND the or vector AND the AND class, and it works as you would expect (i.e. the stop words are ignored). Unfortunately I am up against a deadline right now so I can't fix this myself. I'm just going to filter out stop words before feeding them to the query parser. I'll try to have a look at it in roughly 2 weeks time if nobody else has solved it. Peter Pimley, Semantico Here is the stack trace. java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.Vector.elementAt(Vector.java:434) at org.apache.lucene.queryParser.QueryParser.addClause(QueryParser.java:181) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:529) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching in keyword field ?
You can add the category keyword multiple times to a document. Instead of seperating your categories with a delimiter, just add the keyword multiple times. doc.add(Field.Keyword(category, ABC); doc.add(Field.Keyword(category, DEF GHI); On Tue, 9 Nov 2004 17:18:19 +0100, Thierry Ferrero (Itldev.info) [EMAIL PROTECTED] wrote: Hi All, Can i search only one word in a keyword field which contains few words. I know keyword field isn't tokenized. After many tests, i think is impossible. Someone can confirm me ? Why don't i use a text field? because the users know the category from a list (ex: category ABC, category DEF GHI, category JKL ...) and the keyword field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST). I use a SnowBallAnalyzer for text field in indexing. Perhaps the better way for me, is to use a text field with the value ABC DEF_GHI JKL_NOPQ where categorys are concatinated with a _. Thanks for your reply ! Thierry. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Windows Bug?
The reason this is failing is because you are trying to create a new index in the directory. It works on *nix file systems because you can delete an open file on those operating systems, something you can't do under Windows. If you change the create parameter to false on your second call everything should work as you expect it to. On 8 Nov 2004 18:27:12 -, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, My understanding is that I can have an IndexReader open for searching (as long as it doesn't delete) while an IndexWriter is updating the index. I wrote a simple test app to prove this and it works great on Mac OS X, Java 1.4.2 and Lucene 1.4.2. It fails on Windows XP, Java 1.4.2 and Lucene 1.4.2. I tried other versions of Lucene and it failed in those too. This is the app that fails on Windows: public static void main(String[] args) throws Exception { String indexFolder = /TestIndex; // add a document to the index IndexWriter indexWriter = new IndexWriter (indexFolder, new StandardAnalyzer(), true); Document document = new Document(); Field field = new Field(foo, bar, true, true, true) document.add(field); indexWriter.addDocument(document); indexWriter.close(); // open an index reader but don't close it IndexReader indexReader = IndexReader.open(indexFolder); // open an index writer indexWriter = new IndexWriter (indexFolder, new StandardAnalyzer(), true); indexWriter.close(); } On Windows XP this throws an Exception as soon as it tries to open the IndexWriter after the IndexReader has been opened. Here's the stack trace: Exception in thread main java.io.IOException: Cannot delete _1.cfs at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:105) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at scratch.TestLuceneLocks.main(TestLuceneLocks.java:17) Is this a bug? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearch
You can write to the index and read from it at the same time. You can only have one IndexWriter open at any one time. IndexSearchers will only see documents that were created before they were instantiated, so you need to create new ones periodically to see new documents. On Mon, 8 Nov 2004 14:26:40 -0800, Ramon Aseniero [EMAIL PROTECTED] wrote: Hi All, Can IndexSearcher be persisted? Are there any limitations on index updates while searches are in progress? Any file locking issues? Thanks, Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?
You should exclude your lucene index from the CVS repository. This is the same thing you would do if you had a process that generated files in your source tree from other files. The generated files wouldn't have any meaning in the repository, and can be regenerated at any time, so you would want to exclude them. You should be able to do this in your CVS modules file. Check the CVS manual for details, but I think you can just add !/path/to/exclude to the list of paths in the module file. for example: modulename -a !/exclude/this/path /include/this/path On Fri, 5 Nov 2004 09:03:00 -0800, Chuck Williams [EMAIL PROTECTED] wrote: Sergiu, The Lucene index is not in CVS -- neither the directory nor the files. But it is a subdirectory of a directory that is in CVS, and it needs to be structured that way due to the directory structure constraints of Tomcat and the way Netbeans automates Tomcat app development and deployment (which uses a development directory layout that directly parallels the Tomcat runtime layout). I want to be able to Update the entire repository to make sure I've got all of the latest changes, which means doing CVS Update on an ancestor directory of the Lucene index directory. Even though the index directory is not in CVS, doing the update on the ancestor directory consistently causes CVS to insert a CVS subdirectory into the index directory, causing the problem. Both WinCVS and the Netbeans CVS client have this same behavior. I have not been able to find any option to stop this -- do you know of one? Also, I can't just move the CVS directory out of the index directory, unless I'm very careful to move it back before every CVS Update. For similar reasons I can't just delete it either. CVS (and Netbeans) get very upset if there are points to this directory but it isn't there. The pointer exists in the CVS Entries file (and another for Netbeans in a cache file) in the CVS subdirectory of the parent directory of the index directory. So, I have to manually eliminate those if I want to delete the index directory's CVS directory. And then they come back after the next update! All in all very frustrating. I'm going to try the code patch that Otis suggested. If anybody knows some way in CVS to avoid this problem, I'd love to hear about it. Thanks, Chuck -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Friday, November 05, 2004 1:43 AM To: Lucene Users List Subject: Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory? Chuck Williams wrote: Otis, thanks for looking at this. The stack trace of the exception is below. I looked at the code. It wants to delete every file in the index directory, but fails to delete the CVS subdirectory entry (presumably because it is marked read-only; the specific exception is swallowed). Even if it could delete the CVS subdirectory, this would just cause another problem with Netbeans/CVS, since it wouldn't know how to fix up the pointers in the parent CVS subdirectory. Is there a change I could make that would cause it to safely leave this alone? Why do you have the lucene index in CVS? From what I know the lucene index folder shouldn't contain any other folder, just the lucene files. I think it won't be any problem to delete CVS folder from lucene index and to remove the index from CVS. If you are affraid to do that .. you can move the CVS subfolder from lucene index into another folder ... and restore if you have any problems. I'm sure you will have no problem ... but this is just for your trust... Sergiu This problem only arises on a full index (incremental == false = create == true). Incremental indexes work fine in my app. Chuck java.io.IOException: Cannot delete CVS at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:128) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at [my app]... -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 1:54 PM To: Lucene Users List Subject: Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory? Hm, as far as I know, a CVS sub-directory in an index directory should not bother Lucene. As a matter of fact, I tested this (I used a file, not a directory) for Lucene in Action. What error are you getting? I know there is -I CVS option for ignoring files; perhaps it
Re: one huge index or many small ones?
First off, I think you should make a decision about what you want to store in your index and how you go about searching it. The less information you store in your index, the better, for performance reasons. If you can store the messages in an external database you probably should. I would create a table that contains a clob and an associated id that can be used to get the message at any time. Assuming mail is in SMTP RFC format: I would suggest: Unstored: Subject Keyword: From Keyword: To Stored,Unindexed: ID -- this would be the ID to the message in your database Unstored: Body Keyword: Month Keyword: Day Keyword: Year (and any other keywords you might use) Your lucene query would then look something like: +From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004 Use the stored ID field to get the message contents from your database. If you want to break your index down into multiple indexes, based on some criteria such as time frame you could do that too. You would then use a MultiSearcher or ParallelMultiSearcher to process the multiple indexes. On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza [EMAIL PROTECTED] wrote: Thanks Erik and Giulio for the fast reply. I am just starting to look at lucene so forgive me if I got some ideas wrong. I understand your concerns about one index per email. But having one index only is also (I guess) out of question. I am building an email archive. Email will be kept indefinitely available for search, adding new email every day. Imagine a company with millions of emails per day (been there), keep it growing for years, adding stuff to the index while using it for searches continuously... That's why my idea is to decide on a time frame (a day, a month...an extreme would be an instant, that is a single email, my original idea) and build the index for all the email in that timeframe. After the timeframe is finished no more stuff will be ever added. Before the lucene search emails are selected based on other conditions (we store the from, to, date etc in database as well, and these conditions are enforced with a sql query first, so I would not need to enforce them in the lucene search again, also that query can be quite sophisticated and I guess would not be easyly possible to do it in lucene by itself). That first db step gives me a group of emails that maybe I have to further narrow down based on a lucene search (of body and attachment contents). Having an index for more than one emails means that after the search I would have to get only the overlaping emails from the two searches...Maybe this is better than keeping the same info I have in the db in lucene fields as well. An example: I want all the email from [EMAIL PROTECTED] from Jan to Dec containing the word 'money'. I run the db query that returns a list with john's email for that period of time, then (lets assume I have one index per day) I iterate on every day, looking for emails that contain 'money', from the results returned by lucene I keep only these that are also in the first list. Does that sound better? On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli [EMAIL PROTECTED] wrote: Hi Javier, I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages. Giulio Cesare On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 10 indexes Anyway, any idea about that? I just wanted to check wether someones feels I am wrong. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
prefix wildcard matching options (*blah)
I'm thinking about making a seperate field in my index for prefix wildcard searches. I would chop off x characters from the front to create subtokens for the prefix matches. For the term: republican terms created: republican epublican publican ublican blican My query parser would then intelligently decide if their is a term that has a wildcard as the first character of the term. Instead of searching the normal field, it would then remove the wildcard from the start of the term and search on the prefix field instead. A search for *pub* would be converted to pub* in the prefix field. A search for *blican would be converted to blican Does this sound like an intelligent way to create fast prefix querying ability? Can I index the prefix field with a seperate analyzer that makes the prefix tokens, or should I just do the index-time expansion manually? I wouldn't need to search with this analyzer, just index with it, because the searching doesn't have to expand all those terms. If using a seperate analyzer for the prefix field makes more sense how do I make a tokenizer that returns multiple tokens for one word? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search speed
If you know all the phrases your are going to search for, you could modify an analyzer to make those phrases into whole terms when you are analyzing. Other than that, you can test the speed of breaking the phrase query up into term queries. You would have to do an AND on all the words in the phrase. You would then need to get the documents that match all the terms, then do a substring search for your exact phrase. Any documents that match you would then return. search: death notice for each hit if contents contains death notice add hit to final result list loop On Tue, 2 Nov 2004 18:07:26 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 02 November 2004 17:50, Jeff Munson wrote: Thanks for the info Paul. The requirements of my search engine are that I need to search for phrases like death notice or world war ii. You suggested that I break the phrases into words. Is there a way to break the phrases into words, do the search, and just return the documents with the phrase? I'm just looking for a way to speed up the phrase searches. If you know the phrases in advance, ie. before indexing, you can index and search them as terms with a special purpose analyzer. It's an unusual solution, though. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
When do document ids change
Given an FSDirectory based index A. Documents are added to A with an IndexWriter minMergeDocs = 2 mergeFactor = 3 Documents are never deleted. Once the RAMDirectory merges documents to the index: a) will the documentID values for index A ever change? b) can a mapping between a term in the document and newly created documentID be made? Why I am asking this question: I have a database with about 10M rows in it. My search engine needs to be able to quickly get all the rows back from the database that match a query. All the rows need to be returned at once, because the entire result set is sorted based on user input. What I want to do: When a documentID gets assigned to a document, I want to update the database row with that matches the document field id with the lucene documentID. That way, I can use a hitcollector to gather just the documentID values from the search and insert them into a temporary cache table, then grab the matching rows from the database. This will work assuming the documentID values for the given document never change. Currently, running an IndexSearcher.search() and getting all the rows back takes between 5 and 30 seconds for most queries, which is certainly not fast enough. The time it takes to collect the documentIDs however is less than 1 second. All the time is taken by calling hits.doc() for each document to get the id field to insert into the database. So finally, will what I want to do work, and if so, how can I go about updating the database when the documentID is created? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching for a phrase that contains quote character
Have you tried making a term query by hand and testing to see if it works? Term t = new Term(field, this is a \test\); PhraseQuery pq = new PhraseQuery(t); ... On Thu, 28 Oct 2004 12:02:48 -0400, Will Allen [EMAIL PROTECTED] wrote: I am having this same problem, but cannot find any help! I have a keyword field that sometimes includes double quotes, but I am unable to search for that field because the escape for a quote doesnt work! I have tried a number of things: myfield:lucene is \cool\ AND myfield:lucene is \\cool\\ http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=7351 From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Searching for a phrase that contains quote character Date: Wed, 24 Mar 2004 21:25:16 + I'd like to search for a phrase that contains the quote character. I've tried escaping the quote character, but am receiving a ParseException from the QueryParser: For example to search for the phrase: this is a test I'm trying the following QueryParser.parse(field:\This is a \\\test, field, new StandardAnalyzer()); This results in: org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 31. Encountered: EOF after : at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) ... What is the proper way to accomplish this? --Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching for a phrase that contains quote character
absolutely correct. sorry about that. shouldn't code before coffee :) On Thu, 28 Oct 2004 20:16:16 +0200, Daniel Naber [EMAIL PROTECTED] wrote: On Thursday 28 October 2004 19:03, Justin Swanhart wrote: Have you tried making a term query by hand and testing to see if it works? Term t = new Term(field, this is a \test\); PhraseQuery pq = new PhraseQuery(t); That's not a proper PharseQuery, it searches for *one* term this is a test which is probably not what one wants. You have to add the terms one by one to a PhraseQuery. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter Constructor question
You could always modify your own local copy if you want to change the behavior of the parameter. or just do: IndexWriter w = new IndexWriter(indexDirectory, new StandardAnalyzer(), !(IndexReader.indexExists(indexDirectory)) ); If you do that, then if an index exists then it will not be created, otherwise it will be... On Wed, 27 Oct 2004 12:26:29 -0500, Armbrust, Daniel C. [EMAIL PROTECTED] wrote: Wouldn't it make more sense if the constructor for the IndexWriter always created an index if it doesn't exist - and the boolean parameter should be clear (instead of create) So instead of this (from javadoc): IndexWriter public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If create is true, then a new, empty index will be created in d, replacing the index already there, if any. Parameters: d - the index directory a - the analyzer to use create - true to create the index or overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist, and create is false We would have this: IndexWriter public IndexWriter(Directory d, Analyzer a, boolean clear) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If clear is true, and a index exists at location d, then it will be erased, and a new, empty index will be created in d. Parameters: d - the index directory a - the analyzer to use clear - true to overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist. Its current behavior is kind of annoying, because I have an app that should never clear an existing index, it should always append. So I want create set to false. But when I am starting a brand new index, then I have to change the create flag to keep it from throwing an exception... I guess for now I will have to write code to check if a index actually has content yet, and if it doesn't, change the flag on the fly. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stopwords in Exact phrase
your analyzer will have removed the stopword when you indexed your documents, so lucene won't be able to do this for you. You will need to implement a second pass over the results returned by lucene and check to see if the stopword is included, perhaps with String.indexOf() On Wed, 27 Oct 2004 14:36:14 -0500, Ravi [EMAIL PROTECTED] wrote: Is there way to include stopwords in an exact phrase search? For example, when I search on Melbourne IT, Lucene only searches for Melbourne ignoring IT. Thanks, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi + Parallel
The overhead of creating that many searcher objects is going to far outweigh any performance benefit you could possibly hope to gain by splitting your index up. On Thu, 14 Oct 2004 04:42:27 -0700 (PDT), Otis Gospodnetic [EMAIL PROTECTED] wrote: Search a single merged index. Otis --- Karthik N S [EMAIL PROTECTED] wrote: Hi Apologies.. Can somebody provide me Approximate answers [ Which is Better choice ] A search of 10,000 subindexes using multisearcher or a search on One Single Merged Index [ merged 10,000 Sub indexes ] a) SubIndexes 10,000 ( future) b) Field to be searche upon = 4 c)Field type present in Indexed format = 15 d) RAM = 1GB e) O/s Linux [ Clustered Enviournament] f) Processor make AMD [Probably High End] g) WebServer Tomcat 5.0.x 1)Which would be Faster ???; 2)If not What is may be the Probable Solution. Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 13, 2004 3:53 PM To: Lucene Users List Subject: Re: Multi + Parallel On Oct 13, 2004, at 3:14 AM, Karthik N S wrote: I was Curious to Know the Difference between ParallelMultiSearcher and MultiSearcher , 1) Is the working internal functionality of these are same or different . They are different internally. Externally they should return identical results and not appear different at all. Internally, ParallelMultiSearcher searches each index in a separate thread (searches wait until all threads finish before returning). In MultiSearcher, each index is searched serially. You will not likely see a benefit to using ParallelMultiSearcher unless your environment is specialized to accommodate multi-threading (multiple CPU's, indexes on separate drives that can operate independently, etc). 2) In terms of time domain do these differ when searching same no of fields / words . 3)What are the features used on each of API. There is no external difference to using either implementation. Benchmark searches using both and see what is best, but generally MultiSeacher will be better in most environments as it avoids the overhead of starting up and managing multiple threads. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing Strategy for 20 million documents
It depends on a lot of factors. I myself use multiple indexes for about 10M documents. My documents are transient. Each day I get about 400K and I remove about 400K. I always remove an entire days documents at one time. It is much faster/easier to delete the lucene index for the day that I am removing, then looping through one big index and removing the entries with the IndexReader. Since my data is also partitioned by day in my database, I essentially do the same thing there with truncate table. I use a ParallelMultiSearcher object to search the indexes. I store my indexes on a 14 disk 15k rpm fibre channel RAID 1+0 array (striped mirrors). I get very good performance in both updating and searching indexes. On Fri, 8 Oct 2004 06:11:37 -0700 (PDT), Otis Gospodnetic [EMAIL PROTECTED] wrote: Jeff, These questions are difficult to answer, because the answer depends on a number of factors, such as: - hardware (memory, disk speed, number of disks...) - index complexity and size (number of fields and their size) - number of queries/second - complexity of queries etc. I would try putting everything in a single index first, and split it up only if I see performance issues. Going from 1 index to N indices is not a lot of work (not a lot of Lucene-related code). If searching 1 big index is too slow, split your index, put each index on a separate disk, and use ParallelMultiSearcher (http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html) to search your indices. Otis --- Jeff Munson [EMAIL PROTECTED] wrote: I am a new user of Lucene. I am looking to index over 20 million documents (and a lot more someday) and am looking for ideas on the best indexing/search strategy. Which will optimize the Lucene search, one index or multiple indexes? Do I create multiple indexes and merge them all together? Or do I create multiple indexes and search on the multiple indexes? Any helpful ideas would be appreciated! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzer reuse
Yes you can reuse analyzers. The only performance gain will come from not having to create the objects and not having garbage collection overhead. I create one for each of my index reading threads. On Thu, 07 Oct 2004 16:59:38 +, sam s [EMAIL PROTECTED] wrote: Hi, Can instance of an analyzer be reused? If yes then will it give any performance gain? sam _ Add photos to your messages with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
multiple threads
As I understand it, if two writers try to acess the same index for writing, then one of the writers should block waiting for a lock until the lock timeout period expires, and then they will return a Lock wait timeout exception. I have a multithreaded indexing applications that writes into one of multiple indexes depending on a hash value, and I intend to merge all the hashes when the indexing finishes. Locking usually works but sometimes it doesn't and I get IO exceptions such as the following.. java.io.IOException: Cannot delete _19.fnm at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:198) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:157) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:389) at org.en.global.indexer.IndexGroup.run(IndexGroup.java:387) Any idea on why this could be happening? I am using NFS currently, but the problem appears on the local filesystem as well. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]