Re: Compound file format file size question
Otis, Thanks for the response. Yeah, I was copying the file to a brand new hard-drive and it was formatted to FAT32 by default, which is probably why it couldn't handle the 13GB file. I'm converting the drive to NTFS now, which should get me through temporarily. In the future though, I may break the index up into smaller sub-indexes so that I can distribute them across seperate physical disks for better disk IO. Thanks for your help! Jim --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, --- James Dunn [EMAIL PROTECTED] wrote: Hello all, I have an index that's about 13GB on disk. I'm using 1.4 rc3 which uses the compound file format by default. Once I run optimize on my index, it creates one 13GB ..cfs file. This isn't a problem on Linux (yet), but I'm having some trouble copying the file over to my Windows XP box. What is the exact problem? The sheer size of it or something else? Just curious... Is there some way using the compound file format to set the maximum file size and have Lucene break the index into multiple files once it hits that limit? Can't be done with Lucene, but I seem to recall some discussion about it. Nothing concrete, though. Or do I need to go back to using the non-compound file format? The total size should be (about) the same, but you could certainly do that, if having more smaller files is better for you. Otis Another solution, I suppose, would be to break up my index into seperate smaller indexes. This would be my second choice, however. Thanks a lot, Jim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Otis, My app does run within Tomcat. But when I started getting these OutOfMemoryErrors I wrote a little unit test to watch the memory usage without Tomcat in the middle and I still see the memory usage. Thanks, Jim --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Sorry if I'm stating the obvious. Is this happening in some stand-alone unit tests, or are you running things from some application and in some environment, like Tomcat, Jetty or in some non-web app? Your queries are pretty big (although I recall some people using even bigger ones... but it all depends on the hardware they had), but are you sure running out of memory is due to Lucene, or could it be a leak in the app from which you are running queries? Otis --- James Dunn [EMAIL PROTECTED] wrote: Doug, We only search on analyzed text fields. There are a couple of additional fields in the index like OBJECT_ID that are keywords but we don't search against those, we only use them once we get a result back to find the thing that document represents. Thanks, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: It is cached by the IndexReader and lives until the index reader is garbage collected. 50-70 searchable fields is a *lot*. How many are analyzed text, and how many are simply keywords? Doug James Dunn wrote: Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Memory usage
Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage
Will, Thanks for your response. It may be an object leak. I will look into that. I just ran some more tests and this time I create a 20GB index by repeatedly merging my large index into itself. When I ran my test query against that index I got an OutOfMemoryError on the very first query. I have my heap set to 512MB. Should a query against a 20GB index require that much memory? I page through the results 100 at a time, so I should never have more than 100 Document objects in memory. Any help would be appreciated, thanks! Jim --- [EMAIL PROTECTED] wrote: This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are not getting freed from memory. -Will -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 3:02 PM To: [EMAIL PROTECTED] Subject: Memory usage Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem Indexing Large Document Field
Gilberto, Look at the IndexWriter class. It has a property, maxFieldLength, which you can set to determine the max number of characters to be stored in the index. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html Jim --- Gilberto Rodriguez [EMAIL PROTECTED] wrote: I am trying to index a field in a Lucene document with about 90,000 characters. The problem is that it only indexes part of the document. It seems to only index about 65,00 characters. So, if I search on terms that are at the beginning of the text, the search works, but it fails for terms that are at the end of the document. Is there a limitation on how many characters can be stored in a document field? Any help would be appreciated, thanks Gilberto Rodriguez Software Engineer 370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451 407.339.1177 (Ext.112) phone 407.339.6704 fax [EMAIL PROTECTED] email www.conviveon.com web This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Erik, Thanks for the response. My actual documents are fairly small. Most docs only have about 10 fields. Some of those fields are stored, however, like the OBJECT_ID, NAME and DESC fields. The stored fields are pretty small as well. None should be more than 4KB and very few will approach that limit. I'm also using the default maxFieldSize value of 1. I'm not caching hits, either. Could it be my query? I have about 80 total unique fields in the index although no document has all 80. My query ends up looking like this: +(F1:test F2:test .. F80:test) From previous mails that doesn't look like an enormous amount of fields to be searching against. Is there some formula for the amount of memory required for a query based on the number of clauses and terms? Jim --- Erik Hatcher [EMAIL PROTECTED] wrote: How big are your actual Documents? Are you caching Hits? It stores, internally, up to 200 documents. Erik On May 26, 2004, at 4:08 PM, James Dunn wrote: Will, Thanks for your response. It may be an object leak. I will look into that. I just ran some more tests and this time I create a 20GB index by repeatedly merging my large index into itself. When I ran my test query against that index I got an OutOfMemoryError on the very first query. I have my heap set to 512MB. Should a query against a 20GB index require that much memory? I page through the results 100 at a time, so I should never have more than 100 Document objects in memory. Any help would be appreciated, thanks! Jim --- [EMAIL PROTECTED] wrote: This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are not getting freed from memory. -Will -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 3:02 PM To: [EMAIL PROTECTED] Subject: Memory usage Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
Doug, We only search on analyzed text fields. There are a couple of additional fields in the index like OBJECT_ID that are keywords but we don't search against those, we only use them once we get a result back to find the thing that document represents. Thanks, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: It is cached by the IndexReader and lives until the index reader is garbage collected. 50-70 searchable fields is a *lot*. How many are analyzed text, and how many are simply keywords? Doug James Dunn wrote: Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Preventing duplicate document insertion during optimize
Kevin, I have a similar issue. The only solution I have been able to come up with is, after the merge, to open an IndexReader against the merge index, iterate over all the docs and delete duplicate docs based on my primary key field. Jim --- Kevin A. Burton [EMAIL PROTECTED] wrote: Let's say you have two indexes each with the same document literal. All the fields hash the same and the document is a binary duplicate of a different document in the second index. What happens when you do a merge to create a 3rd index from the first two? I assume you now have two documents that are identical in one index. Is there any way to prevent this? It would be nice to figure out if there's a way to flag a field as a primary key so that if it has already added it to just skip. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster ATTACHMENT part 2 application/pgp-signature name=signature.asc __ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problems From the Word Go
Alex, Could you send along whatever error messages you are receiving? Thanks, Jim --- Alex Wybraniec [EMAIL PROTECTED] wrote: I'm sorry if this is not the correct place to post this, but I'm very confused, and getting towards the end of my tether. I need to install/compile and run Lucene on a Windows XP Pro based machine, running J2SE 1.4.2, with ANT. I downloaded both the source code and the pre-compile versions, and as yet have not been able to get either running. I've been through the documentation, and still I can find little to help me set it up properly. All I want to do (to start with) is compile and run the demo version. I'm sorry to ask such a newbie question, but I'm really stuck. So if anyone can point me to an idiots guide, or offer me some help, I would be most grateful. Once I get past this stage, I'll have all sorts of juicer questions for you, but at the minute, I can't even get past stage 1 Thank you in advance Alex --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.672 / Virus Database: 434 - Release Date: 28/04/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: ArrayIndexOutOfBoundsException
Philippe, thanks for the reply. I didn't FTP my index anywhere, but your response does make it seem that my index is in fact corrupted somehow. Does anyone know of a tool that can verify the validity of a Lucene index, and/or possibly repair it? If not, anyone have any idea how difficult it would be to write one? Thanks, Jim --- Phil brunet [EMAIL PROTECTED] wrote: Hi. I had this problem when i transfered a Lucene index by FTP in ASCII mode. Using binary mode, i never has such a problem. Philippe From: James Dunn [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: ArrayIndexOutOfBoundsException Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT) Hello all, I have a web site whose search is driven by Lucene 1.3. I've been doing some load testing using JMeter and occassionally I will see the exception below when the search page is under heavy load. Has anyone seen similar errors during load testing? I've seen some posts with similar exceptions and the general consensus is that this error means that the index is corrupt. I'm not sure my index is corrupt however. I can run all the queries I use for load testing under normal load and I don't appear to get this error. Is there any way to verify that a Lucene index is corrupt or not? Thanks, Jim java.lang.ArrayIndexOutOfBoundsException: 53 = 52 at java.util.Vector.elementAt(Vector.java:431) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275) at org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) at org.apache.lucene.search.Hits.doc(Hits.java:130) __ Do you Yahoo!? Yahoo! Photos: High-quality 4x6 digital prints for 25¢ http://photos.yahoo.com/ph/print_splash - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Hotmail : un compte GRATUIT qui vous suit partout et tout le temps ! http://g.msn.fr/FR1000/9493 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'Lock obtain timed out' even though NO locks exist...
Which version of lucene are you using? In 1.2, I believe the lock file was located in the index directory itself. In 1.3, it's in your system's tmp folder. Perhaps it's a permission problem on either one of those folders. Maybe your process doesn't have write access to the correct folder and is thus unable to create the lock file? You can also pass lucene a system property to increase the lock timeout interval, like so: -Dorg.apache.lucene.commitLockTimeout=6 or -Dorg.apache.lucene.writeLockTimeout=6 The above sets the timeout to one minute. Hope this helps, Jim --- Kevin A. Burton [EMAIL PROTECTED] wrote: I've noticed this really strange problem on one of our boxes. It's happened twice already. We have indexes where when Lucnes starts it says 'Lock obtain timed out' ... however NO locks exist for the directory. There are no other processes present and no locks in the index dir or /tmp. Is there anyway to figure out what's going on here? Looking at the index it seems just fine... But this is only a brief glance. I was hoping that if it was corrupt (which I don't think it is) that lucene would give me a better error than Lock obtain timed out Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster ATTACHMENT part 2 application/pgp-signature name=signature.asc __ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ArrayIndexOutOfBoundsException
Hello all, I have a web site whose search is driven by Lucene 1.3. I've been doing some load testing using JMeter and occassionally I will see the exception below when the search page is under heavy load. Has anyone seen similar errors during load testing? I've seen some posts with similar exceptions and the general consensus is that this error means that the index is corrupt. I'm not sure my index is corrupt however. I can run all the queries I use for load testing under normal load and I don't appear to get this error. Is there any way to verify that a Lucene index is corrupt or not? Thanks, Jim java.lang.ArrayIndexOutOfBoundsException: 53 = 52 at java.util.Vector.elementAt(Vector.java:431) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275) at org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) at org.apache.lucene.search.Hits.doc(Hits.java:130) __ Do you Yahoo!? Yahoo! Photos: High-quality 4x6 digital prints for 25¢ http://photos.yahoo.com/ph/print_splash - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]