Re: Opening up one large index takes 940M or memory?
Doug Cutting wrote: Kevin A. Burton wrote: Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? You can increase TermInfosWriter.indexInterval. You'll need to re-write the .tii file for this to take effect. The simplest way to do this is to use IndexWriter.addIndexes(), adding your index to a new, empty, directory. This will of course take a while for a 60GB index... (Note... when this works I'll note my findings in a wiki page for future developers) Two more questions: 1. Do I have to do this with a NEW directory? Our nightly index merger uses an existing target index which I assume will re-use the same settings as before? I did this last night and it still seems to use the same amount of memory. Above you assert that I should use a new empty directory and I'll try that tonight. 2. This isn't destructive is it? I mean I'll be able to move BACK to a TermInfosWriter.indexInterval of 128 right? Thanks! Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: 1. Do I have to do this with a NEW directory? Our nightly index merger uses an existing target index which I assume will re-use the same settings as before? I did this last night and it still seems to use the same amount of memory. Above you assert that I should use a new empty directory and I'll try that tonight. You need to re-write the entire index using a modified TermIndexWriter.java. Optimize rewrites the entire index but is destructive. Merging into a new empty directory is a non-destructive way to do this. 2. This isn't destructive is it? I mean I'll be able to move BACK to a TermInfosWriter.indexInterval of 128 right? Yes, you can go back if you re-optimize or re-merge again. Also, there's no need to CC my personal email address. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? You can increase TermInfosWriter.indexInterval. You'll need to re-write the .tii file for this to take effect. The simplest way to do this is to use IndexWriter.addIndexes(), adding your index to a new, empty, directory. This will of course take a while for a 60GB index... Doubling TermInfosWriter.indexInterval should half the Term memory usage and double the time required to look up terms in the dictionary. With an index this large the the latter is probably not an issue, since processing term frequency and proximity data probably overwhelmingly dominate search performance. Perhaps we should make this public by adding an IndexWriter method? Also, you can list the size of your .tii file by using the main() from CompoundFileReader. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Sounds interesting. (Is there a btree seralization impl in java?) .V jian chen wrote: Hi, If it is really the case that every 128th term is loaded into memory. Could you use a relational database or b-tree to index to do the work of indexing of the terms instead? Even if you create another level of indexing on top of the .tii fle, it is just a hack and would not scale well. I would think a b/b+ tree based approach is the way to go for better memory utilization. Cheers, Jian On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- RiA-SoA w/JDNC http://www.SandraSF.com forums - help develop a community My blog http://www.sandrasf.com/adminBlog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Jan 24, 2005, at 00:10, Vic wrote: (Is there a btree seralization impl in java?) http://jdbm.sourceforge.net/ Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( opening... ); long before = System.currentTimeMillis(); Directory dir = FSDirectory.getDirectory( /var/ksa/index-1078106952160/, false ); IndexReader ir = IndexReader.open( dir ); System.out.println( ir.getClass() ); long after = System.currentTimeMillis(); System.out.println( opening...done - duration: + (after-before) ); System.out.println( totalMemory: + Runtime.getRuntime().totalMemory() ); System.out.println( freeMemory: + Runtime.getRuntime().freeMemory() ); Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Hi, If it is really the case that every 128th term is loaded into memory. Could you use a relational database or b-tree to index to do the work of indexing of the terms instead? Even if you create another level of indexing on top of the .tii fle, it is just a hack and would not scale well. I would think a b/b+ tree based approach is the way to go for better memory utilization. Cheers, Jian On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Paul Elschot wrote: This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... The problem I think for everyone right now is that 32bits just doesn't cut it in production systems... 2G of memory per process and you really start to feel it. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Re: Opening up one large index takes 940M or memory?
Chris Hostetter wrote: : We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory footprint settles down after a little while? You're currently checking the memoory usage immediately after opening the index, and some of that memory may be used holding transient data that will get freed up after some GC iterations. Actually I haven't but to be honest the numbers seem dead on. The VM heap wouldn't reallocate if it didn't need that much memory and this is almost exactly the behavior I'm seeing in product. Though I guess it wouldn't hurt ;) Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. I loaded it into a profiler a long time ago. Most of the code was due to Term classes being loaded into memory. I might try to get some time to load it into a profiler on monday... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Jan 22, 2005, at 23:50, Kevin A. Burton wrote: The problem I think for everyone right now is that 32bits just doesn't cut it in production systems... 2G of memory per process and you really start to feel it. Hmmm... no... no pain at all... or perhaps you are implying that your entire system is running on one puny JVM instance... in that case, this is perhaps more of a design problem than an implementation one... YMMV... Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Yes, I remember your email about the large number of Terms. If it can be avoided and you figure out how to do it, I'd love to patch something. :) Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. I loaded it into a profiler a long time ago. Most of the code was due to Term classes being loaded into memory. I might try to get some time to load it into a profiler on monday... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. This would be similar to the way the MySQL index cache works... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
: We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory footprint settles down after a little while? You're currently checking the memoory usage immediately after opening the index, and some of that memory may be used holding transient data that will get freed up after some GC iterations. : IndexReader ir = IndexReader.open( dir ); : System.out.println( ir.getClass() ); : long after = System.currentTimeMillis(); : System.out.println( opening...done - duration: + : (after-before) ); : : System.out.println( totalMemory: + : Runtime.getRuntime().totalMemory() ); : System.out.println( freeMemory: + : Runtime.getRuntime().freeMemory() ); -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]