Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

You can increase TermInfosWriter.indexInterval.  You'll need to 
re-write the .tii file for this to take effect.  The simplest way to 
do this is to use IndexWriter.addIndexes(), adding your index to a 
new, empty, directory.  This will of course take a while for a 60GB 
index...

(Note... when this works I'll note my findings in a wiki page for future 
developers)

Two more questions:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing target index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?

Thanks!
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Doug Cutting
Kevin A. Burton wrote:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing target index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.
You need to re-write the entire index using a modified 
TermIndexWriter.java.  Optimize rewrites the entire index but is 
destructive.  Merging into a new empty directory is a non-destructive 
way to do this.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?
Yes, you can go back if you re-optimize or re-merge again.
Also, there's no need to CC my personal email address.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-27 Thread Doug Cutting
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?
You can increase TermInfosWriter.indexInterval.  You'll need to re-write 
the .tii file for this to take effect.  The simplest way to do this is 
to use IndexWriter.addIndexes(), adding your index to a new, empty, 
directory.  This will of course take a while for a 60GB index...

Doubling TermInfosWriter.indexInterval should half the Term memory usage 
and double the time required to look up terms in the dictionary.  With 
an index this large the the latter is probably not an issue, since 
processing term frequency and proximity data probably overwhelmingly 
dominate search performance.

Perhaps we should make this public by adding an IndexWriter method?
Also, you can list the size of your .tii file by using the main() from 
CompoundFileReader.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-23 Thread Vic
Sounds interesting. (Is there a btree seralization impl in java?)
.V
jian chen wrote:
Hi,
If it is really the case that every 128th term is loaded into memory.
Could you use a relational database or b-tree to index to do the work
of indexing of the terms instead?
Even if you create another level of indexing on top of the .tii fle,
it is just a hack and would not scale well.
I would think a b/b+ tree based approach is the way to go for better
memory utilization.
Cheers,
Jian
On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 

There Kevin, that's what I was referring to, the .tii file.
Otis
--- Paul Elschot [EMAIL PROTECTED] wrote:
   

On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
 

Kevin A. Burton wrote:
   

We have one large index right now... its about 60G ... When I
 

open it
 

the Java VM used 940M of memory.  The VM does nothing else
 

besides
 

open this index.
 

After thinking about it I guess 1.5% of memory per index really
   

isn't
 

THAT bad.  What would be nice if there was a way to do this from
   

disk
 

and then use the a buffer (either via the filesystem or in-vm
   

memory) to
 

access these variables.
   

It's even documented. From:
http://jakarta.apache.org/lucene/docs/fileformats.html :
 

The term info index, or .tii file.
This contains every IndexIntervalth entry from the .tis file, along
   

with its
 

location in the tis file. This is designed to be read entirely
   

into memory
 

and used to provide random access to the tis file.
   

My guess is that this is what you see happening.
To see the actuall .tii file, you need the non default file format.
Once searching starts you'll also see that the field norms are
loaded,
these take one byte per searched field per document.
 

This would be similar to the way the MySQL index cache works...
   

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM
instead...
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


--
RiA-SoA w/JDNC http://www.SandraSF.com forums
- help develop a community
My blog http://www.sandrasf.com/adminBlog
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-23 Thread petite_abeille
On Jan 24, 2005, at 00:10, Vic wrote:
(Is there a btree seralization impl in java?)
http://jdbm.sourceforge.net/
Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Paul Elschot
On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
 Kevin A. Burton wrote:
 
  We have one large index right now... its about 60G ... When I open it 
  the Java VM used 940M of memory.  The VM does nothing else besides 
  open this index.
 
 After thinking about it I guess 1.5% of memory per index really isn't 
 THAT bad.  What would be nice if there was a way to do this from disk 
 and then use the a buffer (either via the filesystem or in-vm memory) to 
 access these variables.

It's even documented. From:
http://jakarta.apache.org/lucene/docs/fileformats.html :

The term info index, or .tii file. 
This contains every IndexIntervalth entry from the .tis file, along with its
location in the tis file. This is designed to be read entirely into memory
and used to provide random access to the tis file. 

My guess is that this is what you see happening.
To see the actuall .tii file, you need the non default file format.

Once searching starts you'll also see that the field norms are loaded,
these take one byte per searched field per document.

 This would be similar to the way the MySQL index cache works...

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
It would be interesting to know _what_exactly_ uses your memory. 
Running under an optimizer should tell you that.

The only thing that comes to mind is... can't remember the details now,
but when the index is opened, I believe every 128th term is read into
memory.  This, I believe, helps with index seeks at search time.  I
wonder if this is what's using your memory.  The number '128' can't be
modified just like that, but somebody (Julien?) has modified the code
in the past to make this variable.  That's the only thing I can think
of right now and it may or may not be an idea in the right direction.

Otis


--- Kevin A. Burton [EMAIL PROTECTED] wrote:
 We have one large index right now... its about 60G ... When I open it
 
 the Java VM used 940M of memory.  The VM does nothing else besides
 open 
 this index.
 
 Here's the code:
 
 System.out.println( opening... );
 
 long before = System.currentTimeMillis();
 Directory dir = FSDirectory.getDirectory( 
 /var/ksa/index-1078106952160/, false );
 IndexReader ir = IndexReader.open( dir );
 System.out.println( ir.getClass() );
 long after = System.currentTimeMillis();
 System.out.println( opening...done - duration:  + 
 (after-before) );
 
 System.out.println( totalMemory:  + 
 Runtime.getRuntime().totalMemory() );
 System.out.println( freeMemory:  + 
 Runtime.getRuntime().freeMemory() );
 
 Is there any way to reduce this footprint?  The index is fully 
 optimized... I'm willing to take a performance hit if necessary.  Is 
 this documented anywhere?
 
 Kevin
 
 -- 
 
 Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
 
 invite!  Also see irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
 If you're interested in RSS, Weblogs, Social Networking, etc... then
 you 
 should work for Rojo!  If you recommend someone and we hire them
 you'll 
 get a free iPod!
 
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
There Kevin, that's what I was referring to, the .tii file.

Otis

--- Paul Elschot [EMAIL PROTECTED] wrote:

 On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
  Kevin A. Burton wrote:
  
   We have one large index right now... its about 60G ... When I
 open it 
   the Java VM used 940M of memory.  The VM does nothing else
 besides 
   open this index.
  
  After thinking about it I guess 1.5% of memory per index really
 isn't 
  THAT bad.  What would be nice if there was a way to do this from
 disk 
  and then use the a buffer (either via the filesystem or in-vm
 memory) to 
  access these variables.
 
 It's even documented. From:
 http://jakarta.apache.org/lucene/docs/fileformats.html :
 
 The term info index, or .tii file. 
 This contains every IndexIntervalth entry from the .tis file, along
 with its
 location in the tis file. This is designed to be read entirely
 into memory
 and used to provide random access to the tis file. 
 
 My guess is that this is what you see happening.
 To see the actuall .tii file, you need the non default file format.
 
 Once searching starts you'll also see that the field norms are
 loaded,
 these take one byte per searched field per document.
 
  This would be similar to the way the MySQL index cache works...
 
 It would be possible to add another level of indexing to the terms.
 No one has done this yet, so I guess it's prefered to buy RAM
 instead...
 
 Regards,
 Paul Elschot
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread jian chen
Hi,

If it is really the case that every 128th term is loaded into memory.
Could you use a relational database or b-tree to index to do the work
of indexing of the terms instead?

Even if you create another level of indexing on top of the .tii fle,
it is just a hack and would not scale well.

I would think a b/b+ tree based approach is the way to go for better
memory utilization.

Cheers,

Jian


On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 There Kevin, that's what I was referring to, the .tii file.
 
 Otis
 
 --- Paul Elschot [EMAIL PROTECTED] wrote:
 
  On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
   Kevin A. Burton wrote:
  
We have one large index right now... its about 60G ... When I
  open it
the Java VM used 940M of memory.  The VM does nothing else
  besides
open this index.
  
   After thinking about it I guess 1.5% of memory per index really
  isn't
   THAT bad.  What would be nice if there was a way to do this from
  disk
   and then use the a buffer (either via the filesystem or in-vm
  memory) to
   access these variables.
 
  It's even documented. From:
  http://jakarta.apache.org/lucene/docs/fileformats.html :
 
  The term info index, or .tii file.
  This contains every IndexIntervalth entry from the .tis file, along
  with its
  location in the tis file. This is designed to be read entirely
  into memory
  and used to provide random access to the tis file.
 
  My guess is that this is what you see happening.
  To see the actuall .tii file, you need the non default file format.
 
  Once searching starts you'll also see that the field norms are
  loaded,
  these take one byte per searched field per document.
 
   This would be similar to the way the MySQL index cache works...
 
  It would be possible to add another level of indexing to the terms.
  No one has done this yet, so I guess it's prefered to buy RAM
  instead...
 
  Regards,
  Paul Elschot
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Paul Elschot wrote:
This would be similar to the way the MySQL index cache works...
   

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...
 

The problem I think for everyone right now is that 32bits just doesn't 
cut it in production systems...   2G of memory per process and you 
really start to feel it.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Chris Hostetter wrote:
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open
Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint settles down after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.
 

Actually I haven't but to be honest the numbers seem dead on. The VM 
heap wouldn't reallocate if it didn't need that much memory and this is 
almost exactly the behavior I'm seeing in product.

Though I guess it wouldn't hurt ;)
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
It would be interesting to know _what_exactly_ uses your memory. 
Running under an optimizer should tell you that.

The only thing that comes to mind is... can't remember the details now,
but when the index is opened, I believe every 128th term is read into
memory.  This, I believe, helps with index seeks at search time.  I
wonder if this is what's using your memory.  The number '128' can't be
modified just like that, but somebody (Julien?) has modified the code
in the past to make this variable.  That's the only thing I can think
of right now and it may or may not be an idea in the right direction.
 

I loaded it into a profiler a long time ago. Most of the code was due to 
Term classes being loaded into memory.

I might try to get some time to load it into a profiler on monday...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread petite_abeille
On Jan 22, 2005, at 23:50, Kevin A. Burton wrote:
The problem I think for everyone right now is that 32bits just doesn't 
cut it in production systems...   2G of memory per process and you 
really start to feel it.
Hmmm... no... no pain at all... or perhaps you are implying that your 
entire system is running on one puny JVM instance... in that case, this 
is perhaps more of a design problem than an implementation one... 
YMMV...

Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
Yes, I remember your email about the large number of Terms.  If it can
be avoided and you figure out how to do it, I'd love to patch
something. :)

Otis

--- Kevin A. Burton [EMAIL PROTECTED] wrote:

 Otis Gospodnetic wrote:
 
 It would be interesting to know _what_exactly_ uses your memory. 
 Running under an optimizer should tell you that.
 
 The only thing that comes to mind is... can't remember the details
 now,
 but when the index is opened, I believe every 128th term is read
 into
 memory.  This, I believe, helps with index seeks at search time.  I
 wonder if this is what's using your memory.  The number '128' can't
 be
 modified just like that, but somebody (Julien?) has modified the
 code
 in the past to make this variable.  That's the only thing I can
 think
 of right now and it may or may not be an idea in the right
 direction.
   
 
 I loaded it into a profiler a long time ago. Most of the code was due
 to 
 Term classes being loaded into memory.
 
 I might try to get some time to load it into a profiler on monday...
 
 Kevin
 
 -- 
 
 Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
 
 invite!  Also see irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
 If you're interested in RSS, Weblogs, Social Networking, etc... then
 you 
 should work for Rojo!  If you recommend someone and we hire them
 you'll 
 get a free iPod!
 
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
Kevin A. Burton wrote:
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides 
open this index.
After thinking about it I guess 1.5% of memory per index really isn't 
THAT bad.  What would be nice if there was a way to do this from disk 
and then use the a buffer (either via the filesystem or in-vm memory) to 
access these variables.

This would be similar to the way the MySQL index cache works...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Chris Hostetter
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open

Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint settles down after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.


: IndexReader ir = IndexReader.open( dir );
: System.out.println( ir.getClass() );
: long after = System.currentTimeMillis();
: System.out.println( opening...done - duration:  +
: (after-before) );
:
: System.out.println( totalMemory:  +
: Runtime.getRuntime().totalMemory() );
: System.out.println( freeMemory:  +
: Runtime.getRuntime().freeMemory() );





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]