index size before and after commit

2009-10-01 Thread Phillip Farber
I am trying to automate a build process that adds documents to 10 shards 
over 5 machines and need to limit the size of a shard to no more than 
200GB because I only have 400GB of disk available to optimize a given shard.


Why does the size (du) of an index typically decrease after a commit?  
I've observed a decrease in size of as much as from 296GB down to 151GB 
or as little as from 183GB to 182GB.  Is that size after a commit close 
to the size the index would be after an optimize?  For that matter, are 
there cases where optimization can take more than 2x?  I've heard of 
cases but have not observed them in my system.  I only do adds to the 
shards, never query them. An LVM snapshot of the shard receives the queries.


Is doing a commit before I take a du a reliable way to gauge the size of 
the shard?  It is really bad news to allow a shard to go over 200GB in 
my use case.  How do others manage this problem of 2x space needed to 
optimize with limited dosk space?


Advice greatly appreciated.

Phil



Re: index size before and after commit

2009-10-01 Thread Grant Ingersoll
It may take some time before resources are released and garbage  
collected, so that may be part of the reason why things hang around  
and du doesn't report much of a drop.


On Oct 1, 2009, at 8:54 AM, Phillip Farber wrote:

I am trying to automate a build process that adds documents to 10  
shards over 5 machines and need to limit the size of a shard to no  
more than 200GB because I only have 400GB of disk available to  
optimize a given shard.


Why does the size (du) of an index typically decrease after a  
commit?  I've observed a decrease in size of as much as from 296GB  
down to 151GB or as little as from 183GB to 182GB.  Is that size  
after a commit close to the size the index would be after an  
optimize?  For that matter, are there cases where optimization can  
take more than 2x?  I've heard of cases but have not observed them  
in my system.


I seem to recall a case where it can be 3x, but I don't know that it  
has been observed much.


I only do adds to the shards, never query them. An LVM snapshot of  
the shard receives the queries.


Is doing a commit before I take a du a reliable way to gauge the  
size of the shard?  It is really bad news to allow a shard to go  
over 200GB in my use case.  How do others manage this problem of 2x  
space needed to optimize with limited dosk space?


Do you need to optimize at all?


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: index size before and after commit

2009-10-01 Thread Mark Miller
Phillip Farber wrote:
 I am trying to automate a build process that adds documents to 10
 shards over 5 machines and need to limit the size of a shard to no
 more than 200GB because I only have 400GB of disk available to
 optimize a given shard.

 Why does the size (du) of an index typically decrease after a commit? 
 I've observed a decrease in size of as much as from 296GB down to
 151GB or as little as from 183GB to 182GB.  Is that size after a
 commit close to the size the index would be after an optimize?  
Likely. Until you commit or close the Writer, the unoptimized index is
the live index. And then you also have the optimized index. Once you
commit and make the optimized index the live index, the unoptimized
index can be removed (depending on your delete policy, which by default
only keeps the latest commit point).
 For that matter, are there cases where optimization can take more than
 2x?  I've heard of cases but have not observed them in my system.  I
 only do adds to the shards, never query them. An LVM snapshot of the
 shard receives the queries.
There are cases where it takes over 2x - but they involve using reopen.
If you have more than one Reader on the index, and only reopen some of
them, the new Readers created can hold open the partially optimized
segments that existed at that moment, creating a need for greater than 2x.

 Is doing a commit before I take a du a reliable way to gauge the size
 of the shard?  It is really bad news to allow a shard to go over 200GB
 in my use case.  How do others manage this problem of 2x space needed
 to optimize with limited dosk space?
Get more disk space ;) Or don't optimize. A lower mergefactor can make
optimizations less necessary.

 Advice greatly appreciated.

 Phil



-- 
- Mark

http://www.lucidimagination.com





Re: index size before and after commit

2009-10-01 Thread Mark Miller
Whoops - they way I have mail come in, not easy to tell if I'm replying
to Lucene or Solr list ;)

The way Solr works with Searchers and reopen, it shouldn't run into a
situation that requires greater than
2x to optimize. I won't guarantee it ;) But based on what I know, it
shouldn't happen under normal circumstances.

Mark Miller wrote:
 Phillip Farber wrote:
   
 I am trying to automate a build process that adds documents to 10
 shards over 5 machines and need to limit the size of a shard to no
 more than 200GB because I only have 400GB of disk available to
 optimize a given shard.

 Why does the size (du) of an index typically decrease after a commit? 
 I've observed a decrease in size of as much as from 296GB down to
 151GB or as little as from 183GB to 182GB.  Is that size after a
 commit close to the size the index would be after an optimize?  
 
 Likely. Until you commit or close the Writer, the unoptimized index is
 the live index. And then you also have the optimized index. Once you
 commit and make the optimized index the live index, the unoptimized
 index can be removed (depending on your delete policy, which by default
 only keeps the latest commit point).
   
 For that matter, are there cases where optimization can take more than
 2x?  I've heard of cases but have not observed them in my system.  I
 only do adds to the shards, never query them. An LVM snapshot of the
 shard receives the queries.
 
 There are cases where it takes over 2x - but they involve using reopen.
 If you have more than one Reader on the index, and only reopen some of
 them, the new Readers created can hold open the partially optimized
 segments that existed at that moment, creating a need for greater than 2x.
   
 Is doing a commit before I take a du a reliable way to gauge the size
 of the shard?  It is really bad news to allow a shard to go over 200GB
 in my use case.  How do others manage this problem of 2x space needed
 to optimize with limited dosk space?
 
 Get more disk space ;) Or don't optimize. A lower mergefactor can make
 optimizations less necessary.
   
 Advice greatly appreciated.

 Phil

 


   


-- 
- Mark

http://www.lucidimagination.com





Re: index size before and after commit

2009-10-01 Thread Mark Miller
bq. and reindex without any merges.

Thats actually quite a hoop to jump as well - though if you determined
and you have tons of RAM, its somewhat doable.

Mark Miller wrote:
 Nice one ;) Its not technically a case where optimize requires  2x
 though in case the user asking gets confused. Its a case unrelated to
 optimize that can grow your index. Then you need  2x for the optimize,
 since you won't copy the deletes.

 It also requires that you jump hoops to delete everything. If you delete
 everything with *:*, that is smart enough not to just do a delete on
 every document - it just creates a new index, allowing the removal of
 the old very efficiently.

 Def agree on the more disk space.

 Walter Underwood wrote:
   
 Here is how you need 3X. First, index everything and optimize. Then
 delete everything and reindex without any merges.

 You have one full-size index containing only deleted docs, one
 full-size index containing reindexed docs, and need that much space
 for a third index.

 Honestly, disk is cheap, and there is no way to make Lucene work
 reliably with less disk. 1TB is a few hundred dollars. You have a free
 search engine, buy some disk.

 wunder

 On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:

 
 151GB or as little as from 183GB to 182GB.  Is that size after a
 commit close to the size the index would be after an optimize?  For
 that matter, are there cases where optimization can take more than
 2x?  I've heard of cases but have not observed them in my system.
 
 I seem to recall a case where it can be 3x, but I don't know that it
 has been observed much.
   


   


-- 
- Mark

http://www.lucidimagination.com





Re: index size before and after commit

2009-10-01 Thread Walter Underwood
I've now worked on three different search engines and they all have a  
3X worst

case on space, so I'm familiar with this case. --wunder

On Oct 1, 2009, at 7:15 AM, Mark Miller wrote:


Nice one ;) Its not technically a case where optimize requires  2x
though in case the user asking gets confused. Its a case unrelated to
optimize that can grow your index. Then you need  2x for the  
optimize,

since you won't copy the deletes.

It also requires that you jump hoops to delete everything. If you  
delete

everything with *:*, that is smart enough not to just do a delete on
every document - it just creates a new index, allowing the removal of
the old very efficiently.

Def agree on the more disk space.

Walter Underwood wrote:

Here is how you need 3X. First, index everything and optimize. Then
delete everything and reindex without any merges.

You have one full-size index containing only deleted docs, one
full-size index containing reindexed docs, and need that much space
for a third index.

Honestly, disk is cheap, and there is no way to make Lucene work
reliably with less disk. 1TB is a few hundred dollars. You have a  
free

search engine, buy some disk.

wunder

On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:


151GB or as little as from 183GB to 182GB.  Is that size after a
commit close to the size the index would be after an optimize?  For
that matter, are there cases where optimization can take more than
2x?  I've heard of cases but have not observed them in my system.


I seem to recall a case where it can be 3x, but I don't know that it
has been observed much.





--
- Mark

http://www.lucidimagination.com







Re: index size before and after commit

2009-10-01 Thread Lance Norskog
I've heard there is a new partial optimize feature in Lucene, but it
is not mentioned in the Solr or Lucene wikis so I cannot advise you
how to use it.

On a previous project we had a 500GB index for 450m documents. It took
14 hours to optimize. We found that Solr worked well (given enough RAM
for sorting and faceting requests) but that the IT logistics of a 500G
fileset were too much.

Also, if you want your query servers to continue serving while
propogating the newly optimized index, you need 2X space to store both
copies on the slave during the transfer. For us this 35 minutes over
1G ethernet.

On Thu, Oct 1, 2009 at 7:36 AM, Walter Underwood wun...@wunderwood.org wrote:
 I've now worked on three different search engines and they all have a 3X
 worst
 case on space, so I'm familiar with this case. --wunder

 On Oct 1, 2009, at 7:15 AM, Mark Miller wrote:

 Nice one ;) Its not technically a case where optimize requires  2x
 though in case the user asking gets confused. Its a case unrelated to
 optimize that can grow your index. Then you need  2x for the optimize,
 since you won't copy the deletes.

 It also requires that you jump hoops to delete everything. If you delete
 everything with *:*, that is smart enough not to just do a delete on
 every document - it just creates a new index, allowing the removal of
 the old very efficiently.

 Def agree on the more disk space.

 Walter Underwood wrote:

 Here is how you need 3X. First, index everything and optimize. Then
 delete everything and reindex without any merges.

 You have one full-size index containing only deleted docs, one
 full-size index containing reindexed docs, and need that much space
 for a third index.

 Honestly, disk is cheap, and there is no way to make Lucene work
 reliably with less disk. 1TB is a few hundred dollars. You have a free
 search engine, buy some disk.

 wunder

 On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:

 151GB or as little as from 183GB to 182GB.  Is that size after a
 commit close to the size the index would be after an optimize?  For
 that matter, are there cases where optimization can take more than
 2x?  I've heard of cases but have not observed them in my system.

 I seem to recall a case where it can be 3x, but I don't know that it
 has been observed much.



 --
 - Mark

 http://www.lucidimagination.com








-- 
Lance Norskog
goks...@gmail.com


Re: index size before and after commit

2009-10-01 Thread Lance Norskog
Ha! Searching partial optimize on
http://www.lucidimagination.com/search , we discover SOLR-603 which
gives the 'maxSegments' option to the optimize command. The text
does not include the word 'partial'.

It's on http://wiki.apache.org/solr/UpdateXmlMessages. The command
gives a number of Lucene segments, and I have no idea how this will
translate to disk space. To minimize disk space, you could run it
repetitively with the number of segments decreasing to one.

On Thu, Oct 1, 2009 at 11:49 AM, Lance Norskog goks...@gmail.com wrote:
 I've heard there is a new partial optimize feature in Lucene, but it
 is not mentioned in the Solr or Lucene wikis so I cannot advise you
 how to use it.

 On a previous project we had a 500GB index for 450m documents. It took
 14 hours to optimize. We found that Solr worked well (given enough RAM
 for sorting and faceting requests) but that the IT logistics of a 500G
 fileset were too much.

 Also, if you want your query servers to continue serving while
 propogating the newly optimized index, you need 2X space to store both
 copies on the slave during the transfer. For us this 35 minutes over
 1G ethernet.

 On Thu, Oct 1, 2009 at 7:36 AM, Walter Underwood wun...@wunderwood.org 
 wrote:
 I've now worked on three different search engines and they all have a 3X
 worst
 case on space, so I'm familiar with this case. --wunder

 On Oct 1, 2009, at 7:15 AM, Mark Miller wrote:

 Nice one ;) Its not technically a case where optimize requires  2x
 though in case the user asking gets confused. Its a case unrelated to
 optimize that can grow your index. Then you need  2x for the optimize,
 since you won't copy the deletes.

 It also requires that you jump hoops to delete everything. If you delete
 everything with *:*, that is smart enough not to just do a delete on
 every document - it just creates a new index, allowing the removal of
 the old very efficiently.

 Def agree on the more disk space.

 Walter Underwood wrote:

 Here is how you need 3X. First, index everything and optimize. Then
 delete everything and reindex without any merges.

 You have one full-size index containing only deleted docs, one
 full-size index containing reindexed docs, and need that much space
 for a third index.

 Honestly, disk is cheap, and there is no way to make Lucene work
 reliably with less disk. 1TB is a few hundred dollars. You have a free
 search engine, buy some disk.

 wunder

 On Oct 1, 2009, at 6:25 AM, Grant Ingersoll wrote:

 151GB or as little as from 183GB to 182GB.  Is that size after a
 commit close to the size the index would be after an optimize?  For
 that matter, are there cases where optimization can take more than
 2x?  I've heard of cases but have not observed them in my system.

 I seem to recall a case where it can be 3x, but I don't know that it
 has been observed much.



 --
 - Mark

 http://www.lucidimagination.com








 --
 Lance Norskog
 goks...@gmail.com




-- 
Lance Norskog
goks...@gmail.com