Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-22 Thread Otis Gospodnetic
Hi Renee,

Here's what I'd do:
* Check how many open files your system is set up for (ulimit -n).  You likely 
want to increase that (1024 seems to be a common default under Linux, and in 
the 
past I've set that to 30k+ without issues)
* Look at your mergeFactor.  If it's high, consider lowering it (will slow down 
indexing a bit)
* Consider using cfs, but if you do the above right, you can avoid using it.
* Consider a better Solr monitoring tool

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Renee Sun renee_...@mcafee.com
 To: solr-user@lucene.apache.org
 Sent: Fri, April 15, 2011 3:41:28 PM
 Subject: Re: partial optimize does not reduce the segment number to 
maxNumSegments
 
 sorry I should elaborate that earlier...
 
 in our production environment,  we have multiple cores and the ingest
 continuously all day long; we only do  optimize periodically, and optimize
 once a day in mid night.
 
 So  sometimes we could see 'too many open files' error. To prevent it  from
 happening, in production we maintain a script to monitor the segment  files
 total with all cores, and send out warnings if that number exceed  a
 threshold... it is kind of preventive measurement.  Currently we are  using
 the linux command to count the files. We are wondering if we can simply  use
 some formula to figure out this number, it will be better that way. Seems  we
 could use the stat url to get segment number and multiply it by 8 (that  is
 what we have given our schema).
 
 Any better way to approach this?  thanks a lot!
 Renee
 
 --
 View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2825736.html

 Sent  from the Solr - User mailing list archive at Nabble.com.
 


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-15 Thread Renee Sun
thanks! 

It seems the file count in index directory is the segment# * 8 in my dev
environment...

I see there are .fnm .frq .fdt .fdx .nrm .prx .tii .tis (8) file extensions,
and each has as many as segment# files.

Is it always safe to calculate the file counts using segment number multiply
by 8? of course this excludes the segment_N, segment.gen and xxx_del files.

I found most of the cores has the file count that can be calculated just
using above formula, but few cores do not have a match number... 

thanks
Renee

--
View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813419.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-15 Thread Renee Sun
yeah, I can figure out the segment number by going to stat page of solr...
but my question was how to figure out exact total number of files in 'index'
folder for each core.

Like I mentioned in previous message, I currently have 8 files per segment
(.prx .tii etc), but it seems this might change if I use term vector for
example.  So I need suggestions on how to accurately figure out the total
file number.

thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2817912.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-15 Thread Erick Erickson
Why do you care? You haven't outlined why having the precise numbers
here is necessary. Perhaps with a higher-level statement of the problem
you're trying to solve we could make some better suggestions

Best
Erick

On Wed, Apr 13, 2011 at 5:23 PM, Renee Sun renee_...@mcafee.com wrote:

 yeah, I can figure out the segment number by going to stat page of solr...
 but my question was how to figure out exact total number of files in
 'index'
 folder for each core.

 Like I mentioned in previous message, I currently have 8 files per segment
 (.prx .tii etc), but it seems this might change if I use term vector for
 example.  So I need suggestions on how to accurately figure out the total
 file number.

 thanks

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2817912.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-15 Thread Renee Sun
sorry I should elaborate that earlier...

in our production environment, we have multiple cores and the ingest
continuously all day long; we only do optimize periodically, and optimize
once a day in mid night.

So sometimes we could see 'too many open files' error. To prevent it from
happening, in production we maintain a script to monitor the segment files
total with all cores, and send out warnings if that number exceed a
threshold... it is kind of preventive measurement.  Currently we are using
the linux command to count the files. We are wondering if we can simply use
some formula to figure out this number, it will be better that way. Seems we
could use the stat url to get segment number and multiply it by 8 (that is
what we have given our schema).

Any better way to approach this? thanks a lot!
Renee

--
View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2825736.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-13 Thread Renee Sun
ok I dug more into this and realize the file extensions can vary depending on
schema, right?
for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)... and
I suspect the file extensions
may change with future lucene releases?

now it seems we can't just count the file using any formula, we have to list
all files in that directory and count that way... any insight will be
appreciated.
thanks
Renee

--
View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-13 Thread Jay Hill
As Hoss mentioned earlier in the thread, you can use the statistics page
from the admin console to view the current number of segments. But if you
want to know by looking at the files, each segment will have a unique
prefix, such as _u. There will be one unique prefix for every segment in
the index.

-Jay


On Tue, Apr 12, 2011 at 3:16 PM, Renee Sun renee_...@mcafee.com wrote:

 ok I dug more into this and realize the file extensions can vary depending
 on
 schema, right?
 for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)...
 and
 I suspect the file extensions
 may change with future lucene releases?

 now it seems we can't just count the file using any formula, we have to
 list
 all files in that directory and count that way... any insight will be
 appreciated.
 thanks
 Renee

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-12 Thread Renee Sun
Hi Hoss,
thanks for your response...

you are right I got a typo in my question, but I did use maxSegments, and
here is the exactly url I used:

 curl
'http://localhost:8080/solr/97/update?optimize=truemaxSegments=10waitFlush=true'

I used jconsole and du -sk to monitor each partial optimize, and I am sure
the optimize was done and
it always reduce segment files from 130+ to 65+ when I started with
maxSegments=10; when I run
again with maxSegments=9, it reduce to somewhere in 50.

when I use maxSegments=2, it always reduce the segment to 18; and
maxSegments=1 (full optimize)
will always reduce the core to 10 segment files.

this has been repeated for about dozen times.

I think the resulting files number is depending on the size of the core. I
have a core takes 10GB disk
space, and it has 4 million documents.

It perhaps also depends on other sole/lucene configurations? let me know if
I should give you any data
with our solr config.  

Here is the actual data from the test I run lately for your reference, you
can see it definitely finished
each partial optimize and the time spent is also included (please note I am
using a core id there which
is different from yours):

/tmp # ls /xxx/solr/data/32455077/index | wc   --- this is the
start point, 150 seg files
 150  150 946
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=10waitFlush=true'
real0m36.050s
user0m0.002s
sys0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc- after first
partial optimize (10), reduce to 82
 82  82 746
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=9waitFlush=true'
real1m54.364s
user0m0.003s
sys0m0.002s

/tmp # ls /xxx/solr/data/32455077/index | wc
 74  74 674
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=8waitFlush=true'
real2m0.443s
user0m0.002s
sys0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc
 66  66 602
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=7waitFlush=true'
?xml version=1.0 encoding=UTF-8?
real3m22.201s
user0m0.002s
sys0m0ls 

/tmp # ls /xxx/solr/data/32455077/index | wc
 58  58 530
/tmp #  time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=6w 
real3m29.277s
user0m0.001s
sys0m0.004s

/tmp # ls /xxx/solr/data/32455077/index | wc
 50  50 458
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=5w 
real3m41.514s
user0m0.003s
sys0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc
 42  42 386
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=4w 
real5m35.697s
user0m0.003s
sys0m0.004s

/tmp # ls /xxx/solr/data/32455077/index | wc
 34  34 314
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=3wa 
real7m8.773s
user0m0.003s
sys0m0.002s

/tmp # ls /xxx/solr/data/32455077/index | wc 
 26  26 242
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=2w 
real9m18.814s
user0m0.004s
sys0m0.001s

/tmp # ls /xxx/solr/data/32455077/index | wc
 18  18 170
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=1w
(full optimize)
real16m6.599s
user0m0.003s
sys0m0.004s

Disk Space Usage:
first 3 runs took about 20% extra 
middle couple runs took about 50% extra 
last full optimize took 100% extra


--
View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2812415.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-12 Thread Chris Hostetter

: /tmp # ls /xxx/solr/data/32455077/index | wc   --- this is the 
start point, 150 seg files
:  150  150 946
: /tmp # time curl


the number of files i nthe index directory is not the number of 
segments

the number of segments is an internal lucene concept that impacts the the 
number of files, but it is not an actual file count.  A segment can 
consist of multiple files depending on how your schema.xml is configured 
(and wether you are using the compound file format)

You can see the current number of segments by looking at the stats page...

http://localhost:8983/solr/admin/stats.jsp
SolrIndexReader{this=64a7c45e,r=ReadOnlyDirectoryReader@64a7c45e,refCnt=1,segments=10}
 

...that's from the solr example, where the index directory at the 
time of that request actually contained 93 files.


-Hoss


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-11 Thread Chris Hostetter

: I have a core with 120+ segment files and I tried partial optimize specify
: maxNumSegments=10, after the optimize the segment files reduced to 64 files;

a) the option you want to specify is maxSegments .. not maxNumSegments

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22optimize.22

b) i can't reproduce this ... i just created an index with 200 segments 
and when i hit the example url from the wiki...

curl 
'http://localhost:8983/solr/update?optimize=truemaxSegments=10waitFlush=false'

...my index was correctly optimized down to 10 segments.

is it possible that you just didn't wait long enough and you were 
observing the number of segments while the optimize was still taking 
place?


-Hoss


partial optimize does not reduce the segment number to maxNumSegments

2011-03-15 Thread Renee Sun
I have a core with 120+ segment files and I tried partial optimize specify
maxNumSegments=10, after the optimize the segment files reduced to 64 files;

I did the same optimize again, it reduced to 30 something;

this keeps going and eventually it drops to teen number.

I was expecting seeing the optimize results in exactly 10 segment files or
somewhere near, and why do I have to manually repeat the optimize to reach
that number?

thanks
Renee 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2682195.html
Sent from the Solr - User mailing list archive at Nabble.com.