Re: partial optimize does not reduce the segment number to maxNumSegments
Hi Renee, Here's what I'd do: * Check how many open files your system is set up for (ulimit -n). You likely want to increase that (1024 seems to be a common default under Linux, and in the past I've set that to 30k+ without issues) * Look at your mergeFactor. If it's high, consider lowering it (will slow down indexing a bit) * Consider using cfs, but if you do the above right, you can avoid using it. * Consider a better Solr monitoring tool Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Renee Sun renee_...@mcafee.com To: solr-user@lucene.apache.org Sent: Fri, April 15, 2011 3:41:28 PM Subject: Re: partial optimize does not reduce the segment number to maxNumSegments sorry I should elaborate that earlier... in our production environment, we have multiple cores and the ingest continuously all day long; we only do optimize periodically, and optimize once a day in mid night. So sometimes we could see 'too many open files' error. To prevent it from happening, in production we maintain a script to monitor the segment files total with all cores, and send out warnings if that number exceed a threshold... it is kind of preventive measurement. Currently we are using the linux command to count the files. We are wondering if we can simply use some formula to figure out this number, it will be better that way. Seems we could use the stat url to get segment number and multiply it by 8 (that is what we have given our schema). Any better way to approach this? thanks a lot! Renee -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2825736.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
thanks! It seems the file count in index directory is the segment# * 8 in my dev environment... I see there are .fnm .frq .fdt .fdx .nrm .prx .tii .tis (8) file extensions, and each has as many as segment# files. Is it always safe to calculate the file counts using segment number multiply by 8? of course this excludes the segment_N, segment.gen and xxx_del files. I found most of the cores has the file count that can be calculated just using above formula, but few cores do not have a match number... thanks Renee -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813419.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
yeah, I can figure out the segment number by going to stat page of solr... but my question was how to figure out exact total number of files in 'index' folder for each core. Like I mentioned in previous message, I currently have 8 files per segment (.prx .tii etc), but it seems this might change if I use term vector for example. So I need suggestions on how to accurately figure out the total file number. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2817912.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
Why do you care? You haven't outlined why having the precise numbers here is necessary. Perhaps with a higher-level statement of the problem you're trying to solve we could make some better suggestions Best Erick On Wed, Apr 13, 2011 at 5:23 PM, Renee Sun renee_...@mcafee.com wrote: yeah, I can figure out the segment number by going to stat page of solr... but my question was how to figure out exact total number of files in 'index' folder for each core. Like I mentioned in previous message, I currently have 8 files per segment (.prx .tii etc), but it seems this might change if I use term vector for example. So I need suggestions on how to accurately figure out the total file number. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2817912.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
sorry I should elaborate that earlier... in our production environment, we have multiple cores and the ingest continuously all day long; we only do optimize periodically, and optimize once a day in mid night. So sometimes we could see 'too many open files' error. To prevent it from happening, in production we maintain a script to monitor the segment files total with all cores, and send out warnings if that number exceed a threshold... it is kind of preventive measurement. Currently we are using the linux command to count the files. We are wondering if we can simply use some formula to figure out this number, it will be better that way. Seems we could use the stat url to get segment number and multiply it by 8 (that is what we have given our schema). Any better way to approach this? thanks a lot! Renee -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2825736.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
ok I dug more into this and realize the file extensions can vary depending on schema, right? for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)... and I suspect the file extensions may change with future lucene releases? now it seems we can't just count the file using any formula, we have to list all files in that directory and count that way... any insight will be appreciated. thanks Renee -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
As Hoss mentioned earlier in the thread, you can use the statistics page from the admin console to view the current number of segments. But if you want to know by looking at the files, each segment will have a unique prefix, such as _u. There will be one unique prefix for every segment in the index. -Jay On Tue, Apr 12, 2011 at 3:16 PM, Renee Sun renee_...@mcafee.com wrote: ok I dug more into this and realize the file extensions can vary depending on schema, right? for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)... and I suspect the file extensions may change with future lucene releases? now it seems we can't just count the file using any formula, we have to list all files in that directory and count that way... any insight will be appreciated. thanks Renee -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
Hi Hoss, thanks for your response... you are right I got a typo in my question, but I did use maxSegments, and here is the exactly url I used: curl 'http://localhost:8080/solr/97/update?optimize=truemaxSegments=10waitFlush=true' I used jconsole and du -sk to monitor each partial optimize, and I am sure the optimize was done and it always reduce segment files from 130+ to 65+ when I started with maxSegments=10; when I run again with maxSegments=9, it reduce to somewhere in 50. when I use maxSegments=2, it always reduce the segment to 18; and maxSegments=1 (full optimize) will always reduce the core to 10 segment files. this has been repeated for about dozen times. I think the resulting files number is depending on the size of the core. I have a core takes 10GB disk space, and it has 4 million documents. It perhaps also depends on other sole/lucene configurations? let me know if I should give you any data with our solr config. Here is the actual data from the test I run lately for your reference, you can see it definitely finished each partial optimize and the time spent is also included (please note I am using a core id there which is different from yours): /tmp # ls /xxx/solr/data/32455077/index | wc --- this is the start point, 150 seg files 150 150 946 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=10waitFlush=true' real0m36.050s user0m0.002s sys0m0.003s /tmp # ls /xxx/solr/data/32455077/index | wc- after first partial optimize (10), reduce to 82 82 82 746 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=9waitFlush=true' real1m54.364s user0m0.003s sys0m0.002s /tmp # ls /xxx/solr/data/32455077/index | wc 74 74 674 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=8waitFlush=true' real2m0.443s user0m0.002s sys0m0.003s /tmp # ls /xxx/solr/data/32455077/index | wc 66 66 602 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=7waitFlush=true' ?xml version=1.0 encoding=UTF-8? real3m22.201s user0m0.002s sys0m0ls /tmp # ls /xxx/solr/data/32455077/index | wc 58 58 530 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=6w real3m29.277s user0m0.001s sys0m0.004s /tmp # ls /xxx/solr/data/32455077/index | wc 50 50 458 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=5w real3m41.514s user0m0.003s sys0m0.003s /tmp # ls /xxx/solr/data/32455077/index | wc 42 42 386 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=4w real5m35.697s user0m0.003s sys0m0.004s /tmp # ls /xxx/solr/data/32455077/index | wc 34 34 314 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=3wa real7m8.773s user0m0.003s sys0m0.002s /tmp # ls /xxx/solr/data/32455077/index | wc 26 26 242 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=2w real9m18.814s user0m0.004s sys0m0.001s /tmp # ls /xxx/solr/data/32455077/index | wc 18 18 170 /tmp # time curl 'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=1w (full optimize) real16m6.599s user0m0.003s sys0m0.004s Disk Space Usage: first 3 runs took about 20% extra middle couple runs took about 50% extra last full optimize took 100% extra -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2812415.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partial optimize does not reduce the segment number to maxNumSegments
: /tmp # ls /xxx/solr/data/32455077/index | wc --- this is the start point, 150 seg files : 150 150 946 : /tmp # time curl the number of files i nthe index directory is not the number of segments the number of segments is an internal lucene concept that impacts the the number of files, but it is not an actual file count. A segment can consist of multiple files depending on how your schema.xml is configured (and wether you are using the compound file format) You can see the current number of segments by looking at the stats page... http://localhost:8983/solr/admin/stats.jsp SolrIndexReader{this=64a7c45e,r=ReadOnlyDirectoryReader@64a7c45e,refCnt=1,segments=10} ...that's from the solr example, where the index directory at the time of that request actually contained 93 files. -Hoss
Re: partial optimize does not reduce the segment number to maxNumSegments
: I have a core with 120+ segment files and I tried partial optimize specify : maxNumSegments=10, after the optimize the segment files reduced to 64 files; a) the option you want to specify is maxSegments .. not maxNumSegments http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22optimize.22 b) i can't reproduce this ... i just created an index with 200 segments and when i hit the example url from the wiki... curl 'http://localhost:8983/solr/update?optimize=truemaxSegments=10waitFlush=false' ...my index was correctly optimized down to 10 segments. is it possible that you just didn't wait long enough and you were observing the number of segments while the optimize was still taking place? -Hoss
partial optimize does not reduce the segment number to maxNumSegments
I have a core with 120+ segment files and I tried partial optimize specify maxNumSegments=10, after the optimize the segment files reduced to 64 files; I did the same optimize again, it reduced to 30 something; this keeps going and eventually it drops to teen number. I was expecting seeing the optimize results in exactly 10 segment files or somewhere near, and why do I have to manually repeat the optimize to reach that number? thanks Renee -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2682195.html Sent from the Solr - User mailing list archive at Nabble.com.