On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
Dear all,
I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:
mergeFactor: 10
<str name="Time taken ">0:11:46.792</str>
mergeFactor: 100
/admin/cores?action=RELOAD
<str name="Time taken ">0:11:44.441</str>
Tomcat restart
<str name="Time taken ">0:11:34.143</str>
This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
ATA disk).
Now, I have three questions:
1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the up-
to-date view on the file system. I tested that. But it's not
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the
index is running)?
It could very well be the case that you aren't seeing any merges with
only 20K docs. Ultimately, if you really want to, you can look in
your data.dir and count the files. If you have indexed a lot and have
an MF of 100 and haven't done an optimize, you will see a lot more
index files.
2. I changed the mergeFactor in both available settings (default and
main index) in the solrconfig.xml file of the core I am reindexing.
That is the correct place? Should a change in performance be
noticeable when increasing from 10 to 100? Or is the change not
perceivable if the requests for data are taking far longer than all
the indexing itself?
Likely, but not guaranteed. Typically, larger merge factors are good
for batch indexing, but a lot of that has changed with Lucene's new
background merger, such that I don't know if it matters as much anymore.
3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
(Or some other setting?)
No, those are separate things. The ramBufferSizeMB (although, I like
the thought of a "rum"BufferSizeMB too! ;-) ) controls how many docs
Lucene holds in memory before it has to flush. MF controls how many
segments are on disk
(I am still trying to get profiling information on how much
application time is eaten up by db connection/requests/processing.
The root entity query is about (average) 20ms. The child entity
query is less than 10ms.
I have my custom entity processor running on the child entity that
populates the map using a multi-row result set. I have also attached
one regex and one script transformer.)
Thank you for any tips!
Chantal
--
Chantal Ackermann
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search