Re: mergeFactor / indexing speed

Avlesh Singh Mon, 03 Aug 2009 09:51:40 -0700

>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
I agree, real bad statistics, actually.


Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
>
To me the former appears to be too high and latter too low (for your machine
configuration). You can safely increase the ramBufferSize (or
maxBufferedDocs) to a higher value.

Couple of things -

   1. The stock solrconfig.xml comes with two sections <indexDefaults> and
   <mainIndex>. Options in the latter override the former. Just make sure that
   you have right values at the right place.
   2. Do you have too many nested entities inside the DIH's data-config? If
   yes, a database level optimization (creating views, in memory tables ...)
   might hold the answer.
   3. Tried playing around with jdbc paramters in the data source? Setting
   "batchSize" property to a considerable value might help.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:02 PM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> Hi all,
>
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
>
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
>
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. Which
> means 1,5 hours at least for 200k - which is as fast/slow as before (on the
> less performant machine).
>
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>  iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*
>
> The sourcing database is responding as fast as ever. (I checked that from
> my own machine, and did only a ping from the linux box to the db server.)
>
> Any help, any hint on where to look would be greatly appreciated.
>
>
> Thanks!
> Chantal
>
>
> Chantal Ackermann schrieb:
>
>> Hi again!
>>
>> Thanks for the answer, Grant.
>>
>>  > It could very well be the case that you aren't seeing any merges with
>>  > only 20K docs.  Ultimately, if you really want to, you can look in
>>  > your data.dir and count the files.  If you have indexed a lot and have
>>  > an MF of 100 and haven't done an optimize, you will see a lot more
>>  > index files.
>>
>> Do you mean that 20k is not representative enough to test those settings?
>> I've chosen the smaller data set so that the index can run completely
>> but doesn't take too long at the same time.
>> If it would be faster to begin with, I could use a larger data set, of
>> course. I still can't believe that 11 minutes is normal (I haven't
>> managed to make it run faster or slower than that, that duration is very
>> stable).
>>
>> It "feels kinda" slow to me...
>> Out of your experience - what would you expect as duration for an index
>> with:
>> - 21 fields, some using a text type with 6 filters
>> - database access using DataImportHandler with a query of (far) less
>> than 20ms
>> - 2 transformers
>>
>> If I knew that indexing time should be shorter than that, at least, I
>> would know that something is definitely wrong with what I am doing or
>> with the environment I am using.
>>
>>  > Likely, but not guaranteed.  Typically, larger merge factors are good
>>  > for batch indexing, but a lot of that has changed with Lucene's new
>>  > background merger, such that I don't know if it matters as much
>> anymore.
>>
>> Ok. I also read some posting where it basically said that the default
>> parameters are ok. And one shouldn't mess around with them.
>>
>> The thing is that our current search setup uses Lucene directly, and the
>> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
>> fields are different, the complete setup is different. But it will be
>> hard to advertise a new implementation/setup where indexing is three
>> times slower - unless I can give some reasons why that is.
>>
>> The full index should be fairly fast because the backing data is update
>> every few hours. I want to put in place an incremental/partial update as
>> main process, but full indexing might have to be done at certain times
>> if data has changed completely, or the schema has to be changed/extended.
>>
>>  > No, those are separate things.  The ramBufferSizeMB (although, I like
>>  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>>  > Lucene holds in memory before it has to flush.  MF controls how many
>>  > segments are on disk
>>
>> alas! the rum. I had that typo on the commandline before. that's my
>> subconscious telling me what I should do when I get home, tonight...
>>
>> So, increasing ramBufferSize should lead to higher memory usage,
>> shouldn't it? I'm not seeing that. :-(
>>
>> I'll try once more with MF 10 and a higher rum... well, you know... ;-)
>>
>> Cheers,
>> Chantal
>>
>> Grant Ingersoll schrieb:
>>
>>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>>
>>>  Dear all,
>>>>
>>>> I want to find out which settings give the best full index
>>>> performance for my setup.
>>>> Therefore, I have been running a small index (less than 20k
>>>> documents) with a mergeFactor of 10 and 100.
>>>> In both cases, indexing took about 11.5 min:
>>>>
>>>> mergeFactor: 10
>>>> <str name="Time taken ">0:11:46.792</str>
>>>> mergeFactor: 100
>>>> /admin/cores?action=RELOAD
>>>> <str name="Time taken ">0:11:44.441</str>
>>>> Tomcat restart
>>>> <str name="Time taken ">0:11:34.143</str>
>>>>
>>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
>>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
>>>> ATA disk).
>>>>
>>>>
>>>> Now, I have three questions:
>>>>
>>>> 1. How can I check which mergeFactor is really being used? The
>>>> solrconfig.xml that is displayed in the admin application is the up-
>>>> to-date view on the file system. I tested that. But it's not
>>>> necessarily what the current SOLR core is using, isn't it?
>>>> Is there a way to check on the actually used mergeFactor (while the
>>>> index is running)?
>>>>
>>> It could very well be the case that you aren't seeing any merges with
>>> only 20K docs.  Ultimately, if you really want to, you can look in
>>> your data.dir and count the files.  If you have indexed a lot and have
>>> an MF of 100 and haven't done an optimize, you will see a lot more
>>> index files.
>>>
>>>
>>>  2. I changed the mergeFactor in both available settings (default and
>>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>>> That is the correct place? Should a change in performance be
>>>> noticeable when increasing from 10 to 100? Or is the change not
>>>> perceivable if the requests for data are taking far longer than all
>>>> the indexing itself?
>>>>
>>> Likely, but not guaranteed.  Typically, larger merge factors are good
>>> for batch indexing, but a lot of that has changed with Lucene's new
>>> background merger, such that I don't know if it matters as much anymore.
>>>
>>>
>>>  3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>>> (Or some other setting?)
>>>>
>>> No, those are separate things.  The ramBufferSizeMB (although, I like
>>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>>> Lucene holds in memory before it has to flush.  MF controls how many
>>> segments are on disk
>>>
>>>  (I am still trying to get profiling information on how much
>>>> application time is eaten up by db connection/requests/processing.
>>>> The root entity query is about (average) 20ms. The child entity
>>>> query is less than 10ms.
>>>> I have my custom entity processor running on the child entity that
>>>> populates the map using a multi-row result set. I have also attached
>>>> one regex and one script transformer.)
>>>>
>>>> Thank you for any tips!
>>>> Chantal
>>>>
>>>>
>>>>
>>>> --
>>>> Chantal Ackermann
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
>
>

Re: mergeFactor / indexing speed

Reply via email to