Re: bulk reindexing 5.3.0 issue

Shawn Heisey Sat, 26 Sep 2015 09:42:33 -0700

On 9/25/2015 10:10 PM, Ravi Solr wrote:
> thank you for taking time to help me out. Yes I was not using cursorMark, I
> will try that next. This is what I was doing, its a bit shabby coding but
> what can I say my brain was fried :-) FYI this is a side process just to
> correct a messed up string. The actual indexing process was working all the
> time as our business owners are a bit petulant about stopping indexing. My
> autocommit conf and code is given below, as you can see autocommit should
> fire every 100 docs anyway


It took a while, but I finally managed to see how this would page
through the docs.  You are filtering on the text that you are removing.
 This would indeed require that the previous changes are committed
before going through the loop again.  Switching to cursorMark is
probably not necessary, if you optimize your query and your commits.

My advice incorporates some of what Erick said, and some ideas of my own:

I think you should remove autoSoftCommit, and set autoCommit to a
maxTime of 300000 (five minutes) and do not include maxDocs.

    <autoCommit>
       <maxTime>300000</maxTime>
    </autoCommit>

Remove the 5 second sleep from the code.  I would also increase the
number of documents for each loop beyond 100 ... to a minimum of 1000,
possibly more like 10000.  The call to getDocs inside the loop should
not use the size of the previous result, it should use the number of
docs you define for the loop.  After the "add" call in your processDocs
method, you should send a soft commit, so the code looks like this:

  client.add(inList);
  client.commit(true, true, true);

The autoCommit will ensure your transaction log never gets very large,
and the soft commit in your code will take care of change visibility as
quickly as possible.  You might find that some loops take longer than
five seconds, but it should work.

You need to remove the "uuid:[* TO *]" filter.  This is doing
unnecessary (and fairly slow) work on the server side -- the other
filter will ensure that the results would match the range filter, so the
range filter is not necessary.  I assume that you have tried out the
query manually so that you know it actually works?

I'm guessing that uuid is a StrField, not an actual UUID type.  I'm
reasonably certain that if it were a UUID type, it would not have
accepted the class name that you are trying to remove.

What is your uniqueKey field?  I hope it's not uuid.  I think that you
would not get the results you want if that were the case.  Your code
excerpt hints that the uniqueKey is another field.

I pulled your code into a new eclipse project and made the recommended
changes, plus a few other very small modifications.  The results are here:

http://apaste.info/w48

I had no context for the "systemid" variable, so I defined it to get rid
of the compiler error.  It is only used for logging.  I also had to
define the "log" variable to get the code to validate, which I think
you've already done in your own class, so that can be removed from my
workup.  The code is formatted to my company's standard formatting,
which probably doesn't match your own standard.

Something I just noticed:  You could probably remove the sort from the
query, which might reduce the amount of memory used on the Solr server
and make everything generally faster.

If the modified code runs into problems, there might be a serious issue
on the server side your Solr install.

Thanks,
Shawn

Re: bulk reindexing 5.3.0 issue

Reply via email to