Re: [Q] Faster Atomic Updates - use docValues?

2019-12-11 Thread Erick Erickson
GCEasy works fine. GCViewer is something you can have on your local machine,
sometimes if you have very large GC logs uploading them can take quite a
while.

The next step, if you can’t find anything satisfactory is to put a profiler on
the running Solr instance, which will tell you where the time is being spent.

Do note that indexing is an I/O intensive operation, especially when segments
are being merged so if you were swapping, I’d expect I/O to go from just
very high to extremely high….

Good luck!

> On Dec 11, 2019, at 8:13 AM, Paras Lehana  wrote:
> 
> Hi Erick,
> 
> You're right - IO was extraordinarily high. But something odd happened. To
> actually build a relation, I tried different heap sizes with default
> solrconfig.xml values as you recommended.
> 
>   1. Increased RAM to 4G, speed 8500k.
>   2. Decreased to 2G, back to old 65k.
>   3. Increased back to 4G, speed 50k
>   4. Decreased to 3G, speed 50k
>   5. Increased to 10G, speed 8500k.
> 
> The speed is 1 min average after the indexing is started. With last 10G, as
> (maybe) expected, I got java.lang.NullPointerException at
> org.apache.solr.handler.component.RealTimeGetComponent.getInputDocument
> before committing. I'm not getting the faster speeds with any of the heap
> sizes now. I will continue digging in deeper and in the meantime, I will be
> getting the 24G RAM. Currently giving Solr 6G heap (speed is 55k - too
> low).
> 
> After making the progress, this may be a step backward but I do believe I
> will take 2 steps forward soon. All credits to you. Getting into GC logs
> now. I'm a newbie here - know about GC theory but have never analyzed
> those. What tool do you prefer? I'm planning to use GCeasy for uploading
> the solr current gc log.
> 
> On Wed, 11 Dec 2019 at 18:21, Erick Erickson 
> wrote:
> 
>> I doubt GC alone would make nearly that difference. More likely
>> it’s I/O interacting with MMapDirectory. Lucene uses OS memory
>> space for much of its index, i.e. the RAM left over
>> after that used for the running Solr process (and any other
>> processes of course). See:
>> 
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> 
>> So if you, you don’t leave much OS memory space for Lucene’s
>> use via MMap, that can lead to swapping. My bet is that was
>> what was happening, and your CPU utilization was low; Lucene and
>> thus Solr was spending all its time waiting around for I/O. If that theory
>> is true, your disk I/O should have been much higher before you reduced
>> your heap.
>> 
>> IOW, I claim if you left the java heap at 12G and increased the physical
>> memory to 24G you’d see an identical (or nearly) speedup. GC for a 12G
>> heap is rarely a bottleneck. That said you want to use as little heap for
>> your Java process as possible, but if you reduce it too much you wind up
>> with other problems. OOM for one, and I’ve also seen GC take an inordinate
>> amount of time when it’s _barely_ enough to run. You hit a GC that
>> recovers,
>> say, 10M of heap which is barely enough to continue for a few milliseconds
>> and hits another GC….. As you can tell, “this is more art than science”…
>> 
>> Glad to hear you’re making progress!
>> Erick
>> 
>>> On Dec 11, 2019, at 5:06 AM, Paras Lehana 
>> wrote:
>>> 
>>> Just to update, I kept the defaults. The indexing got only a little boost
>>> though I have decided to continue with the defaults and do incremental
>>> experiments only. To my surprise, our development server had only 12GB
>> RAM,
>>> of which 8G was allocated to Java. Because I could not increase the RAM,
>> I
>>> tried decreasing it to 4G and guess what! My indexing speed got a boost
>> of
>>> over *50x*. Erick, thanks for helping. I think I should do more homework
>>> about GCs also. Your GC guess seems to be valid. I have raised the
>> request
>>> to increase RAM on the development to 24GB.
>>> 
>>> On Mon, 9 Dec 2019 at 20:23, Erick Erickson 
>> wrote:
>>> 
 Note that that article is from 2011. That was in the Solr 3x days when
 many, many, many things were different. There was no SolrCloud for
 instance. Plus Tom’s problem space is indexing _books_. Whole, complete,
 books. Which is, actually, not “normal” indexing at all as most Solr
 indexes are much smaller documents. Books are a perfectly reasonable
 use-case of course, but have a whole bunch of special requirements.
 
 get-by-id should be very efficient, _except_ that the longer you spend
 before opening a new searcher, the larger the internal data buffers
 supporting get-by-id need to be.
 
 Anyway, best of luck
 Erick
 
> On Dec 9, 2019, at 1:05 AM, Paras Lehana 
 wrote:
> 
> Hi Erick,
> 
> I have reverted back to original values and yes, I did see
>> improvement. I
> will collect more stats. *Thank you for helping. :)*
> 
> Also, here is the reference article that I had referred for changing
> values:
> 
 
>> https://www.ha

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-11 Thread Paras Lehana
Hi Erick,

You're right - IO was extraordinarily high. But something odd happened. To
actually build a relation, I tried different heap sizes with default
solrconfig.xml values as you recommended.

   1. Increased RAM to 4G, speed 8500k.
   2. Decreased to 2G, back to old 65k.
   3. Increased back to 4G, speed 50k
   4. Decreased to 3G, speed 50k
   5. Increased to 10G, speed 8500k.

The speed is 1 min average after the indexing is started. With last 10G, as
(maybe) expected, I got java.lang.NullPointerException at
org.apache.solr.handler.component.RealTimeGetComponent.getInputDocument
before committing. I'm not getting the faster speeds with any of the heap
sizes now. I will continue digging in deeper and in the meantime, I will be
getting the 24G RAM. Currently giving Solr 6G heap (speed is 55k - too
low).

After making the progress, this may be a step backward but I do believe I
will take 2 steps forward soon. All credits to you. Getting into GC logs
now. I'm a newbie here - know about GC theory but have never analyzed
those. What tool do you prefer? I'm planning to use GCeasy for uploading
the solr current gc log.

On Wed, 11 Dec 2019 at 18:21, Erick Erickson 
wrote:

> I doubt GC alone would make nearly that difference. More likely
> it’s I/O interacting with MMapDirectory. Lucene uses OS memory
> space for much of its index, i.e. the RAM left over
> after that used for the running Solr process (and any other
> processes of course). See:
>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> So if you, you don’t leave much OS memory space for Lucene’s
> use via MMap, that can lead to swapping. My bet is that was
> what was happening, and your CPU utilization was low; Lucene and
> thus Solr was spending all its time waiting around for I/O. If that theory
> is true, your disk I/O should have been much higher before you reduced
> your heap.
>
> IOW, I claim if you left the java heap at 12G and increased the physical
> memory to 24G you’d see an identical (or nearly) speedup. GC for a 12G
> heap is rarely a bottleneck. That said you want to use as little heap for
> your Java process as possible, but if you reduce it too much you wind up
> with other problems. OOM for one, and I’ve also seen GC take an inordinate
> amount of time when it’s _barely_ enough to run. You hit a GC that
> recovers,
> say, 10M of heap which is barely enough to continue for a few milliseconds
> and hits another GC….. As you can tell, “this is more art than science”…
>
> Glad to hear you’re making progress!
> Erick
>
> > On Dec 11, 2019, at 5:06 AM, Paras Lehana 
> wrote:
> >
> > Just to update, I kept the defaults. The indexing got only a little boost
> > though I have decided to continue with the defaults and do incremental
> > experiments only. To my surprise, our development server had only 12GB
> RAM,
> > of which 8G was allocated to Java. Because I could not increase the RAM,
> I
> > tried decreasing it to 4G and guess what! My indexing speed got a boost
> of
> > over *50x*. Erick, thanks for helping. I think I should do more homework
> > about GCs also. Your GC guess seems to be valid. I have raised the
> request
> > to increase RAM on the development to 24GB.
> >
> > On Mon, 9 Dec 2019 at 20:23, Erick Erickson 
> wrote:
> >
> >> Note that that article is from 2011. That was in the Solr 3x days when
> >> many, many, many things were different. There was no SolrCloud for
> >> instance. Plus Tom’s problem space is indexing _books_. Whole, complete,
> >> books. Which is, actually, not “normal” indexing at all as most Solr
> >> indexes are much smaller documents. Books are a perfectly reasonable
> >> use-case of course, but have a whole bunch of special requirements.
> >>
> >> get-by-id should be very efficient, _except_ that the longer you spend
> >> before opening a new searcher, the larger the internal data buffers
> >> supporting get-by-id need to be.
> >>
> >> Anyway, best of luck
> >> Erick
> >>
> >>> On Dec 9, 2019, at 1:05 AM, Paras Lehana 
> >> wrote:
> >>>
> >>> Hi Erick,
> >>>
> >>> I have reverted back to original values and yes, I did see
> improvement. I
> >>> will collect more stats. *Thank you for helping. :)*
> >>>
> >>> Also, here is the reference article that I had referred for changing
> >>> values:
> >>>
> >>
> https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
> >>>
> >>> The article was perhaps for normal indexing and thus, suggested
> >> increasing
> >>> mergeFactor and then finally optimizing. In my case, a large number of
> >>> segments could have impacted get-by-id of atomic updates? Just being
> >>> curious.
> >>>
> >>> On Fri, 6 Dec 2019 at 19:02, Paras Lehana 
> >>> wrote:
> >>>
>  Hey Erick,
> 
>  We have just upgraded to 8.3 before starting the indexing. We were on
> >> 6.6
>  before that.
> 
>  Thank you for your continued support and resources. Again, I have
> >> already
>  taken

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-11 Thread Erick Erickson
I doubt GC alone would make nearly that difference. More likely
it’s I/O interacting with MMapDirectory. Lucene uses OS memory
space for much of its index, i.e. the RAM left over
after that used for the running Solr process (and any other
processes of course). See:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

So if you, you don’t leave much OS memory space for Lucene’s 
use via MMap, that can lead to swapping. My bet is that was
what was happening, and your CPU utilization was low; Lucene and
thus Solr was spending all its time waiting around for I/O. If that theory
is true, your disk I/O should have been much higher before you reduced
your heap.

IOW, I claim if you left the java heap at 12G and increased the physical
memory to 24G you’d see an identical (or nearly) speedup. GC for a 12G
heap is rarely a bottleneck. That said you want to use as little heap for
your Java process as possible, but if you reduce it too much you wind up
with other problems. OOM for one, and I’ve also seen GC take an inordinate
amount of time when it’s _barely_ enough to run. You hit a GC that recovers,
say, 10M of heap which is barely enough to continue for a few milliseconds
and hits another GC….. As you can tell, “this is more art than science”…

Glad to hear you’re making progress!
Erick

> On Dec 11, 2019, at 5:06 AM, Paras Lehana  wrote:
> 
> Just to update, I kept the defaults. The indexing got only a little boost
> though I have decided to continue with the defaults and do incremental
> experiments only. To my surprise, our development server had only 12GB RAM,
> of which 8G was allocated to Java. Because I could not increase the RAM, I
> tried decreasing it to 4G and guess what! My indexing speed got a boost of
> over *50x*. Erick, thanks for helping. I think I should do more homework
> about GCs also. Your GC guess seems to be valid. I have raised the request
> to increase RAM on the development to 24GB.
> 
> On Mon, 9 Dec 2019 at 20:23, Erick Erickson  wrote:
> 
>> Note that that article is from 2011. That was in the Solr 3x days when
>> many, many, many things were different. There was no SolrCloud for
>> instance. Plus Tom’s problem space is indexing _books_. Whole, complete,
>> books. Which is, actually, not “normal” indexing at all as most Solr
>> indexes are much smaller documents. Books are a perfectly reasonable
>> use-case of course, but have a whole bunch of special requirements.
>> 
>> get-by-id should be very efficient, _except_ that the longer you spend
>> before opening a new searcher, the larger the internal data buffers
>> supporting get-by-id need to be.
>> 
>> Anyway, best of luck
>> Erick
>> 
>>> On Dec 9, 2019, at 1:05 AM, Paras Lehana 
>> wrote:
>>> 
>>> Hi Erick,
>>> 
>>> I have reverted back to original values and yes, I did see improvement. I
>>> will collect more stats. *Thank you for helping. :)*
>>> 
>>> Also, here is the reference article that I had referred for changing
>>> values:
>>> 
>> https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
>>> 
>>> The article was perhaps for normal indexing and thus, suggested
>> increasing
>>> mergeFactor and then finally optimizing. In my case, a large number of
>>> segments could have impacted get-by-id of atomic updates? Just being
>>> curious.
>>> 
>>> On Fri, 6 Dec 2019 at 19:02, Paras Lehana 
>>> wrote:
>>> 
 Hey Erick,
 
 We have just upgraded to 8.3 before starting the indexing. We were on
>> 6.6
 before that.
 
 Thank you for your continued support and resources. Again, I have
>> already
 taken your suggestion to start afresh and that's what I'm going to do.
 Don't get me wrong but I have been just asking doubts. I will surely get
 back with my experience after performing the full indexing.
 
 Thanks again! :)
 
 On Fri, 6 Dec 2019 at 18:48, Erick Erickson 
 wrote:
 
> Nothing implicitly handles optimization, you must continue to do that
> externally.
> 
> Until you get to the bottom of your indexing slowdown, I wouldn’t
>> bother
> with it at all, trying to do all these things at once is what lead to
>> your
> problem in the first place, please change one thing at a time. You say:
> 
> “For a full indexing, optimizations occurred 30 times between batches”.
> 
> This is horrible. I’m not sure what version of Solr you’re using. If
>> it’s
> 7.4 or earlier, this means the the entire index was rewritten 30 times.
> The first time it would condense all segments into a single segment, or
> 1/30 of the total. The second time it would rewrite all that, 2/30 of
>> the
> index into a new segment. The third time 3/30. And so on.
> 
> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was
>> over
> 5G. But still.
> 
> See:
> 
>> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
>

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-11 Thread Paras Lehana
Just to update, I kept the defaults. The indexing got only a little boost
though I have decided to continue with the defaults and do incremental
experiments only. To my surprise, our development server had only 12GB RAM,
of which 8G was allocated to Java. Because I could not increase the RAM, I
tried decreasing it to 4G and guess what! My indexing speed got a boost of
over *50x*. Erick, thanks for helping. I think I should do more homework
about GCs also. Your GC guess seems to be valid. I have raised the request
to increase RAM on the development to 24GB.

On Mon, 9 Dec 2019 at 20:23, Erick Erickson  wrote:

> Note that that article is from 2011. That was in the Solr 3x days when
> many, many, many things were different. There was no SolrCloud for
> instance. Plus Tom’s problem space is indexing _books_. Whole, complete,
> books. Which is, actually, not “normal” indexing at all as most Solr
> indexes are much smaller documents. Books are a perfectly reasonable
> use-case of course, but have a whole bunch of special requirements.
>
> get-by-id should be very efficient, _except_ that the longer you spend
> before opening a new searcher, the larger the internal data buffers
> supporting get-by-id need to be.
>
> Anyway, best of luck
> Erick
>
> > On Dec 9, 2019, at 1:05 AM, Paras Lehana 
> wrote:
> >
> > Hi Erick,
> >
> > I have reverted back to original values and yes, I did see improvement. I
> > will collect more stats. *Thank you for helping. :)*
> >
> > Also, here is the reference article that I had referred for changing
> > values:
> >
> https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
> >
> > The article was perhaps for normal indexing and thus, suggested
> increasing
> > mergeFactor and then finally optimizing. In my case, a large number of
> > segments could have impacted get-by-id of atomic updates? Just being
> > curious.
> >
> > On Fri, 6 Dec 2019 at 19:02, Paras Lehana 
> > wrote:
> >
> >> Hey Erick,
> >>
> >> We have just upgraded to 8.3 before starting the indexing. We were on
> 6.6
> >> before that.
> >>
> >> Thank you for your continued support and resources. Again, I have
> already
> >> taken your suggestion to start afresh and that's what I'm going to do.
> >> Don't get me wrong but I have been just asking doubts. I will surely get
> >> back with my experience after performing the full indexing.
> >>
> >> Thanks again! :)
> >>
> >> On Fri, 6 Dec 2019 at 18:48, Erick Erickson 
> >> wrote:
> >>
> >>> Nothing implicitly handles optimization, you must continue to do that
> >>> externally.
> >>>
> >>> Until you get to the bottom of your indexing slowdown, I wouldn’t
> bother
> >>> with it at all, trying to do all these things at once is what lead to
> your
> >>> problem in the first place, please change one thing at a time. You say:
> >>>
> >>> “For a full indexing, optimizations occurred 30 times between batches”.
> >>>
> >>> This is horrible. I’m not sure what version of Solr you’re using. If
> it’s
> >>> 7.4 or earlier, this means the the entire index was rewritten 30 times.
> >>> The first time it would condense all segments into a single segment, or
> >>> 1/30 of the total. The second time it would rewrite all that, 2/30 of
> the
> >>> index into a new segment. The third time 3/30. And so on.
> >>>
> >>> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was
> over
> >>> 5G. But still.
> >>>
> >>> See:
> >>>
> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
> >>> for 7.4 and earlier,
> >>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> for
> >>> 7.5 and later
> >>>
> >>> Eventually you can optimize by sending in an http or curl request like
> >>> this:
> >>> ../solr/collection/update?optimize=true
> >>>
> >>> You also changed to using StandardDirectory. The default has heuristics
> >>> built in
> >>> to choose the best directory implementation.
> >>>
> >>> I can’t emphasize enough that you’re changing lots of things at one
> time.
> >>> I
> >>> _strongly_ urge you to go back to the standard setup, make _no_
> >>> modifications
> >>> and change things one at a time. Some very bright people have done a
> lot
> >>> of work to try to make Lucene/Solr work well.
> >>>
> >>> Make one change at a time. Measure. If that change isn’t helpful, undo
> it
> >>> and
> >>> move to the next one. You’re trying to second-guess the Lucene/Solr
> >>> developers who have years of understanding how this all works. Assume
> they
> >>> picked reasonable options for defaults and that Lucene/Solr performs
> >>> reasonably
> >>> well. When I get unexplainably poor results, I usually assume it was
> the
> >>> last
> >>> thing I changed….
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>
> >>>
> >>>
>  On Dec 6, 2019, at 1:31 AM, Paras Lehana 
> >>> wrote:
> 
>  Hi Erick,
> 
>  I believed optimizing explicitly merges segments and that's why I was
>  expecting it to giv

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-09 Thread Erick Erickson
Note that that article is from 2011. That was in the Solr 3x days when many, 
many, many things were different. There was no SolrCloud for instance. Plus 
Tom’s problem space is indexing _books_. Whole, complete, books. Which is, 
actually, not “normal” indexing at all as most Solr indexes are much smaller 
documents. Books are a perfectly reasonable use-case of course, but have a 
whole bunch of special requirements.

get-by-id should be very efficient, _except_ that the longer you spend before 
opening a new searcher, the larger the internal data buffers supporting 
get-by-id need to be.

Anyway, best of luck
Erick

> On Dec 9, 2019, at 1:05 AM, Paras Lehana  wrote:
> 
> Hi Erick,
> 
> I have reverted back to original values and yes, I did see improvement. I
> will collect more stats. *Thank you for helping. :)*
> 
> Also, here is the reference article that I had referred for changing
> values:
> https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
> 
> The article was perhaps for normal indexing and thus, suggested increasing
> mergeFactor and then finally optimizing. In my case, a large number of
> segments could have impacted get-by-id of atomic updates? Just being
> curious.
> 
> On Fri, 6 Dec 2019 at 19:02, Paras Lehana 
> wrote:
> 
>> Hey Erick,
>> 
>> We have just upgraded to 8.3 before starting the indexing. We were on 6.6
>> before that.
>> 
>> Thank you for your continued support and resources. Again, I have already
>> taken your suggestion to start afresh and that's what I'm going to do.
>> Don't get me wrong but I have been just asking doubts. I will surely get
>> back with my experience after performing the full indexing.
>> 
>> Thanks again! :)
>> 
>> On Fri, 6 Dec 2019 at 18:48, Erick Erickson 
>> wrote:
>> 
>>> Nothing implicitly handles optimization, you must continue to do that
>>> externally.
>>> 
>>> Until you get to the bottom of your indexing slowdown, I wouldn’t bother
>>> with it at all, trying to do all these things at once is what lead to your
>>> problem in the first place, please change one thing at a time. You say:
>>> 
>>> “For a full indexing, optimizations occurred 30 times between batches”.
>>> 
>>> This is horrible. I’m not sure what version of Solr you’re using. If it’s
>>> 7.4 or earlier, this means the the entire index was rewritten 30 times.
>>> The first time it would condense all segments into a single segment, or
>>> 1/30 of the total. The second time it would rewrite all that, 2/30 of the
>>> index into a new segment. The third time 3/30. And so on.
>>> 
>>> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over
>>> 5G. But still.
>>> 
>>> See:
>>> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
>>> for 7.4 and earlier,
>>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for
>>> 7.5 and later
>>> 
>>> Eventually you can optimize by sending in an http or curl request like
>>> this:
>>> ../solr/collection/update?optimize=true
>>> 
>>> You also changed to using StandardDirectory. The default has heuristics
>>> built in
>>> to choose the best directory implementation.
>>> 
>>> I can’t emphasize enough that you’re changing lots of things at one time.
>>> I
>>> _strongly_ urge you to go back to the standard setup, make _no_
>>> modifications
>>> and change things one at a time. Some very bright people have done a lot
>>> of work to try to make Lucene/Solr work well.
>>> 
>>> Make one change at a time. Measure. If that change isn’t helpful, undo it
>>> and
>>> move to the next one. You’re trying to second-guess the Lucene/Solr
>>> developers who have years of understanding how this all works. Assume they
>>> picked reasonable options for defaults and that Lucene/Solr performs
>>> reasonably
>>> well. When I get unexplainably poor results, I usually assume it was the
>>> last
>>> thing I changed….
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>> 
>>> 
 On Dec 6, 2019, at 1:31 AM, Paras Lehana 
>>> wrote:
 
 Hi Erick,
 
 I believed optimizing explicitly merges segments and that's why I was
 expecting it to give performance boost. I know that optimizations should
 not be done very frequently. For a full indexing, optimizations
>>> occurred 30
 times between batches. I take your suggestion to undo all the changes
>>> and
 that's what I'm going to do. I mentioned about the optimizations giving
>>> an
 indexing boost (for sometime) only to support your point of my
>>> mergePolicy
 backfiring. I will certainly read again about the merge process.
 
 Taking your suggestions - so, commits would be handled by autoCommit.
>>> What
 implicitly handles optimizations? I think the merge policy or is there
>>> any
 other setting I'm missing?
 
 I'm indexing via Curl API on the same server. The Current Speed of curl
>>> is
 only 50k (down from 1300k in the first batch). I think - as the curl is
>

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-08 Thread Paras Lehana
Hi Erick,

I have reverted back to original values and yes, I did see improvement. I
will collect more stats. *Thank you for helping. :)*

Also, here is the reference article that I had referred for changing
values:
https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1

The article was perhaps for normal indexing and thus, suggested increasing
mergeFactor and then finally optimizing. In my case, a large number of
segments could have impacted get-by-id of atomic updates? Just being
curious.

On Fri, 6 Dec 2019 at 19:02, Paras Lehana 
wrote:

> Hey Erick,
>
> We have just upgraded to 8.3 before starting the indexing. We were on 6.6
> before that.
>
> Thank you for your continued support and resources. Again, I have already
> taken your suggestion to start afresh and that's what I'm going to do.
> Don't get me wrong but I have been just asking doubts. I will surely get
> back with my experience after performing the full indexing.
>
> Thanks again! :)
>
> On Fri, 6 Dec 2019 at 18:48, Erick Erickson 
> wrote:
>
>> Nothing implicitly handles optimization, you must continue to do that
>> externally.
>>
>> Until you get to the bottom of your indexing slowdown, I wouldn’t bother
>> with it at all, trying to do all these things at once is what lead to your
>> problem in the first place, please change one thing at a time. You say:
>>
>> “For a full indexing, optimizations occurred 30 times between batches”.
>>
>> This is horrible. I’m not sure what version of Solr you’re using. If it’s
>> 7.4 or earlier, this means the the entire index was rewritten 30 times.
>> The first time it would condense all segments into a single segment, or
>> 1/30 of the total. The second time it would rewrite all that, 2/30 of the
>> index into a new segment. The third time 3/30. And so on.
>>
>> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over
>> 5G. But still.
>>
>> See:
>> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
>> for 7.4 and earlier,
>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for
>> 7.5 and later
>>
>> Eventually you can optimize by sending in an http or curl request like
>> this:
>> ../solr/collection/update?optimize=true
>>
>> You also changed to using StandardDirectory. The default has heuristics
>> built in
>> to choose the best directory implementation.
>>
>> I can’t emphasize enough that you’re changing lots of things at one time.
>> I
>> _strongly_ urge you to go back to the standard setup, make _no_
>> modifications
>> and change things one at a time. Some very bright people have done a lot
>> of work to try to make Lucene/Solr work well.
>>
>> Make one change at a time. Measure. If that change isn’t helpful, undo it
>> and
>> move to the next one. You’re trying to second-guess the Lucene/Solr
>> developers who have years of understanding how this all works. Assume they
>> picked reasonable options for defaults and that Lucene/Solr performs
>> reasonably
>> well. When I get unexplainably poor results, I usually assume it was the
>> last
>> thing I changed….
>>
>> Best,
>> Erick
>>
>>
>>
>>
>> > On Dec 6, 2019, at 1:31 AM, Paras Lehana 
>> wrote:
>> >
>> > Hi Erick,
>> >
>> > I believed optimizing explicitly merges segments and that's why I was
>> > expecting it to give performance boost. I know that optimizations should
>> > not be done very frequently. For a full indexing, optimizations
>> occurred 30
>> > times between batches. I take your suggestion to undo all the changes
>> and
>> > that's what I'm going to do. I mentioned about the optimizations giving
>> an
>> > indexing boost (for sometime) only to support your point of my
>> mergePolicy
>> > backfiring. I will certainly read again about the merge process.
>> >
>> > Taking your suggestions - so, commits would be handled by autoCommit.
>> What
>> > implicitly handles optimizations? I think the merge policy or is there
>> any
>> > other setting I'm missing?
>> >
>> > I'm indexing via Curl API on the same server. The Current Speed of curl
>> is
>> > only 50k (down from 1300k in the first batch). I think - as the curl is
>> > transmitting the XML, the documents are getting indexing. Because then
>> only
>> > would speed be so low. I don't think that the whole XML is taking the
>> > memory - I remember I had to change the curl options to get rid of the
>> > transmission error for large files.
>> >
>> > This is my curl request:
>> >
>> > curl 'http://localhost:$port/solr/product/update?commit=true'  -T
>> > batch1.xml -X POST -H 'Content-type:text/xml
>> >
>> > Although, we had been doing this since ages - I think I should now
>> consider
>> > using the solr post service (since the indexing files stays on the same
>> > server) or using Solarium (we use PHP to make XMLs).
>> >
>> > On Thu, 5 Dec 2019 at 20:00, Erick Erickson 
>> wrote:
>> >
>> >>> I think I should have also done optimize between batches, no?
>> >>
>> >> No, 

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-06 Thread Paras Lehana
Hey Erick,

We have just upgraded to 8.3 before starting the indexing. We were on 6.6
before that.

Thank you for your continued support and resources. Again, I have already
taken your suggestion to start afresh and that's what I'm going to do.
Don't get me wrong but I have been just asking doubts. I will surely get
back with my experience after performing the full indexing.

Thanks again! :)

On Fri, 6 Dec 2019 at 18:48, Erick Erickson  wrote:

> Nothing implicitly handles optimization, you must continue to do that
> externally.
>
> Until you get to the bottom of your indexing slowdown, I wouldn’t bother
> with it at all, trying to do all these things at once is what lead to your
> problem in the first place, please change one thing at a time. You say:
>
> “For a full indexing, optimizations occurred 30 times between batches”.
>
> This is horrible. I’m not sure what version of Solr you’re using. If it’s
> 7.4 or earlier, this means the the entire index was rewritten 30 times.
> The first time it would condense all segments into a single segment, or
> 1/30 of the total. The second time it would rewrite all that, 2/30 of the
> index into a new segment. The third time 3/30. And so on.
>
> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over
> 5G. But still.
>
> See:
> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
> for 7.4 and earlier,
> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for
> 7.5 and later
>
> Eventually you can optimize by sending in an http or curl request like
> this:
> ../solr/collection/update?optimize=true
>
> You also changed to using StandardDirectory. The default has heuristics
> built in
> to choose the best directory implementation.
>
> I can’t emphasize enough that you’re changing lots of things at one time. I
> _strongly_ urge you to go back to the standard setup, make _no_
> modifications
> and change things one at a time. Some very bright people have done a lot
> of work to try to make Lucene/Solr work well.
>
> Make one change at a time. Measure. If that change isn’t helpful, undo it
> and
> move to the next one. You’re trying to second-guess the Lucene/Solr
> developers who have years of understanding how this all works. Assume they
> picked reasonable options for defaults and that Lucene/Solr performs
> reasonably
> well. When I get unexplainably poor results, I usually assume it was the
> last
> thing I changed….
>
> Best,
> Erick
>
>
>
>
> > On Dec 6, 2019, at 1:31 AM, Paras Lehana 
> wrote:
> >
> > Hi Erick,
> >
> > I believed optimizing explicitly merges segments and that's why I was
> > expecting it to give performance boost. I know that optimizations should
> > not be done very frequently. For a full indexing, optimizations occurred
> 30
> > times between batches. I take your suggestion to undo all the changes and
> > that's what I'm going to do. I mentioned about the optimizations giving
> an
> > indexing boost (for sometime) only to support your point of my
> mergePolicy
> > backfiring. I will certainly read again about the merge process.
> >
> > Taking your suggestions - so, commits would be handled by autoCommit.
> What
> > implicitly handles optimizations? I think the merge policy or is there
> any
> > other setting I'm missing?
> >
> > I'm indexing via Curl API on the same server. The Current Speed of curl
> is
> > only 50k (down from 1300k in the first batch). I think - as the curl is
> > transmitting the XML, the documents are getting indexing. Because then
> only
> > would speed be so low. I don't think that the whole XML is taking the
> > memory - I remember I had to change the curl options to get rid of the
> > transmission error for large files.
> >
> > This is my curl request:
> >
> > curl 'http://localhost:$port/solr/product/update?commit=true'  -T
> > batch1.xml -X POST -H 'Content-type:text/xml
> >
> > Although, we had been doing this since ages - I think I should now
> consider
> > using the solr post service (since the indexing files stays on the same
> > server) or using Solarium (we use PHP to make XMLs).
> >
> > On Thu, 5 Dec 2019 at 20:00, Erick Erickson 
> wrote:
> >
> >>> I think I should have also done optimize between batches, no?
> >>
> >> No, no, no, no. Absolutely not. Never. Never, never, never between
> batches.
> >> I don’t  recommend optimizing at _all_ unless there are demonstrable
> >> improvements.
> >>
> >> Please don’t take this the wrong way, the whole merge process is really
> >> hard to get your head around. But the very fact that you’d suggest
> >> optimizing between batches shows that the entire merge process is
> >> opaque to you. I’ve seen many people just start changing things and
> >> get themselves into a bad place, then try to change more things to get
> >> out of that hole. Rinse. Repeat.
> >>
> >> I _strongly_ recommend that you undo all your changes. Neither
> >> commit nor optimize from outside Solr. Set your autocommit
> >> settings to somethi

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-06 Thread Erick Erickson
Nothing implicitly handles optimization, you must continue to do that
externally.

Until you get to the bottom of your indexing slowdown, I wouldn’t bother
with it at all, trying to do all these things at once is what lead to your
problem in the first place, please change one thing at a time. You say:

“For a full indexing, optimizations occurred 30 times between batches”.

This is horrible. I’m not sure what version of Solr you’re using. If it’s
7.4 or earlier, this means the the entire index was rewritten 30 times.
The first time it would condense all segments into a single segment, or
1/30 of the total. The second time it would rewrite all that, 2/30 of the
index into a new segment. The third time 3/30. And so on.

If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over
5G. But still.

See: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/ 
for 7.4 and earlier,
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for 7.5 and 
later

Eventually you can optimize by sending in an http or curl request like this:
../solr/collection/update?optimize=true

You also changed to using StandardDirectory. The default has heuristics built in
to choose the best directory implementation.

I can’t emphasize enough that you’re changing lots of things at one time. I
_strongly_ urge you to go back to the standard setup, make _no_ modifications
and change things one at a time. Some very bright people have done a lot
of work to try to make Lucene/Solr work well.

Make one change at a time. Measure. If that change isn’t helpful, undo it and
move to the next one. You’re trying to second-guess the Lucene/Solr
developers who have years of understanding how this all works. Assume they
picked reasonable options for defaults and that Lucene/Solr performs reasonably
well. When I get unexplainably poor results, I usually assume it was the last 
thing I changed….

Best,
Erick




> On Dec 6, 2019, at 1:31 AM, Paras Lehana  wrote:
> 
> Hi Erick,
> 
> I believed optimizing explicitly merges segments and that's why I was
> expecting it to give performance boost. I know that optimizations should
> not be done very frequently. For a full indexing, optimizations occurred 30
> times between batches. I take your suggestion to undo all the changes and
> that's what I'm going to do. I mentioned about the optimizations giving an
> indexing boost (for sometime) only to support your point of my mergePolicy
> backfiring. I will certainly read again about the merge process.
> 
> Taking your suggestions - so, commits would be handled by autoCommit. What
> implicitly handles optimizations? I think the merge policy or is there any
> other setting I'm missing?
> 
> I'm indexing via Curl API on the same server. The Current Speed of curl is
> only 50k (down from 1300k in the first batch). I think - as the curl is
> transmitting the XML, the documents are getting indexing. Because then only
> would speed be so low. I don't think that the whole XML is taking the
> memory - I remember I had to change the curl options to get rid of the
> transmission error for large files.
> 
> This is my curl request:
> 
> curl 'http://localhost:$port/solr/product/update?commit=true'  -T
> batch1.xml -X POST -H 'Content-type:text/xml
> 
> Although, we had been doing this since ages - I think I should now consider
> using the solr post service (since the indexing files stays on the same
> server) or using Solarium (we use PHP to make XMLs).
> 
> On Thu, 5 Dec 2019 at 20:00, Erick Erickson  wrote:
> 
>>> I think I should have also done optimize between batches, no?
>> 
>> No, no, no, no. Absolutely not. Never. Never, never, never between batches.
>> I don’t  recommend optimizing at _all_ unless there are demonstrable
>> improvements.
>> 
>> Please don’t take this the wrong way, the whole merge process is really
>> hard to get your head around. But the very fact that you’d suggest
>> optimizing between batches shows that the entire merge process is
>> opaque to you. I’ve seen many people just start changing things and
>> get themselves into a bad place, then try to change more things to get
>> out of that hole. Rinse. Repeat.
>> 
>> I _strongly_ recommend that you undo all your changes. Neither
>> commit nor optimize from outside Solr. Set your autocommit
>> settings to something like 5 minutes with openSearcher=true.
>> Set all autowarm counts in your caches in solrconfig.xml to 0,
>> especially filterCache and queryResultCache.
>> 
>> Do not set soft commit at all, leave it at -1.
>> 
>> Repeat do _not_ commit or optimize from the client! Just let your
>> autocommit settings do the commits.
>> 
>> It’s also pushing things to send 5M docs in a single XML packet.
>> That all has to be held in memory and then indexed, adding to
>> pressure on the heap. I usually index from SolrJ in batches
>> of 1,000. See:
>> https://lucidworks.com/post/indexing-with-solrj/
>> 
>> Simply put, your slowdown should not be happening. 

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-05 Thread Paras Lehana
Hi Erick,

I believed optimizing explicitly merges segments and that's why I was
expecting it to give performance boost. I know that optimizations should
not be done very frequently. For a full indexing, optimizations occurred 30
times between batches. I take your suggestion to undo all the changes and
that's what I'm going to do. I mentioned about the optimizations giving an
indexing boost (for sometime) only to support your point of my mergePolicy
backfiring. I will certainly read again about the merge process.

Taking your suggestions - so, commits would be handled by autoCommit. What
implicitly handles optimizations? I think the merge policy or is there any
other setting I'm missing?

I'm indexing via Curl API on the same server. The Current Speed of curl is
only 50k (down from 1300k in the first batch). I think - as the curl is
transmitting the XML, the documents are getting indexing. Because then only
would speed be so low. I don't think that the whole XML is taking the
memory - I remember I had to change the curl options to get rid of the
transmission error for large files.

This is my curl request:

curl 'http://localhost:$port/solr/product/update?commit=true'  -T
batch1.xml -X POST -H 'Content-type:text/xml

Although, we had been doing this since ages - I think I should now consider
using the solr post service (since the indexing files stays on the same
server) or using Solarium (we use PHP to make XMLs).

On Thu, 5 Dec 2019 at 20:00, Erick Erickson  wrote:

> >  I think I should have also done optimize between batches, no?
>
> No, no, no, no. Absolutely not. Never. Never, never, never between batches.
> I don’t  recommend optimizing at _all_ unless there are demonstrable
> improvements.
>
> Please don’t take this the wrong way, the whole merge process is really
> hard to get your head around. But the very fact that you’d suggest
> optimizing between batches shows that the entire merge process is
> opaque to you. I’ve seen many people just start changing things and
> get themselves into a bad place, then try to change more things to get
> out of that hole. Rinse. Repeat.
>
> I _strongly_ recommend that you undo all your changes. Neither
> commit nor optimize from outside Solr. Set your autocommit
> settings to something like 5 minutes with openSearcher=true.
> Set all autowarm counts in your caches in solrconfig.xml to 0,
> especially filterCache and queryResultCache.
>
> Do not set soft commit at all, leave it at -1.
>
> Repeat do _not_ commit or optimize from the client! Just let your
> autocommit settings do the commits.
>
> It’s also pushing things to send 5M docs in a single XML packet.
> That all has to be held in memory and then indexed, adding to
> pressure on the heap. I usually index from SolrJ in batches
> of 1,000. See:
> https://lucidworks.com/post/indexing-with-solrj/
>
> Simply put, your slowdown should not be happening. I strongly
> believe that it’s something in your environment, most likely
> 1> your changes eventually shoot you in the foot OR
> 2> you are running in too little memory and eventually GC is killing you.
> Really, analyze your GC logs. OR
> 3> you are running on underpowered hardware which just can’t take the load
> OR
> 4> something else in your environment
>
> I’ve never heard of a Solr installation with such a massive slowdown during
> indexing that was fixed by tweaking things like the merge policy etc.
>
> Best,
> Erick
>
>
> > On Dec 5, 2019, at 12:57 AM, Paras Lehana 
> wrote:
> >
> > Hey Erick,
> >
> > This is a huge red flag to me: "(but I could only test for the first few
> >> thousand documents”.
> >
> >
> > Yup, that's probably where the culprit lies. I could only test for the
> > starting batch because I had to wait for a day to actually compare. I
> > tweaked the merge values and kept whatever gave a speed boost. My first
> > batch of 5 million docs took only 40 minutes (atomic updates included)
> and
> > the last batch of 5 million took more than 18 hours. If this is an issue
> of
> > mergePolicy, I think I should have also done optimize between batches,
> no?
> > I remember, when I indexed a single XML of 80 million after optimizing
> the
> > core already indexed with 30 XMLs of 5 million each, I could post 80
> > million in a day only.
> >
> >
> >
> >> The indexing rate you’re seeing is abysmal unless these are _huge_
> >> documents
> >
> >
> > Documents only contain the suggestion name, possible titles,
> > phonetics/spellcheck/synonym fields and numerical fields for boosting.
> They
> > are far smaller than what a Search Document would contain. Auto-Suggest
> is
> > only concerned about suggestions so you can guess how simple the
> documents
> > would be.
> >
> >
> > Some data is held on the heap and some in the OS RAM due to MMapDirectory
> >
> >
> > I'm using StandardDirectory (which will make Solr choose the right
> > implementation). Also, planning to read more about these (looking forward
> > to use MMap). Thanks for the article!
> >
> >
> > 

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-05 Thread Erick Erickson
>  I think I should have also done optimize between batches, no?

No, no, no, no. Absolutely not. Never. Never, never, never between batches.
I don’t  recommend optimizing at _all_ unless there are demonstrable
improvements.

Please don’t take this the wrong way, the whole merge process is really
hard to get your head around. But the very fact that you’d suggest
optimizing between batches shows that the entire merge process is
opaque to you. I’ve seen many people just start changing things and
get themselves into a bad place, then try to change more things to get
out of that hole. Rinse. Repeat.

I _strongly_ recommend that you undo all your changes. Neither
commit nor optimize from outside Solr. Set your autocommit
settings to something like 5 minutes with openSearcher=true.
Set all autowarm counts in your caches in solrconfig.xml to 0,
especially filterCache and queryResultCache.

Do not set soft commit at all, leave it at -1.

Repeat do _not_ commit or optimize from the client! Just let your
autocommit settings do the commits.

It’s also pushing things to send 5M docs in a single XML packet.
That all has to be held in memory and then indexed, adding to
pressure on the heap. I usually index from SolrJ in batches
of 1,000. See:
https://lucidworks.com/post/indexing-with-solrj/

Simply put, your slowdown should not be happening. I strongly
believe that it’s something in your environment, most likely
1> your changes eventually shoot you in the foot OR
2> you are running in too little memory and eventually GC is killing you. 
Really, analyze your GC logs. OR
3> you are running on underpowered hardware which just can’t take the load OR
4> something else in your environment

I’ve never heard of a Solr installation with such a massive slowdown during
indexing that was fixed by tweaking things like the merge policy etc.

Best,
Erick


> On Dec 5, 2019, at 12:57 AM, Paras Lehana  wrote:
> 
> Hey Erick,
> 
> This is a huge red flag to me: "(but I could only test for the first few
>> thousand documents”.
> 
> 
> Yup, that's probably where the culprit lies. I could only test for the
> starting batch because I had to wait for a day to actually compare. I
> tweaked the merge values and kept whatever gave a speed boost. My first
> batch of 5 million docs took only 40 minutes (atomic updates included) and
> the last batch of 5 million took more than 18 hours. If this is an issue of
> mergePolicy, I think I should have also done optimize between batches, no?
> I remember, when I indexed a single XML of 80 million after optimizing the
> core already indexed with 30 XMLs of 5 million each, I could post 80
> million in a day only.
> 
> 
> 
>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> documents
> 
> 
> Documents only contain the suggestion name, possible titles,
> phonetics/spellcheck/synonym fields and numerical fields for boosting. They
> are far smaller than what a Search Document would contain. Auto-Suggest is
> only concerned about suggestions so you can guess how simple the documents
> would be.
> 
> 
> Some data is held on the heap and some in the OS RAM due to MMapDirectory
> 
> 
> I'm using StandardDirectory (which will make Solr choose the right
> implementation). Also, planning to read more about these (looking forward
> to use MMap). Thanks for the article!
> 
> 
> You're right. I should change one thing at a time. Let me experiment and
> then I will summarize here what I tried. Thank you for your responses. :)
> 
> On Wed, 4 Dec 2019 at 20:31, Erick Erickson  wrote:
> 
>> This is a huge red flag to me: "(but I could only test for the first few
>> thousand documents”
>> 
>> You’re probably right that that would speed things up, but pretty soon
>> when you’re indexing
>> your entire corpus there are lots of other considerations.
>> 
>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> documents, but you
>> indicate that at the start you’re getting 1,400 docs/second so I don’t
>> think the complexity
>> of the docs is the issue here.
>> 
>> Do note that when we’re throwing RAM figures out, we need to draw a sharp
>> distinction
>> between Java heap and total RAM. Some data is held on the heap and some in
>> the OS
>> RAM due to MMapDirectory, see Uwe’s excellent article:
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> 
>> Uwe recommends about 25% of your available physical RAM be allocated to
>> Java as
>> a starting point. Your particular Solr installation may need a larger
>> percent, IDK.
>> 
>> But basically I’d go back to all default settings and change one thing at
>> a time.
>> First, I’d look at GC performance. Is it taking all your CPU? In which
>> case you probably need to
>> increase your heap. I pick this first because it’s very common that this
>> is a root cause.
>> 
>> Next, I’d put a profiler on it to see exactly where I’m spending time.
>> Otherwise you wind
>> up making random changes and hoping one of them works.
>> 
>>

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-04 Thread Paras Lehana
Hey Erick,

This is a huge red flag to me: "(but I could only test for the first few
> thousand documents”.


Yup, that's probably where the culprit lies. I could only test for the
starting batch because I had to wait for a day to actually compare. I
tweaked the merge values and kept whatever gave a speed boost. My first
batch of 5 million docs took only 40 minutes (atomic updates included) and
the last batch of 5 million took more than 18 hours. If this is an issue of
mergePolicy, I think I should have also done optimize between batches, no?
I remember, when I indexed a single XML of 80 million after optimizing the
core already indexed with 30 XMLs of 5 million each, I could post 80
million in a day only.



> The indexing rate you’re seeing is abysmal unless these are _huge_
> documents


Documents only contain the suggestion name, possible titles,
phonetics/spellcheck/synonym fields and numerical fields for boosting. They
are far smaller than what a Search Document would contain. Auto-Suggest is
only concerned about suggestions so you can guess how simple the documents
would be.


Some data is held on the heap and some in the OS RAM due to MMapDirectory


I'm using StandardDirectory (which will make Solr choose the right
implementation). Also, planning to read more about these (looking forward
to use MMap). Thanks for the article!


You're right. I should change one thing at a time. Let me experiment and
then I will summarize here what I tried. Thank you for your responses. :)

On Wed, 4 Dec 2019 at 20:31, Erick Erickson  wrote:

> This is a huge red flag to me: "(but I could only test for the first few
> thousand documents”
>
> You’re probably right that that would speed things up, but pretty soon
> when you’re indexing
> your entire corpus there are lots of other considerations.
>
> The indexing rate you’re seeing is abysmal unless these are _huge_
> documents, but you
> indicate that at the start you’re getting 1,400 docs/second so I don’t
> think the complexity
> of the docs is the issue here.
>
> Do note that when we’re throwing RAM figures out, we need to draw a sharp
> distinction
> between Java heap and total RAM. Some data is held on the heap and some in
> the OS
> RAM due to MMapDirectory, see Uwe’s excellent article:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Uwe recommends about 25% of your available physical RAM be allocated to
> Java as
> a starting point. Your particular Solr installation may need a larger
> percent, IDK.
>
> But basically I’d go back to all default settings and change one thing at
> a time.
> First, I’d look at GC performance. Is it taking all your CPU? In which
> case you probably need to
> increase your heap. I pick this first because it’s very common that this
> is a root cause.
>
> Next, I’d put a profiler on it to see exactly where I’m spending time.
> Otherwise you wind
> up making random changes and hoping one of them works.
>
> Best,
> Erick
>
> > On Dec 4, 2019, at 3:21 AM, Paras Lehana 
> wrote:
> >
> > (but I could only test for the first few
> > thousand documents
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 


Re: [Q] Faster Atomic Updates - use docValues?

2019-12-04 Thread Erick Erickson
This is a huge red flag to me: "(but I could only test for the first few 
thousand documents”

You’re probably right that that would speed things up, but pretty soon when 
you’re indexing
your entire corpus there are lots of other considerations.

The indexing rate you’re seeing is abysmal unless these are _huge_ documents, 
but you
indicate that at the start you’re getting 1,400 docs/second so I don’t think 
the complexity
of the docs is the issue here.

Do note that when we’re throwing RAM figures out, we need to draw a sharp 
distinction
between Java heap and total RAM. Some data is held on the heap and some in the 
OS
RAM due to MMapDirectory, see Uwe’s excellent article:
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Uwe recommends about 25% of your available physical RAM be allocated to Java as
a starting point. Your particular Solr installation may need a larger percent, 
IDK.

But basically I’d go back to all default settings and change one thing at a 
time.
First, I’d look at GC performance. Is it taking all your CPU? In which case you 
probably need to 
increase your heap. I pick this first because it’s very common that this is a 
root cause.

Next, I’d put a profiler on it to see exactly where I’m spending time. 
Otherwise you wind
up making random changes and hoping one of them works.

Best,
Erick

> On Dec 4, 2019, at 3:21 AM, Paras Lehana  wrote:
> 
> (but I could only test for the first few
> thousand documents



Re: [Q] Faster Atomic Updates - use docValues?

2019-12-04 Thread Paras Lehana
Hi Erick,

Thank you for replying.


Do you have empirical evidence that all these parameter changes are doing
> you any good?


When I had played with merge configuration last year, the new values gave
an instant boost to indexing (but I could only test for the first few
thousand documents). Increasing autoCommit made the indexing speed high for
the interval between commits. But I tried with even higher values, we had
to revert due to OOMs during indexing.


The first thing I note is that 8G for a 250M document index is a red flag.


Yes, we have initiated a request to increase the RAM on the server to 32G.
I will also monitor the GC status. By now, solr has atomically indexed over
115 million documents and has been running since a week. Indexing 5 million
documents is taking more than 15 hours (~92 docs/s) while these 5 million
documents would have taken only 1 hour if this was the start (1400 docs/s).
Of the total 12G on my system, free is none but available is 3G. With 4
cores, server CPU usage is under 150%.


The second thing is you have no searchers being opened.


Oh! I had set it to false because showing real time updates to the user was
not our concern. I had read somewhere that this could speed up indexing.
But you're probably right - I'm doing atomic updates and this could impact
performance of get by id. That's what you are saying, right?


Why did you increase RamBufferSizeMB?


2G seems to be the hard limit for int overflow. I read this somewhere.
Also, I should have mentioned this:



Referring to this

(and
this
),
increasing RamBufferSizeMB may help keeping uncommitted documents in the
memory but works with NRT and I'm not using it. :P
Actually, this was increased only for this indexing - I wanted to measure
the impact if any. But I agree, my issue certainly lies somewhere else.



> The third thing is that you have changed the TieredMergePolicy extensively.


You're right about the large number of segments. I agree this maybe taking
time for Solr to get and update the existing document. Actually, increasing
this did give us indexing boost in earlier days but I said, I had only
tried for first batch of documents. This may be giving diminishing or even
negative returns after tens of millions of documents are indexed. This
could actually be the case of exponential decrease in my speed with
increased number of documents. For this indexing, I only added
maxMergeAtOnce as per Shawn's suggestion in some other thread. But he was
using maxMergeAtOnce of 35 and I think I should start from a lower value
too. I will certainly work on this now.


If it was a  lookup I should think it’d be a problem for the
> first 50M docs.


I think uniqueKey lookup should be the same as the lookup for an
associative array or a hash by its key. Because I don't know exactly how
does atomic updates work on code level, I was thinking if there's something
related to the lookup strategy/algorithm (that may be different from what
I'm thinking) that impacts get-by-id as the universal set increases.


Thank you again, Erick! :)

On Tue, 3 Dec 2019 at 18:50, Erick Erickson  wrote:

> Do you have empirical evidence that all these parameter changes are doing
> you any good?
>
> The first thing I note is that 8G for a 250M document index is a red flag.
> If you’re running on
> a larger machine, I’d increase that to 16G as a test. I’ve seen GC start
> to take up more and
> more CPU as you get closer to the max, sometimes to the point of having a
> 90% or
> more of the CPU consumed by GC.
>
> The second thing is you have no searchers being opened. Solr has to keep
> certain in-memory
> structures in place to support Real Time Get that only gets reclaimed when
> a new
> searcher is opened. Perhaps that’s chewing up memory and getting to a
> tipping point.
>
> Why did you increase RamBufferSizeMB?  I’ve rarely found much increase in
> throughput
> over the default 100M. It’s probably not very useful anyway since, unless
> your autocommit
> limits mean that unless you’re using that full 2G for 100,000 docs or
> within 2 minutes, it
> won’t be used up anyway.
>
> The third thing is that you have changed the TieredMergePolicy
> extensively. When
> background merges kick in, they’ll be HUGE. Further, the settings will
> probably
> cause you to have a lot of segments, which is not ideal.
>
> Fourth why do you think the lookup of the  has anything to do
> with
> your slowdown? If I’m reading this right, you do atomic updates on 50M docs
> _then_ things get slow. If it was a  lookup I should think it’d
> be a problem for the first 50M docs.
>
> So here’s what I’d do:
> 1> go back to the defaults for TieredMergePolicy and RamBufferSizeMB
> 2> measure first, tweak later. Analyze your GC logs to see whether
>  you’re

Re: [Q] Faster Atomic Updates - use docValues?

2019-12-03 Thread Erick Erickson
Do you have empirical evidence that all these parameter changes are doing you 
any good?

The first thing I note is that 8G for a 250M document index is a red flag. If 
you’re running on
a larger machine, I’d increase that to 16G as a test. I’ve seen GC start to 
take up more and
more CPU as you get closer to the max, sometimes to the point of having a 90% or
more of the CPU consumed by GC.

The second thing is you have no searchers being opened. Solr has to keep 
certain in-memory
structures in place to support Real Time Get that only gets reclaimed when a new
searcher is opened. Perhaps that’s chewing up memory and getting to a tipping 
point.

Why did you increase RamBufferSizeMB?  I’ve rarely found much increase in 
throughput
over the default 100M. It’s probably not very useful anyway since, unless your 
autocommit
limits mean that unless you’re using that full 2G for 100,000 docs or within 2 
minutes, it
won’t be used up anyway.

The third thing is that you have changed the TieredMergePolicy extensively. When
background merges kick in, they’ll be HUGE. Further, the settings will probably
cause you to have a lot of segments, which is not ideal.

Fourth why do you think the lookup of the  has anything to do with
your slowdown? If I’m reading this right, you do atomic updates on 50M docs
_then_ things get slow. If it was a  lookup I should think it’d
be a problem for the first 50M docs.

So here’s what I’d do:
1> go back to the defaults for TieredMergePolicy and RamBufferSizeMB
2> measure first, tweak later. Analyze your GC logs to see whether
 you’re taking an inordinate amount of time doing GC coincident with
 your slowness. If so, adjust your heap.
3> If it’s not GC, put a profiler on it and find out where, exactly, you’re
 spending your time.

Best,
Erick


> We occasionally reindex whole data to our Auto-Suggest corpus. Total
> documents to be indexed are around 250 million while, due to atomic
> updates, total unique documents after full indexing converges to 60
> million.
> 
> We have to atomically index documents to store different names for the same
> product (like "bag" and "bags"), to increase demand and to store the months
> they were searched for in the past. One approach could be to calculate all
> this beforehand and then index normally to Solr (non-atomic).
> 
> Once the atomic updates process over 50 million documents, the speed of
> indexing drops down to more than 10x of initial speed.
> 
> As what I have learnt, atomic updates fetch the matching document by
> uniqueKey and then does the normal index using the information in the
> fetched document. Is this actually taking time? As the number of documents
> increases, Solr might be taking time to fetch the stored document.
> 
> But shouldn't the fetch by uniqueKey take O(1) time? If this really impacts
> the fetch, can we use docValues for the field id (uniqueKey)? Our field is
> of type string.
> 
> 
> 
> I'm pasting my config lines that may impact this:
> 
> --
> 
> -Xmx8g -Xms8g
> 
>  omitNorms="false" multiValued="false" />
> id
> 
> 2000
> 
> 
> 50
> 50
> 150
> 
> 
> 
>10
>12
>false
> 
> 
> --
> 
> 
> 
> A normal indexing that should take less than 1 day actually takes over 5
> days with atomic updates. Any experience or suggestion will help. How do
> expedite your indexing process specifically atomic updates? I know this
> might have been asked so many times and I have actually read/implemented
> all of the recommendations. My question is specific to Atomic Updates and
> if something exclusive to Atomic Updates can make it faster.
> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> *
> *
> 
> 



[Q] Faster Atomic Updates - use docValues?

2019-12-03 Thread Paras Lehana
Hi Community,

We occasionally reindex whole data to our Auto-Suggest corpus. Total
documents to be indexed are around 250 million while, due to atomic
updates, total unique documents after full indexing converges to 60
million.

We have to atomically index documents to store different names for the same
product (like "bag" and "bags"), to increase demand and to store the months
they were searched for in the past. One approach could be to calculate all
this beforehand and then index normally to Solr (non-atomic).

Once the atomic updates process over 50 million documents, the speed of
indexing drops down to more than 10x of initial speed.

As what I have learnt, atomic updates fetch the matching document by
uniqueKey and then does the normal index using the information in the
fetched document. Is this actually taking time? As the number of documents
increases, Solr might be taking time to fetch the stored document.

But shouldn't the fetch by uniqueKey take O(1) time? If this really impacts
the fetch, can we use docValues for the field id (uniqueKey)? Our field is
of type string.



I'm pasting my config lines that may impact this:

--

-Xmx8g -Xms8g


id

2000


 50
 50
150
 


10
12
false


--



A normal indexing that should take less than 1 day actually takes over 5
days with atomic updates. Any experience or suggestion will help. How do
expedite your indexing process specifically atomic updates? I know this
might have been asked so many times and I have actually read/implemented
all of the recommendations. My question is specific to Atomic Updates and
if something exclusive to Atomic Updates can make it faster.


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*