Re: Solr takes time to warm up core with huge data

2020-06-09 Thread Erick Erickson
I’d ignore the form of the query for the present, I think that’s a red herring.

Start by taking all your sort clauses off. Then add them back one by one (you 
have
to restart Solr between these experiments). My bet: your problem is 
“uninverting” and you’ll see your startup speed get worse the more clauses you 
add.
I don’t expect every field to add equally, ones with more unique values will 
probably
be worse.

Or, if you have the ability, recreate your index and add docValues=true to 
_all_ fields
that you use to sort.

indexed=true is great for searches, i.e. for answering “for term X in field Y, 
what docs contain it?”.

It’s rotten for sorting, though where the question is “for docX, what term is 
in field Y?”

So to sort if indexed=true is all you have, the entire field has to be read 
into memory and
“uninverted”. Basically this is a table scan and build this structure on the 
heap. Which is
a very expensive operation.

Setting docValues=true means that this uninverted structure is built at index 
time and 
serialized to disk. So rather than uninvert the indexed data for a field that’s 
being
used for sorting (or faceting, or grouping, or function queries) on the heap, 
the 
uninverted structure is just read in off disk, which is much, much, much faster.

That also reduces the pressure on heap memory because Lucene keeps most of the 
index in MMapDirectory space, see:
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick

> On Jun 8, 2020, at 10:10 PM, Srinivas Kashyap 
>  wrote:
> 
> Hi Shawn,
> 
> It's a vague question and I haven't tried it out yet.
> 
> Can I instead mention query as below:
> 
> Basically instead of
> 
> 
> 
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"BAMBOOROSE"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> 
> 
> pass
> 
> 
> 
> q=PHY_KEY2:" HQ012206"+AND+PHY_KEY1:" BAMBOOROSE 
> "=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=1000=MODIFY_TS desc,LOGICAL_SECT_NAME asc,TRACK_ID 
> desc,TRACK_INTER_ID asc,PHY_KEY1 asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 
> asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 
> asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> 
> Instead of q=*:* I pass only those fields which I want to retrieve. Will this 
> be faster?
> 
> Related to earlier question:
> We are using 8.4.1 version
> All the fields that I'm using on sorting are all string data type(modify ts 
> date) with indexed=true stored=true
> 
> 
> Thanks,
> Srinivas
> 
> 
> On 05-Jun-2020 9:50 pm, Shawn Heisey  wrote:
> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
>> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
>> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
>> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
>> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 
>> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>> 
>> This was the original query. Since there were lot of sorting fields, we 
>> decided to not do on the solr side, instead fetch the query response and do 
>> the sorting outside solr. This eliminated the need of more JVM memory which 
>> was allocated. Every time we ran this query, solr would crash exceeding the 
>> JVM memory. Now we are only running filter queries.
> 
> What Solr version, and what is the definition of each of the fields
> you're sorting on? If the definition doesn't include docValues, then a
> large on-heap memory structure will be created for sorting (VERY large
> with 500 million docs), and I wouldn't be surprised if it's created even
> if it is never used. The definition for any field you use for sorting
> should definitely include docValues. In recent Solr versions, docValues
> defaults to true for most field types. Some field classes, TextField in
> particular, cannot have docValues.
> 
> There's something else to discuss about sort params -- each sort field
> will only be used if ALL of the previous sort fields are identical for
> two documents in the full numFound result set. Having more than two or
> three sort fields is usually pointless. My guess (which I know could be
> wrong) is that most queries with this HUGE sort parameter will never use
> anything beyond TRACK_ID.
> 
>> And regarding the filter cache, it is in default setup: (we are using 
>> default solrconfig.xml, and we have only added the request handler for DIH)
>> 
>> > size="512"
>> initialSize="512"
>> autowarmCount="0"/>
> 
> This is way too big for your index, and a prime candidate for why your
> heap requirements are so high. Like I said before, if the filterCache
> on your system actually reaches this max size, it will require 30GB of
> memory JUST for the filterCache on this core. Can you check the admin
> UI to 

Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Srinivas Kashyap
Hi Shawn,

It's a vague question and I haven't tried it out yet.

Can I instead mention query as below:

Basically instead of



q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"BAMBOOROSE"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc



pass



q=PHY_KEY2:" HQ012206"+AND+PHY_KEY1:" BAMBOOROSE 
"=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=1000=MODIFY_TS desc,LOGICAL_SECT_NAME asc,TRACK_ID 
desc,TRACK_INTER_ID asc,PHY_KEY1 asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 
asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 
asc,FIELD_NAME asc


Instead of q=*:* I pass only those fields which I want to retrieve. Will this 
be faster?

Related to earlier question:
We are using 8.4.1 version
All the fields that I'm using on sorting are all string data type(modify ts 
date) with indexed=true stored=true


Thanks,
Srinivas


On 05-Jun-2020 9:50 pm, Shawn Heisey  wrote:
On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>
> This was the original query. Since there were lot of sorting fields, we 
> decided to not do on the solr side, instead fetch the query response and do 
> the sorting outside solr. This eliminated the need of more JVM memory which 
> was allocated. Every time we ran this query, solr would crash exceeding the 
> JVM memory. Now we are only running filter queries.

What Solr version, and what is the definition of each of the fields
you're sorting on? If the definition doesn't include docValues, then a
large on-heap memory structure will be created for sorting (VERY large
with 500 million docs), and I wouldn't be surprised if it's created even
if it is never used. The definition for any field you use for sorting
should definitely include docValues. In recent Solr versions, docValues
defaults to true for most field types. Some field classes, TextField in
particular, cannot have docValues.

There's something else to discuss about sort params -- each sort field
will only be used if ALL of the previous sort fields are identical for
two documents in the full numFound result set. Having more than two or
three sort fields is usually pointless. My guess (which I know could be
wrong) is that most queries with this HUGE sort parameter will never use
anything beyond TRACK_ID.

> And regarding the filter cache, it is in default setup: (we are using default 
> solrconfig.xml, and we have only added the request handler for DIH)
>
>  size="512"
> initialSize="512"
> autowarmCount="0"/>

This is way too big for your index, and a prime candidate for why your
heap requirements are so high. Like I said before, if the filterCache
on your system actually reaches this max size, it will require 30GB of
memory JUST for the filterCache on this core. Can you check the admin
UI to determine what the size is and what hit ratio it's getting? (1.0
is 100% on the hit ratio). I'd probably start with a size of 32 or 64
on this cache. With a size of 64, a little less than 4GB would be the
max heap allocated for the cache. You can experiment... but with 500
million docs, the filterCache size should be pretty small.

You're going to want to carefully digest this part of that wiki page
that I linked earlier. Hopefully email will preserve this link completely:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements

Thanks,
Shawn


DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software 

Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Colvin Cowie
Great, thanks Erick

On Mon, 8 Jun 2020 at 13:22, Erick Erickson  wrote:

> It’s _bounded_ buy MaxDoc/8 + (some overhead). The overhead is
> both the map overhead and the representation of the query.
>
> This is an upper bound, the full bitset is not stored if there
> are few entries that match the filter, in that case the
> doc IDs are stored. Consider if maxDoc is 1M and only 2 docs
> match the query, it’s much more efficient to store two ints
> rather than 1M/8.
>
> You can also limit the RAM used by specifying maxRamMB.
>
> Best,
> Erick
>
> > On Jun 8, 2020, at 4:59 AM, Colvin Cowie 
> wrote:
> >
> > Sorry to hijack this a little bit. Shawn, what's the calculation for the
> > size of the filter cache?
> > Is that 1 bit per document in the core / shard?
> > Thanks
> >
> > On Fri, 5 Jun 2020 at 17:20, Shawn Heisey  wrote:
> >
> >> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
> >>> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO
> >> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS
> >> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1
> >> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6
> >> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> >>>
> >>> This was the original query. Since there were lot of sorting fields, we
> >> decided to not do on the solr side, instead fetch the query response
> and do
> >> the sorting outside solr. This eliminated the need of more JVM memory
> which
> >> was allocated. Every time we ran this query, solr would crash exceeding
> the
> >> JVM memory. Now we are only running filter queries.
> >>
> >> What Solr version, and what is the definition of each of the fields
> >> you're sorting on?  If the definition doesn't include docValues, then a
> >> large on-heap memory structure will be created for sorting (VERY large
> >> with 500 million docs), and I wouldn't be surprised if it's created even
> >> if it is never used.  The definition for any field you use for sorting
> >> should definitely include docValues.  In recent Solr versions, docValues
> >> defaults to true for most field types.  Some field classes, TextField in
> >> particular, cannot have docValues.
> >>
> >> There's something else to discuss about sort params -- each sort field
> >> will only be used if ALL of the previous sort fields are identical for
> >> two documents in the full numFound result set.  Having more than two or
> >> three sort fields is usually pointless.  My guess (which I know could be
> >> wrong) is that most queries with this HUGE sort parameter will never use
> >> anything beyond TRACK_ID.
> >>
> >>> And regarding the filter cache, it is in default setup: (we are using
> >> default solrconfig.xml, and we have only added the request handler for
> DIH)
> >>>
> >>>  >>>  size="512"
> >>>  initialSize="512"
> >>>  autowarmCount="0"/>
> >>
> >> This is way too big for your index, and a prime candidate for why your
> >> heap requirements are so high.  Like I said before, if the filterCache
> >> on your system actually reaches this max size, it will require 30GB of
> >> memory JUST for the filterCache on this core.  Can you check the admin
> >> UI to determine what the size is and what hit ratio it's getting? (1.0
> >> is 100% on the hit ratio).  I'd probably start with a size of 32 or 64
> >> on this cache.  With a size of 64, a little less than 4GB would be the
> >> max heap allocated for the cache.  You can experiment... but with 500
> >> million docs, the filterCache size should be pretty small.
> >>
> >> You're going to want to carefully digest this part of that wiki page
> >> that I linked earlier.  Hopefully email will preserve this link
> completely:
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements
> >>
> >> Thanks,
> >> Shawn
> >>
>
>


Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Erick Erickson
It’s _bounded_ buy MaxDoc/8 + (some overhead). The overhead is
both the map overhead and the representation of the query.

This is an upper bound, the full bitset is not stored if there
are few entries that match the filter, in that case the
doc IDs are stored. Consider if maxDoc is 1M and only 2 docs
match the query, it’s much more efficient to store two ints
rather than 1M/8.

You can also limit the RAM used by specifying maxRamMB.

Best,
Erick

> On Jun 8, 2020, at 4:59 AM, Colvin Cowie  wrote:
> 
> Sorry to hijack this a little bit. Shawn, what's the calculation for the
> size of the filter cache?
> Is that 1 bit per document in the core / shard?
> Thanks
> 
> On Fri, 5 Jun 2020 at 17:20, Shawn Heisey  wrote:
> 
>> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
>>> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO
>> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS
>> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1
>> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6
>> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>>> 
>>> This was the original query. Since there were lot of sorting fields, we
>> decided to not do on the solr side, instead fetch the query response and do
>> the sorting outside solr. This eliminated the need of more JVM memory which
>> was allocated. Every time we ran this query, solr would crash exceeding the
>> JVM memory. Now we are only running filter queries.
>> 
>> What Solr version, and what is the definition of each of the fields
>> you're sorting on?  If the definition doesn't include docValues, then a
>> large on-heap memory structure will be created for sorting (VERY large
>> with 500 million docs), and I wouldn't be surprised if it's created even
>> if it is never used.  The definition for any field you use for sorting
>> should definitely include docValues.  In recent Solr versions, docValues
>> defaults to true for most field types.  Some field classes, TextField in
>> particular, cannot have docValues.
>> 
>> There's something else to discuss about sort params -- each sort field
>> will only be used if ALL of the previous sort fields are identical for
>> two documents in the full numFound result set.  Having more than two or
>> three sort fields is usually pointless.  My guess (which I know could be
>> wrong) is that most queries with this HUGE sort parameter will never use
>> anything beyond TRACK_ID.
>> 
>>> And regarding the filter cache, it is in default setup: (we are using
>> default solrconfig.xml, and we have only added the request handler for DIH)
>>> 
>>> >>  size="512"
>>>  initialSize="512"
>>>  autowarmCount="0"/>
>> 
>> This is way too big for your index, and a prime candidate for why your
>> heap requirements are so high.  Like I said before, if the filterCache
>> on your system actually reaches this max size, it will require 30GB of
>> memory JUST for the filterCache on this core.  Can you check the admin
>> UI to determine what the size is and what hit ratio it's getting? (1.0
>> is 100% on the hit ratio).  I'd probably start with a size of 32 or 64
>> on this cache.  With a size of 64, a little less than 4GB would be the
>> max heap allocated for the cache.  You can experiment... but with 500
>> million docs, the filterCache size should be pretty small.
>> 
>> You're going to want to carefully digest this part of that wiki page
>> that I linked earlier.  Hopefully email will preserve this link completely:
>> 
>> 
>> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements
>> 
>> Thanks,
>> Shawn
>> 



Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Colvin Cowie
Sorry to hijack this a little bit. Shawn, what's the calculation for the
size of the filter cache?
Is that 1 bit per document in the core / shard?
Thanks

On Fri, 5 Jun 2020 at 17:20, Shawn Heisey  wrote:

> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
> > q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6
> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> >
> > This was the original query. Since there were lot of sorting fields, we
> decided to not do on the solr side, instead fetch the query response and do
> the sorting outside solr. This eliminated the need of more JVM memory which
> was allocated. Every time we ran this query, solr would crash exceeding the
> JVM memory. Now we are only running filter queries.
>
> What Solr version, and what is the definition of each of the fields
> you're sorting on?  If the definition doesn't include docValues, then a
> large on-heap memory structure will be created for sorting (VERY large
> with 500 million docs), and I wouldn't be surprised if it's created even
> if it is never used.  The definition for any field you use for sorting
> should definitely include docValues.  In recent Solr versions, docValues
> defaults to true for most field types.  Some field classes, TextField in
> particular, cannot have docValues.
>
> There's something else to discuss about sort params -- each sort field
> will only be used if ALL of the previous sort fields are identical for
> two documents in the full numFound result set.  Having more than two or
> three sort fields is usually pointless.  My guess (which I know could be
> wrong) is that most queries with this HUGE sort parameter will never use
> anything beyond TRACK_ID.
>
> > And regarding the filter cache, it is in default setup: (we are using
> default solrconfig.xml, and we have only added the request handler for DIH)
> >
> >  >   size="512"
> >   initialSize="512"
> >   autowarmCount="0"/>
>
> This is way too big for your index, and a prime candidate for why your
> heap requirements are so high.  Like I said before, if the filterCache
> on your system actually reaches this max size, it will require 30GB of
> memory JUST for the filterCache on this core.  Can you check the admin
> UI to determine what the size is and what hit ratio it's getting? (1.0
> is 100% on the hit ratio).  I'd probably start with a size of 32 or 64
> on this cache.  With a size of 64, a little less than 4GB would be the
> max heap allocated for the cache.  You can experiment... but with 500
> million docs, the filterCache size should be pretty small.
>
> You're going to want to carefully digest this part of that wiki page
> that I linked earlier.  Hopefully email will preserve this link completely:
>
>
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements
>
> Thanks,
> Shawn
>


Re: Solr takes time to warm up core with huge data

2020-06-05 Thread Shawn Heisey

On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 
asc,PHY_KEY6 asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc

This was the original query. Since there were lot of sorting fields, we decided 
to not do on the solr side, instead fetch the query response and do the sorting 
outside solr. This eliminated the need of more JVM memory which was allocated. 
Every time we ran this query, solr would crash exceeding the JVM memory. Now we 
are only running filter queries.


What Solr version, and what is the definition of each of the fields 
you're sorting on?  If the definition doesn't include docValues, then a 
large on-heap memory structure will be created for sorting (VERY large 
with 500 million docs), and I wouldn't be surprised if it's created even 
if it is never used.  The definition for any field you use for sorting 
should definitely include docValues.  In recent Solr versions, docValues 
defaults to true for most field types.  Some field classes, TextField in 
particular, cannot have docValues.


There's something else to discuss about sort params -- each sort field 
will only be used if ALL of the previous sort fields are identical for 
two documents in the full numFound result set.  Having more than two or 
three sort fields is usually pointless.  My guess (which I know could be 
wrong) is that most queries with this HUGE sort parameter will never use 
anything beyond TRACK_ID.



And regarding the filter cache, it is in default setup: (we are using default 
solrconfig.xml, and we have only added the request handler for DIH)




This is way too big for your index, and a prime candidate for why your 
heap requirements are so high.  Like I said before, if the filterCache 
on your system actually reaches this max size, it will require 30GB of 
memory JUST for the filterCache on this core.  Can you check the admin 
UI to determine what the size is and what hit ratio it's getting? (1.0 
is 100% on the hit ratio).  I'd probably start with a size of 32 or 64 
on this cache.  With a size of 64, a little less than 4GB would be the 
max heap allocated for the cache.  You can experiment... but with 500 
million docs, the filterCache size should be pretty small.


You're going to want to carefully digest this part of that wiki page 
that I linked earlier.  Hopefully email will preserve this link completely:


https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements

Thanks,
Shawn


Re: Solr takes time to warm up core with huge data

2020-06-05 Thread Erick Erickson
My suspicion, as others have said, is that you simply have too much data on
too little hardware. Solr definitely should not be taking this long. Or rather,
if Solr is taking this long to start up you have a badly undersized system and
until you address that you’ll just be going ‘round in circles.

Lucene uses MMapDirectory to use OS memory space for almost all of the
actual index, see: 
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
and you have 82G of index, and you only have 8G of OS memory space to hold it.

It’s certainly worth looking at how you use your index and whether you can 
make it smaller, but I’d say you simply won’t get satisfactory performance on 
such
constrained hardware.

You really need to go through “the sizing exercise” to see what your hardware 
and
usage patterns are, see: 
https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

> On Jun 5, 2020, at 3:48 AM, Srinivas Kashyap 
>  wrote:
> 
> Hi Jörn,
> 
> I think, you missed my explanation. We are not using sorting now:
> 
> The original query:
> 
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> But now, I have removed sorting as shown below. The sorting is being done 
> outside solr:
> 
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000
> 
> Also, we are writing custom code to index by discarding DIH too. When I 
> restart the solr, this core with huge data takes time to even show up the 
> query admin GUI console. It takes around 2 hours to show.
> 
> My question is, even for the simple query with filter query mentioned as 
> shown above, it is consuming JVM memory. So, how much memory or what 
> configuration should I be doing on solrconfig.xml to make it work.
> 
> Thanks,
> Srinivas
> 
> From: Jörn Franke 
> Sent: 05 June 2020 12:30
> To: solr-user@lucene.apache.org
> Subject: Re: Solr takes time to warm up core with huge data
> 
> I think DIH is the wrong solution for this. If you do an external custom load 
> you will be probably much faster.
> 
> You have too much JVM memory from my point of view. Reduce it to eight or 
> similar.
> 
> It seems you are just exporting data so you are better off work the exporting 
> handler.
> Add docvalues to the fields for this. It looks like you have no text field to 
> be searched but only simple fields (string, date etc).
> 
> You should not use the normal handler to return many results at once. If you 
> cannot use the Export handler then use cursors :
> 
> https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors<https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors>
> 
> Both work to sort large result sets without consuming the whole memory
> 
>> Am 05.06.2020 um 08:18 schrieb Srinivas Kashyap 
>> mailto:srini...@bamboorose.com.invalid>>:
>> 
>> Thanks Shawn,
>> 
>> The filter queries are not complex. Below are the filter queries I’m running 
>> for the corresponding schema entry:
>> 
>> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
>> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
>> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
>> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 
>> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>> 
>> This was the original query. Since there were lot of sorting fields, we 
>> decided to not do on the solr side, instead fetch the query response and do 
>> the sorting outside solr. This eliminated the need of more JVM memory which 
>> was allocated. Every time we ran this query, solr would crash exceeding the 
>> JVM memory. Now we are only running filter queries.
>> 
>> And regarding the filter cache, it is in default setup: (we are using 
>> default solrconfig.xml, and we have only added the request handler for DIH)
>> 
>> > size="512"
>> initialSize="512"
>> autowarmCount="0"/>
>> 
>> Now that you’re aware of the size and numbers, can you please let me know 
>> what values/size that I need to increase? Is there an advantage of moving 
>> this single core to solr cloud? If yes, can you let us know, how many 
>> shards/replica do we require

RE: Solr takes time to warm up core with huge data

2020-06-05 Thread Srinivas Kashyap
Hi Jörn,

I think, you missed my explanation. We are not using sorting now:

The original query:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc

But now, I have removed sorting as shown below. The sorting is being done 
outside solr:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000

Also, we are writing custom code to index by discarding DIH too. When I restart 
the solr, this core with huge data takes time to even show up the query admin 
GUI console. It takes around 2 hours to show.

My question is, even for the simple query with filter query mentioned as shown 
above, it is consuming JVM memory. So, how much memory or what configuration 
should I be doing on solrconfig.xml to make it work.

Thanks,
Srinivas

From: Jörn Franke 
Sent: 05 June 2020 12:30
To: solr-user@lucene.apache.org
Subject: Re: Solr takes time to warm up core with huge data

I think DIH is the wrong solution for this. If you do an external custom load 
you will be probably much faster.

You have too much JVM memory from my point of view. Reduce it to eight or 
similar.

It seems you are just exporting data so you are better off work the exporting 
handler.
Add docvalues to the fields for this. It looks like you have no text field to 
be searched but only simple fields (string, date etc).

You should not use the normal handler to return many results at once. If you 
cannot use the Export handler then use cursors :

https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors<https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors>

Both work to sort large result sets without consuming the whole memory

> Am 05.06.2020 um 08:18 schrieb Srinivas Kashyap 
> mailto:srini...@bamboorose.com.invalid>>:
>
> Thanks Shawn,
>
> The filter queries are not complex. Below are the filter queries I’m running 
> for the corresponding schema entry:
>
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>
> This was the original query. Since there were lot of sorting fields, we 
> decided to not do on the solr side, instead fetch the query response and do 
> the sorting outside solr. This eliminated the need of more JVM memory which 
> was allocated. Every time we ran this query, solr would crash exceeding the 
> JVM memory. Now we are only running filter queries.
>
> And regarding the filter cache, it is in default setup: (we are using default 
> solrconfig.xml, and we have only added the request handler for DIH)
>
>  size="512"
> initialSize="512"
> autowarmCount="0"/>
>
> Now that you’re aware of the size and numbers, can you please let me know 
> what values/size that I need to increase? Is there an advantage of moving 
> this single core to solr cloud? If yes, can you let us know, how many 
> shards/replica do we require for this core considering we allow it to grow as 
> users transact. The updates to this core is not thru DIH delta import rather, 
> we are using SolrJ to push the changes.
>
> 
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>
>
> Thanks,
> Srinivas
>
>
>
>> On 6/4/2020 9:51 PM, Srinivas Kashyap wrote:
>> We are on solr 8.4.1 and In standalone server mode. We have a core with 
>> 497,767,038 Records indexed. It took around 32Hours to load data through DIH.
>>
>> The disk occupancy is shown below:
>>
>> 82G /var/solr/data//data/index
>>
>> When I restarted solr instance and went to this core to query on solr admin 
>> GUI, it is hanging and is showing "Connection to Solr lost. Please check the 
>> So

Re: Solr takes time to warm up core with huge data

2020-06-05 Thread Jörn Franke
I think DIH is the wrong solution for this. If you do an external custom load 
you will be probably much faster.

You have too much JVM memory from my point of view. Reduce it to eight or 
similar.

It seems you are just exporting data so you are better off work the exporting 
handler.
Add docvalues to the fields for this. It looks like you have no text field to 
be searched but only simple fields (string, date etc).

 You should not use the normal handler to return many results at once. If you 
cannot use the Export handler then use cursors :

https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors

Both work to sort large result sets without consuming the whole memory

> Am 05.06.2020 um 08:18 schrieb Srinivas Kashyap 
> :
> 
> Thanks Shawn,
> 
> The filter queries are not complex. Below are the filter queries I’m running 
> for the corresponding schema entry:
> 
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> This was the original query. Since there were lot of sorting fields, we 
> decided to not do on the solr side, instead fetch the query response and do 
> the sorting outside solr. This eliminated the need of more JVM memory which 
> was allocated. Every time we ran this query, solr would crash exceeding the 
> JVM memory. Now we are only running filter queries.
> 
> And regarding the filter cache, it is in default setup: (we are using default 
> solrconfig.xml, and we have only added the request handler for DIH)
> 
>  size="512"
> initialSize="512"
> autowarmCount="0"/>
> 
> Now that you’re aware of the size and numbers, can you please let me know 
> what values/size that I need to increase? Is there an advantage of moving 
> this single core to solr cloud? If yes, can you let us know, how many 
> shards/replica do we require for this core considering we allow it to grow as 
> users transact. The updates to this core is not thru DIH delta import rather, 
> we are using SolrJ to push the changes.
> 
> 
>   type="string"  indexed="true"  stored="true"
> omitTermFreqAndPositions="true" />
> type="date"indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>  type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
> 
> 
> Thanks,
> Srinivas
> 
> 
> 
>> On 6/4/2020 9:51 PM, Srinivas Kashyap wrote:
>> We are on solr 8.4.1 and In standalone server mode. We have a core with 
>> 497,767,038 Records indexed. It took around 32Hours to load data through DIH.
>> 
>> The disk occupancy is shown below:
>> 
>> 82G /var/solr/data//data/index
>> 
>> When I restarted solr instance and went to this core to query on solr admin 
>> GUI, it is hanging and is showing "Connection to Solr lost. Please check the 
>> Solr instance". But when I go back to dashboard, instance is up and I'm able 
>> to query other cores.
>> 
>> Also, querying on this core is eating up JVM memory allocated(24GB)/(32GB 
>> RAM). A query(*:*) with filterqueries is overshooting the memory with OOM.
> 
> You're going to want to have a lot more than 8GB available memory for
> disk caching with an 82GB index. That's a performance thing... with so
> little caching memory, Solr will be slow, but functional. That aspect
> of your setup will NOT lead to out of memory.
> 
> If you are experiencing Java "OutOfMemoryError" exceptions, you will
> need to figure out what resource is running out. It might be heap
> memory, but it also might 

RE: Solr takes time to warm up core with huge data

2020-06-05 Thread Srinivas Kashyap
Thanks Shawn,

The filter queries are not complex. Below are the filter queries I’m running 
for the corresponding schema entry:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc

This was the original query. Since there were lot of sorting fields, we decided 
to not do on the solr side, instead fetch the query response and do the sorting 
outside solr. This eliminated the need of more JVM memory which was allocated. 
Every time we ran this query, solr would crash exceeding the JVM memory. Now we 
are only running filter queries.

And regarding the filter cache, it is in default setup: (we are using default 
solrconfig.xml, and we have only added the request handler for DIH)



Now that you’re aware of the size and numbers, can you please let me know what 
values/size that I need to increase? Is there an advantage of moving this 
single core to solr cloud? If yes, can you let us know, how many shards/replica 
do we require for this core considering we allow it to grow as users transact. 
The updates to this core is not thru DIH delta import rather, we are using 
SolrJ to push the changes.
















Thanks,
Srinivas



On 6/4/2020 9:51 PM, Srinivas Kashyap wrote:
> We are on solr 8.4.1 and In standalone server mode. We have a core with 
> 497,767,038 Records indexed. It took around 32Hours to load data through DIH.
>
> The disk occupancy is shown below:
>
> 82G /var/solr/data//data/index
>
> When I restarted solr instance and went to this core to query on solr admin 
> GUI, it is hanging and is showing "Connection to Solr lost. Please check the 
> Solr instance". But when I go back to dashboard, instance is up and I'm able 
> to query other cores.
>
> Also, querying on this core is eating up JVM memory allocated(24GB)/(32GB 
> RAM). A query(*:*) with filterqueries is overshooting the memory with OOM.

You're going to want to have a lot more than 8GB available memory for
disk caching with an 82GB index. That's a performance thing... with so
little caching memory, Solr will be slow, but functional. That aspect
of your setup will NOT lead to out of memory.

If you are experiencing Java "OutOfMemoryError" exceptions, you will
need to figure out what resource is running out. It might be heap
memory, but it also might be that you're hitting the process/thread
limit of your operating system. And there are other possible causes for
that exception too. Do you have the text of the exception available?
It will be absolutely critical for you to determine what resource is
running out, or you might focus your efforts on the wrong thing.

If it's heap memory (something that I can't really assume), then Solr is
requiring more than the 24GB heap you've allocated.

Do you have faceting or grouping on those queries? Are any of your
filters really large or complex? These are the things that I would
imagine as requiring lots of heap memory.

What is the size of your filterCache? With about 500 million documents
in the core, each entry in the filterCache will consume nearly 60
megabytes of memory. If your filterCache has the default example size
of 512, and it actually gets that big, then that single cache will
require nearly 30 gigabytes of heap memory (on top of the other things
in Solr that require heap) ... and you only have 24GB. That could cause
OOME exceptions.

Does the server run things other than Solr?

Look here for some valuable info about performance and memory:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

Thanks,
Shawn

DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more 

Re: Solr takes time to warm up core with huge data

2020-06-04 Thread Shawn Heisey

On 6/4/2020 9:51 PM, Srinivas Kashyap wrote:

We are on solr 8.4.1 and In standalone server mode. We have a core with 
497,767,038 Records indexed. It took around 32Hours to load data through DIH.

The disk occupancy is shown below:

82G /var/solr/data//data/index

When I restarted solr instance and went to this core to query on solr admin GUI, it is 
hanging and is showing "Connection to Solr lost. Please check the Solr 
instance". But when I go back to dashboard, instance is up and I'm able to query 
other cores.

Also, querying on this core is eating up JVM memory allocated(24GB)/(32GB RAM). 
A query(*:*) with filterqueries is overshooting the memory with OOM.


You're going to want to have a lot more than 8GB available memory for 
disk caching with an 82GB index.  That's a performance thing... with so 
little caching memory, Solr will be slow, but functional.  That aspect 
of your setup will NOT lead to out of memory.


If you are experiencing Java "OutOfMemoryError" exceptions, you will 
need to figure out what resource is running out.  It might be heap 
memory, but it also might be that you're hitting the process/thread 
limit of your operating system.  And there are other possible causes for 
that exception too.  Do you have the text of the exception available? 
It will be absolutely critical for you to determine what resource is 
running out, or you might focus your efforts on the wrong thing.


If it's heap memory (something that I can't really assume), then Solr is 
requiring more than the 24GB heap you've allocated.


Do you have faceting or grouping on those queries?  Are any of your 
filters really large or complex?  These are the things that I would 
imagine as requiring lots of heap memory.


What is the size of your filterCache?  With about 500 million documents 
in the core, each entry in the filterCache will consume nearly 60 
megabytes of memory.  If your filterCache has the default example size 
of 512, and it actually gets that big, then that single cache will 
require nearly 30 gigabytes of heap memory (on top of the other things 
in Solr that require heap) ... and you only have 24GB.  That could cause 
OOME exceptions.


Does the server run things other than Solr?

Look here for some valuable info about performance and memory:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

Thanks,
Shawn