Facet Advice

2019-10-14 Thread Moyer, Brett
Hello, looking for some advice, I have the suspicion we are doing Facets all 
wrong. We host financial information and recently "tagged" our pages with 
appropriate Facets. We have built a Flat design. Are we going at it the wrong 
way?

In Solr we have a "Tags" field, based on some magic we tagged each page on the 
site with a number of the below example Facets. We have the UI team sending 
queries in the form of 1) q=get a loan=Tags:Retirement, 2) q=get a 
loan=Tags:Retirement AND Tags:Move Money. This restricts the resultset 
hopefully guiding the user to their desired result. Something about it doesn’t 
seem right. Is this right with a flat single level pattern like what we have? 
Should each doc have multiple Fields to map to different values? Any help is 
appreciated. Thanks!

Example Facets:
Brokerage
Retirement
Open an Account
Move Money
Estate Planning
Etc..

Brett
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Lemmatizer for indexing

2019-10-14 Thread Shamik Bandopadhyay
Hi,
  I'm trying to use a lemmatized in my analysis chain. Just wondering what
is the recommended way of achieving this. I've come across few different
implementation which are listed below;

Open NLP -->
https://lucene.apache.org/solr/guide/7_5/language-analysis.html#opennlp-lemmatizer-filter

https://opennlp.apache.org/docs/1.8.0/manual/opennlp.html#tools.lemmatizer

KStem Filter -->
https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#kstem-filter

There are couple of third party libraries , but not sure if they are being
maintained or compatible with the solr version i'm using (7.5).

https://github.com/nicholasding/solr-lemmatizer
https://github.com/bejean/solr-lemmatizer

Currently, I'm looking for English only lemmatization. Also, I need to have
the ability to update the lemma dictionary to add custom terms specific to
our organization (not sure of kstem filter can do that).

Any pointers will be appreciated.

Regards,
Shamik


solr 8.1.1 many time slower returning query results than solr 4.10.4 or solr 6.5.1

2019-10-14 Thread Russell Bahr
Hello,
I am sorry in advance as this will be a lengthy email as I will try to provide 
proper details.
We currently have 2 solr cloud deployments and we are hoping to upgrade to solr 
8.x from these but are running into severe performance problems with solr 
8.1.1.  I am hoping for some guidance in troubleshooting and overcoming this 
problem.

Current setup

Backend email processing.
Used for predefined queries that produce email results for our clients.  
Approximately 35000 emails distributed over different times of the day for our 
clients based on their preferences.
solr-spec 4.10.4
lucene-spec 4.10.4
Runtime Oracle Corporation OpenJDK 64-Bit Server VM (1.8.0_222 25.222-b10)
1 collection 6 shards 5 replicas per shard 17,919,889 current documents (35 
days worth of documents) - indexing new documents regularly throughout the day, 
deleting aged out documents nightly.

Frontend for website.
Used for customer searches, sometimes runs same query as is defined for email 
processing.
solr-spec 6.5.1
lucene-spec 6.5.1
Runtime Oracle Corporation OpenJDK 64-Bit Server VM 1.8.0_222 25.222-b10
1 collection 6 shards 3 replicas per shard 50,821,086 current documents (213 
days (7months) worth of documents) - indexing new documents regularly 
throughout the day, deleting aged out documents nightly.

Backend replacement of solr4 and hopefully Frontend replacement as well.
solr-spec 8.1.1
lucene-spec 8.1.1
Runtime Oracle Corporation OpenJDK 64-Bit Server VM 12 12+33
1 collection 6 shards 5 replicas per shard 17,919,889 current documents (35 
days worth of documents) - indexing new documents regularly throughout the day, 
deleting aged out documents nightly.

We are trying to solve a couple of issues with this upgrade of solr version.

1. Using 2 different solr clouds with different version causes different 
results to come back for our clients in their email and when they search on the 
front end.
2. When previous person attempted to build out solr 6.5.1 for backend it would 
crash in the middle of running through the search that creates the content for 
our client emails.
3. Want to bring both backend and frontend up to current Solr version, and if 
possible run both of of a single solr cloud instead of 2 with same content 
indexed to them.

Problem 1

When I run the backend process in a test with all 35000 email queries dumped 
into a queue on the current solr 4 cloud deployment it takes approximately 7-8 
hours to complete. (This is minimum performance target for new solr cloud 
deployment)
When I run the same backend process in a test with all 35000 email queries 
dumped into a queue on the new solr 8 cloud deployment it takes greater than 24 
hours to complete. (Must be less than 8 hours in order for email deliveries to 
be timely for content)

Problem 2 (likely same core issue as Problem 1, but much easier to work with)

When I run one of our normal queries against solr 6 cloud deployment the 
results return in less than 1/2 second.
When I run the same queries against solr 8 cloud deployment the results return 
in more than 16 seconds.

Link to dropbox folder containing ( 
https://www.dropbox.com/sh/2x2k5c9db7d4pt9/AADnHwuJc7a9Fh4KmUD15rS0a?dl=0 )

"one of our normal queries"

Solr 6 query results
Solr 8 query results

Solr 4 solrconfig.xml
Solr 4 schema.xml
Solr 4 solr.in.sh

Solr 6 solrconfig.xml
Solr 6 schema.xml
Solr 6 solr.in.sh

Solr 8 solrconfig.xml
Solr 8 schema.xml
Solr 8 solr.in.sh

Thank you in advance for any guidance and advice that you can give me,

Russell Bahr
Lead Infrastructure Engineer

Manzama
a MODERN GOVERNANCE company



Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Shawn Heisey

On 10/14/2019 7:18 AM, Vassil Velichkov (Sensika) wrote:

After the migration from 6.x to 7.6 we kept the default GC for a couple of 
weeks, than we've started experimenting with G1 and we've managed to achieve 
less frequent OOM crashes, but not by much.


Changing your GC settings will never prevent OOMs.  The only way to 
prevent them is to either increase the resource that's running out or 
reconfigure the program to use less of that resource.



As I explained in my previous e-mail, the unused filterCache entries are not 
discarded, even after a new SolrSearcher is started. The Replicas are synced 
with the Masters every 5 minutes, the filterCache is auto-warmed and the JVM 
heap utilization keeps going up. Within 1 to 2 hours a 64GB heap is being 
exhausted. The GC log entries clearly show that there are more and more 
humongous allocations piling up.


While it is true that the generation-specific collectors for G1 do not 
clean up humungous allocations from garbage, eventually Java will 
perform a full GC, which will be slow, but should clean them up.  If a 
full GC is not cleaning them up, that's a different problem, and one 
that I would suspect is actually a problem with your installation.  We 
have had memory leak bugs in Solr, but I am not aware of any that are as 
serious as your observations suggest.


You could be running into a memory leak ... but I really doubt that it 
is directly related to the filterCache or the humungous allocations. 
Upgrading to the latest release that you can would be advisable -- the 
latest 7.x version would be my first choice, or you could go all the way 
to 8.2.0.


Are you running completely stock Solr, or have you added custom code? 
One of the most common problems with custom code is leaking searcher 
objects, which will cause Java to retain the large cache entries.  We 
have seen problems where one Solr version will work perfectly with 
custom code, but when Solr is upgraded, the custom code has memory leaks.



We have a really stressful use-case: a single user opens a live-report with 
20-30 widgets, each widget performs a Solr Search or facet aggregations, 
sometimes with 5-15 complex filter queries attached to the main query, so the 
end results are visualized as pivot charts. So, one user could trigger hundreds 
of queries in a very short period of time and when we have several analysts 
working on the same time-period, we usually end-up with OOM. This logic used to 
work quite well on Solr 6.x. The only other difference that comes to my mind is 
that with Solr 7.6 we've started using DocValues. I could not find 
documentation about DocValues memory consumption, so it might be related.


For cases where docValues are of major benefit, which is primarily 
facets and sorting, Solr will use less memory with docValues than it 
does with indexed terms.  Adding docValues should not result in a 
dramatic increase in memory requirements, and in many cases, should 
actually require less memory.



Yep, but I plan to generate some detailed JVM trace-dumps, so we could analyze 
which class / data structure causes the OOM. Any recommendations about what 
tool to use for a detailed JVM dump?


Usually the stacktrace itself is not helpful in diagnosing OOMs -- 
because the place where the error is thrown can be ANY allocation, not 
necessarily the one that is the major resource hog.


What I'm interested in here is the message immediately after the OOME, 
not the stacktrace.  Which I'll admit is slightly odd, because for many 
problems I *am* interested in the stacktrace.  OutOfMemoryError is one 
situation where the stacktrace is not very helpful, but the short 
message the error contains is useful.  I only asked for the stacktrace 
because collecting it will usually mean that nothing else in the message 
has been modified.


Here are two separate examples of what I am looking for:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Caused by: java.lang.OutOfMemoryError: unable to create new native thread


Also, not sure if I could send attachments to the mailing list, but there must 
be a way to share logs...?


There are many websites that facilitate file sharing.  One example, and 
the one that I use most frequently, is dropbox.  Sending attachments to 
the list rarely works.


Thanks,
Shawn


RE: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Vassil Velichkov (Sensika)
Hi Shawn,

My answers are in-line below...

Cheers,
Vassil

-Original Message-
From: Shawn Heisey  
Sent: Monday, October 14, 2019 3:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
tests with Java 13 and the new ZGC?

On 10/14/2019 6:18 AM, Vassil Velichkov (Sensika) wrote:
> We have 1 x Replica with 1 x Solr Core per JVM and each JVM runs in a 
> separate VMware VM.
> We have 32 x JVMs/VMs in total, containing between 50M to 180M documents per 
> replica/core/JVM.

With 180 million documents, each filterCache entry will be 22.5 megabytes in 
size.  They will ALL be this size.

> Ops, I didn't know that, but this makes the things even worse. By looking 
> at the GC log, it seems evicted entries are never discarded.

> In our case most filterCache entities (maxDoc/8 + overhead) are typically 
> more than 16MB, which is more than 50% of the max setting for 
> "XX:G1HeapRegionSize" (which is 32MB). That's why I am so interested in Java 
> 13 and ZGC, because ZGC does not have this weird limitation and collects even 
> _large_ garbage pieces :-). We have almost no documentCache or queryCache 
> entities.

I am not aware of any Solr testing with the new garbage collector.  I'm 
interested in knowing whether it does a better job than CMS and G1, but do not 
have any opportunities to try it.

> Currently we have some 2TB free RAM on the cluster, so I guess we could 
> test it in the next coming days. The plan is to re-index at least 2B 
> documents in a separate cluster and stress-test the new cluster with real 
> production data and real production code with Java 13 and ZGC.

Have you tried letting Solr use its default garbage collection settings instead 
of G1?  Have you tried Java 11?  Java 9 is one of the releases without long 
term support, so as Erick says, it is not recommended.

> After the migration from 6.x to 7.6 we kept the default GC for a couple 
> of weeks, than we've started experimenting with G1 and we've managed to 
> achieve less frequent OOM crashes, but not by much.

> By some time tonight all shards will be rebalanced (we've added 6 more) and 
> will contain up to 100-120M documents (14.31MB + overhead should be < 16MB), 
> so hopefully this will help us to alleviate the OOM crashes.

It doesn't sound to me like your filterCache can cause OOM.  The total size of 
256 filterCache entries that are each 22.5 megabytes should be less than 6GB, 
and I would expect the other Solr caches to be smaller.

> As I explained in my previous e-mail, the unused filterCache entries are 
> not discarded, even after a new SolrSearcher is started. The Replicas are 
> synced with the Masters every 5 minutes, the filterCache is auto-warmed 
> and the JVM heap utilization keeps going up. Within 1 to 2 hours a 64GB 
> heap is being exhausted. The GC log entries clearly show that there are 
> more and more humongous allocations piling up. 
 
If you are hitting OOMs, then some other aspect of your setup is the reason 
that's happening.  I would not normally expect a single core with
180 million documents to need more than about 16GB of heap, and 31GB should 
definitely be enough.  Hitting OOM with the heap sizes you have described is 
very strange.

>> We have a really stressful use-case: a single user opens a live-report 
>> with 20-30 widgets, each widget performs a Solr Search or facet 
>> aggregations, sometimes with 5-15 complex filter queries attached to the 
>> main query, so the end results are visualized as pivot charts. So, one 
>> user could trigger hundreds of queries in a very short period of time 
>> and when we have several analysts working on the same time-period, we 
>> usually end-up with OOM. This logic used to work quite well on Solr 6.x. 
>> The only other difference that comes to my mind is that with Solr 7.6 
>> we've started using DocValues. I could not find documentation about 
>> DocValues memory consumption, so it might be related.

Perhaps the root cause of your OOMs is not heap memory, but some other system 
resource.  Do you have log entries showing the stacktrace on the OOM?

>> Yep, but I plan to generate some detailed JVM trace-dumps, so we could 
>> analyze which class / data structure causes the OOM. Any recommendations 
>> about what tool to use for a detailed JVM dump? 
Also, not sure if I could send attachments to the mailing list, but there must 
be a way to share logs...?

Thanks,
Shawn


Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Shawn Heisey

On 10/14/2019 6:18 AM, Vassil Velichkov (Sensika) wrote:

We have 1 x Replica with 1 x Solr Core per JVM and each JVM runs in a separate 
VMware VM.
We have 32 x JVMs/VMs in total, containing between 50M to 180M documents per 
replica/core/JVM.


With 180 million documents, each filterCache entry will be 22.5 
megabytes in size.  They will ALL be this size.



In our case most filterCache entities (maxDoc/8 + overhead) are typically more than 16MB, 
which is more than 50% of the max setting for "XX:G1HeapRegionSize" (which is 
32MB). That's why I am so interested in Java 13 and ZGC, because ZGC does not have this 
weird limitation and collects even _large_ garbage pieces :-). We have almost no 
documentCache or queryCache entities.


I am not aware of any Solr testing with the new garbage collector.  I'm 
interested in knowing whether it does a better job than CMS and G1, but 
do not have any opportunities to try it.


Have you tried letting Solr use its default garbage collection settings 
instead of G1?  Have you tried Java 11?  Java 9 is one of the releases 
without long term support, so as Erick says, it is not recommended.



By some time tonight all shards will be rebalanced (we've added 6 more) and will 
contain up to 100-120M documents (14.31MB + overhead should be < 16MB), so 
hopefully this will help us to alleviate the OOM crashes.


It doesn't sound to me like your filterCache can cause OOM.  The total 
size of 256 filterCache entries that are each 22.5 megabytes should be 
less than 6GB, and I would expect the other Solr caches to be smaller. 
If you are hitting OOMs, then some other aspect of your setup is the 
reason that's happening.  I would not normally expect a single core with 
180 million documents to need more than about 16GB of heap, and 31GB 
should definitely be enough.  Hitting OOM with the heap sizes you have 
described is very strange.


Perhaps the root cause of your OOMs is not heap memory, but some other 
system resource.  Do you have log entries showing the stacktrace on the OOM?


Thanks,
Shawn


RE: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Vassil Velichkov (Sensika)
Hi Erick,

We have 1 x Replica with 1 x Solr Core per JVM and each JVM runs in a separate 
VMware VM.
We have 32 x JVMs/VMs in total, containing between 50M to 180M documents per 
replica/core/JVM.
In our case most filterCache entities (maxDoc/8 + overhead) are typically more 
than 16MB, which is more than 50% of the max setting for "XX:G1HeapRegionSize" 
(which is 32MB). That's why I am so interested in Java 13 and ZGC, because ZGC 
does not have this weird limitation and collects even _large_ garbage pieces 
:-). We have almost no documentCache or queryCache entities.

By some time tonight all shards will be rebalanced (we've added 6 more) and 
will contain up to 100-120M documents (14.31MB + overhead should be < 16MB), so 
hopefully this will help us to alleviate the OOM crashes.

Cheers,
Vassil


-Original Message-
From: Erick Erickson  
Sent: Monday, October 14, 2019 3:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
tests with Java 13 and the new ZGC?

The filterCache isn’t a single huge allocation, it’s made up of _size_ entries, 
each individual entry shouldn’t be that big, each entry should cap around 
maxDoc/8 bytes + some overhead.

I just scanned the e-mail, I’m not clear how many _replicas_ per JVM you have, 
nor how many JVMs per server you’re running. One strategy to deal with large 
heaps if you have a lot of replicas is to run multiple JVMs, each with a 
smaller heap.

One peculiarity of heaps is that at 32G, one must use long pointers, so a 32G 
heap actually has less available memory than a 31G heap if many of the objects 
are small.


> On Oct 14, 2019, at 7:00 AM, Vassil Velichkov (Sensika) 
>  wrote:
> 
> Thanks Jörn,
> 
> Yep, we are rebalancing the cluster to keep up to ~100M documents per shard, 
> but that's not quite optimal in our use-case.
> 
> We've tried with various ratios between JVM Heap / OS RAM (up to 128GB / 
> 256GB) and we have the same Java Heap OOM crashes.
> For example, a BitSet of 160M documents is > 16MB and when we look at the G1 
> logs, it seems it never discards the humongous allocations, so they keep 
> piling. Forcing a full-garbage collection is just not practical - it takes 
> forever and the shard is not usable. Even when a new Searcher is started 
> (every several minutes) the old large filterCache entries are not freed and 
> sooner or later the JVM crashes.
> 
> On the other hand ZGC has a completely different architecture and does not 
> have the hard-coded threshold of 16MB for *humongous allocations*:
> https://wiki.openjdk.java.net/display/zgc/Main
> 
> Anyway, we will be probably testing Java 13 and ZGC with the real data, we 
> just have to reindex 30+ shards to new Solr servers, which will take a couple 
> of days :-)
> 
> Cheers,
> Vassil
> 
> -Original Message-
> From: Jörn Franke  
> Sent: Monday, October 14, 2019 1:47 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
> tests with Java 13 and the new ZGC?
> 
> I would try JDK11 - it works much better than JDK9 in general. 
> 
> I don‘t think JDK13 with ZGC will bring you better results. There seems to be 
> sth strange with the JDk version or Solr version and some settings. 
> 
> Then , make sure that you have much more free memory for the os cache than 
> the heap. Nearly 100 gb for Solr heap sounds excessive. Try to reduce it to 
> much less.
> 
> Try the default options of Solr and use the latest 7.x version or 8.x version 
> of Solr.
> 
> Additionally you can try to shard more.
> 
>> Am 14.10.2019 um 19:19 schrieb Vassil Velichkov (Sensika) 
>> :
>> 
>> Hi Everyone,
>> 
>> Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 
>> we have frequent OOM crashes on specific nodes.
>> 
>> All investigations (detailed below) lead to a hard-coded limitation in the 
>> G1 garbage collector. The Java Heap is exhausted due to too many filterCache 
>> allocations that are never discarded by the G1.
>> 
>> Our hope is to use Java 13 with the new ZGC, which is specifically designed 
>> for large heap-sizes, and supposedly would handle and dispose larger 
>> allocations. The Solr release notes claim that Solr 7.6 builds are tested 
>> with Java 11 / 12 / 13 (pre-release).
>> Does anyone use Java 13 in production and has experience with the new ZGC 
>> and large heap sizes / large document sets of more than 150M documents per 
>> shard?
>> 
>>> Some background information and reference to the possible 
>>> root-cause, described by Shawn Heisey in Solr 1.4 documentation 
>>> >
>> 
>> Our current setup is as follows:
>> 
>> 1.   All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
>> Solr 7.6
>> 
>> 2.   Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
>> 50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
>> DocValues ON
>> 
>> 3. 

Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Erick Erickson
The filterCache isn’t a single huge allocation, it’s made up of _size_ entries, 
each individual entry shouldn’t be that big, each entry should cap around 
maxDoc/8 bytes + some overhead.

I just scanned the e-mail, I’m not clear how many _replicas_ per JVM you have, 
nor how many JVMs per server you’re running. One strategy to deal with large 
heaps if you have a lot of replicas is to run multiple JVMs, each with a 
smaller heap.

One peculiarity of heaps is that at 32G, one must use long pointers, so a 32G 
heap actually has less available memory than a 31G heap if many of the objects 
are small.


> On Oct 14, 2019, at 7:00 AM, Vassil Velichkov (Sensika) 
>  wrote:
> 
> Thanks Jörn,
> 
> Yep, we are rebalancing the cluster to keep up to ~100M documents per shard, 
> but that's not quite optimal in our use-case.
> 
> We've tried with various ratios between JVM Heap / OS RAM (up to 128GB / 
> 256GB) and we have the same Java Heap OOM crashes.
> For example, a BitSet of 160M documents is > 16MB and when we look at the G1 
> logs, it seems it never discards the humongous allocations, so they keep 
> piling. Forcing a full-garbage collection is just not practical - it takes 
> forever and the shard is not usable. Even when a new Searcher is started 
> (every several minutes) the old large filterCache entries are not freed and 
> sooner or later the JVM crashes.
> 
> On the other hand ZGC has a completely different architecture and does not 
> have the hard-coded threshold of 16MB for *humongous allocations*:
> https://wiki.openjdk.java.net/display/zgc/Main
> 
> Anyway, we will be probably testing Java 13 and ZGC with the real data, we 
> just have to reindex 30+ shards to new Solr servers, which will take a couple 
> of days :-)
> 
> Cheers,
> Vassil
> 
> -Original Message-
> From: Jörn Franke  
> Sent: Monday, October 14, 2019 1:47 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
> tests with Java 13 and the new ZGC?
> 
> I would try JDK11 - it works much better than JDK9 in general. 
> 
> I don‘t think JDK13 with ZGC will bring you better results. There seems to be 
> sth strange with the JDk version or Solr version and some settings. 
> 
> Then , make sure that you have much more free memory for the os cache than 
> the heap. Nearly 100 gb for Solr heap sounds excessive. Try to reduce it to 
> much less.
> 
> Try the default options of Solr and use the latest 7.x version or 8.x version 
> of Solr.
> 
> Additionally you can try to shard more.
> 
>> Am 14.10.2019 um 19:19 schrieb Vassil Velichkov (Sensika) 
>> :
>> 
>> Hi Everyone,
>> 
>> Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 
>> we have frequent OOM crashes on specific nodes.
>> 
>> All investigations (detailed below) lead to a hard-coded limitation in the 
>> G1 garbage collector. The Java Heap is exhausted due to too many filterCache 
>> allocations that are never discarded by the G1.
>> 
>> Our hope is to use Java 13 with the new ZGC, which is specifically designed 
>> for large heap-sizes, and supposedly would handle and dispose larger 
>> allocations. The Solr release notes claim that Solr 7.6 builds are tested 
>> with Java 11 / 12 / 13 (pre-release).
>> Does anyone use Java 13 in production and has experience with the new ZGC 
>> and large heap sizes / large document sets of more than 150M documents per 
>> shard?
>> 
>>> Some background information and reference to the possible 
>>> root-cause, described by Shawn Heisey in Solr 1.4 documentation 
>>> >
>> 
>> Our current setup is as follows:
>> 
>> 1.   All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
>> Solr 7.6
>> 
>> 2.   Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
>> 50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
>> DocValues ON
>> 
>> 3.   The only “hot” and frequently used cache is filterCache, configured 
>> with the default value of 256 entries. If we increase the setting to 512 or 
>> 1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes 
>> become too frequent.
>> 
>> 4.   Regardless of the Java Heap size (we’ve tested with even larger 
>> heaps and VM sizing up to 384GB), all nodes that have approx. more than 
>> 120-130M documents crash with OOM under heavy load (hundreds of simultaneous 
>> searches with a variety of Filter Queries).
>> 
>> FilterCache is really frequently used and some of the BitSets are spanning 
>> across 80-90% of the Docset of each shard, so in many cases the FC entries 
>> become larger than 16MB. We believe we’ve pinpointed the problem to the G1 
>> Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which 
>> allows setting a maximum of 32MB, regardless if it is auto-calculated or set 
>> manually in the JVM startup options. The JVM memory allocation algorithm 
>> tracks every memory 

Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Vassil Velichkov (DGM)
Hi Everyone,

Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 we 
have frequent OOM crashes on specific nodes.

All investigations (detailed below) lead to a hard-coded limitation in the G1 
garbage collector and the Java Heap is exhausted due to too many filterCache 
allocations that are never discarded by G1.

Our hope is to use Java 13 with the new ZGC, which is specifically designed for 
large heap-sizes, and supposedly would handle and dispose larger allocations. 
The Solr release notes claim that Solr 7.6 builds are tested with Java 11 / 12 
/ 13 (pre-release), but does anyone uses Java 13 in production and has 
experience with the new ZGC and large heap sizes / large document sets of more 
than 150M documents per shard?

> Some background information and reference to the possible root-cause, 
> described by Shawn Heisey in Solr 1.4 documentation.

Our current setup is as follows:

1.   All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
Solr 7.6

2.   Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
DocValues ON

3.   The only “hot” and frequently used cache is filterCache, configured 
with the default value of 256 entries. If we increase the setting to 512 or 
1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes 
become too frequent.

4.   Regardless of the Java Heap size (we’ve tested with even larger heaps 
and VM sizing up to 384GB), all nodes that have approx. more than 120-130M 
documents crash with OOM under heavy load (hundreds of simultaneous searches 
with a variety of Filter Queries).

FilterCache is really frequently used and some of the BitSets are spanning 
across 80-90% of the Docset of each shard, so in many cases the FC entries 
become larger than 16MB. We believe we’ve pinpointed the problem to the G1 
Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which 
allows setting a maximum of 32MB, regardless if it is auto-calculated or set 
manually in the JVM startup options. The JVM memory allocation algorithm tracks 
every memory allocation request and if the request exceeds 50% of 
G1HeapRegionSize, it is considered humongous allocation (he-he, extremely large 
allocation in 2019?!?), so it is not scanned and evaluated during standard 
garbage collection cycles. Unused humongous allocations are basically freed 
only during Full Garbage Collection cycles, which are never really invoked by 
the G1 garbage collector, before it is too late and the JVM crashes with OOM.

Now we are rebalancing the cluster to have up to 100-120M  documents per shard, 
following and ancient, but probably still valid limitation suggested in Solr 
1.4 documentation by Shawn 
Heisey: “If you 
have an index with about 100 million documents in it, you'll want to use a 
region size of 32MB, which is the maximum possible size. Because of this 
limitation of the G1 collector, we recommend always keeping a Solr index below 
a maxDoc value of around 100 to 120 million.”

Cheers,
Vassil Velichkov


RE: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Vassil Velichkov (Sensika)
Thanks Jörn,

Yep, we are rebalancing the cluster to keep up to ~100M documents per shard, 
but that's not quite optimal in our use-case.

We've tried with various ratios between JVM Heap / OS RAM (up to 128GB / 256GB) 
and we have the same Java Heap OOM crashes.
For example, a BitSet of 160M documents is > 16MB and when we look at the G1 
logs, it seems it never discards the humongous allocations, so they keep 
piling. Forcing a full-garbage collection is just not practical - it takes 
forever and the shard is not usable. Even when a new Searcher is started (every 
several minutes) the old large filterCache entries are not freed and sooner or 
later the JVM crashes.

On the other hand ZGC has a completely different architecture and does not have 
the hard-coded threshold of 16MB for *humongous allocations*:
https://wiki.openjdk.java.net/display/zgc/Main

Anyway, we will be probably testing Java 13 and ZGC with the real data, we just 
have to reindex 30+ shards to new Solr servers, which will take a couple of 
days :-)

Cheers,
Vassil

-Original Message-
From: Jörn Franke  
Sent: Monday, October 14, 2019 1:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
tests with Java 13 and the new ZGC?

I would try JDK11 - it works much better than JDK9 in general. 

I don‘t think JDK13 with ZGC will bring you better results. There seems to be 
sth strange with the JDk version or Solr version and some settings. 

Then , make sure that you have much more free memory for the os cache than the 
heap. Nearly 100 gb for Solr heap sounds excessive. Try to reduce it to much 
less.

Try the default options of Solr and use the latest 7.x version or 8.x version 
of Solr.

Additionally you can try to shard more.

> Am 14.10.2019 um 19:19 schrieb Vassil Velichkov (Sensika) 
> :
> 
> Hi Everyone,
> 
> Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 
> we have frequent OOM crashes on specific nodes.
> 
> All investigations (detailed below) lead to a hard-coded limitation in the G1 
> garbage collector. The Java Heap is exhausted due to too many filterCache 
> allocations that are never discarded by the G1.
> 
> Our hope is to use Java 13 with the new ZGC, which is specifically designed 
> for large heap-sizes, and supposedly would handle and dispose larger 
> allocations. The Solr release notes claim that Solr 7.6 builds are tested 
> with Java 11 / 12 / 13 (pre-release).
> Does anyone use Java 13 in production and has experience with the new ZGC and 
> large heap sizes / large document sets of more than 150M documents per shard?
> 
>> Some background information and reference to the possible 
>> root-cause, described by Shawn Heisey in Solr 1.4 documentation >
> 
> Our current setup is as follows:
> 
> 1.   All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
> Solr 7.6
> 
> 2.   Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
> 50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
> DocValues ON
> 
> 3.   The only “hot” and frequently used cache is filterCache, configured 
> with the default value of 256 entries. If we increase the setting to 512 or 
> 1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes 
> become too frequent.
> 
> 4.   Regardless of the Java Heap size (we’ve tested with even larger 
> heaps and VM sizing up to 384GB), all nodes that have approx. more than 
> 120-130M documents crash with OOM under heavy load (hundreds of simultaneous 
> searches with a variety of Filter Queries).
> 
> FilterCache is really frequently used and some of the BitSets are spanning 
> across 80-90% of the Docset of each shard, so in many cases the FC entries 
> become larger than 16MB. We believe we’ve pinpointed the problem to the G1 
> Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which 
> allows setting a maximum of 32MB, regardless if it is auto-calculated or set 
> manually in the JVM startup options. The JVM memory allocation algorithm 
> tracks every memory allocation request and if the request exceeds 50% of 
> G1HeapRegionSize, it is considered humongous allocation (he-he, extremely 
> large allocation in 2019?!?), so it is not scanned and evaluated during 
> standard garbage collection cycles. Unused humongous allocations are 
> basically freed only during Full Garbage Collection cycles, which are never 
> really invoked by the G1 garbage collector, before it is too late and the JVM 
> crashes with OOM.
> 
> Now we are rebalancing the cluster to have up to 100-120M  documents per 
> shard, following and ancient, but probably still valid limitation suggested 
> in Solr 1.4 documentation by Shawn 
> Heisey: “If you 
> have an index with about 100 million documents in it, you'll want to use a 
> region size of 32MB, which 

Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Jörn Franke
I would try JDK11 - it works much better than JDK9 in general. 

I don‘t think JDK13 with ZGC will bring you better results. There seems to be 
sth strange with the JDk version or Solr version and some settings. 

Then , make sure that you have much more free memory for the os cache than the 
heap. Nearly 100 gb for Solr heap sounds excessive. Try to reduce it to much 
less.

Try the default options of Solr and use the latest 7.x version or 8.x version 
of Solr.

Additionally you can try to shard more.

> Am 14.10.2019 um 19:19 schrieb Vassil Velichkov (Sensika) 
> :
> 
> Hi Everyone,
> 
> Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 
> we have frequent OOM crashes on specific nodes.
> 
> All investigations (detailed below) lead to a hard-coded limitation in the G1 
> garbage collector. The Java Heap is exhausted due to too many filterCache 
> allocations that are never discarded by the G1.
> 
> Our hope is to use Java 13 with the new ZGC, which is specifically designed 
> for large heap-sizes, and supposedly would handle and dispose larger 
> allocations. The Solr release notes claim that Solr 7.6 builds are tested 
> with Java 11 / 12 / 13 (pre-release).
> Does anyone use Java 13 in production and has experience with the new ZGC and 
> large heap sizes / large document sets of more than 150M documents per shard?
> 
>> Some background information and reference to the possible 
>> root-cause, described by Shawn Heisey in Solr 1.4 documentation >
> 
> Our current setup is as follows:
> 
> 1.   All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
> Solr 7.6
> 
> 2.   Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
> 50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
> DocValues ON
> 
> 3.   The only “hot” and frequently used cache is filterCache, configured 
> with the default value of 256 entries. If we increase the setting to 512 or 
> 1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes 
> become too frequent.
> 
> 4.   Regardless of the Java Heap size (we’ve tested with even larger 
> heaps and VM sizing up to 384GB), all nodes that have approx. more than 
> 120-130M documents crash with OOM under heavy load (hundreds of simultaneous 
> searches with a variety of Filter Queries).
> 
> FilterCache is really frequently used and some of the BitSets are spanning 
> across 80-90% of the Docset of each shard, so in many cases the FC entries 
> become larger than 16MB. We believe we’ve pinpointed the problem to the G1 
> Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which 
> allows setting a maximum of 32MB, regardless if it is auto-calculated or set 
> manually in the JVM startup options. The JVM memory allocation algorithm 
> tracks every memory allocation request and if the request exceeds 50% of 
> G1HeapRegionSize, it is considered humongous allocation (he-he, extremely 
> large allocation in 2019?!?), so it is not scanned and evaluated during 
> standard garbage collection cycles. Unused humongous allocations are 
> basically freed only during Full Garbage Collection cycles, which are never 
> really invoked by the G1 garbage collector, before it is too late and the JVM 
> crashes with OOM.
> 
> Now we are rebalancing the cluster to have up to 100-120M  documents per 
> shard, following and ancient, but probably still valid limitation suggested 
> in Solr 1.4 documentation by Shawn 
> Heisey: “If you 
> have an index with about 100 million documents in it, you'll want to use a 
> region size of 32MB, which is the maximum possible size. Because of this 
> limitation of the G1 collector, we recommend always keeping a Solr index 
> below a maxDoc value of around 100 to 120 million.”
> 
> Cheers,
> Vassil Velichkov


RE: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-14 Thread Retro
Hello, thanks for answer, but let me explain the setup. We are running our
own backup solution for emails (messages from Exchange in MSG format).
Content of these messages then indexed in SOLR. But SOLR can not process
attachments within those MSG files, can not OCR them. This is what I need -
to OCR attachments and get their content indexed in SOLR. 

Davis, Daniel (NIH/NLM) [C] wrote
> Nuance and ABBYY provide OCR capabilities as well.
> Looking at higher level solutions, both indexengines.com and Comvault can
> do email remediation for legal issues.
>> AJ Weber wrote
>> > There are alternative, paid, libraries to parse and extract attachments
>> > from EML files as well
>> > EML attachments will have a mimetype associated with their metadata.
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

2019-10-14 Thread Vassil Velichkov (Sensika)
Hi Everyone,

Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 we 
have frequent OOM crashes on specific nodes.

All investigations (detailed below) lead to a hard-coded limitation in the G1 
garbage collector. The Java Heap is exhausted due to too many filterCache 
allocations that are never discarded by the G1.

Our hope is to use Java 13 with the new ZGC, which is specifically designed for 
large heap-sizes, and supposedly would handle and dispose larger allocations. 
The Solr release notes claim that Solr 7.6 builds are tested with Java 11 / 12 
/ 13 (pre-release).
Does anyone use Java 13 in production and has experience with the new ZGC and 
large heap sizes / large document sets of more than 150M documents per shard?

> Some background information and reference to the possible root-cause, 
> described by Shawn Heisey in Solr 1.4 documentation >

Our current setup is as follows:

1.   All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
Solr 7.6

2.   Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
DocValues ON

3.   The only “hot” and frequently used cache is filterCache, configured 
with the default value of 256 entries. If we increase the setting to 512 or 
1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes 
become too frequent.

4.   Regardless of the Java Heap size (we’ve tested with even larger heaps 
and VM sizing up to 384GB), all nodes that have approx. more than 120-130M 
documents crash with OOM under heavy load (hundreds of simultaneous searches 
with a variety of Filter Queries).

FilterCache is really frequently used and some of the BitSets are spanning 
across 80-90% of the Docset of each shard, so in many cases the FC entries 
become larger than 16MB. We believe we’ve pinpointed the problem to the G1 
Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which 
allows setting a maximum of 32MB, regardless if it is auto-calculated or set 
manually in the JVM startup options. The JVM memory allocation algorithm tracks 
every memory allocation request and if the request exceeds 50% of 
G1HeapRegionSize, it is considered humongous allocation (he-he, extremely large 
allocation in 2019?!?), so it is not scanned and evaluated during standard 
garbage collection cycles. Unused humongous allocations are basically freed 
only during Full Garbage Collection cycles, which are never really invoked by 
the G1 garbage collector, before it is too late and the JVM crashes with OOM.

Now we are rebalancing the cluster to have up to 100-120M  documents per shard, 
following and ancient, but probably still valid limitation suggested in Solr 
1.4 documentation by Shawn 
Heisey: “If you 
have an index with about 100 million documents in it, you'll want to use a 
region size of 32MB, which is the maximum possible size. Because of this 
limitation of the G1 collector, we recommend always keeping a Solr index below 
a maxDoc value of around 100 to 120 million.”

Cheers,
Vassil Velichkov