Hardware related issue

2020-06-01 Thread Rudenko, Artur
Hi Guys,

We were planning on using 7 physical servers for our solr node with 64 VCPUs 
2GHZ and 128 GRAM each but due to some constrains we had to use virtual 
environment that does not has the same amount of cpus. We were suggested to use 
less cpus with higher ghz.
Do we need to look for L1-L3 cache metrics? Do we need to increase RAM/IOPS 
relationally to fill the "gap" of less amount of CPU cores?
Any particular suggestions?

Artur Rudenko


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


RE: Filtering large amount of values

2020-05-17 Thread Rudenko, Artur
Hi Mikhail,

Thank you for the help, with you suggestion we actually managed to improve the 
results.

We now get and store the docValues in this method instead of inside collect() 
method:

@Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
super.doSetNextReader(context);
sortedDocValues = DocValues.getSorted(context.reader(), 
FileFilterPostQuery.this.metaField);
}

We see a big improvement. Is this the most efficient way?
Since it's a post filter, we have to return "false" in getCache method. Is 
there a way to implement it with cache?

Thanks,
Artur Rudenko

-Original Message-
From: Mikhail Khludnev 
Sent: Thursday, May 14, 2020 2:57 PM
To: solr-user 
Subject: Re: Filtering large amount of values

Hi, Artur.

Please, don't tell me that you obtain docValues per every doc? It's deadly slow 
see https://issues.apache.org/jira/browse/LUCENE-9328 for related problem.
Make sure you obtain them once per segment, when leaf reader is injected.
Recently there are some new method(s) for {!terms} I'm wondering if any of them 
might solve the problem.

On Thu, May 14, 2020 at 2:36 PM Rudenko, Artur 
wrote:

> Hi,
> We have a requirement of implementing a boolean filter with up to 500k
> values.
>
> We took the approach of post filter.
>
> Our environment has 7 servers of 128gb ram and 64cpus each server. We
> have 20-40m very large documents. Each solr instance has 64 shards
> with 2 replicas and JVM memory xms and xmx set to 31GB.
>
> We are seeing that using single post filter with 1000 on 20m documents
> takes about 4.5 seconds.
>
> Logic in our collect method:
> numericDocValues =
> reader.getNumericDocValues(FileFilterPostQuery.this.metaField);
>
> if (numericDocValues != null &&
> numericDocValues.advanceExact(docNumber)) {
> longVal = numericDocValues.longValue();
> } else {
> return;
> }
> }
>
> if (numericValuesSet.contains(longVal)) {
> super.collect(docNumber);
> }
>
>
> Is it the best we can get?
>
>
> Thanks,
> Artur Rudenko
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or
> subsidiaries. The information is intended to be for the use of the
> individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may
> not use, copy, disclose or distribute to anyone this message or any
> information contained in this message. If you have received this
> electronic message in error, please notify us by replying to this e-mail.
>


--
Sincerely yours
Mikhail Khludnev


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Filtering large amount of values

2020-05-14 Thread Rudenko, Artur
Hi,
We have a requirement of implementing a boolean filter with up to 500k values.

We took the approach of post filter.

Our environment has 7 servers of 128gb ram and 64cpus each server. We have 
20-40m very large documents. Each solr instance has 64 shards with 2 replicas 
and JVM memory xms and xmx set to 31GB.

We are seeing that using single post filter with 1000 on 20m documents takes 
about 4.5 seconds.

Logic in our collect method:
numericDocValues = 
reader.getNumericDocValues(FileFilterPostQuery.this.metaField);

if (numericDocValues != null && 
numericDocValues.advanceExact(docNumber)) {
longVal = numericDocValues.longValue();
} else {
return;
}
}

if (numericValuesSet.contains(longVal)) {
super.collect(docNumber);
}


Is it the best we can get?


Thanks,
Artur Rudenko


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


RE: Possible performance bug - JSON facet - numBuckets:true

2020-03-15 Thread Rudenko, Artur
Update:

I started working on a fix to this issue and I found that the result for 
"numBuckets" in the original implementation is not accurate:

Query using my fix for limit -1:

{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":31,
"params":{
  "q":"*:*",
  "json.facet":"{\"Chart_01_Bins\":{type:terms, field:date, mincount:1, 
limit:-1, numBuckets:true, missing:false, refine:true }}",
  "rows":"0"}},
  "response":{"numFound":170500,"start":0,"maxScore":1.0,"docs":[]
  },
  "facets":{
"count":170500,
"Chart_01":{
  "numBuckets":2660,
  "buckets":[{
  "val":"2019-01-16T15:17:03Z",
  "count":749},
{
  "val":"2019-01-23T21:46:44Z",
  "count":742},
{
  "val":"2019-01-04T11:06:22Z",
  "count":603},
{
  "val":"2019-01-08T01:08:58Z",
  "count":484},
 .
 .
 .
 .
{
  "val":"2019-01-26T06:30:33Z",
  "count":3}]}}}


Query with high limit that should include all buckets, based on current solr 
implementation:
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":29,
"params":{
  "q":"*:*",
  "json.facet":"{\"Chart_01_Bins\":{type:terms, field:date, mincount:1, 
limit:5000, numBuckets:true, missing:false, refine:true }}",
  "rows":"0"}},
  "response":{"numFound":170500,"start":0,"maxScore":1.0,"docs":[]
  },
  "facets":{
"count":170500,
"Chart_01_Bins":{
  "numBuckets":2671,
  "buckets":[{
  "val":"2019-01-16T15:17:03Z",
  "count":749},
{
  "val":"2019-01-23T21:46:44Z",
  "count":742},
{
  "val":"2019-01-04T11:06:22Z",
  "count":603},
{
  "val":"2019-01-08T01:08:58Z",
  "count":484},
 .
 .
 .
 .
  "val":"2019-01-26T06:30:33Z",
  "count":3}]}}}

There is 2660 buckets (which is the result of my fix) while the original solr 
implementation claims there are 2671 buckets (11 more)
The result of both query were compared with comparing tool and except of QTime, 
different limit value and numbuckets value all were the same (I decided not to 
pace all the buckets response but all were the same = 2660 and not 2671).
I also could not find in the docs that "numbuckets" is an estimation. For low 
cardinality values, the result was accurate.

Is this the expected behavior?


Artur Rudenko

-Original Message-
From: Mikhail Khludnev 
Sent: Tuesday, March 10, 2020 8:46 AM
To: solr-user 
Subject: Re: Possible performance bug - JSON facet - numBuckets:true

Hello, Artur.

Thanks for your interest.
Perhaps, we can amend doc mentioning this effect. In long term it can be 
optimized by adding a proper condition. Both patches are welcome.

On Wed, Feb 12, 2020 at 10:48 PM Rudenko, Artur 
wrote:

> Hello everyone,
> I'm am currently investigating a performance issue in our environment
> and it looks like we found a performance bug.
> Our environment:
> 20M large PARENT documents and 800M nested small CHILD documents.
> The system inserts about 400K PARENT documents and 16M CHILD documents
> per day. (Currently we stopped the calls insertion to investigate the
> performance issue) This is a solr cloud 8.3 environment with 7 servers
> (64 VCPU 128 GB RAM each, 24GB allocated to Solr) with single
> collection (32 shards and replication factor 2).
>
> The below query runs in about 14-16 seconds (we have to use limit:-1
> due to a business case - cardinality is 1K values).
>
> fq=channel:345133
> &fq=content_type:PARENT
> &fq=Meta_is_organizationIds:(344996998 344594999 34501 total
> of int 562 values)
> &q=*:*
> &json.facet={
> "Chart_01_Bins":{
> type:terms,
> field:groupIds,
> mincount:1,
> limit:-1,
> numBuckets:true,
> missing:false,
>   

RE: Possible performance bug - JSON facet - numBuckets:true

2020-03-08 Thread Rudenko, Artur
Guys?

Artur Rudenko

-Original Message-
From: Rudenko, Artur 
Sent: Saturday, February 15, 2020 12:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Possible performance bug - JSON facet - numBuckets:true

Promoting my question


Thanks,
Artur Rudenko

From: Rudenko, Artur
Sent: Wednesday, February 12, 2020 9:48 PM
To: solr-user@lucene.apache.org
Subject: Possible performance bug - JSON facet - numBuckets:true

Hello everyone,
I'm am currently investigating a performance issue in our environment and it 
looks like we found a performance bug.
Our environment:
20M large PARENT documents and 800M nested small CHILD documents.
The system inserts about 400K PARENT documents and 16M CHILD documents per day. 
(Currently we stopped the calls insertion to investigate the performance issue) 
This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
24GB allocated to Solr) with single collection (32 shards and replication 
factor 2).

The below query runs in about 14-16 seconds (we have to use limit:-1 due to a 
business case - cardinality is 1K values).

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true,
facet:{

min_score_avg:"avg(min_score)",

max_score_avg:"avg(max_score)",

avg_score_avg:"avg(avg_score)"
}
},
"Chart_01_FIELD_NOT_EXISTS":{
type:query,
q:"-groupIds:[* TO *]",
facet:{
min_score_avg:"avg(min_score)",
max_score_avg:"avg(max_score)",
avg_score_avg:"avg(avg_score)"
}
}
}
&rows=0

Also, when the facet is simplified, it takes about 4-6 seconds

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true
}
}
&rows=0

Schema relevant fields:

 


 

 

 


   

 




I noticed that when we set numBuckets:false, the result returns faster (1.5-3.5 
seconds less) - that sounds like a performance bug:
The limit is -1, which means all bucks, so adding about significant time to the 
overall time just to get number of buckets when we will get all of them anyway 
doesn't seems to be right.

Any thoughts?


Thanks
Artur Rudenko


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


RE: Slow quires and facets

2020-02-15 Thread Rudenko, Artur
Promoting my question

Artur Rudenko

-Original Message-
From: Rudenko, Artur 
Sent: Wednesday, February 12, 2020 10:33 PM
To: solr-user@lucene.apache.org
Subject: Slow quires and facets

Hello everyone,
I'm am currently investigating a performance issue in our environment:
20M large PARENT documents and 800M nested small CHILD documents.
The system inserts about 400K PARENT documents and 16M CHILD documents per day. 
(Currently we stopped the calls insertion to investigate the performance issue) 
This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
24GB allocated to Solr) with single collection (32 shards and replication 
factor 2).

We experience generally slow queries (about 4-7 seconds) and facet times. The 
below query runs in about 14-16 seconds (we have to use limit:-1 due to a 
business case - cardinality is 1K values).

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true,
facet:{

min_score_avg:"avg(min_score)",

max_score_avg:"avg(max_score)",

avg_score_avg:"avg(avg_score)"
}
},
"Chart_01_FIELD_NOT_EXISTS":{
type:query,
q:"-groupIds:[* TO *]",
facet:{
min_score_avg:"avg(min_score)",
max_score_avg:"avg(max_score)",
avg_score_avg:"avg(avg_score)"
}
}
}
&rows=0

Also, when the facet is simplified, it takes about 4-6 seconds

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true
}
}
&rows=0

Schema relevant fields:

 


 

 

 


   

 



Any suggestions how to proceed with the investigation?

Right now we are trying to figure out if using single shard on each machine 
will help.
Artur Rudenko
Analytics Developer
Customer Engagement Solutions, VERINT
T +972.74.747.2536 | M +972.52.425.4686



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


RE: Possible performance bug - JSON facet - numBuckets:true

2020-02-15 Thread Rudenko, Artur
Promoting my question


Thanks,
Artur Rudenko

From: Rudenko, Artur
Sent: Wednesday, February 12, 2020 9:48 PM
To: solr-user@lucene.apache.org
Subject: Possible performance bug - JSON facet - numBuckets:true

Hello everyone,
I'm am currently investigating a performance issue in our environment and it 
looks like we found a performance bug.
Our environment:
20M large PARENT documents and 800M nested small CHILD documents.
The system inserts about 400K PARENT documents and 16M CHILD documents per day. 
(Currently we stopped the calls insertion to investigate the performance issue)
This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
24GB allocated to Solr) with single collection (32 shards and replication 
factor 2).

The below query runs in about 14-16 seconds (we have to use limit:-1 due to a 
business case - cardinality is 1K values).

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true,
facet:{

min_score_avg:"avg(min_score)",

max_score_avg:"avg(max_score)",

avg_score_avg:"avg(avg_score)"
}
},
"Chart_01_FIELD_NOT_EXISTS":{
type:query,
q:"-groupIds:[* TO *]",
facet:{
min_score_avg:"avg(min_score)",
max_score_avg:"avg(max_score)",
avg_score_avg:"avg(avg_score)"
}
}
}
&rows=0

Also, when the facet is simplified, it takes about 4-6 seconds

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true
}
}
&rows=0

Schema relevant fields:























I noticed that when we set numBuckets:false, the result returns faster (1.5-3.5 
seconds less) - that sounds like a performance bug:
The limit is -1, which means all bucks, so adding about significant time to the 
overall time just to get number of buckets when we will get all of them anyway 
doesn't seems to be right.

Any thoughts?


Thanks
Artur Rudenko


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Slow quires and facets

2020-02-12 Thread Rudenko, Artur
Hello everyone,
I'm am currently investigating a performance issue in our environment:
20M large PARENT documents and 800M nested small CHILD documents.
The system inserts about 400K PARENT documents and 16M CHILD documents per day. 
(Currently we stopped the calls insertion to investigate the performance issue)
This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
24GB allocated to Solr) with single collection (32 shards and replication 
factor 2).

We experience generally slow queries (about 4-7 seconds) and facet times. The 
below query runs in about 14-16 seconds (we have to use limit:-1 due to a 
business case - cardinality is 1K values).

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true,
facet:{

min_score_avg:"avg(min_score)",

max_score_avg:"avg(max_score)",

avg_score_avg:"avg(avg_score)"
}
},
"Chart_01_FIELD_NOT_EXISTS":{
type:query,
q:"-groupIds:[* TO *]",
facet:{
min_score_avg:"avg(min_score)",
max_score_avg:"avg(max_score)",
avg_score_avg:"avg(avg_score)"
}
}
}
&rows=0

Also, when the facet is simplified, it takes about 4-6 seconds

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true
}
}
&rows=0

Schema relevant fields:






















Any suggestions how to proceed with the investigation?

Right now we are trying to figure out if using single shard on each machine 
will help.
Artur Rudenko
Analytics Developer
Customer Engagement Solutions, VERINT
T +972.74.747.2536 | M +972.52.425.4686



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Possible performance bug - JSON facet - numBuckets:true

2020-02-12 Thread Rudenko, Artur
Hello everyone,
I'm am currently investigating a performance issue in our environment and it 
looks like we found a performance bug.
Our environment:
20M large PARENT documents and 800M nested small CHILD documents.
The system inserts about 400K PARENT documents and 16M CHILD documents per day. 
(Currently we stopped the calls insertion to investigate the performance issue)
This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
24GB allocated to Solr) with single collection (32 shards and replication 
factor 2).

The below query runs in about 14-16 seconds (we have to use limit:-1 due to a 
business case - cardinality is 1K values).

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true,
facet:{

min_score_avg:"avg(min_score)",

max_score_avg:"avg(max_score)",

avg_score_avg:"avg(avg_score)"
}
},
"Chart_01_FIELD_NOT_EXISTS":{
type:query,
q:"-groupIds:[* TO *]",
facet:{
min_score_avg:"avg(min_score)",
max_score_avg:"avg(max_score)",
avg_score_avg:"avg(avg_score)"
}
}
}
&rows=0

Also, when the facet is simplified, it takes about 4-6 seconds

fq=channel:345133
&fq=content_type:PARENT
&fq=Meta_is_organizationIds:(344996998 344594999 34501 total of int 562 
values)
&q=*:*
&json.facet={
"Chart_01_Bins":{
type:terms,
field:groupIds,
mincount:1,
limit:-1,
numBuckets:true,
missing:false,
refine:true
}
}
&rows=0

Schema relevant fields:























I noticed that when we set numBuckets:false, the result returns faster (1.5-3.5 
seconds less) - that sounds like a performance bug:
The limit is -1, which means all bucks, so adding about significant time to the 
overall time just to get number of buckets when we will get all of them anyway 
doesn't seems to be right.

Any thoughts?


Thanks
Artur Rudenko


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


RE: Possible performance issue in my environment setup

2020-02-11 Thread Rudenko, Artur
Thanks for helping, I will keep investigating.

Just note, we did stopped indexing and we did not saw any significant changes.

Artur Rudenko
Analytics Developer
Customer Engagement Solutions, VERINT
T +972.74.747.2536 | M +972.52.425.4686

-Original Message-
From: Erick Erickson 
Sent: Tuesday, February 11, 2020 4:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Possible performance issue in my environment setup

My first bit of advice would be to fix your autocommit intervals. There’s not 
much point in having openSearcher set to true _and_ having your soft commit 
times also set, all soft commit does is open a searcher and your autocommit 
does that.

I’d also reduce the time for autoCommit. You’re _probably_ being saved by the 
maxDoc entry,

Fix here is set openSearcher=false in autoCommit, and reduce the time. And let 
soft commit handle opening searchers. Here’s more than you want to know about 
how all this works:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Given your observation that you see a new searcher being opened 65K times, my 
bet is that you’re somehow committing far, far too often. What’s the rate of 
opening new searchers? Do those 65K entries span an hour? 10 days? Either 
you’re sending 50K docs very frequently or your client is sending commits.

So here’s what I’d do as a quick-n-dirty triage of where to look first:

- first turn off indexing. Does your query performance improve? If so, consider 
autowarming and tuning your commit interval.

- next, add &debug=timing to some of your queries. That’ll tell you if a 
particular _component_ is taking a long time, something like faceting say.

- If nothing jumps out, throw a profiler at Solr to see where it’s spending 
it’s time.

Best,
Erick

> On Feb 11, 2020, at 6:17 AM, Rudenko, Artur  wrote:
>
> I'm am currently investigating a performance issue in our environment (20M 
> large PARENT documents and 800M nested small CHILD documents). The system 
> inserts about 400K PARENT documents and 16M CHILD documents per day.
> This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
> 24GB allocated to Solr) with single collection (32 shards and replication 
> factor 2).
>
> Solr config related info :
>
> 
>  ${solr.autoCommit.maxTime:360}
>  ${solr.autoCommit.maxDocs:5}
>  true
>   
>
>
>   
>  ${solr.autoSoftCommit.maxTime:30}
>   
>
> I found in the solr log the following log line:
>
> [2020-02-10T00:01:00.522] INFO [qtp1686100174-100525]
> org.apache.solr.search.SolrIndexSearcher Opening
> [Searcher@37c9205b[0_shard29_replica_n112] realtime]
>
> From a log with 100K records, the above log record appears 65K times.
>
> We are experiencing extremely slow query time while the indexing time is fast 
> and sufficient.
>
> Is this a possible direction to keep investigating? If so, any advices?
>
>
> Thanks,
> Artur Rudenko
>
>
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Possible performance issue in my environment setup

2020-02-11 Thread Rudenko, Artur
I'm am currently investigating a performance issue in our environment (20M 
large PARENT documents and 800M nested small CHILD documents). The system 
inserts about 400K PARENT documents and 16M CHILD documents per day.
This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
24GB allocated to Solr) with single collection (32 shards and replication 
factor 2).

Solr config related info :


  ${solr.autoCommit.maxTime:360}
  ${solr.autoCommit.maxDocs:5}
  true
   


   
  ${solr.autoSoftCommit.maxTime:30}
   

I found in the solr log the following log line:

[2020-02-10T00:01:00.522] INFO [qtp1686100174-100525] 
org.apache.solr.search.SolrIndexSearcher Opening 
[Searcher@37c9205b[0_shard29_replica_n112] realtime]

>From a log with 100K records, the above log record appears 65K times.

We are experiencing extremely slow query time while the indexing time is fast 
and sufficient.

Is this a possible direction to keep investigating? If so, any advices?


Thanks,
Artur Rudenko


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Solr fact response strange behaviour

2020-01-27 Thread Rudenko, Artur
I'm trying to parse facet response, but sometimes the count returns as Long 
type and sometimes as Integer type(on different environments), The error is:
"java.lang.ClassCastException: java.lang.Integer cannot be cast to 
java.lang.Long"

Can you please explain why this happenes? Why it not consistent?

I know the workaround to use Number class and longValue method but I want to to 
the root cause before using this workaround

Artur Rudenko



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Type of auto suggest feature

2019-11-24 Thread Rudenko, Artur
Hi,
I am quite new to solr and I am interested in implementing a sort of auto terms 
suggest (not auto complete) feature based on the user query.
Users builds some query (on multiple fields) and I am trying to help him 
refining his query by suggesting to add more terms based on his current query.
The suggestions should contain synonyms and different word forms (query:close , 
result: closed, closing) and also some other "interesting" (hard to define what 
interesting is) terms and phrases based on that search.

The queries are perform on text field with about 1000 words on document sets of 
about 20-50M

So far I came up with solution that uses Suggester component over the 1000 
words text field (copy field) as shown below and im trying to find how to add 
to it more "interesting" terms and phrases based on the text field







  






  
  

 









Thanks,
Artur Rudenko



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.