Solrcloud create collection ignores createNodeSet parameter

2020-10-27 Thread Webster Homer
We have a solrcloud set up with 2 nodes, 1 zookeeper and running Solr 7.7.2 
This cloud is used for development purposes. Collections are sharded across the 
2 nodes.

Recently we noticed that one of the main collections we use had both replicas 
running on the same node. Normally we don't see collections created where the 
replicas run on the same node.

I tried to create a new version of the collection forcing it to use both nodes. 
However, that doesn't work both replicas are created on the same node:
/solr/admin/collections?action=CREATE=sial-catalog-product-20201027=sial-catalog-product-20200808=2=1=1=uc1a-ecomdev-msc02:8983_solr,uc1a-ecomdev-msc01:8983_solr
The call returns this:
{
"responseHeader": {
"status": 0,
"QTime": 4659
},
"success": {
"uc1a-ecomdev-msc01:8983_solr": {
"responseHeader": {
"status": 0,
"QTime": 3900
},
"core": "sial-catalog-product-20201027_shard2_replica_n2"
},
"uc1a-ecomdev-msc01:8983_solr": {
"responseHeader": {
"status": 0,
"QTime": 4012
},
"core": "sial-catalog-product-20201027_shard1_replica_n1"
}
}
}

Both replicas are created on the same node. Why is this happening?

How do we force the replicas be placed on different nodes?



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: Odd Solr zkcli script behavior

2020-08-27 Thread Webster Homer
Never mind I figured out my problem.

-Original Message-
From: Webster Homer 
Sent: Thursday, August 27, 2020 10:29 AM
To: solr-user@lucene.apache.org
Subject: Odd Solr zkcli script behavior

I am using solr 7.7.2 solr cloud

We version our collection and config set names with dates. I have two 
collections sial-catalog-product-20200711 and sial-catalog-product-20200808. A 
developer uploaded a configuration file to the 20200711 version that was not 
checked into our source control, and I wanted to retrieve is from zookeeper as 
we cannot find the version anywhere else. So I tried the zkcli.sh shell script.

It always throws an exception when trying to access 
sial-catalog-product-20200711 but not when trying to access 
sial-catalog-product-20200808 INFO  - 2020-08-27 10:26:36.283; 
org.apache.solr.common.cloud.ConnectionManager; zkClient has connected 
Exception in thread "main" 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /solr/configs/sial-catalog-product-20200711/_schema_model-store.json
  at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
  at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1221)
  at 
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$5(SolrZkClient.java:358)
  at 
org.apache.solr.common.cloud.SolrZkClient$$Lambda$6/1384010761.execute(Unknown 
Source)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
  at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:358)
  at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:331)

I can see both collections and configsets in the SolrAdmin console. I can 
download the file from sial-catalog-product-20200808 with no problem. As far as 
I can tell the two config sets are accessible in the cloud, both config sets 
and collections are available the only difference is that we have an alias set 
to point to the newer one which is current, but the zkcli script does not use 
the alias.

I tried both the getfile and downconfig commands and the behavior is consistent 
I can always get to the later one but the 20200711 version gives the 
NoNodeException What is going on here?

A general comment, we use Zookeeper chroot, but the zkcli command doesn't seem 
to care if I pass the root on the zkhost argument or not. I also noticed that 
the zkcli command is poorly documented.



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Odd Solr zkcli script behavior

2020-08-27 Thread Webster Homer
I am using solr 7.7.2 solr cloud

We version our collection and config set names with dates. I have two 
collections sial-catalog-product-20200711 and sial-catalog-product-20200808. A 
developer uploaded a configuration file to the 20200711 version that was not 
checked into our source control, and I wanted to retrieve is from zookeeper as 
we cannot find the version anywhere else. So I tried the zkcli.sh shell script.

It always throws an exception when trying to access 
sial-catalog-product-20200711 but not when trying to access 
sial-catalog-product-20200808
INFO  - 2020-08-27 10:26:36.283; 
org.apache.solr.common.cloud.ConnectionManager; zkClient has connected
Exception in thread "main" 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /solr/configs/sial-catalog-product-20200711/_schema_model-store.json
  at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
  at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1221)
  at 
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$5(SolrZkClient.java:358)
  at 
org.apache.solr.common.cloud.SolrZkClient$$Lambda$6/1384010761.execute(Unknown 
Source)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
  at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:358)
  at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:331)

I can see both collections and configsets in the SolrAdmin console. I can 
download the file from sial-catalog-product-20200808 with no problem. As far as 
I can tell the two config sets are accessible in the cloud, both config sets 
and collections are available the only difference is that we have an alias set 
to point to the newer one which is current, but the zkcli script does not use 
the alias.

I tried both the getfile and downconfig commands and the behavior is consistent 
I can always get to the later one but the 20200711 version gives the 
NoNodeException
What is going on here?

A general comment, we use Zookeeper chroot, but the zkcli command doesn't seem 
to care if I pass the root on the zkhost argument or not. I also noticed that 
the zkcli command is poorly documented.



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: How to measure search performance

2020-07-23 Thread Webster Homer
I forgot to mention, the fields being used in the function query are indexed 
fields. They are mostly text fields that cannot have DocValues

-Original Message-
From: Webster Homer 
Sent: Thursday, July 23, 2020 2:07 PM
To: solr-user@lucene.apache.org
Subject: RE: How to measure search performance

Hi Erick,

This is an example of a pseudo field: wdim_pno_:if(gt(query({!edismax 
qf=searchmv_pno v=$q}),0),1,0) I get your point that it would only be applied 
to the results returned and not to all the results. The intent is to be able to 
identify which of the fields matched the search. Our business people are keen 
to know, for internal reasons.

I have not done a lot of function queries like this, does using edismax make 
this less performant? My tests have a lot of variability but I do see an effect 
on the QTime for adding these, but it is hard to quantify. It could be as much 
as 10%

Thank you for your quick response.
Webster

-Original Message-
From: Erick Erickson 
Sent: Thursday, July 23, 2020 12:52 PM
To: solr-user@lucene.apache.org
Subject: Re: How to measure search performance

This isn’t usually a cause for concern. Clearing the caches doesn’t necessarily 
clear the OS caches for instance. I think you’re already aware that Lucene uses 
MMapDirectory, meaning the index pages are mapped to OS memory space. Whether 
those pages are actually _in_ the OS physical memory or not is anyone’s guess 
so depending on when they’re needed they might have to be read from disk. This 
is entirely independent of Solr’s caches, and could come into play even if you 
restarted Solr.

Then there’s your function queries for the pseudo fields. This is read from the 
docValues sections of the index. Once again the relevant parts of the index may 
or may not be in the OS memory.

So comparing individual queries is “fraught” with uncertainties. I suppose you 
could reboot the machines each time ;) I’ve only ever had luck averaging a 
bunch of unique queries when trying to measure perf differences.

Do note that function queries for pseudo fields is not something I’d expect to 
add much overhead at all. The reason is that they’re only called for the top N 
docs that you’re returning, not part of the search at all. Consider a function 
query involved in scoring. That one must be called for every document that 
matches. But a function query for a pseudo field is only called for the docs 
returned in the packet, i.e. the “rows” parameter.

Best,
Erick

> On Jul 23, 2020, at 11:49 AM, Webster Homer 
>  wrote:
>
> I'm trying to determine the overhead of adding some pseudo fields to one of 
> our standard searches. The pseudo fields are simply function queries to 
> report if certain fields matched the query or not. I had thought that I could 
> run the search without the change and then re-run the searches with the 
> fields added.
> I had assumed that the QTime in the query response would be a good metric to 
> use when comparing the performance of the two search queries. However I see 
> that the QTime for a query can vary by more than 10%. When testing I cleared 
> the query cache between tests. Usually the QTime would be within a few 
> milliseconds of each other, however in some cases there was a 10X or more 
> difference between them.
> Even cached queries vary in their QTime, though much less.
>
> I am running Solr 7.7.2 in a solrcloud configuration with 2 shards and 2 
> replicas/shard. Our nodes have 32Gb memory and 16GB of heap allocated to solr.
>
> I am concerned that these discrepancies indicate that our system is not tuned 
> well enough.
> Should I expect that a query's QTime really is a measure of the query's 
> inherent performance? Is there a better way to measure query performance?
>
>
>
>
>
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.



This message and any attachment ar

RE: How to measure search performance

2020-07-23 Thread Webster Homer
Hi Erick,

This is an example of a pseudo field: wdim_pno_:if(gt(query({!edismax 
qf=searchmv_pno v=$q}),0),1,0)
I get your point that it would only be applied to the results returned and not 
to all the results. The intent is to be able to identify which of the fields 
matched the search. Our business people are keen to know, for internal reasons.

I have not done a lot of function queries like this, does using edismax make 
this less performant? My tests have a lot of variability but I do see an effect 
on the QTime for adding these, but it is hard to quantify. It could be as much 
as 10%

Thank you for your quick response.
Webster

-Original Message-
From: Erick Erickson 
Sent: Thursday, July 23, 2020 12:52 PM
To: solr-user@lucene.apache.org
Subject: Re: How to measure search performance

This isn’t usually a cause for concern. Clearing the caches doesn’t necessarily 
clear the OS caches for instance. I think you’re already aware that Lucene uses 
MMapDirectory, meaning the index pages are mapped to OS memory space. Whether 
those pages are actually _in_ the OS physical memory or not is anyone’s guess 
so depending on when they’re needed they might have to be read from disk. This 
is entirely independent of Solr’s caches, and could come into play even if you 
restarted Solr.

Then there’s your function queries for the pseudo fields. This is read from the 
docValues sections of the index. Once again the relevant parts of the index may 
or may not be in the OS memory.

So comparing individual queries is “fraught” with uncertainties. I suppose you 
could reboot the machines each time ;) I’ve only ever had luck averaging a 
bunch of unique queries when trying to measure perf differences.

Do note that function queries for pseudo fields is not something I’d expect to 
add much overhead at all. The reason is that they’re only called for the top N 
docs that you’re returning, not part of the search at all. Consider a function 
query involved in scoring. That one must be called for every document that 
matches. But a function query for a pseudo field is only called for the docs 
returned in the packet, i.e. the “rows” parameter.

Best,
Erick

> On Jul 23, 2020, at 11:49 AM, Webster Homer 
>  wrote:
>
> I'm trying to determine the overhead of adding some pseudo fields to one of 
> our standard searches. The pseudo fields are simply function queries to 
> report if certain fields matched the query or not. I had thought that I could 
> run the search without the change and then re-run the searches with the 
> fields added.
> I had assumed that the QTime in the query response would be a good metric to 
> use when comparing the performance of the two search queries. However I see 
> that the QTime for a query can vary by more than 10%. When testing I cleared 
> the query cache between tests. Usually the QTime would be within a few 
> milliseconds of each other, however in some cases there was a 10X or more 
> difference between them.
> Even cached queries vary in their QTime, though much less.
>
> I am running Solr 7.7.2 in a solrcloud configuration with 2 shards and 2 
> replicas/shard. Our nodes have 32Gb memory and 16GB of heap allocated to solr.
>
> I am concerned that these discrepancies indicate that our system is not tuned 
> well enough.
> Should I expect that a query's QTime really is a measure of the query's 
> inherent performance? Is there a better way to measure query performance?
>
>
>
>
>
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delet

How to measure search performance

2020-07-23 Thread Webster Homer
I'm trying to determine the overhead of adding some pseudo fields to one of our 
standard searches. The pseudo fields are simply function queries to report if 
certain fields matched the query or not. I had thought that I could run the 
search without the change and then re-run the searches with the fields added.
I had assumed that the QTime in the query response would be a good metric to 
use when comparing the performance of the two search queries. However I see 
that the QTime for a query can vary by more than 10%. When testing I cleared 
the query cache between tests. Usually the QTime would be within a few 
milliseconds of each other, however in some cases there was a 10X or more 
difference between them.
Even cached queries vary in their QTime, though much less.

I am running Solr 7.7.2 in a solrcloud configuration with 2 shards and 2 
replicas/shard. Our nodes have 32Gb memory and 16GB of heap allocated to solr.

I am concerned that these discrepancies indicate that our system is not tuned 
well enough.
Should I expect that a query's QTime really is a measure of the query's 
inherent performance? Is there a better way to measure query performance?





This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Grouping and Learning To Rank (SOLR-8776)

2020-06-15 Thread Webster Homer
My company is very interested in using Learning To Rank in our product search. 
The problem we face is that our product search groups its results and that does 
not work with LTR.
https://issues.apache.org/jira/browse/SOLR-8776

Is there any traction to getting the SOLR-8776 patch into the main branch? 
Seems like this would be useful to a lot of people



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: eDismax query syntax question

2020-06-15 Thread Webster Homer
Markus,
Thanks, for the reference, but that doesn't answer my question. If - is a 
special character, it's not consistently special. In my example "3-DIMETHYL" 
behaves quite differently than ")-PYRIMIDINE".  If I escape the closing 
parenthesis the following minus no longer behaves specially. The referred 
article does not even mention parenthesis, but it changes the behavior of the 
following "-" if it is escaped. In "3-DIMETHYL" the minus is not special.

These all fix the problem:
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
1,3-DIMETHYL-5-\(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE

Only the minus following the parenthesis is treated as a NOT.
Are parentheses special? They're not mentioned in the eDismax documentation.

-Original Message-
From: Markus Jelsma 
Sent: Saturday, June 13, 2020 4:57 AM
To: solr-user@lucene.apache.org
Subject: RE: eDismax query syntax question

Hello,

These are special characters, if you don't need them, you must escape them.

See top of the article:
https://lucene.apache.org/solr/guide/8_5/the-extended-dismax-query-parser.html

Markus




-Original message-
> From:Webster Homer 
> Sent: Friday 12th June 2020 22:09
> To: solr-user@lucene.apache.org
> Subject: eDismax query syntax question
>
> Recently we found strange behavior in a query. We use eDismax as the query 
> parser.
>
> This is the query term:
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE
>
> It should hit one document in our index. It does not. However, if you use the 
> Dismax query parser it does match the record.
>
> The problem seems to involve the parenthesis and the dashes. If you
> escape the dash after the parenthesis it matches
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
>
> I thought that eDismax and Dismax escaped all lucene special characters 
> before passing the query to lucene. Although I also remember reading that + 
> and - can have special significance in a query if preceded with white space. 
> I can find very little documentation on either query parser in how they work.
>
> Is this expected behavior or is this a bug? If expected, where can I find 
> documentation?
>
>
>
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.
>


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


eDismax query syntax question

2020-06-12 Thread Webster Homer
Recently we found strange behavior in a query. We use eDismax as the query 
parser.

This is the query term:
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE

It should hit one document in our index. It does not. However, if you use the 
Dismax query parser it does match the record.

The problem seems to involve the parenthesis and the dashes. If you escape the 
dash after the parenthesis it matches
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE

I thought that eDismax and Dismax escaped all lucene special characters before 
passing the query to lucene. Although I also remember reading that + and - can 
have special significance in a query if preceded with white space. I can find 
very little documentation on either query parser in how they work.

Is this expected behavior or is this a bug? If expected, where can I find 
documentation?



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: Why Did It Match?

2020-05-28 Thread Webster Homer
Thank you.

The problem is that Endeca just provided this information. The website users 
see how each search result matched the query.
For example this is displayed for a hit:
1 Product Result

|  Match Criteria: Material, Product Number

The business users will wonder why we cannot provide this information with the 
new system.

-Original Message-
From: Erick Erickson 
Sent: Thursday, May 28, 2020 4:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Why Did It Match?

Yes, debug=explain is expensive. Expensive in the sense that I’d never add it 
to every query. But if your business users are trying to understand why query X 
came back the way it did by examining individual queries, then I wouldn’t worry.

You can easily see how expensive it is in your situation by looking at the 
timings returned. Debug is just a component just like facet etc and the time it 
takes is listed separately in the timings section of debug output…

Best,
Erick

> On May 28, 2020, at 4:52 PM, Webster Homer  
> wrote:
>
> My concern was that I thought that explain is resource heavy, and was only 
> used for debugging queries.
>
> -Original Message-
> From: Doug Turnbull 
> Sent: Thursday, May 21, 2020 4:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Why Did It Match?
>
> Is your concern that the Solr explain functionality is slower than Endecas?
> Or harder to understand/interpret?
>
> If the latter, I might recommend http://splainer.io as one solution
>
> On Thu, May 21, 2020 at 4:52 PM Webster Homer < 
> webster.ho...@milliporesigma.com> wrote:
>
>> My company is working on a new website. The old/current site is
>> powered by Endeca. The site under development is powered by Solr
>> (currently 7.7.2)
>>
>> Out of the box, Endeca provides the capability to show how a query
>> was matched in the search. The business users like this
>> functionality, in solr this functionality is an expensive debug
>> option. Is there another way to get this information from a query?
>>
>> Webster Homer
>>
>>
>>
>> This message and any attachment are confidential and may be
>> privileged or otherwise protected from disclosure. If you are not the
>> intended recipient, you must not copy this message or attachment or
>> disclose the contents to any other person. If you have received this
>> transmission in error, please notify the sender immediately and
>> delete the message and any attachment from your system. Merck KGaA,
>> Darmstadt, Germany and any of its subsidiaries do not accept
>> liability for any omissions or errors in this message which may arise
>> as a result of E-Mail-transmission or for damages resulting from any
>> unauthorized changes of the content of this message and any
>> attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> subsidiaries do not guarantee that this message is free of viruses
>> and does not accept liability for any damages caused by any virus 
>> transmitted therewith.
>>
>>
>>
>> Click http://www.merckgroup.com/disclaimer to access the German,
>> French, Spanish and Portuguese versions of this disclaimer.
>>
>
>
> --
> *Doug Turnbull **| CTO* | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI 
> Powered Search <http://aipoweredsearch.com>* This e-mail and all contents, 
> including attachments, is considered to be Company Confidential unless 
> explicitly stated otherwise, regardless of whether attachments are marked as 
> such.
>
>
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.



This message and any attachment a

RE: Why Did It Match?

2020-05-28 Thread Webster Homer
My concern was that I thought that explain is resource heavy, and was only used 
for debugging queries.

-Original Message-
From: Doug Turnbull 
Sent: Thursday, May 21, 2020 4:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Why Did It Match?

Is your concern that the Solr explain functionality is slower than Endecas?
Or harder to understand/interpret?

If the latter, I might recommend http://splainer.io as one solution

On Thu, May 21, 2020 at 4:52 PM Webster Homer < 
webster.ho...@milliporesigma.com> wrote:

> My company is working on a new website. The old/current site is
> powered by Endeca. The site under development is powered by Solr
> (currently 7.7.2)
>
> Out of the box, Endeca provides the capability to show how a query was
> matched in the search. The business users like this functionality, in
> solr this functionality is an expensive debug option. Is there another
> way to get this information from a query?
>
> Webster Homer
>
>
>
> This message and any attachment are confidential and may be privileged
> or otherwise protected from disclosure. If you are not the intended
> recipient, you must not copy this message or attachment or disclose
> the contents to any other person. If you have received this
> transmission in error, please notify the sender immediately and delete
> the message and any attachment from your system. Merck KGaA,
> Darmstadt, Germany and any of its subsidiaries do not accept liability
> for any omissions or errors in this message which may arise as a
> result of E-Mail-transmission or for damages resulting from any
> unauthorized changes of the content of this message and any attachment
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do
> not guarantee that this message is free of viruses and does not accept
> liability for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German,
> French, Spanish and Portuguese versions of this disclaimer.
>


--
*Doug Turnbull **| CTO* | OpenSource Connections 
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI Powered 
Search <http://aipoweredsearch.com>* This e-mail and all contents, including 
attachments, is considered to be Company Confidential unless explicitly stated 
otherwise, regardless of whether attachments are marked as such.


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Why Did It Match?

2020-05-21 Thread Webster Homer
My company is working on a new website. The old/current site is powered by 
Endeca. The site under development is powered by Solr (currently 7.7.2)

Out of the box, Endeca provides the capability to show how a query was matched 
in the search. The business users like this functionality, in solr this 
functionality is an expensive debug option. Is there another way to get this 
information from a query?

Webster Homer



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Solrcloud Garbage Collection Suspension linked across nodes?

2020-05-04 Thread Webster Homer
My company has several Solrcloud environments. In our most active cloud we are 
seeing outages that are related to GC pauses. We have about 10 collections of 
which 4 get a lot of traffic. The solrcloud consists of 4 nodes with 6 
processors and 11Gb heap size (25Gb physical memory).

I notice that the 4 nodes seem to do their garbage collection at almost the 
same time. That seems strange to me. I would expect them to be more staggered.

This morning we had a GC pause that caused problems . During that time our 
application service was reporting "No live SolrServers available to handle this 
request"

Between 3:55 and 3:56 AM all 4 nodes were having some amount of garbage 
collection pauses, for 2 of the nodes it was minor, for one it was 50%. For 3 
nodes it lasted  until 3>57. However the node with the worst impact didn't 
recover until 4am.

How is it that all 4 nodes were in lock step doing GC? If they all are doing GC 
at the same time it defeats the purpose of having redundant cloud servers.
We just this weekend switched to use G1GC from CMS

At this point in time we also saw that traffic to solr was not well 
distributed. The application calls solr using CloudSolrClient which I thought 
did its own load balancing. We saw 10X more traffic going to one solr node that 
all the others, the we saw it start hitting another node. All solr queries come 
from our application.

During this period of time I saw only 1 error message in the solr log:
ERROR (zkConnectionManagerCallback-8-thread-1) [   ] o.a.s.c.ZkController There 
was a problem finding the leader in zk:org.apache.solr.common.SolrException: 
Could not get leader props

We are currently using Solr 7.7.2
GC Tuning
GC_TUNE="-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=250 \
-XX:+ParallelRefProcEnabled"




This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Learning To Rank with Group Queries

2020-04-14 Thread Webster Homer
Hi,
My company is looking at using the Learning to Rank. However, our main searches 
do grouping. There is an old Jira from 2016 about how these don't work together.
https://issues.apache.org/jira/browse/SOLR-8776
It doesn't look like this has moved much since then. When will we be able to 
re-rank group queries? From the Jira it seems that it is mostly patched. We use 
Solrcloud and group on a field.

Did these changes ever fix the pagination issues mentioned in the Jira?

We are currently using Solr 7.7.2 but expect to move to 8.* in the next few 
months.

Thanks,
Webster



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Schema Browser API

2020-04-09 Thread Webster Homer
I was just looking at the Schema Browser for one of our collections. It's 
pretty handy. I was thinking that it would be useful to create a tool that 
would create a report about what fields were indexed had docValues, were 
multivalued etc...

Has someone built such a tool? I want it to aid in estimating memory 
requirements for our collections.

I'm currently running solr 7.7.2



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Upgrading Solrcloud indexes from 7.2 to 8.4.1

2020-03-06 Thread Webster Homer
We are looking at upgrading our Solrcoud  instances from 7.2 to the most recent 
version of solr 8.4.1 at this time. The last time we upgraded a major solr 
release we were able to upgrade the index files to the newer version, this 
prevented us from having an outage. Subsequently we've reindexed all our 
collections. However the Solr documentation for 8.4.1 states that we need to be 
at Solr 7.3 or later to run the index upgrade.  
https://lucene.apache.org/solr/guide/8_4/solr-upgrade-notes.html

So if we upgrade to 7.7,  and then move to 8.4.1  and run the index upgrade 
script just once?
I guess I'm confused about the 7.2 -> 8.* issue is it data related?

Regards,
Webster



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: Solr Admin Console hangs on Chrome

2020-01-14 Thread Webster Homer
My experience is that the size of the query doesn't seem to matter, it just has 
to be run in Chrome. Moreover this used to work fine in chrome too, chrome's 
behavior with the  admin console changed, my data hasn't. I don't see problems 
in firefox.

-Original Message-
From: Erick Erickson 
Sent: Tuesday, January 14, 2020 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Admin Console hangs on Chrome

I’ve also seen the browser hang when returning large result sets even when 
typing the query in the address bar, bypassing the admin UI entirely.

Browsers are simply not built to display large amounts of data, and Jan’s 
comments about bypassing the admin UI’s built-in processing may help, but it’ll 
just kick the can down the road.

“Don’t do that” is my best advice, work with smaller result sets in the admin 
UI. Or the browser in general.

Best,
Erick

> On Jan 14, 2020, at 7:12 AM, Jan Høydahl  wrote:
>
> This is the build-in JSON formatting in the Query panel, it is so slow when 
> requesting huge JSON. I believe we have some JS code that fetch the result 
> JSON in AJAX and then formats it in a pretty way, also trying to filter out 
> XSS traps etc. Not sure if Chrome’s native JSON renderer is being used here 
> though.
>
> Normally I click the URL and do those queries directly in the browser address 
> bar instead of inside the Admin UI.
>
> Jan
>
>> 14. jan. 2020 kl. 10:43 skrev Mel Mason :
>>
>> Good questions. I've been having similar problems for a while, for me, the 
>> UI in general is frozen, including the navigation buttons and query text 
>> boxes. Delay depends on the size of the json - if I do a request for 1000 
>> rows with just 1 field each, it's a permanent 5s delay on scrolling and the 
>> text boxes and navigation buttons start working after a bit. If it's larger 
>> - e.g. 1 rows with 1 field each, or 10 rows with lots of large fields - 
>> then it just freezes indefinitely as far as I can tell. Smaller queries - 
>> e.g. 100 rows with 1 field each work fine.
>>
>> I don't see any errors in the console. The problems start around the time it 
>> starts chunking the json response, which may just be coincidence, but I 
>> think there are a few differences in how Chrome processes chunked responses. 
>> Although the response has a content-type of application/json set, which 
>> should avoid those problems.
>>
>> Firefox handles all of these queries without any problem.
>>
>> On 14/01/2020 09:01, Jan Høydahl wrote:
>>> How long delay do you see? Is it only for query panel or for the UI in 
>>> general?
>>> A query for *:* is not necessarily a simple query, it depends on how many 
>>> and large fields you have etc. Try a query with fl=id or fl=title and see 
>>> if that helps.
>>>
>>> Jan
>>>
>>>> 13. jan. 2020 kl. 22:29 skrev Webster Homer 
>>>> :
>>>>
>>>> I still see this issue with Chrome and the admin console. I am
>>>> using Solr 7.3
>>>>
>>>> In the Chrome console I see an error:  "style.css:1 Failed to load 
>>>> resource: the server responded with a status of 404 (Not Found)"
>>>>
>>>> This used to work.
>>>>
>>>> It is unusably slow, even with a simple query like *:*
>>>>
>>>> -Original Message-
>>>> From: Jan Høydahl 
>>>> Sent: Thursday, December 12, 2019 1:45 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Solr Admin Console hangs on Chrome
>>>>
>>>> I have seen slowness when the result is a very large json but not for 
>>>> ordinary queries. How long delay do you see? Is it only for query panel or 
>>>> for the UI in general?
>>>>
>>>> Jan Høydahl
>>>>
>>>>> 11. des. 2019 kl. 16:07 skrev Alexandre Rafalovitch :
>>>>>
>>>>> Check for popup and other tracker blockers. It is possible one of
>>>>> the resources has a similar name and triggers blocking. There was
>>>>> a thread in early October with a similar discussion, but apart
>>>>> from the blockers idea nothing else was discovered at the time.
>>>>>
>>>>> An easy way would be to create a new Chrome profile without any
>>>>> add-ons and try accessing Solr that way. This would differentiate
>>>>> "Chrome vs Firefox" and "Chrome vs Chrome plugins".
>>>>>
>>>>> Regards,
>>>>> Alex.
>>>>>
>>>>>> On Wed, 11

RE: Solr Admin Console hangs on Chrome

2020-01-13 Thread Webster Homer
I still see this issue with Chrome and the admin console. I am using Solr 7.3

In the Chrome console I see an error:  "style.css:1 Failed to load resource: 
the server responded with a status of 404 (Not Found)"

This used to work.

It is unusably slow, even with a simple query like *:*

-Original Message-
From: Jan Høydahl 
Sent: Thursday, December 12, 2019 1:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Admin Console hangs on Chrome

I have seen slowness when the result is a very large json but not for ordinary 
queries. How long delay do you see? Is it only for query panel or for the UI in 
general?

Jan Høydahl

> 11. des. 2019 kl. 16:07 skrev Alexandre Rafalovitch :
>
> Check for popup and other tracker blockers. It is possible one of the
> resources has a similar name and triggers blocking. There was a thread
> in early October with a similar discussion, but apart from the
> blockers idea nothing else was discovered at the time.
>
> An easy way would be to create a new Chrome profile without any
> add-ons and try accessing Solr that way. This would differentiate
> "Chrome vs Firefox" and "Chrome vs Chrome plugins".
>
> Regards,
>   Alex.
>
>> On Wed, 11 Dec 2019 at 07:50, A Adel  wrote:
>>
>> Hi - could you provide more details, such as Solr and browser network
>> logs when using Chrome / other browsers?
>>
>>> On Tue, Dec 10, 2019 at 5:48 PM Joel Bernstein  wrote:
>>>
>>> Did a recent change to Chrome cause this?
>>>
>>> In Solr 8x, I'm not seeing slowness with Chrome on Mac.
>>>
>>>
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>>
>>> On Tue, Dec 10, 2019 at 8:26 PM SAGAR INGALE
>>> 
>>> wrote:
>>>
>>>> I am also facing the same issue for v6.4.0
>>>>
>>>> On Wed, 11 Dec, 2019, 5:37 AM Joel Bernstein, 
>>> wrote:
>>>>
>>>>> What version of Solr?
>>>>>
>>>>>
>>>>>
>>>>> Joel Bernstein
>>>>> http://joelsolr.blogspot.com/
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2019 at 5:58 PM Arnold Bronley <
>>> arnoldbron...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am also facing similar issue. I have also switched to other
>>> browsers
>>>> to
>>>>>> solve this issue.
>>>>>>
>>>>>> On Tue, Dec 10, 2019 at 2:22 PM Webster Homer <
>>>>>> webster.ho...@milliporesigma.com> wrote:
>>>>>>
>>>>>>> It seems like the Solr Admin console has become slow when you
>>>>>>> use
>>> it
>>>> on
>>>>>>> the chrome browser. If I go to the query tab and execute a
>>>>>>> query,
>>>> even
>>>>>> the
>>>>>>> default *:* after that the browser window becomes very slow.
>>>>>>> I'm using chrome Version 78.0.3904.108 (Official Build) (64-bit)
>>>>>>> on
>>>>>> Windows
>>>>>>>
>>>>>>> The work around is to use Firefox
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This message and any attachment are confidential and may be
>>>> privileged
>>>>> or
>>>>>>> otherwise protected from disclosure. If you are not the intended
>>>>>> recipient,
>>>>>>> you must not copy this message or attachment or disclose the
>>> contents
>>>>> to
>>>>>>> any other person. If you have received this transmission in
>>>>>>> error,
>>>>> please
>>>>>>> notify the sender immediately and delete the message and any
>>>> attachment
>>>>>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>>>>> subsidiaries do not accept liability for any omissions or errors
>>>>>>> in
>>>>> this
>>>>>>> message which may arise as a result of E-Mail-transmission or
>>>>>>> for
>>>>> damages
>>>>>>> resulting from any unauthorized changes of the content of this
>>>> message
>>>>>> and
>>>>>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any
>>>>>>> of
>>> its
>>>&g

Solr Admin Console hangs on Chrome

2019-12-10 Thread Webster Homer
It seems like the Solr Admin console has become slow when you use it on the 
chrome browser. If I go to the query tab and execute a query, even the default 
*:* after that the browser window becomes very slow.
I'm using chrome Version 78.0.3904.108 (Official Build) (64-bit) on Windows

The work around is to use Firefox



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: fq pfloat_field:* returns no documents, tfloat:* does

2019-11-21 Thread Webster Homer
Thank you. Why don't point fields get loaded by the Schema Browser's "Load Term 
Info" button?


-Original Message-
From: Tomás Fernández Löbbe 
Sent: Wednesday, November 20, 2019 4:38 PM
To: solr-user@lucene.apache.org
Subject: Re: fq pfloat_field:* returns no documents, tfloat:* does

Hi Webster,
> The fq  facet_melting_point:*
"Point" numeric fields don't support that syntax currently, and the way to 
retrieve "docs with any value in field foo" is "foo:[* TO *]". See
https://issues.apache.org/jira/browse/SOLR-11746


On Wed, Nov 20, 2019 at 2:21 PM Webster Homer < 
webster.ho...@milliporesigma.com> wrote:

> The fq   facet_melting_point:*
> Returns 0 rows. However the field clearly has data in it, why does
> this query return rows where there is data
>
> I am trying to update our solr schemas to use the point fields instead
> of the trie fields.
>
> We have a number of pfloat fields. These fields are indexed and I can
> facet on them
>
> This is a typical definition
>  stored="true" required="false" multiValued="true" docValues="true"/>
>
> Another odd behavior is that when I use the Schema Browser the "Load
> Term Info" loads no data.
>
> I am using Solr 7.2
> This message and any attachment are confidential and may be privileged
> or otherwise protected from disclosure. If you are not the intended
> recipient, you must not copy this message or attachment or disclose
> the contents to any other person. If you have received this
> transmission in error, please notify the sender immediately and delete
> the message and any attachment from your system. Merck KGaA,
> Darmstadt, Germany and any of its subsidiaries do not accept liability
> for any omissions or errors in this message which may arise as a
> result of E-Mail-transmission or for damages resulting from any
> unauthorized changes of the content of this message and any attachment
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do
> not guarantee that this message is free of viruses and does not accept
> liability for any damages caused by any virus transmitted therewith.
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.
>
This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.


fq pfloat_field:* returns no documents, tfloat:* does

2019-11-20 Thread Webster Homer
The fq   facet_melting_point:*
Returns 0 rows. However the field clearly has data in it, why does this query 
return rows where there is data

I am trying to update our solr schemas to use the point fields instead of the 
trie fields.

We have a number of pfloat fields. These fields are indexed and I can facet on 
them

This is a typical definition


Another odd behavior is that when I use the Schema Browser the "Load Term Info" 
loads no data.

I am using Solr 7.2
This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.


Filtering point fields filters everything.

2019-11-06 Thread Webster Homer
My company has been using solr for searching our product catalog. We migrated 
the data from Solr 6.6 to Solr 7.2. I am investigating the changes needed to 
migrate to Solr 8.*. Our current schema has a number of fields using the trie 
data types which are deprecated in 7 and gone in 8. I went through the schema 
and changed the trie fields to their point equivalent.
For example we have these field types and fields defined:
  
  
  
  
  
  
  
  
  
  
  




These last two were converted from the older types, they were originally 
defined as:



In the process of the update I changed the version of the schema


And the lucene match version
7.2.0

After making these changes I created a new collection and used our ETL to load 
it. We saw no errors during the data load.

The problem I see  is that if I try to filter on facet_fwght I get no results
"fq":"facet_fwght:[100 TO 200]", returns no documents, nor does
facet_fwght:*

Even more bizarre when I used the Admin Console schema browser, it sees the 
fields but when I try to load the term info for any point field, nothing loads.

On the other hand, I can facet on facet_fwght, I just cannot filter on it. I 
couldn't get values for index_date either even though every record has it set 
with the default of NOW

So what am I doing wrong with the point fields? I expected to be able to do 
just about everything with the point fields I could do with the deprecated trie 
fields.

Regards,
Webster Homer
This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.


RE: tlogs are not deleted

2019-10-23 Thread Webster Homer
Tlogs will accumulate if you have buffers "enabled". Make sure that you 
explicitly disable buffering from the cdcr endpoint
https://lucene.apache.org/solr/guide/7_7/cdcr-api.html#disablebuffer
Make sure that they're disabled on both the source and targets

I believe that sometimes buffers get enabled on their own. We added monitoring 
of CDCR to check for the buffer setting
This endpoint shows you the status
https://lucene.apache.org/solr/guide/7_7/cdcr-api.html#cdcr-status-example

I don't understand the use case for enabling  buffers, or why it is enabled by 
default.

-Original Message-
From: Erick Erickson 
Sent: Wednesday, October 23, 2019 7:23 AM
To: solr-user@lucene.apache.org
Subject: Re: tlogs are not deleted

My first guess is that your CDCR setup isn’t running. CDCR uses tlogs as a 
queueing mechanism. If CDCR can’t send docs to the target collection, they’ll 
accumulate forever.

Best,
Erick

> On Oct 22, 2019, at 7:48 PM, Woo Choi  wrote:
>
> Hi,
>
> We are using solr 7.7 cloud with CDCR(every collection has 3 replicas,
> 1 shard).
>
> In solrconfig.xml,
>
> tlog configuration is super simple like : 
>
> There is also daily data import and commit is called after data import
> every time.
>
> Indexing works fine, but the problem is that the number of tlogs keeps
> growing.
>
> According to the documentation
> here(https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrco
> nfig.html), I expected tlog will remain as many as 10(default value of
> maxNumLogsToKeep=10).
>
> However I still have a bunch of tlogs - the oldest one is Sep 6..!
>
> I did an experiment by running data import with commit option from
> solr admin ui, but any of tlogs were not deleted.
>
> tlog.002.1643995079881261056
> tlog.018.1645444642733293568
> tlog.034.1646803619099443200
> tlog.003.1644085718240198656
> tlog.019.1645535304072822784
> tlog.035.1646894195509559296
> tlog.004.1644176284537847808
> tlog.020.1645625651261079552
> tlog.036.1646984623121498112
> tlog.005.1644357373324689408
> tlog.021.1645625651316654083
> tlog.037.1647076244416626688
> tlog.006.167899616018432
> tlog.022.1645716477747134464
> tlog.038.1647165801017376768
> tlog.007.1644538486210953216
> tlog.023.1645806853961023488
> tlog.039.1647165801042542594
> tlog.008.1644629084296183808
> tlog.024.1645897663703416832
> tlog.040.1647256590865137664
> tlog.009.1644719895268556800
> tlog.025.1645988248838733824
> tlog.041.1647347172490870784
> tlog.010.1644810493331767296
> tlog.026.1646078905702940672
> tlog.042.1647437758859313152
> tlog.011.1644901113324896256
> tlog.027.1646169478772293632
> tlog.043.1647528345005457408
> tlog.012.1645031030684385280
> tlog.028.1646259838395613184
> tlog.044.1647618793025830912
> tlog.013.164503103008545
> tlog.029.1646350429145006080
> tlog.045.1647709579019026432
> tlog.014.1645082080252526592
> tlog.030.1646441456502571008
> tlog.046.1647890587519549440
> tlog.015.1645172929206419456
> tlog.031.1646531802044563456
> tlog.047.1647981403286011904
> tlog.016.1645263488829882368
> tlog.032.16466061568
> tlog.048.1648071989042085888
> tlog.017.1645353861842468864
> tlog.033.1646712822719053824
> tlog.049.1648135546466205696
>
> Did I miss something in the solrconfig file?
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. 

RE: json.facet throws ClassCastException

2019-10-04 Thread Webster Homer
Sometimes it comes back in the reply
"java.lang.ClassCastException: java.lang.String cannot be cast to 
java.util.Map\n\tat 
org.apache.solr.search.facet.FacetModule.prepare(FacetModule.java:78)\n\tat 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:269)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)\n\tat 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)\n\tat 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
 
org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:169)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat
 org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat
 java.lang.Thread.run(Thread.java:748)\n",

-Original Message-
From: Mikhail Khludnev 
Sent: Friday, October 04, 2019 2:28 PM
To: solr-user 
Subject: Re: json.facet throws ClassCastException

Hello, Webster.

Have you managed to capture stacktrace?

On Fri, Oct 4, 2019 at 8:24 PM Webster Homer < 
webster.ho...@milliporesigma.com> wrote:

> I'm trying to understand what is wrong with my query or collection.
>
> I have a functioning solr schema and collection. I'm running Solr 7.2
>
> When I run with a facet.field it works, but if I change it to use a
> json.facet it throws a class cast exception.
>
> json.facet=prod:{type:terms,field:product,mincount:1,limit:8}
>
> java.lang.String cannot be cast to java.util.Map
>
> The product field is defined as
> 
>
> And lowercase is defined as:
>  positionIncrementGap="100">
>   
> 
> 
>   
> 
>
> I don't have enough information to understand what its complaining about.
>
> Thanks
> This message and any attachment are confidential and may be privileged
> or otherwise protected from disclosure. If you are not the intended
> recipient, you must not copy this message or attachment or disclose
> the contents to any other person. If you have received this
> transmission in error, please notify the sender immediately and delete
> the message and any attachment from your system. Merck KGaA,
> Darmstadt, Germany and any of its subsidiaries do not accept liability
> for any omissions or errors in this message which may arise as a
> result of E-Mail-transmission or for damages resulting from any
> unauthorized changes of the content of this message

json.facet throws ClassCastException

2019-10-04 Thread Webster Homer
I'm trying to understand what is wrong with my query or collection.

I have a functioning solr schema and collection. I'm running Solr 7.2

When I run with a facet.field it works, but if I change it to use a json.facet 
it throws a class cast exception.

json.facet=prod:{type:terms,field:product,mincount:1,limit:8}

java.lang.String cannot be cast to java.util.Map

The product field is defined as


And lowercase is defined as:

  


  


I don't have enough information to understand what its complaining about.

Thanks
This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.


RE: Strange regex behavior in solr.PatternReplaceCharFilterFactory

2019-09-27 Thread Webster Homer
I had some examples already, but I wrote a unit test, and solr is not handling 
\p{L} correctly.

Also saw some vague discussion in Oracle's documentation around \p{L}

@Test
public void testUnicodeLetter() {
Pattern pattern = Pattern.compile("[\\p{L}\\p{M}\\p{Digit}]+");
String matchChina = "乙醇";
Matcher match = pattern.matcher(matchChina);
assertTrue(match.matches());
String matchLower = "aaa";
match = pattern.matcher(matchLower);
assertTrue(match.matches());
String matchUppeer = "AAA";
match = pattern.matcher(matchUppeer);
assertTrue(match.matches()); // should fail if Solr is correct
}
My dev Java is jdk1.8.0_162 which isn't real current...
So this could be an issue with a version of Java or solr is doing something more

-Original Message-
From: Erick Erickson 
Sent: Friday, September 27, 2019 2:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Solr’s pattern replace _is_  Java’s. See PatternReplaceCharFilter. You’ll see:

private final Pattern pattern;

and later:
final Matcher m = pattern.matcher(input);

That said, there’s some manipulation after that, so there’s always room for 
issues. But I’d try just a standard Java program with your regex to verify 
rather than online sources.

Best,
Erick

> On Sep 27, 2019, at 2:24 PM, Jörn Franke  wrote:
>
> Check the log files on the collection reload.
> About your regex: check a web page that checks Java regexes - there can be 
> subtle differences between Java, JavaScript, php etc.
> Then it could be that your original text is not UTF-8 encoded, but Windows or 
> similar.
> Check also if you have special characters in the text (line breaks, tabs 
> etc.).
>
>> Am 27.09.2019 um 16:42 schrieb Webster Homer 
>> :
>>
>> I forgot to mention that I'm using Solr 7.2. I also found that if
>> instead of \p{L} I use the long form \p{Letter} then when I reload
>> the collection after updating the schema, Solr will not load the
>> collection. I think that Solr's regex support is not standard  Java 8
>>
>> -Original Message-
>> From: Webster Homer 
>> Sent: Friday, September 27, 2019 9:09 AM
>> To: solr-user@lucene.apache.org
>> Subject: Strange regex behavior in
>> solr.PatternReplaceCharFilterFactory
>>
>> I am developing a new version of a fieldtype that we’ve been using for 
>> several years. This fieldtype is to be used as a part of an autocomplete 
>> code. The original version handled standard ascii characters well, but I 
>> wanted it to be able to handle any Unicode letter, not just A-Za-z but Greek 
>> and Chinese as well. The analysis chain is supposed to remove any character 
>> that is not a letter, digit or space.
>> I settled on this fieldType. The main changes from the old version is that I 
>> moved the character removal from a PatternReplaceFilterFactory call to a 
>> PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two 
>> filter factories handle this regex:
>> ([^\p{L}\p{M}\p{Digit} ])
>> Here is the fieldtype
>>  > positionIncrementGap="100">
>> 
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>> pattern="([\.,;:-_])" replacement=" "/>
>>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>
>>
>>> words="lang/stopwords_en.txt"/>
>> > minGramSize="1"/>
>>  
>> 
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>> pattern="([\.,;:-_])" replacement=" "/>
>>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>
>>
>>> words="lang/stopwords_en.txt"/>
>>> pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
>>
>>   
>>
>> The problem I’m seeing is that the call:
>>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>
>> Strips out letters that match A-Z  It will leave digits, lowercase
>> letters and Chinese characters. I tested my regex with a couple of
>> online regex testers and it works. It seems that only the
>> solr.PatternReplaceCharFilterFactory has this behavior. Here is what
>> I see in the Analyzer Using this test term: 12水3-23-ER1:abc
>> After the PRCF I see this: 12水323 1 abc The “ER” is removed. I think
>> this is a bug, or am I doing something wrong.
>> I us

RE: Strange regex behavior in solr.PatternReplaceCharFilterFactory

2019-09-27 Thread Webster Homer
I forgot to mention that I'm using Solr 7.2. I also found that if instead of 
\p{L} I use the long form \p{Letter} then when I reload the collection after 
updating the schema, Solr will not load the collection. I think that Solr's 
regex support is not standard  Java 8

-Original Message-
From: Webster Homer 
Sent: Friday, September 27, 2019 9:09 AM
To: solr-user@lucene.apache.org
Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory

I am developing a new version of a fieldtype that we’ve been using for several 
years. This fieldtype is to be used as a part of an autocomplete code. The 
original version handled standard ascii characters well, but I wanted it to be 
able to handle any Unicode letter, not just A-Za-z but Greek and Chinese as 
well. The analysis chain is supposed to remove any character that is not a 
letter, digit or space.
I settled on this fieldType. The main changes from the old version is that I 
moved the character removal from a PatternReplaceFilterFactory call to a 
PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two 
filter factories handle this regex:
([^\p{L}\p{M}\p{Digit} ])
Here is the fieldtype
   
  
 
 
 
 
 
 
  
   
  
 
 
 
 
 
 
 
 


The problem I’m seeing is that the call:
 

Strips out letters that match A-Z  It will leave digits, lowercase letters and 
Chinese characters. I tested my regex with a couple of online regex testers and 
it works. It seems that only the solr.PatternReplaceCharFilterFactory has this 
behavior. Here is what I see in the Analyzer Using this test term: 
12水3-23-ER1:abc
After the PRCF I see this: 12水323 1 abc
The “ER” is removed. I think this is a bug, or am I doing something wrong.
I used this link as the source for my regex: 
https://www.regular-expressions.info/unicode.html
It seems that Solr is treating \p{L} as matching lower case ascii characters, 
but is correct for other Unicode characters. For letters in the A-Z range it is 
behaving as if the regex was \p{Ll}. I tried explicitly adding \p{Lu} in and it 
made no difference capital letters were still stripped.

This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.
This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.


Strange regex behavior in solr.PatternReplaceCharFilterFactory

2019-09-27 Thread Webster Homer
I am developing a new version of a fieldtype that we’ve been using for several 
years. This fieldtype is to be used as a part of an autocomplete code. The 
original version handled standard ascii characters well, but I wanted it to be 
able to handle any Unicode letter, not just A-Za-z but Greek and Chinese as 
well. The analysis chain is supposed to remove any character that is not a 
letter, digit or space.
I settled on this fieldType. The main changes from the old version is that I 
moved the character removal from a PatternReplaceFilterFactory call to a 
PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two 
filter factories handle this regex:
([^\p{L}\p{M}\p{Digit} ])
Here is the fieldtype
   
  
 
 
 
 
 
 
  
   
  
 
 
 
 
 
 
 
 


The problem I’m seeing is that the call:
 

Strips out letters that match A-Z  It will leave digits, lowercase letters and 
Chinese characters. I tested my regex with a couple of online regex testers and 
it works. It seems that only the solr.PatternReplaceCharFilterFactory has this 
behavior. Here is what I see in the Analyzer
Using this test term: 12水3-23-ER1:abc
After the PRCF I see this: 12水323 1 abc
The “ER” is removed. I think this is a bug, or am I doing something wrong.
I used this link as the source for my regex: 
https://www.regular-expressions.info/unicode.html
It seems that Solr is treating \p{L} as matching lower case ascii characters, 
but is correct for other Unicode characters. For letters in the A-Z range it is 
behaving as if the regex was \p{Ll}. I tried explicitly adding \p{Lu} in and it 
made no difference capital letters were still stripped.

This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.


RE: CDCR tlog corruption leads to infinite loop

2019-09-11 Thread Webster Homer
We also see an accumulation of tlog files on the target solrs. One of our 
production clouds crashed due to too many open files
2019-09-11 15:59:39.570 ERROR (qtp1355531311-81540) 
[c:bioreliance-catalog-testarticle-20190713 s:shard2 r:core_node8 
x:bioreliance-catalog-testarticle-20190713_shard2_replica_n6] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: 
java.io.FileNotFoundException: 
/var/solr/data/bioreliance-catalog-testarticle-20190713_shard2_replica_n6/data/tlog/tlog.0005307.1642472809370222592
 (Too many open files)

We found 9106 open files. 

This is our update request handler


 


  ${solr.ulog.dir:}



  
   ${solr.autoCommit.maxTime:6} 
   false 
 

  
   ${solr.autoSoftCommit.maxTime:3000} 
 

  

solr.autoSoftCommit.maxTime is set to 3000
solr.autoCommit.maxTime is set to 6

-Original Message-
From: Webster Homer  
Sent: Monday, September 09, 2019 4:17 PM
To: solr-user@lucene.apache.org
Subject: CDCR tlog corruption leads to infinite loop

We are running Solr 7.2.0

Our configuration has several collections that are loaded into a solr cloud 
which is set to replicate using CDCR to 3 different solrclouds. All of our 
target collections have 2 shards with two replicas per shard. Our source 
collection has 2 shards, and 1 replica per shard.

Frequently we start to see errors where the target collections are out of date, 
and the cdcr action=errors endpoint shows large numbers of errors For example:
{"responseHeader": {
"status": 0,
"QTime": 0},
"errors": [
"uc1f-ecom-mzk01:2181,uc1f-ecom-mzk02:2181,uc1f-ecom-mzk03:2181/solr",
["sial-catalog-product-20190824",
[
"consecutiveErrors",
700357,
"bad_request",
0,
"internal",
700357,
"last",
[
"2019-09-09T19:17:57.453Z",
"internal",
"2019-09-09T19:17:56.949Z",
"internal",
"2019-09-09T19:17:56.448Z"
,"internal",...

We have found that one or more tlogs have become corrupt. It appears that the 
CDCR keeps trying to send data, but cannot read the data from the tlog and then 
it retrys, forever.
How does this happen?  It seems to be very frequent, on a weekly basis and 
difficult to trouble shoot Today we had it happen with one of our collections. 
Here is the listing for the tlog files:

$ ls -alht
total 604M
drwxr-xr-x 2 apache apache  44K Sep  9 14:27 .
-rw-r--r-- 1 apache apache 6.7M Sep  6 19:44 
tlog.766.1643975309914013696
-rw-r--r-- 1 apache apache  35M Sep  6 19:43 
tlog.765.1643975245907886080
-rw-r--r-- 1 apache apache  30M Sep  6 19:42 
tlog.764.1643975182924120064
-rw-r--r-- 1 apache apache  37M Sep  6 19:41 
tlog.763.1643975118316109824
-rw-r--r-- 1 apache apache  19M Sep  6 19:40 
tlog.762.1643975053918863360
-rw-r--r-- 1 apache apache  21M Sep  6 19:39 
tlog.761.1643974989726089216
-rw-r--r-- 1 apache apache  21M Sep  6 19:38 
tlog.760.1643974926010417152
-rw-r--r-- 1 apache apache  29M Sep  6 19:37 
tlog.759.1643974862567374848
-rw-r--r-- 1 apache apache 6.2M Sep  6 19:10 
tlog.758.1643973174027616256
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 
tlog.757.1643885009483857920
-rw-r--r-- 1 apache apache  27M Sep  5 19:48 
tlog.756.1643884946565103616
-rw-r--r-- 1 apache apache  35M Sep  5 19:47 
tlog.755.1643884877912735744
-rw-r--r-- 1 apache apache  30M Sep  5 19:46 
tlog.754.1643884812724862976
-rw-r--r-- 1 apache apache  25M Sep  5 19:45 
tlog.753.1643884748976685056
-rw-r--r-- 1 apache apache  18M Sep  5 19:44 
tlog.752.1643884685794738176
-rw-r--r-- 1 apache apache  21M Sep  5 19:43 
tlog.751.1643884621330382848
-rw-r--r-- 1 apache apache  16M Sep  5 19:42 
tlog.750.1643884558054064128
-rw-r--r-- 1 apache apache  26M Sep  5 19:41 
tlog.749.1643884494725316608
-rw-r--r-- 1 apache apache 5.8M Sep  5 19:12 
tlog.748.1643882681969147904
-rw-r--r-- 1 apache apache  31M Sep  4 19:56 
tlog.747.1643794877229563904
-rw-r--r-- 1 apache apache  31M Sep  4 19:55 
tlog.746.1643794813706829824
-rw-r--r-- 1 apache apache  30M Sep  4 19:54 
tlog.745.1643794749615767552
-rw-r--r-- 1 apache apache  22M Sep  4 19:53 
tlog.744.1643794686253465600
-rw-r--r-- 1 apache apache  18M Sep  4 19:52 
tlog.743.1643794622319689728
-rw-r--r-- 1 apache apache  21M Sep  4 19:51 
tlog.742.1643794558055612416
-rw-r--r-- 1 apache apache  15M Sep  4 19:50 
tlog.741.1643794493330161664
-rw-r--r-- 1 apache apache  26M Sep  4 19:49 
tlog.740.1643794428790308864
-rw-r--r-- 1 apache apac

CDCR tlog corruption leads to infinite loop

2019-09-09 Thread Webster Homer
We are running Solr 7.2.0

Our configuration has several collections that are loaded into a solr cloud 
which is set to replicate using CDCR to 3 different solrclouds. All of our 
target collections have 2 shards with two replicas per shard. Our source 
collection has 2 shards, and 1 replica per shard.

Frequently we start to see errors where the target collections are out of date, 
and the cdcr action=errors endpoint shows large numbers of errors
For example:
{"responseHeader": {
"status": 0,
"QTime": 0},
"errors": [
"uc1f-ecom-mzk01:2181,uc1f-ecom-mzk02:2181,uc1f-ecom-mzk03:2181/solr",
["sial-catalog-product-20190824",
[
"consecutiveErrors",
700357,
"bad_request",
0,
"internal",
700357,
"last",
[
"2019-09-09T19:17:57.453Z",
"internal",
"2019-09-09T19:17:56.949Z",
"internal",
"2019-09-09T19:17:56.448Z"
,"internal",...

We have found that one or more tlogs have become corrupt. It appears that the 
CDCR keeps trying to send data, but cannot read the data from the tlog and then 
it retrys, forever.
How does this happen?  It seems to be very frequent, on a weekly basis and 
difficult to trouble shoot
Today we had it happen with one of our collections. Here is the listing for the 
tlog files:

$ ls -alht
total 604M
drwxr-xr-x 2 apache apache  44K Sep  9 14:27 .
-rw-r--r-- 1 apache apache 6.7M Sep  6 19:44 
tlog.766.1643975309914013696
-rw-r--r-- 1 apache apache  35M Sep  6 19:43 
tlog.765.1643975245907886080
-rw-r--r-- 1 apache apache  30M Sep  6 19:42 
tlog.764.1643975182924120064
-rw-r--r-- 1 apache apache  37M Sep  6 19:41 
tlog.763.1643975118316109824
-rw-r--r-- 1 apache apache  19M Sep  6 19:40 
tlog.762.1643975053918863360
-rw-r--r-- 1 apache apache  21M Sep  6 19:39 
tlog.761.1643974989726089216
-rw-r--r-- 1 apache apache  21M Sep  6 19:38 
tlog.760.1643974926010417152
-rw-r--r-- 1 apache apache  29M Sep  6 19:37 
tlog.759.1643974862567374848
-rw-r--r-- 1 apache apache 6.2M Sep  6 19:10 
tlog.758.1643973174027616256
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 
tlog.757.1643885009483857920
-rw-r--r-- 1 apache apache  27M Sep  5 19:48 
tlog.756.1643884946565103616
-rw-r--r-- 1 apache apache  35M Sep  5 19:47 
tlog.755.1643884877912735744
-rw-r--r-- 1 apache apache  30M Sep  5 19:46 
tlog.754.1643884812724862976
-rw-r--r-- 1 apache apache  25M Sep  5 19:45 
tlog.753.1643884748976685056
-rw-r--r-- 1 apache apache  18M Sep  5 19:44 
tlog.752.1643884685794738176
-rw-r--r-- 1 apache apache  21M Sep  5 19:43 
tlog.751.1643884621330382848
-rw-r--r-- 1 apache apache  16M Sep  5 19:42 
tlog.750.1643884558054064128
-rw-r--r-- 1 apache apache  26M Sep  5 19:41 
tlog.749.1643884494725316608
-rw-r--r-- 1 apache apache 5.8M Sep  5 19:12 
tlog.748.1643882681969147904
-rw-r--r-- 1 apache apache  31M Sep  4 19:56 
tlog.747.1643794877229563904
-rw-r--r-- 1 apache apache  31M Sep  4 19:55 
tlog.746.1643794813706829824
-rw-r--r-- 1 apache apache  30M Sep  4 19:54 
tlog.745.1643794749615767552
-rw-r--r-- 1 apache apache  22M Sep  4 19:53 
tlog.744.1643794686253465600
-rw-r--r-- 1 apache apache  18M Sep  4 19:52 
tlog.743.1643794622319689728
-rw-r--r-- 1 apache apache  21M Sep  4 19:51 
tlog.742.1643794558055612416
-rw-r--r-- 1 apache apache  15M Sep  4 19:50 
tlog.741.1643794493330161664
-rw-r--r-- 1 apache apache  26M Sep  4 19:49 
tlog.740.1643794428790308864
-rw-r--r-- 1 apache apache  11M Sep  4 14:58 
tlog.737.1643701398824550400
drwxr-xr-x 5 apache apache   53 Aug 21 06:30 ..
[apache@dfw-pauth-msc01 tlog]$ ls -alht 
tlog.757.1643885009483857920
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 
tlog.757.1643885009483857920
$ date
Mon Sep  9 14:27:31 CDT 2019
$ pwd
/var/solr/data/sial-catalog-product-20190824_shard1_replica_n1/data/tlog

CDCR started replicating after we deleted the oldest tlog file and restarted 
CDCR
tlog.737.1643701398824550400

About the same time I found a number of errors in the solr logs like this:
2019-09-04 19:58:01.393 ERROR 
(recoveryExecutor-162-thread-1-processing-n:dfw-pauth-msc01:8983_solr 
x:sial-catalog-product-20190824_shard1_replica_n1 s:shard1 
c:sial-catalog-product-20190824 r:core_node3) [c:sial-catalog-product-20190824 
s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] 
o.a.s.u.UpdateLog java.lang.ClassCastException

This was the most common error at the time, I saw it for all of our collections
2019-09-04 19:57:46.572 ERROR (qtp1355531311-20) 
[c:sial-catalog-product-20190824 s:shard1 r:core_node3 
x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.h.RequestHandlerBase 

RE: boost parameter produces garbage hits

2019-04-18 Thread Webster Homer
Looked at boost a bit more. The # of results remains the same whether the boost 
parameter is present or not. If it is present the behavior seems to be that if 
it matches a hit in the result, it does what I expect, however if it does not 
match the hit, what ends up in the result is completely unexpected with 0 
relevancy. 
It does appear that bq does what I want, but the behavior of boost seems like a 
bug. We use boost elsewhere and it works as we want, that use case does not 
involve using the query function though.

-Original Message-
From: Webster Homer  
Sent: Thursday, April 18, 2019 12:16 PM
To: solr-user@lucene.apache.org
Subject: boost parameter produces garbage hits

Hi,

I am trying to understand how the boost (and bq) parameters are supposed to 
work.
My application searches our product schema and returns the best matches. To 
enable an exactish match on product name we created fields that are minimally 
tokenized (keyword tokenizer/lowercase). Now I want the search to boost results 
that match on those fields. I thought that either the boost or bq parameter 
would work. I found very few good examples of the boost parameter used on a 
query. A lot of permutations resulted in errors such as this:
org.apache.solr.search.SyntaxError: Infinite Recursion detected parsing query 
'ethyl alcohol'

I am using Solr 7.2 and the eDismax query parser.
I have gotten boost to work, sort of, it really changes the query results in a 
bad way. I'm sure that I'm doing something wrong. Here is an example of my 
boost parameter boost=product(query({!edismax qf="search_en_p_pri_name_min 
search_en_root_name_min" v=$q boost=}, 0),1)

When I search for "ethyl alcohol" products named "ethyl alcohol" come first, 
which is what I want. We have a range of ethyl alcohol products. Normally I 
expect to see "ethyl alcohol, pure" and "ethyl alcohol, dnatured" after the 
initial "ethyl alcohol" and I see this without the boost. With the boost I get 
"ethyl alcohol" with a score of, 3.87201088E8. The second hit is "Brilliant 
Cresyl blue" with a score of 0. All subsequent hits have a 0

Why are there any matches returned with a score of 0? Why are these hits with a 
0 score being returned at all? Especially when more relevant matches are not 
being returned? I suspect that there is something wrong with my boost function, 
but it looks right. However if I take it and instead submit the function shown 
above as a bf parameter I get a syntax error:
bf=product(query({!edismax qf="search_en_p_pri_name_min 
search_en_root_name_min" v=$q bf=}),1)
org.apache.solr.search.SyntaxError: Expected identifier at pos 23 
str='product(query({!edismax'"

>From the documentation I expected that the bf and boost parameters only 
>differed as to how the result was boosted with boost being multiplicative and 
>the bf being additive, but I cannot find an equivalent which actually works 
>with the bf parameter.

The bq parameter doesn't throw an error, but it doesn't seem to have any effect 
in how the results are ordered.

What am I doing wrong? Why does the boost parameter return garbage hits with 0 
score? What would work as a bf parameter function?



boost parameter produces garbage hits

2019-04-18 Thread Webster Homer
Hi,

I am trying to understand how the boost (and bq) parameters are supposed to 
work.
My application searches our product schema and returns the best matches. To 
enable an exactish match on product name we created fields that are minimally 
tokenized (keyword tokenizer/lowercase). Now I want the search to boost results 
that match on those fields. I thought that either the boost or bq parameter 
would work. I found very few good examples of the boost parameter used on a 
query. A lot of permutations resulted in errors such as this:
org.apache.solr.search.SyntaxError: Infinite Recursion detected parsing query 
'ethyl alcohol'

I am using Solr 7.2 and the eDismax query parser.
I have gotten boost to work, sort of, it really changes the query results in a 
bad way. I'm sure that I'm doing something wrong. Here is an example of my 
boost parameter
boost=product(query({!edismax qf="search_en_p_pri_name_min 
search_en_root_name_min" v=$q boost=}, 0),1)

When I search for "ethyl alcohol" products named "ethyl alcohol" come first, 
which is what I want. We have a range of ethyl alcohol products. Normally I 
expect to see "ethyl alcohol, pure" and "ethyl alcohol, dnatured" after the 
initial "ethyl alcohol" and I see this without the boost. With the boost I get 
"ethyl alcohol" with a score of, 3.87201088E8. The second hit is "Brilliant 
Cresyl blue" with a score of 0. All subsequent hits have a 0

Why are there any matches returned with a score of 0? Why are these hits with a 
0 score being returned at all? Especially when more relevant matches are not 
being returned? I suspect that there is something wrong with my boost function, 
but it looks right. However if I take it and instead submit the function shown 
above as a bf parameter I get a syntax error:
bf=product(query({!edismax qf="search_en_p_pri_name_min 
search_en_root_name_min" v=$q bf=}),1)
org.apache.solr.search.SyntaxError: Expected identifier at pos 23 
str='product(query({!edismax'"

>From the documentation I expected that the bf and boost parameters only 
>differed as to how the result was boosted with boost being multiplicative and 
>the bf being additive, but I cannot find an equivalent which actually works 
>with the bf parameter.

The bq parameter doesn't throw an error, but it doesn't seem to have any effect 
in how the results are ordered.

What am I doing wrong? Why does the boost parameter return garbage hits with 0 
score? What would work as a bf parameter function?



CloudSolrClient Question

2019-03-01 Thread Webster Homer
I am using the CloudSolrClient Solrj api for querying solr cloud collections. 
For the most part it works well. However we recently experienced a series of 
outages where our production cloud became unavailable. All the nodes were down. 
That's a separate topic... The client application tried to launch searches but 
always experienced a SolrServerException that there were no live nodes 
available. After a few hundred such exceptions, the application ran out of 
memory and failed when trying to allocate a thread... I'm not sure where the 
resources are being leaked in exception handling. Is there a way to ask the 
CloudSolrClient if there are enough replicas to execute the search.

I'm using Solr 7.2


Help needed with Solrcloud error messages

2019-02-04 Thread Webster Homer
We have a number of collections in a Solrcloud.

The cloud has 2 shards each with 2 replicas, 4 nodes. On one of the nodes I am 
seeing a lot of errors in the log like this:
2019-02-04 20:27:11.831 ERROR (qtp1595212853-88527) [c:sial-catalog-product 
s:shard1 r:core_node4 x:sial-catalog-product_shard1_replica2] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error reading 
document with docId 417762
2019-02-04 20:29:49.779 ERROR (qtp1595212853-87296) [c:sial-catalog-product 
s:shard1 r:core_node4 x:sial-catalog-product_shard1_replica2] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error reading 
document with docId 417676
2019-02-04 20:23:47.505 ERROR (qtp1595212853-87538) [c:sial-catalog-product 
s:shard1 r:core_node4 x:sial-catalog-product_shard1_replica2] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error reading 
document with docId 414871

There are many more than these three. What does this mean?

On the same node I also see problems with 2 other collections:
ehs-catalog-qmdoc_shard1_replica2: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Error opening new searcher
sial-catalog-category-180721_shard2_replica_n4: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Error opening new searcher

Yet another replica on this node is down

What could cause the error reading docId problems? Why is there a problem 
opening a new searcher on   2 unrelated collections which just happen to be on 
the same node? How do I go about diagnosing the problems?

We've been seeing a lot of problems with solrcloud.

We are on Solr 7.2




RE: Query kills Solrcloud

2019-01-02 Thread Webster Homer
We are still having serious problems with our solrcloud failing due to this 
problem.
The problem is clearly data related. 
How can I determine what documents are being searched? Is it possible to get 
Solr/lucene to output the docids being searched?

I believe that this is a lucene bug, but I need to narrow the focus to a 
smaller number of records, and I'm not certain how to do that efficiently. Are 
there debug parameters that could help?

-Original Message-
From: Webster Homer  
Sent: Thursday, December 20, 2018 3:45 PM
To: solr-user@lucene.apache.org
Subject: Query kills Solrcloud

We are experiencing almost nightly solr crashes due to Japanese queries. I’ve 
been able to determine that one of our field types seems to be a culprit. When 
I run a much reduced version of the query against out DEV solrcloud I see the 
memory usage jump from less than a gb to 5gb using only a single field in the 
query. The collection is fairly small ~411,000 documents of which only ~25,000 
have searchable Japanese fields. I have been able to simplify the query to run 
against a single Japanese field in the schema. The JVM memory jumps from less 
than a gig to close to 5 gb, and back down. The QTime is 36959 which seems high 
for 2500 documents. Indeed the single field that I’m using in my test case has 
2031 documents.

I extended the query to 5 fields and watch the memory usage in the Solr Console 
application. The memory usage goes to almost 6gb with a QTime of 100909. The 
Solrconsole shows connection errors, and when I look at the Cloud graph all the 
replicas on the node where I submitted the query are down. In dev the replicas 
eventually recover. In production, with the full query which has a lot more 
fields in the qf parameter, the solr cloud dies.
One example query term:
ジエチルアミノヒドロキシベンゾイル安息香酸ヘキシル

This is the field type that we have defined:
   
 

   
 






   

  

 
 
   

   
   





   

  


Why is searching even 1 field of this type so expensive?
I suspect that this is data related, as other queries return in far less than a 
second. What are good strategies for determining what documents are causing the 
problem? I’m new to debugging Solr so I could use some help. I’d like to reduce 
the number of records to a minimum to create a small dataset to reproduce the 
problem.
Right now our only option is to stop using this fieldtype, but it does improve 
the relevancy of searches that don’t cause Solr to crash.

It would be a great help if the Solrconsole would not timeout on these queries, 
is there a way to turn off the timeout?
We are running Solr 7.2


Query kills Solrcloud

2018-12-20 Thread Webster Homer
We are experiencing almost nightly solr crashes due to Japanese queries. I’ve 
been able to determine that one of our field types seems to be a culprit. When 
I run a much reduced version of the query against out DEV solrcloud I see the 
memory usage jump from less than a gb to 5gb using only a single field in the 
query. The collection is fairly small ~411,000 documents of which only ~25,000 
have searchable Japanese fields. I have been able to simplify the query to run 
against a single Japanese field in the schema. The JVM memory jumps from less 
than a gig to close to 5 gb, and back down. The QTime is 36959 which seems high 
for 2500 documents. Indeed the single field that I’m using in my test case has 
2031 documents.

I extended the query to 5 fields and watch the memory usage in the Solr Console 
application. The memory usage goes to almost 6gb with a QTime of 100909. The 
Solrconsole shows connection errors, and when I look at the Cloud graph all the 
replicas on the node where I submitted the query are down. In dev the replicas 
eventually recover. In production, with the full query which has a lot more 
fields in the qf parameter, the solr cloud dies.
One example query term:
ジエチルアミノヒドロキシベンゾイル安息香酸ヘキシル

This is the field type that we have defined:
   
 

   
 






   

  

 
 
   

   
   





   

  


Why is searching even 1 field of this type so expensive?
I suspect that this is data related, as other queries return in far less than a 
second. What are good strategies for determining what documents are causing the 
problem? I’m new to debugging Solr so I could use some help. I’d like to reduce 
the number of records to a minimum to create a small dataset to reproduce the 
problem.
Right now our only option is to stop using this fieldtype, but it does improve 
the relevancy of searches that don’t cause Solr to crash.

It would be a great help if the Solrconsole would not timeout on these queries, 
is there a way to turn off the timeout?
We are running Solr 7.2


Help CJK OOM Errors

2018-12-12 Thread Webster Homer
Recently we had a few Japanese queries that killed our production Solrcloud 
instance. Our schemas support multiple languages, with language specific search 
fields.

This query and similar ones caused OOM errors in Solr:
モノクローナル抗ニコチン性アセチルコリンレセプター(??7サブユニット)抗体 マウス宿主抗体

The query doesn’t match anything

We are running Solr 7.2 in Google cloud. The Solr cloud has 4 solr nodes (3 
zookeepers on their own nodes) holding 18 collections. The usage on most of the 
collections is currently fairly light. One of them gets a lot of traffic. This 
has 500,000 documents of which 25,000 contain some Japanese fields.
We did a lot of tests, but I think we used historical search data which tends 
to have short queries. A 44 character CJK string generates ~80 tokens

I ran the query against a single Japanese field and it took ~30 seconds to come 
back. Removing the ?? from it made no significant difference in performance.
I’ve run other Japanese queries of a similar length and they return in ~200 
msecs.

Our solr cloud usually performs quite well, but in this case it was horrible. 
The bigram filter creates a lot of tokens, but this seems to be a fairly 
standard approach for Chinese and Japanese searches.
How can I debug what is going on with this query?
How resource intensive will searches against these fields be?
How do we estimate the additional memory that seem to require?

We have about a dozen Japanese search fields. These all have this CJKBigram 
field type.
   

 
 
   
 






   

  

 
 
   

   







   

  




Query kills Solr

2018-12-11 Thread Webster Homer
Is there a way to get an approximate measure of the memory used by an indexed 
field(s). I’m looking into a problem with one of our Solr indexes. I have a 
Japanese query that causes the replicas to run out of memory when processing a 
query.
Also, is there a way to change or disable the timeout in the Solr Console? When 
I run this query there it always times out, and that is a real pain. I know 
that it will complete eventually.

I have this field type:
   

 
 
   
 






   

  

 
 
   

   







   

  

I have a number of fields of this type. The CJKBigramFilterFactory can generate 
a lot of tokens. I’m concerned that this combination is what is killing our 
solr instances
This is the query that is causing my problems:
モノクローナル抗ニコチン性アセチルコリンレセプター(??7サブユニット)抗体 マウス宿主抗体

We are using Solr 7.2 in a solrcloud



CDCR Replication sensitive to network problems

2018-12-07 Thread Webster Homer
We are using Solr 7.2. We have two solrclouds that are hosted on Google clouds. 
These are targets for an on Prem solr cloud where we run our ETL loads  and 
have CDCR replicate it to the Google clouds. This mostly works pretty well. 
However, networks can fail. When the network has a brief outage we frequently 
then see corrupted tlog files. Frequently we see 0 length tlog files or files 
that appear to be truncated. When this happens we see lots of cdcr errors. If 
there is a corrupt tlog, we delete it and things go back to normal.
The frequency of the errors is troubling. CDCR needs to be more robust with 
networking issues. I don't know how tlogs get corrupted in this scenario, but 
they obviously do.

Today we started seeing lots of CdcrReplicator errors but could not find a 
corrupt tlog. This is a trace from the logs
java.io.EOFException
at 
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:168)
at 
org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:863)
at 
org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:857)
at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:266)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at 
org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:603)
at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:315)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at 
org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)
at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:272)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at 
org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:690)
at 
org.apache.solr.update.CdcrTransactionLog$CdcrLogReader.next(CdcrTransactionLog.java:304)
at 
org.apache.solr.update.CdcrUpdateLog$CdcrLogReader.next(CdcrUpdateLog.java:633)
at 
org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:77)
at 
org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(CdcrReplicatorScheduler.java:81)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Our admins restarted the source solr servers and that seems to have helped.


Solr on Java 11?

2018-11-30 Thread Webster Homer
My company is planning on upgrading our stack to use Java 11. What version of 
Solr is planned to be supported on Java 11?
We won't be doing this immediately as several of our key components are not yet 
been ported to 11, but we want to plan for it.

Thanks,
Webster


RE: Negative CDCR Queue Size?

2018-11-06 Thread Webster Homer
I'm sorry I should have included that. We are running Solr 7.2. We use CDCR for 
almost all of our collections. We have experienced several intermittent 
problems with CDCR, this one seems to be new, at least I hadn't seen it before

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, November 06, 2018 12:36 PM
To: solr-user 
Subject: Re: Negative CDCR Queue Size?

What version of Solr? CDCR has changed quite a bit in the 7x  code line so it's 
important to know the version.

On Tue, Nov 6, 2018 at 10:32 AM Webster Homer 
 wrote:
>
> Several times I have noticed that the CDCR action=QUEUES will return a 
> negative queueSize. When this happens we seem to be missing data in the 
> target collection. How can this happen? What does a negative Queue size mean? 
> The timestamp is an empty string.
>
> We have two targets for a source. One looks like this, with a negative 
> queue size
> queues": 
> ["uc1f-ecom-mzk01.sial.com:2181,uc1f-ecom-mzk02.sial.com:2181,uc1f-eco
> m-mzk03.sial.com:2181/solr",["ucb-catalog-material-180317",["queueSize
> ",-1,"lastTimestamp",""]],
>
> The other is healthy
> "ae1b-ecom-mzk01.sial.com:2181,ae1b-ecom-mzk02.sial.com:2181,ae1b-ecom
> -mzk03.sial.com:2181/solr",["ucb-catalog-material-180317",["queueSize"
> ,246980,"lastTimestamp","2018-11-06T16:21:53.265Z"]]
>
> We are not seeing CDCR errors.
>
> What could cause this behavior?


Negative CDCR Queue Size?

2018-11-06 Thread Webster Homer
Several times I have noticed that the CDCR action=QUEUES will return a negative 
queueSize. When this happens we seem to be missing data in the target 
collection. How can this happen? What does a negative Queue size mean? The 
timestamp is an empty string.

We have two targets for a source. One looks like this, with a negative queue 
size
queues": 
["uc1f-ecom-mzk01.sial.com:2181,uc1f-ecom-mzk02.sial.com:2181,uc1f-ecom-mzk03.sial.com:2181/solr",["ucb-catalog-material-180317",["queueSize",-1,"lastTimestamp",""]],

The other is healthy
"ae1b-ecom-mzk01.sial.com:2181,ae1b-ecom-mzk02.sial.com:2181,ae1b-ecom-mzk03.sial.com:2181/solr",["ucb-catalog-material-180317",["queueSize",246980,"lastTimestamp","2018-11-06T16:21:53.265Z"]]

We are not seeing CDCR errors.

What could cause this behavior?


RE: Odd Scoring behavior

2018-10-31 Thread Webster Homer
The KeywordRepeat and RemoveDuplicates were added to support better wildcard 
matching. Removing the duplicates just removes those terms that weren't 
stemmed. 

This seems like a subtle bug to me

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, October 30, 2018 4:55 PM
To: solr-user@lucene.apache.org
Subject: RE: Odd Scoring behavior

Hello Webster,

It smells like KeywordRepeat. In general it is not a problem if all terms are 
scored twice. But you also have RemoveDuplicates, and this causes that in some 
cases a term in one field is scored twice, but once in the other field and then 
you have a problem.

Due to lack of replies, in the end i chose to remove the RemoveDuplicates 
filter, so that everything is always scored twice. This 'solution' at least 
solved the general scoring problem of searching across many fields.

Thus far there is no real solution to this problem as far as i know it.

Regards,
Markus

http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html

 
 
-Original message-
> From:Webster Homer 
> Sent: Tuesday 30th October 2018 22:34
> To: solr-user@lucene.apache.org
> Subject: Odd Scoring behavior
> 
> I noticed that sometimes query matches seem to get counted twice when they 
> are scored. This will happen if the fieldtype is being stemmed, and there is 
> a matching synonym.
> It seems that the score for the field is 2X higher than it should be. We see 
> this only when there is a matching synonym that has a stemmed term in it.
> 
> 
> We have this synonym defined:
> bsa, bovine serum albumin
> 
> We have this fieldtype:
>  positionIncrementGap="100">
>   
> 
>  words="lang/stopwords_en.txt" />
> 
> 
> 
> 
> 
>  
> 
> 
>  words="lang/stopwords_en.txt" />
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
> 
> 
> 
>   
> 
> 
> Which is used as:
>  indexed="true" stored="true" required="false" multiValued="false" />
> 
> When we query this field using the eDismax query parser the field, 
> search_en_root_name seems to contribute twice to the score for this query:
> bovine serum albumin
> 
> once for the base query, and once for the stemmed form of the query:
> bovin serum albumin
> 
> If we remove the synonym it will only be counted once. We only see this 
> behavior If part of the synonym can be stemmed. This seems odd and has the 
> effect of overpowering boosts on other fields.
> 
> The explain plan without synonym
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":44,
> "params":{
>   "mm":"2<-25%",
>   "fl":"searchmv_pno, search_en_p_pri_name [explain style=nl]",
>   "group.limit":"1",
>   "q.op":"OR",
>   "sort":"score desc,sort_en_name asc ,sort_ds asc,  search_pid asc",
>   "group.ngroups":"true",
>   "q":"bovine serum albumin",
>   "tie":".45",
>   "defType":"edismax",
>   "group.sort":"sort_ds asc, score desc",
>   "qf":"search_en_p_pri_name_min^7500
> search_en_root_name_min^12000 search_en_p_pri_name^3000
> search_pid^2500 searchmv_pno^2500 searchmv_cas_number^2500
> searchmv_p_skus^2500 search_lform_lc^2500  search_en_root_name^2500
> searchmv_en_s_pri_name^2500 searchmv_en_keywords^2500
> searchmv_lookahead_terms^2000 searchmv_user_term^2000
> searchmv_en_acronym^1500 searchmv_en_synonyms^1500
> searchmv_concat_sku^1000 search_concat_pno^1000
> searchmv_en_name_suf^1000 searchmv_component_cas^1000
> search_lform^1000 searchmv_pno_genr^500 search_concat_pno_genr^500
> searchmv_p_skus_genr^500 search_eform search_mol_form 
> searchmv_component_molform searchmv_en_descriptions searchmv_en_chem_comp 
> searchmv_en_attributes searchmv_en_page_title search_mdl_number 
> searchmv_xref_comparable_pno searchmv_xref_comparable_sku 
> searchmv_xref_equivalent_pno searchmv_xref_exact_pno searchmv_xref_exact_sku 
> searchmv_vendor_sku searchmv_material_number search_en_sortkey searchmv_rtecs 
> search_color_idx search_beilstein search_ecnumber search_egecnumber 
> search_femanumber searchmv_isbn",
>   "group.field":"id_s",
>   "_":"1540331449276",
>   "group":"true"}},
>   "grouped":{
> "id_s":{
>   "matches":4701,
>   "ngroups":4393,
>   "groups":[{
>   "groupValue":"bovineserumalbumin123459048468",
>   "doclist":{"numFound":57,"start":0,"docs":[
>   {
> "search_en_p_pri_name":"Bovine Serum Albumin",
> "searchmv_pno":["A2153"],
> "[explain]":{
>   "match":true,
>   "value":38145.117,
>   "description":"max plus 0.45 times others of:",
>   "details":[{
>   "match":true,
>   "value":10434.111,
>   "description":"sum of:",

Odd Scoring behavior

2018-10-30 Thread Webster Homer
I noticed that sometimes query matches seem to get counted twice when they are 
scored. This will happen if the fieldtype is being stemmed, and there is a 
matching synonym.
It seems that the score for the field is 2X higher than it should be. We see 
this only when there is a matching synonym that has a stemmed term in it.


We have this synonym defined:
bsa, bovine serum albumin

We have this fieldtype:

  







 








  


Which is used as:


When we query this field using the eDismax query parser the field, 
search_en_root_name seems to contribute twice to the score for this query:
bovine serum albumin

once for the base query, and once for the stemmed form of the query:
bovin serum albumin

If we remove the synonym it will only be counted once. We only see this 
behavior If part of the synonym can be stemmed. This seems odd and has the 
effect of overpowering boosts on other fields.

The explain plan without synonym
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":44,
"params":{
  "mm":"2<-25%",
  "fl":"searchmv_pno, search_en_p_pri_name [explain style=nl]",
  "group.limit":"1",
  "q.op":"OR",
  "sort":"score desc,sort_en_name asc ,sort_ds asc,  search_pid asc",
  "group.ngroups":"true",
  "q":"bovine serum albumin",
  "tie":".45",
  "defType":"edismax",
  "group.sort":"sort_ds asc, score desc",
  "qf":"search_en_p_pri_name_min^7500
search_en_root_name_min^12000 search_en_p_pri_name^3000
search_pid^2500 searchmv_pno^2500 searchmv_cas_number^2500
searchmv_p_skus^2500 search_lform_lc^2500  search_en_root_name^2500
searchmv_en_s_pri_name^2500 searchmv_en_keywords^2500
searchmv_lookahead_terms^2000 searchmv_user_term^2000
searchmv_en_acronym^1500 searchmv_en_synonyms^1500
searchmv_concat_sku^1000 search_concat_pno^1000
searchmv_en_name_suf^1000 searchmv_component_cas^1000
search_lform^1000 searchmv_pno_genr^500 search_concat_pno_genr^500
searchmv_p_skus_genr^500 search_eform search_mol_form 
searchmv_component_molform searchmv_en_descriptions searchmv_en_chem_comp 
searchmv_en_attributes searchmv_en_page_title search_mdl_number 
searchmv_xref_comparable_pno searchmv_xref_comparable_sku 
searchmv_xref_equivalent_pno searchmv_xref_exact_pno searchmv_xref_exact_sku 
searchmv_vendor_sku searchmv_material_number search_en_sortkey searchmv_rtecs 
search_color_idx search_beilstein search_ecnumber search_egecnumber 
search_femanumber searchmv_isbn",
  "group.field":"id_s",
  "_":"1540331449276",
  "group":"true"}},
  "grouped":{
"id_s":{
  "matches":4701,
  "ngroups":4393,
  "groups":[{
  "groupValue":"bovineserumalbumin123459048468",
  "doclist":{"numFound":57,"start":0,"docs":[
  {
"search_en_p_pri_name":"Bovine Serum Albumin",
"searchmv_pno":["A2153"],
"[explain]":{
  "match":true,
  "value":38145.117,
  "description":"max plus 0.45 times others of:",
  "details":[{
  "match":true,
  "value":10434.111,
  "description":"sum of:",
  "details":[{
  "match":true,
  "value":4042.5876,

"description":"weight(Synonym(search_en_root_name:bovin
search_en_root_name:bovine) in 20407) [SialBM25Similarity], result of:",
  "details":[{
  "match":true,
  "value":4042.5876,
  "description":"score(doc=20407,freq=2.0
= termFreq=2.0\n), product of:",
  "details":[{
  "match":true,
  "value":2500.0,
  "description":"boost"},
{
  "match":true,
  "value":1.0,
  "description":"idf, computed as
log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
  "details":[{
  "match":true,
  "value":204.0,
  "description":"docFreq"},
{
  "match":true,
  "value":365301.0,
  "description":"docCount"}]},
{
  "match":true,
  "value":1.617035,
  "description":"tfNorm, computed as (freq * 
(k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength /
avgFieldLength)) from:",

Solr admin timeout

2018-09-20 Thread Webster Homer
I have a fairly complex query which I'm trying to debug. The query works as 
long as I don't try to return the field: [explain style=nl]
Of course this is the data I'm really interested in. When I run it, the console 
acts busy, but then the screen clears displaying no data. I suspect a timeout 
in the admin console. Is there any way to set the timeout to be longer?

I've had issues with what I think is a timeout problem using the Streams 
interface as well.

Thanks


numdocs is different on replicas in a shard

2018-08-03 Thread Webster Homer
This morning I was told that there was something screwy with one of our
collections.
This collection has 2 shards and 2 replicas per shard. Each replica has a
different value for numDocs!
Datacenter #1
shard1_replica11513053
shard1_replica21512653
shard2_replica11512296
shard2_replica21512487

We have 2 copies of this collection that are populated via cdcr from a
common collection. Both copies show the same thing. They run in different
datacenters in the cloud
Datacenter #2
shard1_replica11513054
shard1_replica21512903
shard2_replica11512452
shard2_replica21512487

We are running Solr 7.2.0 in SolrCloud mode.
This collection is populated by CDCR, auto commits are enabled.

I don't see any errors in the logs. I manually sent a commit to the
collection, the above numbers are after the commit.

The source collection has only 2 replicas per shard
shard1_replica1   1513054
shard2_replica1   1512487


What could cause this? How can I address it? How do we prevent it from
happening again?

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


Re: list of collections

2018-07-17 Thread Webster Homer
use the Solrcloud Collections API
https://lucene.apache.org/solr/guide/7_3/collections-api.html#list

On Tue, Jul 17, 2018 at 12:12 PM, Kudrettin Güleryüz 
wrote:

> Hi,
>
> What is the suggested way to get list of collections from a solr Cloud with
> a ZKhost?
>
> Thank you
>

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


Retrieving json.facet from a search

2018-06-28 Thread Webster Homer
I have a fairly large existing code base for querying Solr. It is
architected where common code calls solr and returns a solrj QueryResponse
object.

I'm currently using Solr 7.2 the code interacts with solr using the Solrj
client api

I have a need that would be very easily met by using the json.facet api.
The problem is that I don't see how to get the json.facet out of a
QueryResponse object.

There doesn't seem to be a lot of discussion on line about this.
Is there a way to get the Json object out of the QueryResponse?

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


CDCR sensitive to network failures

2018-05-18 Thread Webster Homer
Recently I encountered some problems with CDCR after we experienced network
problems, I thought I'd share.

I'm using Solr 7.2.0
We have 3 solr cloud instances where we update one cloud and use cdcr to
forward updates to the two solrclouds that are hosted in a cloud.

Usually this works pretty well.
Recently we have experienced some serious but intermittent network issues.
When that occurs we find that we get tons of cdcr warnings:

CdcrReplicator  Failed to forward update request to target:
bioreliance-catalog-assay
with errors like ClassCastException, and/or NullpointerException etc...

Updates accumulate on the server and it has tons of errors in the
cdcr?action=errors
"2018-05-18T16:11:19.860Z","internal","2018-05-18T16:11:18.860Z","internal",
"2018-05-18T16:11:17.860Z","internal",
When I looked around on the source collection, I found tlog files like this:
-rw-r--r-- 1 apache apache 1376736 May 10 23:04
tlog.141.1600138985674375168
*-rw-r--r-- 1 apache apache   0 May 11 23:05
tlog.143.1600229645842644992*
*-rw-r--r-- 1 apache apache   65458 May 12 07:50
tlog.142.1600229582225539072*
-rw-r--r-- 1 apache apache 1355610 May 18 10:05
tlog.144.1600814785270644736
-rw-r--r-- 1 apache apache 1355610 May 18 10:16
tlog.145.1600815458585411584
-rw-r--r-- 1 apache apache 1355610 May 18 10:21
tlog.146.1600815785277652992
-rw-r--r-- 1 apache apache 1355610 May 18 10:29
tlog.147.1600816282070941696

Note the 0 length file, and the truncated file
tlog.142.1600229582225539072

The solution is to delete these files. Once these files are removed the
updates start flowing

These errors show up as warnings in the log, I would have expected them to
be errors. CDCR doesn't seem to be able to detect that the tlog is
corrupted.

Hope this helps someone else. If there are better solutions, I'd like to
know

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


Re: the number of docs in each group depends on rows

2018-05-04 Thread Webster Homer
We do group queries with Solrcloud all the time. You must set up your
collection so that all values for the field you are grouping on are in the
same shard.
This can easily be done with the composite router. Basically you do this be
creating a unique field that contains the field to group on, with your
unique id:
!

See
https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting
for more details.

Solrcloud does limit you to grouping on fields, you cannot group on
function queries

On Fri, May 4, 2018 at 6:37 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net> wrote:

> Hello,
>
> I'm not sure 100% but I think that if you have multiple shards the number
> of docs matched in each group is *not* guarantee to be exact. Increasing
> the rows will increase the amount of partial information that each shard
> sends to the federator and make the number more precise.
>
> For exact counts you might need one shard OR  to make sure that all the
> documents in the same group are in the same shard by using document routing
> via composite keys [1].
>
> Thinking about that, it should be possible to fix grouping to compute the
> exact numbers on request...
>
> cheers,
> Diego
>
>
> [1] https://lucene.apache.org/solr/guide/6_6/shards-and-
> indexing-data-in-solrcloud.html#shards-and-indexing-data-in-solrcloud
>
>
> From: solr-user@lucene.apache.org At: 05/04/18 07:53:41To:
> solr-user@lucene.apache.org
> Subject: the number of docs in each group depends on rows
>
> Hi,
> We used Solr Cloud 7.1.0(3 nodes, 3 shards with 2 replicas). When we used
> group query, we found that the number of docs in each group depends on the
> rows number(group number).
>
> difference:
> 
>
> when the rows bigger then 5, the return docs are correct and stable, for
> the
> rest, the number of docs is smaller than the actual result.
>
> Could you please explain why and give me some suggestion about how to
> decide
> the rows number?
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
>
>

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


CDCR broken for Mixed Replica Collections

2018-04-25 Thread Webster Homer
I was looking at SOLR-12057

According to the comment on the ticket, CDCR can not work when a collection
has PULL Replicas. That seems like a MAJOR limitation to CDCR and PULL
Replicas. Is this likely to be addressed in the future?
CDCR currently is broken for TLOG replicas too.

https://issues.apache.org/jira/browse/SOLR-12057?focusedCommentId=16391558=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16391558

Thanks

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


Re: Collection out of disk space, commit problem

2018-04-02 Thread Webster Homer
Erick,

Thanks, Normally our dev environment does not use CDCR, except when we're
doing active development on it. As it happens the collection in question,
was one we used to test cdcr. Or rather the configuration for it was, as
the specific collection has been deleted and created many times. Even
though we had cdcr turned off it seems that buffers got set to "enabled"
Which seems to be the default, and it is a really bad default!

Because it's dev and we don't do cdcr there, I might not have thought to
look at that. So thank you for that

Web

On Mon, Apr 2, 2018 at 1:10 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Webster:
>
> Do you by any chance have CDCR configured? If so, insure that
> buffering is disabled. Buffering was intended to be enabled
> _temporarily_ during, say, a maintenance window and was conceived
> before the bootstrapping capability was added to CDCR.
>
> But I don't recall your other e-mails mention CDCR so I mention this
> on the off chance...
>
> Best,
> Erick
>
> On Mon, Apr 2, 2018 at 10:35 AM, Webster Homer <webster.ho...@sial.com>
> wrote:
> > Over the weekend one of our Dev solrcloud ran out of disk space.
> Examining
> > the problem we found one collection that had 2 months of uncommitted tlog
> > files. Unfortuneatly the solr logs rolled over and so I cannot see the
> > commit behavior during the last time data was loaded to it.
> >
> > The solrconfig.xml has both autoCommit and autoSoftCommit enabled.
> >  
> >${solr.autoCommit.maxTime:6}
> >false
> > 
> >
> > solr.autoCommit.maxTime  is set to 6
> >
> >  
> >${solr.autoSoftCommit.maxTime:5000}
> >  
> > solr.autoSoftCommit.maxTime  is set to 3000
> >
> > I found tlog files dated to Feb. 27. There is an automated job that
> reloads
> > the data once a week. It looks like no commits occurred from Feb 27
> onward.
> > Once the disk got full solr got very unhappy.
> >
> > This solrcloud has 2 shards and one replica per shard.
> >
> > We have a second development solrcloud which has the same collections
> with
> > identical configurations except that these collections have 2 shards and
> 2
> > replicas per shard. That one doesn't seem to have the tlog files
> > accumulating.
> >
> > I have long suspected that autoCommit is not reliable, and this seems to
> > indicate that it is not.
> >
> > We have several collections that share the same configuration, and have
> > similar ETL jobs loading them. This is the second time that this
> particular
> > collection has had this  problem.
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Collection out of disk space, commit problem

2018-04-02 Thread Webster Homer
Over the weekend one of our Dev solrcloud ran out of disk space. Examining
the problem we found one collection that had 2 months of uncommitted tlog
files. Unfortuneatly the solr logs rolled over and so I cannot see the
commit behavior during the last time data was loaded to it.

The solrconfig.xml has both autoCommit and autoSoftCommit enabled.
 
   ${solr.autoCommit.maxTime:6}
   false


solr.autoCommit.maxTime  is set to 6

 
   ${solr.autoSoftCommit.maxTime:5000}
 
solr.autoSoftCommit.maxTime  is set to 3000

I found tlog files dated to Feb. 27. There is an automated job that reloads
the data once a week. It looks like no commits occurred from Feb 27 onward.
Once the disk got full solr got very unhappy.

This solrcloud has 2 shards and one replica per shard.

We have a second development solrcloud which has the same collections with
identical configurations except that these collections have 2 shards and 2
replicas per shard. That one doesn't seem to have the tlog files
accumulating.

I have long suspected that autoCommit is not reliable, and this seems to
indicate that it is not.

We have several collections that share the same configuration, and have
similar ETL jobs loading them. This is the second time that this particular
collection has had this  problem.

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Solr 7.2 cannot see all running nodes

2018-03-29 Thread Webster Homer
This Zookeeper ensemble doesn't look right.
>
> ./bin/solr start -cloud -s /usr/local/bin/solr-7.2.1/server/solr/node1/ -p
> 8983 -z zk0-esohad,zk1-esohad,zk3-esohad:2181 -m 8g


Shouldn't the zookeeper ensemble be specified as:
  zk0-esohad:2181,zk1-esohad:2181,zk3-esohad:2181

You should put the zookeeper port on each node in the comma separated list.
I don't know if this is your problem, but I think your solr nodes will only
be connecting to 1 zookeeper

On Thu, Mar 29, 2018 at 10:56 AM, Walter Underwood 
wrote:

> I had that problem. Very annoying and we probably should require special
> flag to use localhost.
>
> We need to start solr like this:
>
> ./solr start -c -h `hostname`
>
> If anybody ever forgets, we get a 127.0.0.1 node that shows down in
> cluster status. No idea how to get rid of that.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Mar 29, 2018, at 7:46 AM, Shawn Heisey  wrote:
> >
> > On 3/29/2018 8:25 AM, Abhi Basu wrote:
> >> "Operation create caused
> >> exception:":"org.apache.solr.common.SolrException:org.
> apache.solr.common.SolrException:
> >> Cannot create collection ems-collection. Value of maxShardsPerNode is 1,
> >> and the number of nodes currently live or live and part of your
> >
> > I'm betting that all your nodes are registering themselves with the same
> name, and that name is probably either 127.0.0.1 or 127.1.1.0 -- an address
> on the loopback interface.
> >
> > Usually this problem (on an OS other than Windows, at least) is caused
> by an incorrect /etc/hosts file that maps your hostname to a  loopback
> address instead of a real address.
> >
> > You can override the value that SolrCloud uses to register itself into
> zookeeper so it doesn't depend on the OS configuration.  In solr.in.sh, I
> think this is the SOLR_HOST variable, which gets translated into -Dhost=XXX
> on the java commandline.  It can also be configured in solr.xml.
> >
> > Thanks,
> > Shawn
> >
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Why are cursor mark queries recommended over regular start, rows combination?

2018-03-26 Thread Webster Homer
Shawn,
Thanks. It's been a while now, but we did find issues with both cursorMark
AND start/rows. the effect was much more obvious with cursorMark.
We were able to address this by switching to use TLOG replicas. These give
consistent results. It's nice to know that the cursorMark problems were
related to relevancy retrieval order.

We found one major drawback with TLOG replicas, and that was that CDCR was
broken for TLOG replicas. There is a Jira on this, and it is being
addressed. NRT may have a use case, but I think that reproducible correct
results should trump performance everytime. We use Solr as a search engine,
we almost always want to retrieve results in order of relevancy.

I think that we will phase out the use of NRT replicas in favor of TLOG
replicas

On Fri, Mar 23, 2018 at 7:04 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/23/2018 3:47 PM, Webster Homer wrote:
> > Just FYI I had a project recently where I tried to use cursorMark in
> > Solrcloud and solr 7.2.0 and it was very unreliable. It couldn't even
> > return consistent numberFound values. I posted about it in this forum.
> > Using the start and rows arguments in SolrQuery did work reliably so I
> > abandoned cursorMark as just too buggy
> >
> > I had originally wanted to try using streaming expressions, but they
> don't
> > return results ordered by relevancy, a major limitation for a search
> > engine, in my opinion.
>
> The problems that can affect cursorMark are also problems when using
> start/rows pagination.
>
> You've mentioned relevancy ordering, so I think this is what you're
> running into:
>
> Trying to use relevancy ranking on SolrCloud with NRT replicas can break
> pagination.  The problem happens both with cursorMark and start/rows.
> NRT replicas in a SolrCloud index can have different numbers of deleted
> documents.  Even though deleted documents do not appear in search
> results, they ARE still part of the index, and can affect scoring.
> Since SolrCloud load balances requests across replicas, page 1 may use
> different replicas than page 2, and end up with different scoring, which
> can affect the order of results and change which page number they end up
> on.  Using TLOG or PULL replicas (available since 7.0) usually fixes
> that problem, because different replicas are 100% identical with those
> replica types.
>
> Changing the index in the middle of trying to page through results can
> also cause issues with pagination.
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: solrj question

2018-03-26 Thread Webster Homer
You may say that the String in the constructor is "meant to be query
syntax", nothing in the Javadoc says anything about the expected syntax.
Since there is also a method to set the query, it seemed reasonable to
expect that it would take the output of the toString method. (or some other
serialization method)
https://lucene.apache.org/solr/6_6_0/solr-solrj/org/apache/solr/client/solrj/SolrQuery.html#SolrQuery-java.lang.String-

So how would a user play back logged queries? This seems like an important
use case. I can parse the toString output, It seems like the constructor
should be able to take it.
If not a constructor and toString, methods, I don't see methods to
serialize and deserialize the query
Being able to write the complete query to a log is important, but we also
want to be able to read the log and submit the query to solr. Being able to
playback the logs allows us to  trouble shoot search issues on our site. It
also provides a way to create load tests.

Yes I can and am going to create this functionality, it's not that
complicated, but I don't think it's unreasonable to think that the existing
API should handle it.

Thanks,


On Fri, Mar 23, 2018 at 6:44 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/23/2018 3:24 PM, Webster Homer wrote:
> > I see this in the output:
> > Lexical error at line 1, column 1759.  Encountered:  after :
> > "/select?defType=edismax=0=25&...
> > It has basically the entire solr query which it obviously couldn't parse.
> >
> > solrQuery = new SolrQuery(log.getQuery());
>
> This isn't going to work.  The string in the constructor is expected to
> be query syntax -- so something like this:
>
> company:Google AND (city:"San Jose" OR state:WA)
>
> It has no idea what to do with a URL path and parameters.
>
> > Is there something I'm doing wrong, or is it that the SolrQuery class
> > cannot really take its toString output to make itself? Does it have a
> > different serialization method that could be used?
>
> I don't think there's any expectation that an object's toString() output
> can be used as input for anything.  This is the javadoc for
> Object.toString():
>
> https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#toString--
>
> The emphasis there is on human readability.  It is not intended for
> deserialization.  You *could* be looking at a toString() output like
> SolrQuery@1d44bcfa instead of something you can actually read.
>
> For the incomplete string shown in the error message you mentioned, you
> could do:
>
> SolrQuery q = new SolrQuery();
> q.setRequestHandler("/select");
> // The default handler is /select, so
> // the above is actually not necessary.
>
> q.set("defType", "edismax");
> q.set("start", "0");
> q.set("rows","25");
> // sugar method: q.setStart(0);
> // sugar method: q.setRows(25);
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Why are cursor mark queries recommended over regular start, rows combination?

2018-03-23 Thread Webster Homer
Just FYI I had a project recently where I tried to use cursorMark in
Solrcloud and solr 7.2.0 and it was very unreliable. It couldn't even
return consistent numberFound values. I posted about it in this forum.
Using the start and rows arguments in SolrQuery did work reliably so I
abandoned cursorMark as just too buggy

I had originally wanted to try using streaming expressions, but they don't
return results ordered by relevancy, a major limitation for a search
engine, in my opinion.

On Tue, Mar 20, 2018 at 11:47 AM, Jason Gerlowski 
wrote:

> > I can take a stab at this if someone can point me how to update the
> documentation.
>
>
> Hey SG,
>
> Please do, that'd be awesome.
>
> Thanks to some work done by Cassandra Targett a release or two ago,
> the Solr Ref Guide documentation now lives in the same codebase as the
> Solr/Lucene code itself, and the process for updating it is the same
> as suggesting a change to the code:
>
>
> 1. Open a JIRA issue detailing the improvement you'd like to make
> 2. Find the relevant ref guide pages to update, making the changes
> you're proposing.
> 3. Upload a patch to your JIRA and ask for someone to take a look.
> (You can tag me on this issue if you'd like).
>
>
> Some more specific links you might find helpful:
>
> - JIRA: https://issues.apache.org/jira/projects/SOLR/issues
> - Pointers on JIRA conventions, creating patches:
> https://wiki.apache.org/solr/HowToContribute
> - Root directory for the Solr Ref-Guide code:
> https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide
> - https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html
>
> Best,
>
> Jason
>
> On Wed, Mar 14, 2018 at 2:53 PM, Erick Erickson 
> wrote:
> > I'm pretty sure you can use Streaming Expressions to get all the rows
> > back from a sharded collection without chewing up lots of memory.
> >
> > Try:
> > search(collection,
> >  q="id:*",
> >  fl="id",
> >  sort="id asc",
> > qt="/export")
> >
> > on a sharded SolrCloud installation, I believe you'll get all the rows
> back.
> >
> > NOTE:
> > 1> Some while ago you couldn't _stop_ the stream part way through.
> > down in the SolrJ world you could read from a stream for a while and
> > call close on it but that would just spin in the background until it
> > reached EOF. Search the JIRA list if you need (can't find the JIRA
> > right now, 6.6 IIRC is OK and, of course, 7.3).
> >
> > This shouldn't chew up memory since the streams are sorted, so what
> > you get in the response is the ordered set of tuples.
> >
> > Some of the join streams _do_ have to hold all the results in memory,
> > so look at the docs if you wind up using those.
> >
> >
> > Best,
> > Erick
> >
> > On Wed, Mar 14, 2018 at 9:20 AM, S G  wrote:
> >> Thanks everybody. This is lot of good information.
> >> And we should try to update this in the documentation too to help users
> >> make the right choice.
> >> I can take a stab at this if someone can point me how to update the
> >> documentation.
> >>
> >> Thanks
> >> SG
> >>
> >>
> >> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter <
> hossman_luc...@fucit.org>
> >> wrote:
> >>
> >>>
> >>> : > 3) Lastly, it is not clear the role of export handler. It seems
> that
> >>> the
> >>> : > export handler would also have to do exactly the same kind of
> thing as
> >>> : > start=0 and rows=1000,000. And that again means bad performance.
> >>>
> >>> : <3> First, streaming requests can only return docValues="true"
> >>> : fields.Second, most streaming operations require sorting on something
> >>> : besides score. Within those constraints, streaming will be _much_
> >>> : faster and more efficient than cursorMark. Without tuning I saw 200K
> >>> : rows/second returned for streaming, the bottleneck will be the speed
> >>> : that the client can read from the network. First of all you only
> >>> : execute one query rather than one query per N rows. Second, in the
> >>> : cursorMark case, to return a document you and assuming that any field
> >>> : you return is docValues=false
> >>>
> >>> Just to clarify, there is big difference between the /export handler
> >>> and "streaming expressions"
> >>>
> >>> Unless something has changed drasticly in the past few releases, the
> >>> /export handler does *NOT* support exporting a full *collection* in
> solr
> >>> cloud -- it only operates on an individual core (aka: shard/replica).
> >>>
> >>> Streaming expressions is a feature that does work in Cloud mode, and
> can
> >>> make calls to the /export handler on a replica of each shard in order
> to
> >>> process the data of an entire collection -- but when doing so it has to
> >>> aggregate the *ALL* the results from every shard in memory on the
> >>> coordinating node -- meaning that (in addition to the docvalues caveat)
> >>> streaming expressions requires you to "spend" a lot of ram usage on one
> >>> node as 

solrj question

2018-03-23 Thread Webster Homer
I am working on a program to play back queries from a log file. It seemed
straight forward. The log has the solr query written to it. via the
SolrQuery.toString method. The SolrQuery class has a constructor which
takes a  string. It does instantiate a SolrQuery object, however when I try
to actually use it in a search, Solr throws an error that it is not able to
parse the query.

I see this in the output:
Lexical error at line 1, column 1759.  Encountered:  after :
"/select?defType=edismax=0=25&...
It has basically the entire solr query which it obviously couldn't parse.

solrQuery = new SolrQuery(log.getQuery());

the log.getQuery method just returns the query that was written to the log
with the toString() method

Is there something I'm doing wrong, or is it that the SolrQuery class
cannot really take its toString output to make itself? Does it have a
different serialization method that could be used?

This would be very useful functionality.

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: solrcloud Auto-commit doesn't seem reliable

2018-03-23 Thread Webster Homer
It's been a while since I had time to look further into this. I'll have to
go back through logs, which I need to get retrieved by an admin.

On Fri, Mar 23, 2018 at 8:45 AM, Amrit Sarkar <sarkaramr...@gmail.com>
wrote:

> Elaino,
>
> When you say commits not working, the solr logs not printing "commit"
> messages? or documents are not appearing when we search.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Thu, Mar 22, 2018 at 4:05 AM, Elaine Cario <etca...@gmail.com> wrote:
>
> > I'm just catching up on reading solr emails, so forgive me for being late
> > to this dance
> >
> > I've just gone through a project to enable CDCR on our Solr, and I also
> > experienced a small period of time where the commits on the source server
> > just seemed to stop.  This was during a period of intense experimentation
> > where I was mucking around with configurations, turning CDCR on/off, etc.
> > At some point the commits stopped occurring, and it drove me nuts for a
> > couple of days - tried everything - restarting Solr, reloading, turned
> > buffering on, turned buffering off, etc.  I finally threw up my hands and
> > rebooted the server out of desperation (it was a physical Linux box).
> > Commits worked fine after that.  I don't know what caused the commits to
> > stop, and why re-booting (and not just restarting Solr) caused them to
> work
> > fine.
> >
> > Wondering if you ever found a solution to your situation?
> >
> >
> >
> > On Fri, Feb 16, 2018 at 2:44 PM, Webster Homer <webster.ho...@sial.com>
> > wrote:
> >
> > > I meant to get back to this sooner.
> > >
> > > When I say I issued a commit I do issue it as
> > collection/update?commit=true
> > >
> > > The soft commit interval is set to 3000, but I don't have a problem
> with
> > > soft commits ( I think). I was responding
> > >
> > > I am concerned that some hard commits don't seem to happen, but I think
> > > many commits do occur. I'd like suggestions on how to diagnose this,
> and
> > > perhaps an idea of where to look. Typically I believe that issues like
> > this
> > > are from our configuration.
> > >
> > > Our indexing job is pretty simple, we send blocks of JSON to
> > > /update/json. We have either re-index the whole collection,
> > or
> > > just apply updates. Typically we reindex the data once a week and
> delete
> > > any records that are older than the last full index. This does lead to
> a
> > > fair number of deleted records in the index especially if commits fail.
> > > Most of our collections are not large between 2 and 3 million records.
> > >
> > > The collections are hosted in google cloud
> > >
> > > On Mon, Feb 12, 2018 at 5:00 PM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > > > bq: But if 3 seconds is aggressive what would be a  good value for
> soft
> > > > commit?
> > > >
> > > > The usual answer is "as long as you can stand". All top-level caches
> > are
> > > > invalidated, autowarming is done etc. on each soft commit. That can
> be
> > a
> > > > lot of
> > > > work and if your users are comfortable with docs not showing up for,
> > > > say, 10 minutes
> > > > then use 10 minutes. As always "it depends" here, the point is not to
> > > > do unnecessary
> > > > work if possible.
> > > >
> > > > bq: If a commit doesn't happen how would there ever be an index merge
> > > > that would remove the deleted documents.
> > > >
> > > > Right, it wouldn't. It's a little more subtle than that though.
> > > > Segments on various
> > > > replicas will contain different docs, thus the term/doc statistics
> can
> > be
> > > > a bit
> > > > different between multiple replicas. None of the stats will change
> > > > until the commit
> > > > though. You might try turning no distributed doc/term stats though.
> > > >
> > > > Your comments about PULL or TLOG replicas are well taken. However,
> even
> > > > those
> > > > won't be absolutely in sync since they'll replicate from the m

Re: Solr SynonymGraphFilterFactory error on import

2018-03-06 Thread Webster Homer
You probably want to call  solr.FlattenGraphFilterFactory  after the call
to  WordDelimiterGraphFilterFactory. I put it at the end

Also there is an issue calling more than one graph filter in an analysis
chain so you may need to remove one of them. I think that there is a Jira
about that

Personally I prefer to do synonyms at query time.



On Mon, Mar 5, 2018 at 3:56 AM, damian.pawski  wrote:

> After upgrading to Solr 7.2 import started to log errors for some
> documents.
>
> Field that returns errors:
>
> positionIncrementGap="100">
>   
> 
>  ignoreCase="true" expand="true"/>
> 
>  words="stopwords.txt" />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
> 
> 
> 
> 
> 
>   
>   
> 
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
> 
> 
> 
>   
> 
> During the import below error is returned for some of the records:
>
> org.apache.solr.common.SolrException: Exception writing document id X
> to
> the index; possible analysis error: startOffset must be non-negative, and
> endOffset must be >= startOffset, and offsets must not go backwards
> startOffset=2874,endOffset=2878,lastStartOffset=2879 for field 'X'
> at
> g.apache.solr.update.DirectUpdateHandler2.addDoc(
> DirectUpdateHandler2.java:226)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(
> RunUpdateProcessorFactory.java:67)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(
> UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(
> DistributedUpdateProcessor.java:936)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(
> DistributedUpdateProcessor.java:616)
> at
> org.apache.solr.update.processor.LogUpdateProcessorFactory$
> LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> at org.apache.solr.handler.dataimport.SolrWriter.upload(
> SolrWriter.java:80)
>
>
> It is related to the:
>  ignoreCase="true" expand="true"/>
> 
>
> If I remove this it works fine, previously we were using:
>  ignoreCase="true" expand="true"/>
>
> and it was working fine, but the SynonymFilterFactory is not longer
> supported on the Solr 7.X., it has been replaced with
> SynonymGraphFilterFactory, I have added FlattenGraphFilterFactory as
> suggested.
>
> I am not sure why Solr returns those errors?
>
> Thank you in advance for suggestions.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Solr 7.2.0 CDCR Issue with TLOG collections

2018-03-06 Thread Webster Homer
seems that this is a bug in Solr
https://issues.apache.org/jira/browse/SOLR-12057

Hopefully it can be addressed soon!

On Mon, Mar 5, 2018 at 4:14 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> I noticed that the cdcr action=queues returns different results for the
> target clouds. One target says that the  updateLogSynchronizer  is
> stopped the other says started. Why? What does that mean. We don't
> explicitly set that anywhere
>
>
> {"responseHeader": {"status": 0,"QTime": 0},"queues": [],"tlogTotalSize":
> 0,"tlogTotalCount": 0,"updateLogSynchronizer": "stopped"}
>
> and the other
>
> {"responseHeader": {"status": 0,"QTime": 0},"queues": [],"tlogTotalSize":
> 22254206389,"tlogTotalCount": 2,"updateLogSynchronizer": "started"}
>
> The source is as follows:
> {
> "responseHeader": {
> "status": 0,
> "QTime": 5
> },
> "queues": [
> "xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,xxx-mzk03.
> sial.com:2181/solr",
> [
> "b2b-catalog-material-180124T",
> [
> "queueSize",
> 0,
> "lastTimestamp",
> "2018-02-28T18:34:39.704Z"
> ]
> ],
> "yyy-mzk01.sial.com:2181,yyy-mzk02.sial.com:2181,yyy-mzk03.
> sial.com:2181/solr",
> [
> "b2b-catalog-material-180124T",
> [
> "queueSize",
> 0,
> "lastTimestamp",
> "2018-02-28T18:34:39.704Z"
> ]
> ]
> ],
> "tlogTotalSize": 1970848,
> "tlogTotalCount": 1,
> "updateLogSynchronizer": "stopped"
> }
>
>
> On Fri, Mar 2, 2018 at 5:05 PM, Webster Homer <webster.ho...@sial.com>
> wrote:
>
>> It looks like the data is getting to the target servers. I see tlog files
>> with the right timestamps. Looking at the timestamps on the documents in
>> the collection none of the data appears to have been loaded.
>> In the solr.log I see lots of /cdcr messages  action=LASTPROCESSEDVERSION,
>>  action=COLLECTIONCHECKPOINT, and  action=SHARDCHECKPOINT
>>
>> no errors
>>
>> autoCommit is set to  6 I tried sending a commit explicitly no
>> difference. cdcr is uploading data, but no new data appears in the
>> collection.
>>
>> On Fri, Mar 2, 2018 at 1:39 PM, Webster Homer <webster.ho...@sial.com>
>> wrote:
>>
>>> We have been having strange behavior with CDCR on Solr 7.2.0.
>>>
>>> We have a number of replicas which have identical schemas. We found that
>>> TLOG replicas give much more consistent search results.
>>>
>>> We created a collection using TLOG replicas in our QA clouds.
>>> We have a locally hosted solrcloud with 2 nodes, all our collections
>>> have 2 shards. We use CDCR to replicate the collections from this
>>> environment to 2 data centers hosted in Google cloud. This seems to work
>>> fairly well for our collections with NRT replicas. However the new TLOG
>>> collection has problems.
>>>
>>> The google cloud solrclusters have 4 nodes each (3 separate Zookeepers).
>>> 2 shards per collection with 2 replicas per shard.
>>>
>>> We never see data show up in the cloud collections, but we do see tlog
>>> files show up on the cloud servers. I can see that all of the servers have
>>> cdcr started, buffers are disabled.
>>> The cdcr source configuration is:
>>>
>>> "requestHandler":{"/cdcr":{
>>>   "name":"/cdcr",
>>>   "class":"solr.CdcrRequestHandler",
>>>   "replica":[
>>> {
>>>   "zkHost":"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,xx
>>> x-mzk03.sial.com:2181/solr",
>>>   "source":"b2b-catalog-material-180124T",
>>>   "target":"b2b-catalog-material-180124T"},
>>> {
>>>   "zkHost":"-mzk01.sial.com:2181,-mzk02.sial.com:2181,
>>> -mzk03.sial.com:2181/solr",
>>>   "source":"b2b-catalog-material-180124T",
>>>   "target":"b2b-catalog-material-180124T"}],
>>>   "replicator":{
>>> "threadPoolSize":4,
>>> "schedule":500,
>>> "batchSize":250},
>>>   "updateLogSynchronizer":{"sche

Re: Solr 7.2.0 CDCR Issue with TLOG collections

2018-03-05 Thread Webster Homer
I noticed that the cdcr action=queues returns different results for the
target clouds. One target says that the  updateLogSynchronizer  is stopped
the other says started. Why? What does that mean. We don't explicitly set
that anywhere


{"responseHeader": {"status": 0,"QTime": 0},"queues": [],"tlogTotalSize": 0,
"tlogTotalCount": 0,"updateLogSynchronizer": "stopped"}

and the other

{"responseHeader": {"status": 0,"QTime": 0},"queues": [],"tlogTotalSize":
22254206389,"tlogTotalCount": 2,"updateLogSynchronizer": "started"}

The source is as follows:
{
"responseHeader": {
"status": 0,
"QTime": 5
},
"queues": [
"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,
xxx-mzk03.sial.com:2181/solr",
[
"b2b-catalog-material-180124T",
[
"queueSize",
0,
"lastTimestamp",
"2018-02-28T18:34:39.704Z"
]
],
"yyy-mzk01.sial.com:2181,yyy-mzk02.sial.com:2181,
yyy-mzk03.sial.com:2181/solr",
[
"b2b-catalog-material-180124T",
[
"queueSize",
0,
"lastTimestamp",
"2018-02-28T18:34:39.704Z"
]
]
],
"tlogTotalSize": 1970848,
"tlogTotalCount": 1,
"updateLogSynchronizer": "stopped"
}


On Fri, Mar 2, 2018 at 5:05 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> It looks like the data is getting to the target servers. I see tlog files
> with the right timestamps. Looking at the timestamps on the documents in
> the collection none of the data appears to have been loaded.
> In the solr.log I see lots of /cdcr messages  action=LASTPROCESSEDVERSION,
>  action=COLLECTIONCHECKPOINT, and  action=SHARDCHECKPOINT
>
> no errors
>
> autoCommit is set to  6 I tried sending a commit explicitly no
> difference. cdcr is uploading data, but no new data appears in the
> collection.
>
> On Fri, Mar 2, 2018 at 1:39 PM, Webster Homer <webster.ho...@sial.com>
> wrote:
>
>> We have been having strange behavior with CDCR on Solr 7.2.0.
>>
>> We have a number of replicas which have identical schemas. We found that
>> TLOG replicas give much more consistent search results.
>>
>> We created a collection using TLOG replicas in our QA clouds.
>> We have a locally hosted solrcloud with 2 nodes, all our collections have
>> 2 shards. We use CDCR to replicate the collections from this environment to
>> 2 data centers hosted in Google cloud. This seems to work fairly well for
>> our collections with NRT replicas. However the new TLOG collection has
>> problems.
>>
>> The google cloud solrclusters have 4 nodes each (3 separate Zookeepers).
>> 2 shards per collection with 2 replicas per shard.
>>
>> We never see data show up in the cloud collections, but we do see tlog
>> files show up on the cloud servers. I can see that all of the servers have
>> cdcr started, buffers are disabled.
>> The cdcr source configuration is:
>>
>> "requestHandler":{"/cdcr":{
>>   "name":"/cdcr",
>>   "class":"solr.CdcrRequestHandler",
>>   "replica":[
>> {
>>   "zkHost":"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,xx
>> x-mzk03.sial.com:2181/solr",
>>   "source":"b2b-catalog-material-180124T",
>>   "target":"b2b-catalog-material-180124T"},
>> {
>>   "zkHost":"-mzk01.sial.com:2181,-mzk02.sial.com:2181,
>> -mzk03.sial.com:2181/solr",
>>   "source":"b2b-catalog-material-180124T",
>>   "target":"b2b-catalog-material-180124T"}],
>>   "replicator":{
>> "threadPoolSize":4,
>> "schedule":500,
>> "batchSize":250},
>>   "updateLogSynchronizer":{"schedule":6
>>
>> The target configurations in the 2 clouds are the same:
>> "requestHandler":{"/cdcr":{ "name":"/cdcr", "class":
>> "solr.CdcrRequestHandler", "buffer":{"defaultState":"disabled"}}}
>>
>> All of our collections have a timestamp field, index_date. In the source
>> collection all the records have a date of 2/28/2018 but the target
>> collections have a latest date of 1/26/2018
>>
>> I don't see cdcr errors in the logs, but we use logstash to search them,
>> and we're still perfecting that.
&

Re: Solr 7.2.0 CDCR Issue with TLOG collections

2018-03-02 Thread Webster Homer
It looks like the data is getting to the target servers. I see tlog files
with the right timestamps. Looking at the timestamps on the documents in
the collection none of the data appears to have been loaded.
In the solr.log I see lots of /cdcr messages  action=LASTPROCESSEDVERSION,
 action=COLLECTIONCHECKPOINT, and  action=SHARDCHECKPOINT

no errors

autoCommit is set to  6 I tried sending a commit explicitly no
difference. cdcr is uploading data, but no new data appears in the
collection.

On Fri, Mar 2, 2018 at 1:39 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> We have been having strange behavior with CDCR on Solr 7.2.0.
>
> We have a number of replicas which have identical schemas. We found that
> TLOG replicas give much more consistent search results.
>
> We created a collection using TLOG replicas in our QA clouds.
> We have a locally hosted solrcloud with 2 nodes, all our collections have
> 2 shards. We use CDCR to replicate the collections from this environment to
> 2 data centers hosted in Google cloud. This seems to work fairly well for
> our collections with NRT replicas. However the new TLOG collection has
> problems.
>
> The google cloud solrclusters have 4 nodes each (3 separate Zookeepers). 2
> shards per collection with 2 replicas per shard.
>
> We never see data show up in the cloud collections, but we do see tlog
> files show up on the cloud servers. I can see that all of the servers have
> cdcr started, buffers are disabled.
> The cdcr source configuration is:
>
> "requestHandler":{"/cdcr":{
>   "name":"/cdcr",
>   "class":"solr.CdcrRequestHandler",
>   "replica":[
> {
>   "zkHost":"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,x
> xx-mzk03.sial.com:2181/solr",
>   "source":"b2b-catalog-material-180124T",
>   "target":"b2b-catalog-material-180124T"},
> {
>   "zkHost":"-mzk01.sial.com:2181,-mzk02.sial.com:2181,
> -mzk03.sial.com:2181/solr",
>   "source":"b2b-catalog-material-180124T",
>   "target":"b2b-catalog-material-180124T"}],
>   "replicator":{
> "threadPoolSize":4,
> "schedule":500,
> "batchSize":250},
>   "updateLogSynchronizer":{"schedule":6
>
> The target configurations in the 2 clouds are the same:
> "requestHandler":{"/cdcr":{ "name":"/cdcr", "class":"solr.
> CdcrRequestHandler", "buffer":{"defaultState":"disabled"}}}
>
> All of our collections have a timestamp field, index_date. In the source
> collection all the records have a date of 2/28/2018 but the target
> collections have a latest date of 1/26/2018
>
> I don't see cdcr errors in the logs, but we use logstash to search them,
> and we're still perfecting that.
>
> We have a number of similar collections that behave correctly. This is the
> only collection that is a TLOG collection. It appears that CDCR doesn't
> support TLOG collections.
>
> This begins to look like a bug
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Solr 7.2.0 CDCR Issue with TLOG collections

2018-03-02 Thread Webster Homer
We have been having strange behavior with CDCR on Solr 7.2.0.

We have a number of replicas which have identical schemas. We found that
TLOG replicas give much more consistent search results.

We created a collection using TLOG replicas in our QA clouds.
We have a locally hosted solrcloud with 2 nodes, all our collections have 2
shards. We use CDCR to replicate the collections from this environment to 2
data centers hosted in Google cloud. This seems to work fairly well for our
collections with NRT replicas. However the new TLOG collection has problems.

The google cloud solrclusters have 4 nodes each (3 separate Zookeepers). 2
shards per collection with 2 replicas per shard.

We never see data show up in the cloud collections, but we do see tlog
files show up on the cloud servers. I can see that all of the servers have
cdcr started, buffers are disabled.
The cdcr source configuration is:

"requestHandler":{"/cdcr":{
  "name":"/cdcr",
  "class":"solr.CdcrRequestHandler",
  "replica":[
{
  "zkHost":"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,
xxx-mzk03.sial.com:2181/solr",
  "source":"b2b-catalog-material-180124T",
  "target":"b2b-catalog-material-180124T"},
{
  "zkHost":"-mzk01.sial.com:2181,-mzk02.sial.com:2181,
-mzk03.sial.com:2181/solr",
  "source":"b2b-catalog-material-180124T",
  "target":"b2b-catalog-material-180124T"}],
  "replicator":{
"threadPoolSize":4,
"schedule":500,
"batchSize":250},
  "updateLogSynchronizer":{"schedule":6

The target configurations in the 2 clouds are the same:
"requestHandler":{"/cdcr":{ "name":"/cdcr", "class":
"solr.CdcrRequestHandler", "buffer":{"defaultState":"disabled"}}}

All of our collections have a timestamp field, index_date. In the source
collection all the records have a date of 2/28/2018 but the target
collections have a latest date of 1/26/2018

I don't see cdcr errors in the logs, but we use logstash to search them,
and we're still perfecting that.

We have a number of similar collections that behave correctly. This is the
only collection that is a TLOG collection. It appears that CDCR doesn't
support TLOG collections.

This begins to look like a bug

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Webster Homer
Becky,
This should have been its own question.

Solrcloud is different from standalone solr, the configurations live in
Zookeeper and the index is created under SOLR_HOME. You might want to
rethink your solution, What problem are you trying to solve with that
layout? Would it be solved by creating the Parent1 collection with 2 shards?

On Fri, Mar 2, 2018 at 10:56 AM, Becky Bonner <bbon...@teleflora.com> wrote:

> We are trying to setup one solr server for several applications each with
> a different collection.  Is there a way to have have 2 collections under
> one folder and the url be something like this:
> http://mysolrinstance.com/solr/myParent1/collection1
> http://mysolrinstance.com/solr/myParent1/collection2
> http://mysolrinstance.com/solr/myParent2
> http://mysolrinstance.com/solr/myParent3
>
>
> We organized it like that under the solr folder but the URLs to the
> collections do not include the "myParent1".
> This makes the names of my collections more confusing because you can't
> tell what application they belong to.  It wasn’t a problem until we had 2
> collections for one of the apps.
>
>
>
>
> -Original Message-
> From: Webster Homer [mailto:webster.ho...@sial.com]
> Sent: Friday, March 2, 2018 10:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: NRT replicas miss hits and return duplicate hits when paging
> solrcloud searches
>
> I am trying to test if enabling stats cache as suggested by Eric would
> also address this issue. I added this to my solrconfig.xml
>
>  
>
> I executed queries and saw no differences. Then I re-indexed the data,
> again I saw no differences in behavior.
> Then I found this,  SOLR-10952. It seems we need to disable the
> queryResultCache for the global stats cache to work.
> I've never disabled this before. I edited the solrconfig.xml setting the
> sizes to 0. I'm not sure if this is how to disable the cache or not.
>
>   size="0"
>  initialSize="0"
>  autowarmCount="0"/>
>
> I also set this:
>0
>
> Then uploaded the solrconfig.xml and reloaded the collection. It sill made
> no difference. Do I need to restart solr for this to take effect?
> When I look in the admin console, the queryResultCache still seems to have
> the old settings.
>
> Does enabling statsCache require a solr restart too? Does enabling the
> statsCache require that the data be re-indexed? The documentation on this
> feature is skimpy.
> Is there a way to see if it's enabled in the Admin Console?
>
> On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer <webster.ho...@sial.com>
> wrote:
>
> > Emir,
> >
> > Using tlog replica types addresses my immediate problem.
> >
> > The secondary issue is that all of our searches show inconsistent
> results.
> > These are all normal paging use cases. We regularly test our
> > relevancy, and these differences creates confusion in the testers.
> > Moreover, we are migrating from Endeca which has very consistent results.
> >
> > I'm hoping that using the global stats cache will make the other
> > searches more stable. I think we will eventually move to favoring tlog
> > replicas. We have a couple of collections where NRT makes sense, but
> > those collections don't need to return data in relevancy order. I
> > think NRT should be considered a niche use case for a search engine,
> > tlog and pull replicas are a much better fit for a search engine
> > (imho)
> >
> > On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Webster,
> >> Since you are returning all hits, returning the last page is almost
> >> as heavy for Solr as returning all documents. Maybe you should
> >> consider just returning one large page and completely avoid this issue.
> >> I agree with you that this should be handled by Solr. ES solved this
> >> issue with “preference” search parameter where you can set session id
> >> as preference and it will stick to the same shards. I guess you could
> >> try similar thing on your own but that would require you to send list
> >> of shards as parameter for your search and balance it for different
> sessions.
> >>
> >> HTH,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >> > On 26 Feb 2018, at 21:03, Webster Homer <webster.ho...@sial.com>
> wrote:
> >> &

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Webster Homer
Thanks Shawn.

Commenting it out works to remove it. If I change the values e.g. change
the 512 to 0, it does require a restart to take effect.

Tested using statsCache set to
org.apache.solr.search.stats.ExactSharedStatsCache,
with the queryResultCache disabled, and I still see the problem with NRT
replicas. So using TLOG replicas still looks like the best work around for
the NRT issue

On Fri, Mar 2, 2018 at 10:44 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/2/2018 9:28 AM, Webster Homer wrote:
>
>> I've never disabled this before. I edited the solrconfig.xml setting the
>> sizes to 0. I'm not sure if this is how to disable the cache or not.
>>
>>  >   size="0"
>>   initialSize="0"
>>   autowarmCount="0"/>
>>
>
> To completely disable a cache, either comment it out or remove it from the
> config.  I do not know whether setting the size to 0 will actually work or
> not.
>
> Does enabling statsCache require a solr restart too? Does enabling the
>> statsCache require that the data be re-indexed? The documentation on this
>> feature is skimpy.
>>
>
> Most changes to solrconfig.xml just require a reload.  I would expect any
> cache configurations to fall into that category.
>
> Is there a way to see if it's enabled in the Admin Console?
>>
>
> I don't know anything about the statsCache.  If you don't see it in the
> Plugins/Stats tab, that's probably something that was forgotten, and needs
> to be added to the admin UI.
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Webster Homer
I am trying to test if enabling stats cache as suggested by Eric would also
address this issue. I added this to my solrconfig.xml

 

I executed queries and saw no differences. Then I re-indexed the data,
again I saw no differences in behavior.
Then I found this,  SOLR-10952. It seems we need to disable the
queryResultCache for the global stats cache to work.
I've never disabled this before. I edited the solrconfig.xml setting the
sizes to 0. I'm not sure if this is how to disable the cache or not.



I also set this:
   0

Then uploaded the solrconfig.xml and reloaded the collection. It sill made
no difference. Do I need to restart solr for this to take effect?
When I look in the admin console, the queryResultCache still seems to have
the old settings.

Does enabling statsCache require a solr restart too? Does enabling the
statsCache require that the data be re-indexed? The documentation on this
feature is skimpy.
Is there a way to see if it's enabled in the Admin Console?

On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer <webster.ho...@sial.com>
wrote:

> Emir,
>
> Using tlog replica types addresses my immediate problem.
>
> The secondary issue is that all of our searches show inconsistent results.
> These are all normal paging use cases. We regularly test our relevancy, and
> these differences creates confusion in the testers. Moreover, we are
> migrating from Endeca which has very consistent results.
>
> I'm hoping that using the global stats cache will make the other searches
> more stable. I think we will eventually move to favoring tlog replicas. We
> have a couple of collections where NRT makes sense, but those collections
> don't need to return data in relevancy order. I think NRT should be
> considered a niche use case for a search engine, tlog and pull replicas are
> a much better fit for a search engine (imho)
>
> On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Webster,
>> Since you are returning all hits, returning the last page is almost as
>> heavy for Solr as returning all documents. Maybe you should consider just
>> returning one large page and completely avoid this issue.
>> I agree with you that this should be handled by Solr. ES solved this
>> issue with “preference” search parameter where you can set session id as
>> preference and it will stick to the same shards. I guess you could try
>> similar thing on your own but that would require you to send list of shards
>> as parameter for your search and balance it for different sessions.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 26 Feb 2018, at 21:03, Webster Homer <webster.ho...@sial.com> wrote:
>> >
>> > Erick,
>> >
>> > No we didn't look at that. I will add it to the list. We have  not seen
>> > performance issues with solr. We have much slower technologies in our
>> > stack. This project was to replace a system that was too slow.
>> >
>> > Thank you, I will look into it
>> >
>> > Webster
>> >
>> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Did you try enabling distributed IDF (statsCache)? See:
>> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>> >>
>> >> It's may not totally fix the issue, but it's worth trying. It does
>> >> come with a performance penalty of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <
>> webster.ho...@sial.com>
>> >> wrote:
>> >>> Thanks Shawn, I had settled on this as a solution.
>> >>>
>> >>> All our use cases for Solr is to return results in order of relevancy
>> to
>> >>> the query, so having a deterministic sort would defeat that purpose.
>> >> Since
>> >>> we wanted to be able to return all the results for a query, I
>> originally
>> >>> looked at using the Streaming API, but that doesn't support returning
>> >>> results sorted by relevancy
>> >>>
>> >>> I disagree with you about NRT replicas though. They may function as
>> >>> designed, but since they cannot guarantee consistent results their
>> design
>> >>> is buggy, at least it is for a search engine.
>> >>>
>> >>>
>> >>> On Mon, 

Re: 7.2.1 ExactStatsCache seems no longer functioning

2018-03-02 Thread Webster Homer
Your problem seems a lot like an issue I see with Near Real Time (NRT)
replicas. I posted about it in this forum. I was told that a possible
solution was to use the Global Stats feature. I am looking at testing that
now.

Have you tried using Tlog replicas? That fixed my issues with relevancy
differences between queries.

On Mon, Feb 19, 2018 at 9:41 AM, Markus Jelsma 
wrote:

> Hello,
>
> We're on 7.2.1 and rely on ExactStatsCache to work around the problem of
> not all nodes sharing the same maxDoc within a shard. But, it doesn't work,
> anymore!
>
> I've looked things up in Jira but nothing so far. SOLR-10952 also doesn't
> cause it because with queryResultCache disabled, document scores don't
> match up, the ordering of search results is not constant for the same query
> in consecutive searches.
>
> We see this on a local machine, just with default similarity and classic
> query parser.
>
> Any hints on what to do now?
>
> Many thanks,
> Markus
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-02-27 Thread Webster Homer
Emir,

Using tlog replica types addresses my immediate problem.

The secondary issue is that all of our searches show inconsistent results.
These are all normal paging use cases. We regularly test our relevancy, and
these differences creates confusion in the testers. Moreover, we are
migrating from Endeca which has very consistent results.

I'm hoping that using the global stats cache will make the other searches
more stable. I think we will eventually move to favoring tlog replicas. We
have a couple of collections where NRT makes sense, but those collections
don't need to return data in relevancy order. I think NRT should be
considered a niche use case for a search engine, tlog and pull replicas are
a much better fit for a search engine (imho)

On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Webster,
> Since you are returning all hits, returning the last page is almost as
> heavy for Solr as returning all documents. Maybe you should consider just
> returning one large page and completely avoid this issue.
> I agree with you that this should be handled by Solr. ES solved this issue
> with “preference” search parameter where you can set session id as
> preference and it will stick to the same shards. I guess you could try
> similar thing on your own but that would require you to send list of shards
> as parameter for your search and balance it for different sessions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 26 Feb 2018, at 21:03, Webster Homer <webster.ho...@sial.com> wrote:
> >
> > Erick,
> >
> > No we didn't look at that. I will add it to the list. We have  not seen
> > performance issues with solr. We have much slower technologies in our
> > stack. This project was to replace a system that was too slow.
> >
> > Thank you, I will look into it
> >
> > Webster
> >
> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Did you try enabling distributed IDF (statsCache)? See:
> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
> >>
> >> It's may not totally fix the issue, but it's worth trying. It does
> >> come with a performance penalty of course.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <webster.ho...@sial.com
> >
> >> wrote:
> >>> Thanks Shawn, I had settled on this as a solution.
> >>>
> >>> All our use cases for Solr is to return results in order of relevancy
> to
> >>> the query, so having a deterministic sort would defeat that purpose.
> >> Since
> >>> we wanted to be able to return all the results for a query, I
> originally
> >>> looked at using the Streaming API, but that doesn't support returning
> >>> results sorted by relevancy
> >>>
> >>> I disagree with you about NRT replicas though. They may function as
> >>> designed, but since they cannot guarantee consistent results their
> design
> >>> is buggy, at least it is for a search engine.
> >>>
> >>>
> >>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <apa...@elyograg.org>
> >> wrote:
> >>>
> >>>> On 2/26/2018 10:26 AM, Webster Homer wrote:
> >>>>> We need the results by relevancy so the application sorts the results
> >> by
> >>>>> score desc, and the unique id ascending as the tie breaker
> >>>>
> >>>> This is the reason for the discrepancy, and why the different replica
> >>>> types don't have the same issue.
> >>>>
> >>>> Each NRT replica can have different deleted documents than the others,
> >>>> just due to the way that NRT replicas work.  Deleted documents affect
> >>>> relevancy scoring.  When one replica has say 5000 deleted documents
> and
> >>>> another has 200, or has 5000 but they're different docs, a relevancy
> >>>> sort can end up different.  So when Solr goes to one replica for page
> 1
> >>>> and another for page 2 (which is expected due to SolrCloud's internal
> >>>> load balancing), you may end up with duplicate documents or documents
> >>>> missing.  Because deleted documents are not counted or returned,
> >>>> numFound will be consistent, as long as the index doesn't change
> between
> >>>

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-02-26 Thread Webster Homer
Erick,

No we didn't look at that. I will add it to the list. We have  not seen
performance issues with solr. We have much slower technologies in our
stack. This project was to replace a system that was too slow.

Thank you, I will look into it

Webster

On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Did you try enabling distributed IDF (statsCache)? See:
> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>
> It's may not totally fix the issue, but it's worth trying. It does
> come with a performance penalty of course.
>
> Best,
> Erick
>
> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <webster.ho...@sial.com>
> wrote:
> > Thanks Shawn, I had settled on this as a solution.
> >
> > All our use cases for Solr is to return results in order of relevancy to
> > the query, so having a deterministic sort would defeat that purpose.
> Since
> > we wanted to be able to return all the results for a query, I originally
> > looked at using the Streaming API, but that doesn't support returning
> > results sorted by relevancy
> >
> > I disagree with you about NRT replicas though. They may function as
> > designed, but since they cannot guarantee consistent results their design
> > is buggy, at least it is for a search engine.
> >
> >
> > On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <apa...@elyograg.org>
> wrote:
> >
> >> On 2/26/2018 10:26 AM, Webster Homer wrote:
> >> > We need the results by relevancy so the application sorts the results
> by
> >> > score desc, and the unique id ascending as the tie breaker
> >>
> >> This is the reason for the discrepancy, and why the different replica
> >> types don't have the same issue.
> >>
> >> Each NRT replica can have different deleted documents than the others,
> >> just due to the way that NRT replicas work.  Deleted documents affect
> >> relevancy scoring.  When one replica has say 5000 deleted documents and
> >> another has 200, or has 5000 but they're different docs, a relevancy
> >> sort can end up different.  So when Solr goes to one replica for page 1
> >> and another for page 2 (which is expected due to SolrCloud's internal
> >> load balancing), you may end up with duplicate documents or documents
> >> missing.  Because deleted documents are not counted or returned,
> >> numFound will be consistent, as long as the index doesn't change between
> >> the queries for pages.
> >>
> >> If you were using a deterministic sort rather than relevancy, this
> >> wouldn't be happening, because deleted documents have no influence on
> >> that kind of sort.
> >>
> >> With TLOG or PULL, the replicas are absolutely identical, so there is no
> >> difference, unless the index is changing as you page through the
> results.
> >>
> >> I think changing replica types is the only solution here.  NRT replicas
> >> are working as they were designed -- there's no bug, even though
> >> problems like this do sometimes turn up.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the send

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-02-26 Thread Webster Homer
Thanks Shawn, I had settled on this as a solution.

All our use cases for Solr is to return results in order of relevancy to
the query, so having a deterministic sort would defeat that purpose. Since
we wanted to be able to return all the results for a query, I originally
looked at using the Streaming API, but that doesn't support returning
results sorted by relevancy

I disagree with you about NRT replicas though. They may function as
designed, but since they cannot guarantee consistent results their design
is buggy, at least it is for a search engine.


On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/26/2018 10:26 AM, Webster Homer wrote:
> > We need the results by relevancy so the application sorts the results by
> > score desc, and the unique id ascending as the tie breaker
>
> This is the reason for the discrepancy, and why the different replica
> types don't have the same issue.
>
> Each NRT replica can have different deleted documents than the others,
> just due to the way that NRT replicas work.  Deleted documents affect
> relevancy scoring.  When one replica has say 5000 deleted documents and
> another has 200, or has 5000 but they're different docs, a relevancy
> sort can end up different.  So when Solr goes to one replica for page 1
> and another for page 2 (which is expected due to SolrCloud's internal
> load balancing), you may end up with duplicate documents or documents
> missing.  Because deleted documents are not counted or returned,
> numFound will be consistent, as long as the index doesn't change between
> the queries for pages.
>
> If you were using a deterministic sort rather than relevancy, this
> wouldn't be happening, because deleted documents have no influence on
> that kind of sort.
>
> With TLOG or PULL, the replicas are absolutely identical, so there is no
> difference, unless the index is changing as you page through the results.
>
> I think changing replica types is the only solution here.  NRT replicas
> are working as they were designed -- there's no bug, even though
> problems like this do sometimes turn up.
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-02-26 Thread Webster Homer
I have an application which implements several different searches against a
solrcloud collection.
We are using Solr 7.2 and Solr 6.1

The collection b2b-catalog-material is created with the default Near Real
Time (NRT) replicas. The collection has 2 shards each with 2 replicas.

The application launches a search and pages through all of the results up
to a maximum, typically about 1000 results and returns them to the caller.
It pages by the standard method of incrementing the start parameter by the
rows. until we retrieve the maximum we need or return all the
hits.Typically we set rows to 200

If a search matches 2000 results, the app will call solr 10 times to
retrieve 200 results per call. This is configurable.

The documents in the collection are product skus but the searchable fields
are mostly product oriented, and we have between 2 and 500 skus per
product. There are about 2,463,442 documents in the collection.

We need the results by relevancy so the application sorts the results by
score desc, and the unique id ascending as the tie breaker

We discovered that the application often returns duplicate records from a
search. I believe that this is due to the NRT replicas having slightly
different index data due to commit orders and different numbers of deleted
records. For many queries we see about 20 to 30 results duplicated. The
results from solr are sent to another system to retrieve pricing
information. This system is not yet fully populated so that out of a 1000
results we may return 350 or so. The problem is each time we called the
application with the same query we would see different results. I saw it
vary between 351 which was correct to 341 and 346. I believe that for each
"duplicate" found by the application, there is also a result that was
missed.

The numberFound from the solr Query response does not vary

This variability in the same query is unacceptable to the business. For
quite a while I thought it was in our code, or in the call to the other
system. However, we now know that it is Solr.

I created a simple test driver that calls solr and pages through the
results. It maintains a set of all the ids that we've encountered and it
will regularly find 20 or more duplicates depending upon the query.

Some observations:
The unique id is unique, it's used in other systems for this data.

If we do an optimize on the collection, the duplicates won't show up until
the next data load

I created a second collection that used the TLOG replica type, and we don't
see the problem even with repeated data loads.


The data in the collection is kept up to date by an etl process that
completely reindexes the data once a week. That would be how it would work
once in production anyway we reload it more frequently as we're testing the
app.

My boss has lost all confidence in Solrcloud. It seems that it cannot find
the same data in subsequent searches. Returning consistent results from a
search is job #1 and solrcloud is failing at that.

It looks like using TLOG replicas seems to address the issue, it appears
that you cannot trust NRT replicas to return consistent results.

The scores for many searches are fairly flat with not a lot of variability
in them, which means that a small difference in a score can change the
order of results.

We found that upgrading to 7.2 in our production servers and using tlog
replicas worked, but the alternative of optimizing after each load while a
hack does seem to address the problem too, however determining when to
optimize would be difficult to automate since we use CDCR to replicate the
data to a cloud environment and it's not easy to determine when the remote
collections are fully loaded.

The only other thing I can think of is tweaking the lucene merge algorithm
to better remove deleted documents from the index

Have others encountered this kind of inconsistency in solrcloud? I cannot
believe that we're the first to have encountered it.

How have you addressed it?

We have settled on using TLOG replicas as they provide consistent results
and don't return duplicate hits, which also means that there are no missing
hits.

Unless you need real time indexing, NRT replicas should be avoided in favor
of TLOG replicas or a mix of TLOG and PULL replicas.

I wrote a test program and verified that we actually have this issue with
all or our collections. We hadn't noticed it before because most of the
time the missing/duplicate results were 5 to 10 pages into the result set.

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in 

Re: solrcloud Auto-commit doesn't seem reliable

2018-02-16 Thread Webster Homer
I meant to get back to this sooner.

When I say I issued a commit I do issue it as collection/update?commit=true

The soft commit interval is set to 3000, but I don't have a problem with
soft commits ( I think). I was responding

I am concerned that some hard commits don't seem to happen, but I think
many commits do occur. I'd like suggestions on how to diagnose this, and
perhaps an idea of where to look. Typically I believe that issues like this
are from our configuration.

Our indexing job is pretty simple, we send blocks of JSON to
/update/json. We have either re-index the whole collection, or
just apply updates. Typically we reindex the data once a week and delete
any records that are older than the last full index. This does lead to a
fair number of deleted records in the index especially if commits fail.
Most of our collections are not large between 2 and 3 million records.

The collections are hosted in google cloud

On Mon, Feb 12, 2018 at 5:00 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: But if 3 seconds is aggressive what would be a  good value for soft
> commit?
>
> The usual answer is "as long as you can stand". All top-level caches are
> invalidated, autowarming is done etc. on each soft commit. That can be a
> lot of
> work and if your users are comfortable with docs not showing up for,
> say, 10 minutes
> then use 10 minutes. As always "it depends" here, the point is not to
> do unnecessary
> work if possible.
>
> bq: If a commit doesn't happen how would there ever be an index merge
> that would remove the deleted documents.
>
> Right, it wouldn't. It's a little more subtle than that though.
> Segments on various
> replicas will contain different docs, thus the term/doc statistics can be
> a bit
> different between multiple replicas. None of the stats will change
> until the commit
> though. You might try turning no distributed doc/term stats though.
>
> Your comments about PULL or TLOG replicas are well taken. However, even
> those
> won't be absolutely in sync since they'll replicate from the master at
> slightly
> different times and _could_ get slightly different segments _if_
> there's indexing
> going on. But let's say you stop indexing. After the next poll
> interval all the replicas
> will have identical characteristics and will score the docs the same.
>
> I don't have any signifiant wisdom to offer here, except this is really the
> first time I've heard of this behavior. About all I can imagine is
> that _somehow_
> the soft commit interval is -1. When you say you "issue a commit" I'm
> assuming
> it's via collection/update?commit=true or some such which issues a
> hard
> commit with openSearcher=true. And it's on a _collection_ basis, right?
>
> Sorry I can't be more help
> Erick
>
>
>
>
> On Mon, Feb 12, 2018 at 10:44 AM, Webster Homer <webster.ho...@sial.com>
> wrote:
> > Erick, I am aware of the CDCR buffering problem causing tlog retention,
> we
> > always turn buffering off in our cdcr configurations.
> >
> > My post was precipitated by seeing that we had uncommitted data in
> > collections > 24 hours after it was loaded. The collections I was looking
> > at are in our development environment, where we do not use CDCR. However
> > I'm pretty sure that I've seen situations in production where commits
> were
> > also long overdue.
> >
> > the "autoSoftcommit" was a typo. The soft commit logic seems to be fine,
> I
> > don't see an issue with data visibility. But if 3 seconds is aggressive
> > what would be a  good value for soft commit? We have a couple of
> > collections that are updated every minute although most of them are
> updated
> > much less frequently.
> >
> > My reason for raising this commit issue is that we see problems with the
> > relevancy of solrcloud searches, and the NRT replica type. Sometimes the
> > results flip where the best hit varies by what replica serviced the
> search.
> > This is hard to explain to management. Doing an optimized does address
> the
> > problem for a while. I try to avoid optimizing for the reasons you and
> Sean
> > list. If a commit doesn't happen how would there ever be an index merge
> > that would remove the deleted documents.
> >
> > The problem with deletes and relevancy don't seem to occur when we use
> TLOG
> > replicas, probably because they don't do their own indexing but get
> copies
> > from their leader. We are testing them now eventually we may abandon the
> > use of NRT replicas for most of our collections.
> >
> > I am quite concerned about this commit issue. What kinds of things would
> > 

Collections Fail to load after Solr Restart

2018-02-16 Thread Webster Homer
Yesterday I restarted a development solrcloud. After the cloud restarted 2
collections failed to come back.

I see this in the log:
2018-02-16 15:31:16.684 ERROR
(coreLoadExecutor-6-thread-1-processing-n:ae1c-ecomdev-msc02:8983_solr) [
 ] o.a.s.c.CachingDirectoryFactory Error closing
directory:org.apache.solr.common.SolrException: Timeout waiting for all
directory ref counts to be released - gave up waiting on

Null Pointer exception after upgrading lucene index from 6.1 to 7.2

2018-02-12 Thread Webster Homer
We ran the org.apache.lucene.index.IndexUpgrader as part of upgrading from
6.1 to 7.2.0

After the upgrade, one of our collections threw a NullPointerException on a
query of *:*

We didn't observe errors in the logs. All of our other collections appear
to be fine.

Re-indexing the collection seems to have fixed the issue.

So be very careful with the upgrade tool it is not robust

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: solrcloud Auto-commit doesn't seem reliable

2018-02-12 Thread Webster Homer
Erick, I am aware of the CDCR buffering problem causing tlog retention, we
always turn buffering off in our cdcr configurations.

My post was precipitated by seeing that we had uncommitted data in
collections > 24 hours after it was loaded. The collections I was looking
at are in our development environment, where we do not use CDCR. However
I'm pretty sure that I've seen situations in production where commits were
also long overdue.

the "autoSoftcommit" was a typo. The soft commit logic seems to be fine, I
don't see an issue with data visibility. But if 3 seconds is aggressive
what would be a  good value for soft commit? We have a couple of
collections that are updated every minute although most of them are updated
much less frequently.

My reason for raising this commit issue is that we see problems with the
relevancy of solrcloud searches, and the NRT replica type. Sometimes the
results flip where the best hit varies by what replica serviced the search.
This is hard to explain to management. Doing an optimized does address the
problem for a while. I try to avoid optimizing for the reasons you and Sean
list. If a commit doesn't happen how would there ever be an index merge
that would remove the deleted documents.

The problem with deletes and relevancy don't seem to occur when we use TLOG
replicas, probably because they don't do their own indexing but get copies
from their leader. We are testing them now eventually we may abandon the
use of NRT replicas for most of our collections.

I am quite concerned about this commit issue. What kinds of things would
influence whether a commit occurs? One commonality for our systems is that
they are hosted in a Google cloud. We have a number of collections that
share configurations, but others that do not. I think commits do happen,
but I don't trust that autoCommit is reliable. What can we do to make it
reliable?

Most of our collections are reindexed weekly with partial updates applied
daily, that at least is what happens in production, our development clouds
are not as regular.

Our solr startup script sets the following values:
-Dsolr.autoCommit.maxDocs=35000
-Dsolr.autoCommit.maxTime=6
-Dsolr.autoSoftCommit.maxTime=3000

I don't think we reference  solr.autoCommit.maxDocs in our solrconfig.xml
files.

here are our settings for autoCommit and autoSoftCommit

We had a lot of issues with missing commits when we didn't set
solr.autoCommit.maxTime
 
   ${solr.autoCommit.maxTime:6}
   false


 
   ${solr.autoSoftCommit.maxTime:5000}
 



On Fri, Feb 9, 2018 at 3:49 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/9/2018 9:29 AM, Webster Homer wrote:
>
>> A little more background. Our production Solrclouds are populated via
>> CDCR,
>> CDCR does not replicate commits, Commits to the target clouds happen via
>> autoCommit settings
>>
>> We see relvancy scores get inconsistent when there are too many deletes
>> which seems to happen when hard commits don't happen.
>>
>> On Fri, Feb 9, 2018 at 10:25 AM, Webster Homer <webster.ho...@sial.com>
>> wrote:
>>
>> I we do have autoSoftcommit set to 3 seconds. It is NOT the visibility of
>>> the records that is my primary concern. I am concerned about is the
>>> accumulation of uncommitted tlog files and the larger number of deleted
>>> documents.
>>>
>>
> For the deleted documents:  Have you ever done an optimize on the
> collection?  If so, you're going to need to re-do the optimize regularly to
> keep deleted documents from growing out of control.  See this issue for a
> very technical discussion about it:
>
> https://issues.apache.org/jira/browse/LUCENE-7976
>
> Deleted documents probably aren't really related to what we've been
> discussing.  That shouldn't really be strongly affected by commit settings.
>
> -
>
> A 3 second autoSoftCommit is VERY aggressive.   If your soft commits are
> taking longer than 3 seconds to complete, which is often what happens, then
> that will lead to problems.  I wouldn't expect it to cause the kinds of
> problems you describe, though.  It would manifest as Solr working too hard,
> logging warnings or errors, and changes taking too long to show up.
>
> Assuming that the config for autoSoftCommit doesn't have the typo that
> Erick mentioned.
>
> 
>
> I have never used CDCR, so I know very little about it.  But I have seen
> reports on this mailing list saying that transaction logs never get deleted
> when CDCR is configured.
>
> Below is a link to a mailing list discussion related to CDCR not deleting
> transaction logs.  Looks like for it to work right a buffer needs to be
> disabled, and there may also be problems caused by not having a complete
> zkHost string in the CDCR config:
>
> http://lucene.47

Re: solrcloud Auto-commit doesn't seem reliable

2018-02-09 Thread Webster Homer
A little more background. Our production Solrclouds are populated via CDCR,
CDCR does not replicate commits, Commits to the target clouds happen via
autoCommit settings

We see relvancy scores get inconsistent when there are too many deletes
which seems to happen when hard commits don't happen.

On Fri, Feb 9, 2018 at 10:25 AM, Webster Homer <webster.ho...@sial.com>
wrote:

> I we do have autoSoftcommit set to 3 seconds. It is NOT the visibility of
> the records that is my primary concern. I am concerned about is the
> accumulation of uncommitted tlog files and the larger number of deleted
> documents.
>
> I am VERY familiar with the Solr documentation on this.
>
> On Fri, Feb 9, 2018 at 10:08 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 2/9/2018 8:44 AM, Webster Homer wrote:
>>
>>> I look at the latest timestamp on a record in the collection and see that
>>> it is over 24 hours old.
>>>
>>> I send a commit to the collection, and then see that the core is now
>>> current, and the segments are fewer. The commit worked
>>>
>>> This is the setting in solrconfig.xml
>>>  ${solr.autoCommit.maxTime:6} <
>>> openSearcher>false 
>>>
>>
>> As recommended, you have openSearcher set to false.
>>
>> This means that these commits are NEVER going to make changes visible.
>>
>> Don't go and change openSearcher to true.  It is STRONGLY recommended to
>> have openSearcher=false in your autoCommit settings.  The reason for this
>> configuration is that it prevents the transaction log from growing out of
>> control.  With openSearcher=false, those commits will be very fast.  This
>> is because it's opening the searcher that's slow, not the process of
>> writing data to disk.
>>
>> Here's the recommended reading on the subject:
>>
>> https://lucidworks.com/understanding-transaction-logs-softco
>> mmit-and-commit-in-sorlcloud/
>>
>> For change visibility, configure autoSoftCommit, probably with a
>> different interval than you have for autoCommit.  I would recommend a
>> longer interval.  Or include the commitWithin parameter on at least some of
>> your update requests.  Or send explicit commit requests, preferably as soft
>> commits.
>>
>> Thanks,
>> Shawn
>>
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: solrcloud Auto-commit doesn't seem reliable

2018-02-09 Thread Webster Homer
I we do have autoSoftcommit set to 3 seconds. It is NOT the visibility of
the records that is my primary concern. I am concerned about is the
accumulation of uncommitted tlog files and the larger number of deleted
documents.

I am VERY familiar with the Solr documentation on this.

On Fri, Feb 9, 2018 at 10:08 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/9/2018 8:44 AM, Webster Homer wrote:
>
>> I look at the latest timestamp on a record in the collection and see that
>> it is over 24 hours old.
>>
>> I send a commit to the collection, and then see that the core is now
>> current, and the segments are fewer. The commit worked
>>
>> This is the setting in solrconfig.xml
>>  ${solr.autoCommit.maxTime:6} <
>> openSearcher>false 
>>
>
> As recommended, you have openSearcher set to false.
>
> This means that these commits are NEVER going to make changes visible.
>
> Don't go and change openSearcher to true.  It is STRONGLY recommended to
> have openSearcher=false in your autoCommit settings.  The reason for this
> configuration is that it prevents the transaction log from growing out of
> control.  With openSearcher=false, those commits will be very fast.  This
> is because it's opening the searcher that's slow, not the process of
> writing data to disk.
>
> Here's the recommended reading on the subject:
>
> https://lucidworks.com/understanding-transaction-logs-
> softcommit-and-commit-in-sorlcloud/
>
> For change visibility, configure autoSoftCommit, probably with a different
> interval than you have for autoCommit.  I would recommend a longer
> interval.  Or include the commitWithin parameter on at least some of your
> update requests.  Or send explicit commit requests, preferably as soft
> commits.
>
> Thanks,
> Shawn
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


solrcloud Auto-commit doesn't seem reliable

2018-02-09 Thread Webster Homer
I have observed this behavior with several versions of solr (4.10, 6.1, and
now 7.2)

I look in the admin console and look at a core and see that it is not
"current"
I also notice that there are lots of segments etc...

I look at the latest timestamp on a record in the collection and see that
it is over 24 hours old.

I send a commit to the collection, and then see that the core is now
current, and the segments are fewer. The commit worked


This is the setting in solrconfig.xml
 ${solr.autoCommit.maxTime:6} <
openSearcher>false 

Our Solr startup sets solr.autoCommit
-Dsolr.autoCommit.maxTime=6

Yet it looks like we don't get commits regularity. This morning I saw
several collections that hadn't had a hard commit in more than 24 hours.

How can this be? I don't feel that we can rely on Solr's autocommit
capability and this is disturbing

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

2018-02-06 Thread Webster Homer
I noticed that in some of the current example schemas that are shipped with
Solr, there is a fieldtype, text_en_splitting, that feeds the output
of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if
this isn't supported, the example should probably be updated or removed.

On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe  wrote:

> Hi Александр,
>
> > On Feb 5, 2018, at 11:19 AM, Shawn Heisey  wrote:
> >
> > There should be no problem with using them together.
>
> I believe Shawn is wrong.
>
> From  org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:
>
> > NOTE: this cannot consume an incoming graph; results will be undefined.
>
> Unfortunately, the ref guide entry for Synonym Graph Filter <
> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-
> graph-filter> doesn’t include a warning about this, but it should, like
> the warning on Word Delimiter Graph Filter  solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:
>
> > Note: although this filter produces correct token graphs, it cannot
> consume an input token graph correctly.
>
> (I’ve just committed a change to the ref guide source to add this also on
> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be
> included in the ref guide for Solr 7.3.)
>
> In short, the combination of the two filters is not supported, because
> WDGF produces a token graph, which SGF cannot correctly interpret.
>
> Other filters also have this issue, see e.g.  jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some
> attention recently, and hopefully it will inspire fixes elsewhere.
>
> Patches welcome!
>
> --
> Steve
> www.lucidworks.com
>
>
> > On Feb 5, 2018, at 11:19 AM, Shawn Heisey  wrote:
> >
> > On 2/5/2018 3:55 AM, Александр Шестак wrote:
> >>
> >> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
> >> and  WordDelimiterGraphFilterFactory. Can they be used together?
> >>
> >
> > There should be no problem with using them together.  But it is always
> > possible that the behavior will surprise you, while working 100% as
> > designed.
> >
> >> I have solr type configured in next way
> >>
> >>  >> autoGeneratePhraseQueries="true">
> >>   
> >> 
> >>  >> generateWordParts="1" generateNumberParts="1"
> >> splitOnNumerics="1"
> >> catenateWords="1" catenateNumbers="1" catenateAll="0"
> >> preserveOriginal="1" protected="protwords_en.txt"/>
> >> 
> >>   
> >>   
> >> 
> >>  >> generateWordParts="1" generateNumberParts="1"
> >> splitOnNumerics="1"
> >> catenateWords="0" catenateNumbers="0" catenateAll="0"
> >> preserveOriginal="1" protected="protwords_en.txt"/>
> >> 
> >>  >> synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
> >>   
> >> 
> >>
> >> So on query time it uses SynonymGraphFilterFactory after
> >> WordDelimiterGraphFilterFactory.
> >> Synonyms are configured in next way:
> >> b=>b,boron
> >> 2=>ii,2
> >>
> >> Query in solr analysis tool looks so. It is shown that terms after SGF
> >> have positions 3 and 4. Is it correct? I thought that they should had
> >> 1 and 2 positions.
> >>
> >
> > What matters is the *relative* positions.  The exact position number
> > doesn't matter much.  Something new that the Graph implementations use
> > is the position length.  That feature is necessary for multi-term
> > synonyms to function correctly in phrase queries.
> >
> > In your analysis screenshot, WDGF creates three tokens.  The two tokens
> > created by splitting the input are at positions 1 and 2, which I think
> > is 100% as expected.  It also sets the positionLength of the first term
> > to 2, probably because it has split that term into 2 additional terms.
> >
> > Then the SGF takes those last two terms and expands them.  Each of the
> > synonyms is at the same position as the original term, and the relative
> > positions of the two synonym pairs have not changed -- the second one is
> > still one higher than the first.  I think the reason that SGF moves the
> > positions two higher is because the positionLength on the "b2" term is
> > 2, previously set by WDGF.  Someone with more knowledge about the Graph
> > implementations may have to speak up as to whether this behavior is
> correct.
> >
> > Because the relative positions of the split terms don't change when SGF
> > runs, I think this is probably working as designed.
> >
> > Thanks,
> > Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and 

Re: cdcr replication of new collection doesn't replicate

2018-02-01 Thread Webster Homer
It looks like CDCR is entirely broken in 7.2.0
We have been using CDCR to replicate data from our on Prem systems to
solrclouds hosted in Google Cloud.
We used the lucene index upgrade to do an in place upgrade of the indexes
in all our systems
In at least one case we deleted all the rows from a collection. The delete
did propagate to the clouds

Then we loaded data into that collection. Only half the data is available
in the search. Using the Solr console I see that the index segments show no
data. All of the search results are from tlog files

The only time CDCR has been reliable has been the delete. Otherwise it
doesn't seem to work very well.

On Fri, Jan 26, 2018 at 1:29 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> We have just upgraded our QA solr clouds to 7.2.0
> We have 3 solr clouds. collections in the first cloud replicate to the
> other 2
>
> For existing collections which we upgraded in place using the lucene index
> upgrade tool seem to behave correctly data written to collections in the
> first environment replicates to the other 2
>
> We created a new collection has 2 shards each with 2 replicas. The new
> collection uses tlog replicas instead of NRT replicas.
>
> We configured CDCR similarly to other collections so that writes to the
> first are sent to the other 2 clouds. However, we never see data appear in
> the target collections.
> We do see tlog files appear, and I can see cdcr update messages in the
> logs, but none of the cores ever get any data in them. So the tlogs
> accumulate but are never loaded into the target collections
>
> This doesn't seem correct.
>
> I'm at a loss as to what to do next. We will probably copy the index files
> from the one collection to the other two collections directly, but
> shouldn't cdcr be sending the data?
>
> Does cdcr work with tlog replicas?
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


cdcr replication of new collection doesn't replicate

2018-01-26 Thread Webster Homer
We have just upgraded our QA solr clouds to 7.2.0
We have 3 solr clouds. collections in the first cloud replicate to the
other 2

For existing collections which we upgraded in place using the lucene index
upgrade tool seem to behave correctly data written to collections in the
first environment replicates to the other 2

We created a new collection has 2 shards each with 2 replicas. The new
collection uses tlog replicas instead of NRT replicas.

We configured CDCR similarly to other collections so that writes to the
first are sent to the other 2 clouds. However, we never see data appear in
the target collections.
We do see tlog files appear, and I can see cdcr update messages in the
logs, but none of the cores ever get any data in them. So the tlogs
accumulate but are never loaded into the target collections

This doesn't seem correct.

I'm at a loss as to what to do next. We will probably copy the index files
from the one collection to the other two collections directly, but
shouldn't cdcr be sending the data?

Does cdcr work with tlog replicas?

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


What are solr index.properties files

2018-01-24 Thread Webster Homer
While upgrading our QA solr 6.1 solrclouds to Solr 7.2.0 I discovered that
some of our index folders for a replica had directory names like
index.20170830071504690

These replicas also had a file index.properties which indicates which index
directory is current.

We don't see this configuration in our Dev solrclouds, nor

Why would we have these folders in solrcloud indexes?

What configuration controls this behavior?

Is it normal for solrcloud?

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Strange Alias behavior

2018-01-24 Thread Webster Homer
I don't like that this behavior is not documented.
It appears from this that aliases are recursive (sort of) and that isn't
documented.

On Wed, Jan 24, 2018 at 6:38 AM, alessandro.benedetti 
wrote:

> b2b-catalog-material-etl -> b2b-catalog-material
> b2b-catalog-material -> b2b-catalog-material-180117
>
> and we do a data load to b2b-catalog-material-etl
>
> We see data being added to both b2b-catalog-material and
> b2b-catalog-material-180117  -> *in here you wanted just to index in
> b2b-catalog-material-180117 I assume*
>
> when I delete the alias b2b-catalog-material then the data stopped loading
> into the collection b2b-catalog-material-180117  -> *this makes sense as
> you
> deleted the alias so the data will just go the b2b-catalog-material
> collection.*
> Why haven't you deleted the old collection instead? what was the purpose of
> deleting the alias ?
>
> To wrap it up, what is that you don't like ?
> is this bit "We see data being added to both b2b-catalog-material and
> b2b-catalog-material-180117" ?
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Strange Alias behavior

2018-01-23 Thread Webster Homer
Thanks, Erick.
The main reason that we assigned the alias this way was for consistency.
The consistency of communicating with the administrators that actually
maintain the solr instances in our QA and production clouds. They know very
little about solr. It was useful to have all the interfaces in place. They
will be responsible for the processes that set and delete aliases.

Our intent is to migrate away from the situation where the alias and
collection name could ever be the same.
When we only used the alias for reading there was no problem.

The behavior I didn't expect was that when a different alias
b2b-catalog-material-etl was pointing at the collection
b2b-catalog-material we saw data being added to it, but also to another
collection b2b-catalog-material-180117T where the alias
b2b-catalog-material was pointing.

We will be deleting any alias whose name is the same as a collection name
until we have replaced those collections.

On Sat, Jan 20, 2018 at 12:01 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> SOLR-11488 is there so we formalize what we intend here. For instance,
> at one point we discovered that you could have an alias pointing to
> collection1,collection2 then delete collection2 say. Solr was happy in
> that configuration but it made no sense. See SOLR-11218.
>
> So I don't know what the eventual resolution will be...
>
> In terms of why one would want an alias and collection to have the
> same name, a common recommendation is to completely re-index when
> making schema changes. By being able to create an alias with the same
> name as a collection, you can do that reindexing and atomically switch
> over without affecting the rest of your code or having any service
> interruption. So it looks like this:
>
> create a new_collection and index to it
> create an alias old_collection->new_collection
> delete old_collection
>
> and there's no service interruption. At very best if you couldn't
> create an alias with the same name as your collection, you'd have to
>
> create new_collection and index to it.
> Shut down all your apps
> delete old_collection
> create alias old_collection->new_collection
> bring all your apps back up.
>
> or
> create new_collection and index to it
> create an alias->new_collection
> rewrite all your apps to use the alias
> when they were all re-written, then delete old_collection.
>
> So it is convenient I think. We haven't moved forward on SOLR-11488
> yet. SOLR-11218 beefed up some testing also so we don't inadvertently
> break things.
>
> Best,
> Erick
>
>
>
>
> On Fri, Jan 19, 2018 at 3:06 PM, Webster Homer <webster.ho...@sial.com>
> wrote:
> > It seems like a useful feature, especially for migrating from standalone
> to
> > solrcloud, at least if the precedence of alias to collection is defined
> and
> > enforced.
> >
> > On Fri, Jan 19, 2018 at 5:01 PM, Shawn Heisey <apa...@elyograg.org>
> wrote:
> >
> >> On 1/19/2018 3:53 PM, Webster Homer wrote:
> >>
> >>> I created the alias with an existing collection name because our code
> base
> >>> which was created with stand alone solr was a pain to change. I did
> test
> >>> that the alias took precedence over the collection, when I did a
> search.
> >>>
> >>
> >> The ability to create aliases and collections with the same name is
> viewed
> >> as a bug by some, and probably will be removed in a future version.
> >>
> >> https://issues.apache.org/jira/browse/SOLR-11488
> >>
> >> It doesn't really make sense to have an alias with the same name as a
> >> collection, and the behavior is probably undefined.
> >>
> >> Thanks,
> >> Shawn
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability 

Re: Strange Alias behavior

2018-01-19 Thread Webster Homer
It seems like a useful feature, especially for migrating from standalone to
solrcloud, at least if the precedence of alias to collection is defined and
enforced.

On Fri, Jan 19, 2018 at 5:01 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 1/19/2018 3:53 PM, Webster Homer wrote:
>
>> I created the alias with an existing collection name because our code base
>> which was created with stand alone solr was a pain to change. I did test
>> that the alias took precedence over the collection, when I did a search.
>>
>
> The ability to create aliases and collections with the same name is viewed
> as a bug by some, and probably will be removed in a future version.
>
> https://issues.apache.org/jira/browse/SOLR-11488
>
> It doesn't really make sense to have an alias with the same name as a
> collection, and the behavior is probably undefined.
>
> Thanks,
> Shawn
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Strange Alias behavior

2018-01-19 Thread Webster Homer
I created the alias with an existing collection name because our code base
which was created with stand alone solr was a pain to change. I did test
that the alias took precedence over the collection, when I did a search.

On Fri, Jan 19, 2018 at 4:22 PM, Wenjie Zhang (Jack) <
wenjiezhang2...@gmail.com> wrote:

> Why would you create an alias with an existing collection name?
>
> Sent from my iPhone
>
> > On Jan 19, 2018, at 14:14, Webster Homer <webster.ho...@sial.com> wrote:
> >
> > I just discovered some odd behavior with aliases.
> >
> > We are in the process of converting over to use aliases in solrcloud. We
> > have a number of collections that applications have referenced the
> > collections from when we used standalone solr. So we created alias names
> to
> > match the name that the java applications already used.
> >
> > We still have collections that have the name of the alias.
> >
> > We also decided to create new aliases for use in our ETL process.
> > I have 3 collections that have the same configset which is named
> > b2b-catalog-material
> > collection 1: b2b-catalog-material
> > collection 2: b2b-catalog-material-180117
> > collection 3: b2b-catalog-material-180117T
> >
> > When the alias, b2b-catalog-material-etl is pointed at
> b2b-catalog-material
> > and the alias b2b-catalog-material is pointed to
> b2b-catalog-material-180117
> >
> > and we do a data load to b2b-catalog-material-etl
> >
> > We see data being added to both b2b-catalog-material and
> > b2b-catalog-material-180117
> >
> > when I delete the alias b2b-catalog-material then the data stopped
> loading
> > into the collection b2b-catalog-material-180117
> >
> >
> > So it seems that alias resolution is somewhat recursive. I'm surprised
> that
> > both collections were being updated.
> >
> > Is this the intended behavior for aliases? I don't remember seeing this
> > documented.
> > This was on a solrcloud running solr 7.2
> >
> > I haven't checked this in Solr 7.2 but when I created a new collection
> and
> > then pointed the alias to it and did a search no data was returned
> because
> > there was none to return. So this indicates to me that aliases behave
> > differently if we're writing to them or reading from them.
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Preserve order during indexing

2018-01-19 Thread Webster Homer
db order isn't generally defined, unless you are using an explicit "order
by" on your select. Default behavior would vary by database type and even
release of the database. You can index the fields that you would "order by"
in the db, and sort on those fields in solr

On Thu, Jan 18, 2018 at 10:17 PM, jagdish vasani 
wrote:

> Hi Ashish,
> I think it's not possible,solr creates  inverted index.. but you can get
> documents by sorting orders, give sort= asc/desc.
>
> Thanks,
> JagdishVasani
> On 19-Jan-2018 9:22 am, "Aashish Agarwal"  wrote:
>
> > Hi,
> >
> > I need to index documents in solr so that they are stored in same order
> as
> > present in database. i.e *:* gives result in db order. Is it possible.
> >
> > Thanks,
> > Aashish
> >
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Strange Alias behavior

2018-01-19 Thread Webster Homer
I just discovered some odd behavior with aliases.

We are in the process of converting over to use aliases in solrcloud. We
have a number of collections that applications have referenced the
collections from when we used standalone solr. So we created alias names to
match the name that the java applications already used.

We still have collections that have the name of the alias.

We also decided to create new aliases for use in our ETL process.
I have 3 collections that have the same configset which is named
b2b-catalog-material
collection 1: b2b-catalog-material
collection 2: b2b-catalog-material-180117
collection 3: b2b-catalog-material-180117T

When the alias, b2b-catalog-material-etl is pointed at b2b-catalog-material
and the alias b2b-catalog-material is pointed to b2b-catalog-material-180117

and we do a data load to b2b-catalog-material-etl

We see data being added to both b2b-catalog-material and
b2b-catalog-material-180117

when I delete the alias b2b-catalog-material then the data stopped loading
into the collection b2b-catalog-material-180117


So it seems that alias resolution is somewhat recursive. I'm surprised that
both collections were being updated.

Is this the intended behavior for aliases? I don't remember seeing this
documented.
This was on a solrcloud running solr 7.2

I haven't checked this in Solr 7.2 but when I created a new collection and
then pointed the alias to it and did a search no data was returned because
there was none to return. So this indicates to me that aliases behave
differently if we're writing to them or reading from them.

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: cursorMark and Solrcloud

2018-01-16 Thread Webster Homer
count is queryResponse.getResults().getNumFound()

The code stops when the cursorMark is equal to the nextCursorMark so how
can it exceed the numFound?
setting the sort order to just the unique id and the code works.

I would try to create an example case, but I'm under a deadline and have to
get this working and I found that using the normal start/rows iteration
seems to work. if less efficiently

On Tue, Jan 16, 2018 at 4:15 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> sorry solr_returned is the total count of the documents retrieved from the
> queryResponse. So if I ask for 200 rows at at time it will be the increment
> of all the 200
>
> numberRetrieved += queryResponse.getResults().size();
>
> Where queryResponse is a solrj QueryResponse
>
> On Mon, Jan 15, 2018 at 6:11 PM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 1/15/2018 12:52 PM, Webster Homer wrote:
>>
>>> When I don't have score in the sort, the solr_returned and count are the
>>> same
>>>
>>
>> I don't know what "solr_returned" means.  I haven't encountered that
>> before, and nothing useful turns up in a google search.
>>
>> If you're getting different numFound values for the same query and the
>> index hasn't changed, there are two possible causes that I know of.  One is
>> replicas out of sync as already described, the other is having documents
>> with the same uniqueKey value in more than one shard.  If the count is
>> always the same with one sort, then I am leaning towards the latter cause.
>>
>> Which router does your collection use?  If it's implicit, how are you
>> deciding which shard gets which document?  If it's compositeId, have you
>> changed your hash ranges without deleting everything and building the index
>> again?
>>
>> Thanks,
>> Shawn
>>
>>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: cursorMark and Solrcloud

2018-01-16 Thread Webster Homer
sorry solr_returned is the total count of the documents retrieved from the
queryResponse. So if I ask for 200 rows at at time it will be the increment
of all the 200

numberRetrieved += queryResponse.getResults().size();

Where queryResponse is a solrj QueryResponse

On Mon, Jan 15, 2018 at 6:11 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 1/15/2018 12:52 PM, Webster Homer wrote:
>
>> When I don't have score in the sort, the solr_returned and count are the
>> same
>>
>
> I don't know what "solr_returned" means.  I haven't encountered that
> before, and nothing useful turns up in a google search.
>
> If you're getting different numFound values for the same query and the
> index hasn't changed, there are two possible causes that I know of.  One is
> replicas out of sync as already described, the other is having documents
> with the same uniqueKey value in more than one shard.  If the count is
> always the same with one sort, then I am leaning towards the latter cause.
>
> Which router does your collection use?  If it's implicit, how are you
> deciding which shard gets which document?  If it's compositeId, have you
> changed your hash ranges without deleting everything and building the index
> again?
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: cursorMark and Solrcloud

2018-01-15 Thread Webster Homer
When I don't have score in the sort, the solr_returned and count are the
same

On Mon, Jan 15, 2018 at 1:50 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> The problem is that the cursor mark query returns different numbers of
> documents each time it is called when the collection has multiple replicas
> per shard.
>
> I meant collection. The same collection is on different clouds. The
> collection in one cloud 1 has 2 shards with 1 replica per shard. In the
> second cloud the collection has 2 shards with 2 replicas per shard.
>
> The same query using cursorMark against the second cloud returns different
> numbers of documents. It appears that each replica returns a slightly
> different number of documents. when run against cloud #1 it always returns
> the same documents.
> Here is a little bit from my debug statements.
> count is the number found, solr_retrieved is a counter for all the
> documents actually returned over all the calls to the cursor mark Why are
> they different?
> Each of these represent a search against our collection.
>
> "count": 1382,
> "solr_returned": 1281,
>
> "count": 1382,
> "solr_returned": 1366,
>
> "count": 1382,
> "solr_returned": 1225,
>
> "count": 1382,
> "solr_returned": 1397,
>
>
> Taking score out of the sort, cloud #2 will return consistent result sets.
>
>
>
> On Mon, Jan 15, 2018 at 1:28 PM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 1/15/2018 11:56 AM, Webster Homer wrote:
>>
>>> I have noticed strange behavior using cursorMark for deep paging in an
>>> application. We use solrcloud for searching. We have several clouds for
>>> development. For our development systems we have two different clouds.
>>> One
>>> cloud has 2 shards with 1 replica per shard. All or our other clouds are
>>> set up with 2 shards and 2 replicas per shard.
>>>
>>
>> A cloud doesn't get set up with shards and replicas.  A collection does.
>> One SolrCloud cluster can contain many collections.
>>
>> When you say "cloud" are you referring to a collection, or are you
>> referring to a set of servers running ZooKeeper and Solr? The latter is
>> what I would expect cloud to mean.
>>
>> When I run against the first cloud, I always get consistent results for
>>> the
>>> same query. That is not the case with the second cloud. Some queries
>>> return
>>> different numbers of results each time it's called. In the code I return
>>> the number found from solr, and I count the number of results for all
>>> iterations against the cursor mark. Sometimes it returns more rows than
>>> the
>>> numFound and sometimes less.
>>>
>>> I figured that the problem was in my code or in the data to make it
>>> easier
>>> to find the problem I changed the sort to just be the unique id from the
>>> schema. The problem went away.
>>>
>>> 1. The Number Found from solr was always the same
>>> 2. It worked when there was only 1 replica per shard
>>> 3. From debug statements it appears to return different total counts from
>>> different replicas. When there were 2 replicas per shard I saw 4
>>> different
>>> values being returned.
>>> 4. Not sorting on score, and only on the unique id provides consistent
>>> results.
>>>
>>
>> When you have multiple replicas, each replica may have different numbers
>> of deleted documents.  Deleted documents will almost always affect
>> scoring.  Because SolrCloud load balances across replicas, one page of your
>> cursorMark query can be served by a different replica than the next one, so
>> the order of results can differ.
>>
>> When sorting by unique ID, deleted documents will not affect sort order.
>> When there is only one replica, then sorting by score will always produce
>> the same order, unless the index gets modified.
>>
>> Thanks,
>> Shawn
>>
>>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: cursorMark and Solrcloud

2018-01-15 Thread Webster Homer
The problem is that the cursor mark query returns different numbers of
documents each time it is called when the collection has multiple replicas
per shard.

I meant collection. The same collection is on different clouds. The
collection in one cloud 1 has 2 shards with 1 replica per shard. In the
second cloud the collection has 2 shards with 2 replicas per shard.

The same query using cursorMark against the second cloud returns different
numbers of documents. It appears that each replica returns a slightly
different number of documents. when run against cloud #1 it always returns
the same documents.
Here is a little bit from my debug statements.
count is the number found, solr_retrieved is a counter for all the
documents actually returned over all the calls to the cursor mark Why are
they different?
Each of these represent a search against our collection.

"count": 1382,
"solr_returned": 1281,

"count": 1382,
"solr_returned": 1366,

"count": 1382,
"solr_returned": 1225,

"count": 1382,
"solr_returned": 1397,


Taking score out of the sort, cloud #2 will return consistent result sets.



On Mon, Jan 15, 2018 at 1:28 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 1/15/2018 11:56 AM, Webster Homer wrote:
>
>> I have noticed strange behavior using cursorMark for deep paging in an
>> application. We use solrcloud for searching. We have several clouds for
>> development. For our development systems we have two different clouds. One
>> cloud has 2 shards with 1 replica per shard. All or our other clouds are
>> set up with 2 shards and 2 replicas per shard.
>>
>
> A cloud doesn't get set up with shards and replicas.  A collection does.
> One SolrCloud cluster can contain many collections.
>
> When you say "cloud" are you referring to a collection, or are you
> referring to a set of servers running ZooKeeper and Solr? The latter is
> what I would expect cloud to mean.
>
> When I run against the first cloud, I always get consistent results for the
>> same query. That is not the case with the second cloud. Some queries
>> return
>> different numbers of results each time it's called. In the code I return
>> the number found from solr, and I count the number of results for all
>> iterations against the cursor mark. Sometimes it returns more rows than
>> the
>> numFound and sometimes less.
>>
>> I figured that the problem was in my code or in the data to make it easier
>> to find the problem I changed the sort to just be the unique id from the
>> schema. The problem went away.
>>
>> 1. The Number Found from solr was always the same
>> 2. It worked when there was only 1 replica per shard
>> 3. From debug statements it appears to return different total counts from
>> different replicas. When there were 2 replicas per shard I saw 4 different
>> values being returned.
>> 4. Not sorting on score, and only on the unique id provides consistent
>> results.
>>
>
> When you have multiple replicas, each replica may have different numbers
> of deleted documents.  Deleted documents will almost always affect
> scoring.  Because SolrCloud load balances across replicas, one page of your
> cursorMark query can be served by a different replica than the next one, so
> the order of results can differ.
>
> When sorting by unique ID, deleted documents will not affect sort order.
> When there is only one replica, then sorting by score will always produce
> the same order, unless the index gets modified.
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


cursorMark and Solrcloud

2018-01-15 Thread Webster Homer
I have noticed strange behavior using cursorMark for deep paging in an
application. We use solrcloud for searching. We have several clouds for
development. For our development systems we have two different clouds. One
cloud has 2 shards with 1 replica per shard. All or our other clouds are
set up with 2 shards and 2 replicas per shard.

The application sorts the data by score descending, and the schema's unique
id ascending. According to the documentation, cursor mark requires that the
tie breaker be the schema's unique id.

When I run against the first cloud, I always get consistent results for the
same query. That is not the case with the second cloud. Some queries return
different numbers of results each time it's called. In the code I return
the number found from solr, and I count the number of results for all
iterations against the cursor mark. Sometimes it returns more rows than the
numFound and sometimes less.

I figured that the problem was in my code or in the data to make it easier
to find the problem I changed the sort to just be the unique id from the
schema. The problem went away.

1. The Number Found from solr was always the same
2. It worked when there was only 1 replica per shard
3. From debug statements it appears to return different total counts from
different replicas. When there were 2 replicas per shard I saw 4 different
values being returned.
4. Not sorting on score, and only on the unique id provides consistent
results.

So it appears that we should not include score in the sort when using
cursor mark and solrcloud.

We use solrj and CloudSolrClient. We are currently using the Solr 6.2 solrj
client with Solr 7.2 in our dev environment. We are in the process of
moving completely to 7.2.

Is this a known issue with cursormark and solrcloud?
For debugging purposes can I determine which solr node that cloudSolrClient
is using for a particular query?

I have not yet created a standalone test case for the issue, I'm still not
100% convinced that it is solrcloud, but it certainly looks like it is.

Thanks,
Webster

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: CDCR configuration in solrconfig

2017-12-18 Thread Webster Homer
We also have the same configurations used in different environments. We
upload the configset to zookeeper and use the Config API to overlay
environment specific settings in the solrconfig.xml. We have avoided having
collections share the same configsets, basically for this reason.

If CDCR supported aliases (SOLR-10679) this would be even easier.

So I suggest using the config API to configure CDCR in each of your
environments.

On Mon, Dec 18, 2017 at 1:12 PM, Erick Erickson 
wrote:

> CDCR doesn't do this yet but WDYT about an option where the
> target collection was _assumed_ to be the same as the source?
>
> You're right, SOLR-8389 (and associated) should address this
> but I don't know what the progress is on that. Seems like
> a reasonable default in any case.
>
> Erick
>
> On Mon, Dec 18, 2017 at 9:29 AM, Elaine Cario  wrote:
> > We've recently been exploring options for disaster recovery, and took a
> > look at CDCR for our SolrCloud(s).  It seems to meet our needs, but we've
> > stumbled into a couple of issues with configuration.
> >
> > The first issue is that currently CDCR is configured as a request handler
> > in solrconfig, but because we will use the same SolrConfig for
> collections
> > in different environments (e.g. development, qa, production), the config
> > will not always be deployed in an environment that has CDCR. As a last
> > resort, we are thinking we can drop back to an old-school xml include,
> and
> > configure different includes for different environments.  This isn't
> > particularly elegant, but workable. Wondering if anyone has done it some
> > other way?
> >
> > The 2nd issue I haven't found a work-around for is the collection name
> > mapping within the cdcr request handler configuration.  For some of our
> > applications, we "share" the same Solr config with many collections.
> When
> > deploying, we just "upconfig" to ZK, and either create a new collection
> > against that same config (config name != collection name).  I'm not sure
> > with the collection name "baked into" the config how I would manage that,
> > except to switch to using a dedicated config for each collection.
> >
> > SOLR-8389 looks like it might solve some of these issues, or at least
> make
> > them easier to manage.  Is this on the roadmap at all?
> >
> > Any ideas would be appreciated.  Thanks!
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Deep Paging with cursorMark throws error

2017-11-20 Thread Webster Homer
As I suspected this was a bug in my code. We use KIE Drools to configure
our queries, and there was a conflict between two rules.

On Mon, Nov 20, 2017 at 4:09 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> I am developing an application that uses cursorMark deep paging. It's a
> java client using solrj client.
>
> Currently the client is created with Solr 6.2 solrj jars, but the test
> server is a solr 7.1 server
>
> I am getting this error:
> Error from server at http://XX:8983/solr/sial-catalog-product: Cursor
> functionality requires a sort containing a uniqueKey field tie breaker
>
> But the sort does have the field that is marked as unique in the schema.
>
> sort=score desc,*id_material* asc
>
> id_material
>
> Does the sort need to be on just the unique field?
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Deep Paging with cursorMark throws error

2017-11-20 Thread Webster Homer
I am developing an application that uses cursorMark deep paging. It's a
java client using solrj client.

Currently the client is created with Solr 6.2 solrj jars, but the test
server is a solr 7.1 server

I am getting this error:
Error from server at http://XX:8983/solr/sial-catalog-product: Cursor
functionality requires a sort containing a uniqueKey field tie breaker

But the sort does have the field that is marked as unique in the schema.

sort=score desc,*id_material* asc

id_material

Does the sort need to be on just the unique field?

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Solr 7 and int, long, float field types

2017-11-16 Thread Webster Homer
Oh sorry missed that they were defined as trie fields. For some reason I
thought that they were Java classes


On Thu, Nov 16, 2017 at 4:23 PM, Webster Homer <webster.ho...@sial.com>
wrote:

> I am converting a schema from 6 to 7 and in the process I removed the Trie
> field types and replaced them with Point field types.
>
> My schema also had fields defined as "int" and "long". These seem to have
> been removed as well, but I don't remember seeing that documented.
>
> In my original schema the _version_ field was a long.
>
> I see that in the new example schema files _version_ is a plong.
>
> I guess I missed where this was documented.
>
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Solr 7 and int, long, float field types

2017-11-16 Thread Webster Homer
I am converting a schema from 6 to 7 and in the process I removed the Trie
field types and replaced them with Point field types.

My schema also had fields defined as "int" and "long". These seem to have
been removed as well, but I don't remember seeing that documented.

In my original schema the _version_ field was a long.

I see that in the new example schema files _version_ is a plong.

I guess I missed where this was documented.

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Admin Console Question

2017-11-15 Thread Webster Homer
In the solr.in.sh script I do see this:
# Set the thread stack size
SOLR_OPTS="$SOLR_OPTS -Xss256k"

I don't remember ever changing this, but it's only there once

I can't find a reference to +UseGCLogFileRotation at all.

I don't see anyplace where we set either of these twice.

We were migrating from solr 6.2.0 if that makes any difference

On Wed, Nov 15, 2017 at 12:55 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 11/15/2017 8:40 AM, Webster Homer wrote:
> > I do see errors in both Consoles. I see more errors on the ones that
> don't
> > display Args
> > Here are the errors that only show up when Args doesn't:
> > Error: [ngRepeat:dupes] Duplicates in a repeater are not allowed. Use
> > 'track by' expression to specify unique keys. Repeater: arg in
> > commandLineArgs, Duplicate key: string:-XX:+UseGCLogFileRotation,
> Duplicate
> > value: -XX:+UseGCLogFileRotation
> 
> > angular.js:11617 Error: [ngRepeat:dupes] Duplicates in a repeater are not
> > allowed. Use 'track by' expression to specify unique keys. Repeater: arg
> in
> > commandLineArgs, Duplicate key: string:-Xss256k, Duplicate value:
> -Xss256k
> >
>
> This was the clue I needed.
>
> I added this line to the end of solr.in.cmd (I'm doing this testing on
> Windows):
>
> set SOLR_OPTS=%SOLR_OPTS% -Xss256k
>
> With that change and a Solr restart, the Args information disappeared
> from the admin UI.
>
> Somewhere, likely in your include script, you have defined custom
> arguments that have duplicated the -Xss256k and
> -XX:+UseGCLogFileRotation arguments that are included by default.  There
> may be other duplicates, but those are the ones that were included in
> the error information you shared.  If you adjust the startup
> configuration so that there are no duplicate commandline arguments, then
> restart Solr, it should display.
>
> This does mean that Solr has a bug in the admin UI, but it's one that
> you can work around by removing duplicate arguments.  The angular code
> used for the argument display cannot handle duplicate entries.  Here's
> the issue I created for the problem:
>
> https://issues.apache.org/jira/browse/SOLR-11645
>
> There's a patch attached to the issue that fixes the problem for me, and
> some instructions for fixing up a binary download with that change
> rather than a source checkout.
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


  1   2   3   >