Re: Replica is going into recovery in Solr 6.1.0

2020-02-13 Thread vishal patel
Total memory of server is 256 GB and in this server below application running
Application1 50 GB
Application2 30 GB
Application3   8 GB
Application4   2 GB
Solr shard164 GB
Solr shard2 replica   64 GB

Note: Solr shard2 and shard1 replica running on another server. Normally 35 to 
40 GB memory constant usage in one solr instance so we keep the 64 GB. We are 
using NRT.

How big are your indexes on disk? - [Shard1-115 Gb, shard2 replica-96 GB] 
[shard1 replica-114 GB, shard2-100GB]
How many docs per replica?- Approx 30959714 docs
How many replicas per host?   - One server has one shard and one 
replica.

Regards,
Vishal

Sent from Outlook

From: Erick Erickson 
Sent: Friday, February 14, 2020 4:00 AM
To: solr-user@lucene.apache.org 
Subject: Re: Replica is going into recovery in Solr 6.1.0

What Walter said. Also, you _must_ leave quite a bit of free RAM for the OS due 
to Lucene using MMapDirectory space, see:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Basically until you can get your GC pauses under control, you’ll have an 
unstable collection.

How big are your indexes on disk? How many docs per replica? How many replicas 
per host?

Best,
Erick

> On Feb 13, 2020, at 5:16 PM, Walter Underwood  wrote:
>
> You have a 64GB heap. That is extremely unusual. You can only do that if the 
> instance has 80 GB or more of RAM. If you don’t have enough RAM, the JVM will 
> start using swap space and cause extremely long GC pauses.
>
> How much RAM do you have?
>
> How did you choose these GC settings?
>
> We have been using these settings with Java 8 in prod for three years with no 
> GC problems.
>
> SOLR_HEAP=8g
> # Use G1 GC  -- wunder 2017-01-23
> # Settings from https://wiki.apache.org/solr/ShawnHeisey
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> “
>
> If you don’t have a very, very good reason for your GC settings, use these 
> instead.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Feb 12, 2020, at 10:47 PM, vishal patel  
>> wrote:
>>
>> My configuration:
>>
>> -XX:+AggressiveOpts -XX:ConcGCThreads=12 -XX:G1HeapRegionSize=33554432 
>> -XX:G1ReservePercent=20 -XX:InitialHeapSize=68719476736 
>> -XX:InitiatingHeapOccupancyPercent=10 -XX:+ManagementServer 
>> -XX:MaxHeapSize=68719476736 -XX:ParallelGCThreads=36 
>> -XX:+ParallelRefProcEnabled -XX:PrintFLSStatistics=1 -XX:+PrintGC 
>> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps 
>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
>> -XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 -XX:+UseG1GC 
>> -XX:-UseLargePages -XX:-UseLargePagesIndividualAllocation 
>> -XX:+UseStringDeduplication
>>
>> Sent from Outlook
>> 
>> From: Rajdeep Sahoo 
>> Sent: Thursday, February 13, 2020 10:03 AM
>> To: solr-user@lucene.apache.org 
>> Subject: Re: Replica is going into recovery in Solr 6.1.0
>>
>> What is your memory configuration
>>
>> On Thu, 13 Feb, 2020, 9:46 AM vishal patel, 
>> wrote:
>>
>>> Is there anyone looking at this?
>>>
>>> Sent from Outlook
>>> 
>>> From: vishal patel 
>>> Sent: Wednesday, February 12, 2020 3:45 PM
>>> To: solr-user@lucene.apache.org 
>>> Subject: Replica is going into recovery in Solr 6.1.0
>>>
>>> I am using solr version 6.1.0, Java 8 version and G1gc on production. We
>>> have 2 shards and each shard has 1 replica. Suddenly one replica is going
>>> into recovery mode and Requests become slow in our production.
>>> I have analyzed that minor GC max pause time was 1 min 6 sec 800 ms on
>>> that time and also multiple times minor GC pauses.
>>>
>>> My logs :
>>>
>>> https://drive.google.com/file/d/158z3nzLsnHGouyRnXgfzCjwD4iadgKSp/view?usp=sharing
>>>
>>> https://drive.google.com/file/d/1E4jyffvIWVJB7EeEMXBXyqaK2ZfAA8kk/view?usp=sharing
>>>
>>> I do not know why long GC pause time happened. In our platform heavy
>>> searching and indexing is performed.
>>> long GC pause times happen due to searching or indexing?
>>> If GC pause time long then why replica is going into recovery? can we set
>>> the waiting time of update request?
>>> what is the minimum GC pause time for going into recovery mode?
>>>
>>> It is useful for my problem? :
>>> https://issues.apache.org/jira/browse/SOLR-9310
>>>
>>> Regards,
>>> Vishal Patel
>>>
>>> Sent from Outlook
>>>
>



Re: offline Solr index creation

2020-02-13 Thread Erick Erickson
Indexing rates scale pretty linearly with the number of shards, so one
way to increase throughput is to simply create a collection with
more shards. For the initial bulk-indexing operations, you can 
go with a 1-replica-per-shard scenario then ADDREPLICA if you need
to build things out.

However… that may leave you with more shards than you really want, but
that’s usually not an impediment.

The MapReduceIndexerTool uses something called the embedded solr server,
so it’s really using Solr under the covers.

All that said, I’m not yet convinced you need to go there. How are you
sure that you’re really driving Solr hard? Are you pegging all the CPUs on
all your Solr nodes while indexing? Very often I see “slow indexing” be the
result of the collection process not being able to feed Solr docs fast
enough. So here’s a couple of things to look at:

1> are your CPUs on the Solr nodes running flat out? If not, you need to
work on your ingestion process. Perhaps parallelize it on the client side
so you have multiple threads throwing docs at Solr. 

2>  Comment out the bit in your SolrJ program where you call
CloudSolrClient.add(doclist). If that doesn’t change the rate you can
process your docs, then you’re spending all your time on the client
side.

Also, sanity checks: You’re not committing after every batch or anything
else like that, right? Speaking of autocommit, I’d set them in my solrconfig
be autoCommit every, say, 60 seconds with openSearcher=true and leave
it at that until proven you need something different.

You also haven’t told us about your topology. How many shards? How many
machines? I pretty much guarantee you won’t be able to fit all that data on a
single shard...

Best,
Erick

> On Feb 13, 2020, at 8:17 PM, vivek chaurasiya  wrote:
> 
> Hi there,
> 
> We are using AWS EMR as our big data processing cluster. We have like 3TB
> of text files where each line denotes a json record which I want to be
> indexed into Solr.
> 
> I have tried this by batching them and pushing to Solr index using
> SolrJClient. But I feel thats really slow.
> 
> My doubt is 2 fold:
> 
> 1. Is there a ready-to-use tool which can be used to create a Solr index
> offline and store in say S3 or somewhere.
> 2. That offline solr index file if possible in (1), how can i push it to a
> live Solr cluster?
> 
> 
> I found this tool:
> https://docs.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html
> 
> but its really cumbersome to use and looks like at the time of creating
> offline index you need to put in shard/schema information.
> 
> Some suggestions would be greatly appreciated.
> 
> -Vivek



offline Solr index creation

2020-02-13 Thread vivek chaurasiya
Hi there,

We are using AWS EMR as our big data processing cluster. We have like 3TB
of text files where each line denotes a json record which I want to be
indexed into Solr.

I have tried this by batching them and pushing to Solr index using
SolrJClient. But I feel thats really slow.

My doubt is 2 fold:

1. Is there a ready-to-use tool which can be used to create a Solr index
offline and store in say S3 or somewhere.
2. That offline solr index file if possible in (1), how can i push it to a
live Solr cluster?


I found this tool:
https://docs.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html

but its really cumbersome to use and looks like at the time of creating
offline index you need to put in shard/schema information.

Some suggestions would be greatly appreciated.

-Vivek


Re: Replica is going into recovery in Solr 6.1.0

2020-02-13 Thread Erick Erickson
What Walter said. Also, you _must_ leave quite a bit of free RAM for the OS due 
to Lucene using MMapDirectory space, see:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Basically until you can get your GC pauses under control, you’ll have an 
unstable collection.

How big are your indexes on disk? How many docs per replica? How many replicas 
per host?

Best,
Erick

> On Feb 13, 2020, at 5:16 PM, Walter Underwood  wrote:
> 
> You have a 64GB heap. That is extremely unusual. You can only do that if the 
> instance has 80 GB or more of RAM. If you don’t have enough RAM, the JVM will 
> start using swap space and cause extremely long GC pauses.
> 
> How much RAM do you have?
> 
> How did you choose these GC settings?
> 
> We have been using these settings with Java 8 in prod for three years with no 
> GC problems.
> 
> SOLR_HEAP=8g
> # Use G1 GC  -- wunder 2017-01-23
> # Settings from https://wiki.apache.org/solr/ShawnHeisey
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> “
> 
> If you don’t have a very, very good reason for your GC settings, use these 
> instead.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 12, 2020, at 10:47 PM, vishal patel  
>> wrote:
>> 
>> My configuration:
>> 
>> -XX:+AggressiveOpts -XX:ConcGCThreads=12 -XX:G1HeapRegionSize=33554432 
>> -XX:G1ReservePercent=20 -XX:InitialHeapSize=68719476736 
>> -XX:InitiatingHeapOccupancyPercent=10 -XX:+ManagementServer 
>> -XX:MaxHeapSize=68719476736 -XX:ParallelGCThreads=36 
>> -XX:+ParallelRefProcEnabled -XX:PrintFLSStatistics=1 -XX:+PrintGC 
>> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps 
>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
>> -XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 -XX:+UseG1GC 
>> -XX:-UseLargePages -XX:-UseLargePagesIndividualAllocation 
>> -XX:+UseStringDeduplication
>> 
>> Sent from Outlook
>> 
>> From: Rajdeep Sahoo 
>> Sent: Thursday, February 13, 2020 10:03 AM
>> To: solr-user@lucene.apache.org 
>> Subject: Re: Replica is going into recovery in Solr 6.1.0
>> 
>> What is your memory configuration
>> 
>> On Thu, 13 Feb, 2020, 9:46 AM vishal patel, 
>> wrote:
>> 
>>> Is there anyone looking at this?
>>> 
>>> Sent from Outlook
>>> 
>>> From: vishal patel 
>>> Sent: Wednesday, February 12, 2020 3:45 PM
>>> To: solr-user@lucene.apache.org 
>>> Subject: Replica is going into recovery in Solr 6.1.0
>>> 
>>> I am using solr version 6.1.0, Java 8 version and G1gc on production. We
>>> have 2 shards and each shard has 1 replica. Suddenly one replica is going
>>> into recovery mode and Requests become slow in our production.
>>> I have analyzed that minor GC max pause time was 1 min 6 sec 800 ms on
>>> that time and also multiple times minor GC pauses.
>>> 
>>> My logs :
>>> 
>>> https://drive.google.com/file/d/158z3nzLsnHGouyRnXgfzCjwD4iadgKSp/view?usp=sharing
>>> 
>>> https://drive.google.com/file/d/1E4jyffvIWVJB7EeEMXBXyqaK2ZfAA8kk/view?usp=sharing
>>> 
>>> I do not know why long GC pause time happened. In our platform heavy
>>> searching and indexing is performed.
>>> long GC pause times happen due to searching or indexing?
>>> If GC pause time long then why replica is going into recovery? can we set
>>> the waiting time of update request?
>>> what is the minimum GC pause time for going into recovery mode?
>>> 
>>> It is useful for my problem? :
>>> https://issues.apache.org/jira/browse/SOLR-9310
>>> 
>>> Regards,
>>> Vishal Patel
>>> 
>>> Sent from Outlook
>>> 
> 



Re: StatelessScriptUpdateProcessorFactory causing OOM errors?

2020-02-13 Thread Erick Erickson
Robert:

My concern with fixing by adding memory is that it may just be kicking the can 
down the road. Assuming there really is some leak eventually they’ll accumulate 
and you’ll hit another OOM. If that were the case, I’d expect a cursory look at 
your memory usage to just keep increasing over time as your script is utilized. 
When I looked at your script, I don’t see anything obvious...

Now, all that said if you bump the memory and it stays in some channel maybe 
you were just running close to your limits before and got “lucky”.

Here's another possibility:

- your commit interval is too long. While I constantly find them set too short, 
it’s also possible to set them to be too long. To support Real-time-get, Solr 
needs to keep pointers in to the TLOGs for all documents that have been added 
since the last searcher was opened. I can’t really make this square with 
switching from a jar to a script, but…

You’d probably need to enable the OOM killer script and enable heap-dump-on-oom 
to really get to the bottom of this, or maybe just take a heap dump after a 
while when you’re indexing docs. 

Best,
Erick

> On Feb 13, 2020, at 2:45 PM, Jörn Franke  wrote:
> 
> I had also issues with this factory when creating atomic updates inside 
> there. They worked, but searcher where never closed and new ones where open 
> and stayed open with all the issues related to that one. Maybe one needs to 
> look into more detail into that. However - it is a script in the end so that 
> could be always a bug in your script as well.
> 
>> Am 13.02.2020 um 19:21 schrieb Haschart, Robert J (rh9ec) 
>> :
>> 
>> Erick,
>> 
>> Sorry I didn't see this response, for some reason solr-users has stopped 
>> being delivered to my mail box.
>> 
>> The script that adds a field based on the value(s) in some other field 
>> doesn't add a large number of different fields to the index.
>> The pool_f field only has a total of 11 different values, and except for 
>> some rare cases, any given record only has a single value in that field, and 
>> those rare cases will have two values.
>> 
>> I had previously implemented the same functionality by making a small jar 
>> file containing a customized version of TemplateUpdateProcessorFactory  that 
>> could generate different field names, but since I needed another bit of 
>> functionality in the Update Chain I decided to port the original 
>> functionality to a script  since the "development overhead" of adding a 
>> script is less than adding in multiple additional custom 
>> UpdateProcessorFactory objects.
>> 
>> I had been running solr with the the memory flag  "-m 8G" and it had been 
>> running fine with that setting for a least a year, even recently when the 
>> customized java version of TemplateUpdateProcessorFactory was being invoked 
>> to perform essentially the same processing step.
>> 
>> However when I tried to accomplish the same thing via javascript through 
>> StatelessScriptUpdateProcessorFactory  and start a re-index it would die 
>> after about 1 million records being indexed.And since it is merely my 
>> (massive) development machine, during the re-index there are close to zero 
>> searches coming through while the re-index is happening.
>> 
>> I've managed to work around the issue on my dev box by upping the the memory 
>> for solr to 16G, and haven't had an OOM since doing that, but I'm hesitant 
>> to push these changes to our AWS-hosted production instances since running 
>> out of memory and terminating there would be more of an issue.
>> 
>> -Bob
>> 
>> 
>> 
>> 
>>   From: Erick Erickson 
>>   Subject: Re: StatelessScriptUpdateProcessorFactory causing OOM errors?
>>   Date: Thu, 6 Feb 2020 09:18:41 -0500
>> 
>>   How many fields do you wind up having? It looks on a quick glance like
>>   it depends on the values of fields. While I’ve seen Solr/Lucene handle
>>   indexes with over 1M different fields, it’s unsatisfactory.
>> 
>>   What I’m wondering is if you are adding a zillion different fields to your
>>   docs as time passes and eventually the structures that are needed to
>>   maintain your field mappings are blowing up memory.
>> 
>>   If that’s that case, you need an alternative design because your
>>   performance will be unacceptable.
>> 
>>   May be off base, if so we can dig further.
>> 
>>   Best,
>>   Erick
>> 
>>> On Feb 5, 2020, at 3:41 PM, Haschart, Robert J (rh9ec)  
>>> wrote:
>>> 
>>> StatelessScriptUpdateProcessorFactory
>> 
>> 
>> 
>> 



Re: Replica is going into recovery in Solr 6.1.0

2020-02-13 Thread Walter Underwood
You have a 64GB heap. That is extremely unusual. You can only do that if the 
instance has 80 GB or more of RAM. If you don’t have enough RAM, the JVM will 
start using swap space and cause extremely long GC pauses.

How much RAM do you have?

How did you choose these GC settings?

We have been using these settings with Java 8 in prod for three years with no 
GC problems.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
“

If you don’t have a very, very good reason for your GC settings, use these 
instead.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 12, 2020, at 10:47 PM, vishal patel  
> wrote:
> 
> My configuration:
> 
> -XX:+AggressiveOpts -XX:ConcGCThreads=12 -XX:G1HeapRegionSize=33554432 
> -XX:G1ReservePercent=20 -XX:InitialHeapSize=68719476736 
> -XX:InitiatingHeapOccupancyPercent=10 -XX:+ManagementServer 
> -XX:MaxHeapSize=68719476736 -XX:ParallelGCThreads=36 
> -XX:+ParallelRefProcEnabled -XX:PrintFLSStatistics=1 -XX:+PrintGC 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution 
> -XX:ThreadStackSize=256 -XX:+UseG1GC -XX:-UseLargePages 
> -XX:-UseLargePagesIndividualAllocation -XX:+UseStringDeduplication
> 
> Sent from Outlook
> 
> From: Rajdeep Sahoo 
> Sent: Thursday, February 13, 2020 10:03 AM
> To: solr-user@lucene.apache.org 
> Subject: Re: Replica is going into recovery in Solr 6.1.0
> 
> What is your memory configuration
> 
> On Thu, 13 Feb, 2020, 9:46 AM vishal patel, 
> wrote:
> 
>> Is there anyone looking at this?
>> 
>> Sent from Outlook
>> 
>> From: vishal patel 
>> Sent: Wednesday, February 12, 2020 3:45 PM
>> To: solr-user@lucene.apache.org 
>> Subject: Replica is going into recovery in Solr 6.1.0
>> 
>> I am using solr version 6.1.0, Java 8 version and G1gc on production. We
>> have 2 shards and each shard has 1 replica. Suddenly one replica is going
>> into recovery mode and Requests become slow in our production.
>> I have analyzed that minor GC max pause time was 1 min 6 sec 800 ms on
>> that time and also multiple times minor GC pauses.
>> 
>> My logs :
>> 
>> https://drive.google.com/file/d/158z3nzLsnHGouyRnXgfzCjwD4iadgKSp/view?usp=sharing
>> 
>> https://drive.google.com/file/d/1E4jyffvIWVJB7EeEMXBXyqaK2ZfAA8kk/view?usp=sharing
>> 
>> I do not know why long GC pause time happened. In our platform heavy
>> searching and indexing is performed.
>> long GC pause times happen due to searching or indexing?
>> If GC pause time long then why replica is going into recovery? can we set
>> the waiting time of update request?
>> what is the minimum GC pause time for going into recovery mode?
>> 
>> It is useful for my problem? :
>> https://issues.apache.org/jira/browse/SOLR-9310
>> 
>> Regards,
>> Vishal Patel
>> 
>> Sent from Outlook
>> 



Re: Adding replica to a shard with only down replicas

2020-02-13 Thread Erick Erickson
Adding a new replica won’t do you much good. Since there’s
no leader, it won’t (well, shouldn’t) sync the index.

Did you try the collections API FORCELEADER? It was put in as
a last resort for this kind of situation.

Best,
Erick

> On Feb 13, 2020, at 3:22 PM, tedsolr  wrote:
> 
> Solr 5.5.4. I have a collection with a single shard and two replicas. Both
> are reporting down. No shard leader exists. Each replica is on a different
> node. Should it be safe to attempt an ADDREPLICA command? Since there's no
> leader I don't know if that will work. This is the cluster state for the
> collection:
> 
> "SCHN":{
>"shards":{"shard1":{
>"range":"8000-7fff",
>"state":"active",
>"replicas":{
>  "core_node6":{
>"core":"SCHN_shard1_replica5",
>"base_url":"http://:8983/solr;,
>"node_name":":8983_solr",
>"state":"down"},
>  "core_node5":{
>"core":"SCHN_shard1_replica2",
>"base_url":"http://8983/solr;,
>"node_name":":8983_solr",
>"state":"down",
>"replicationFactor":"2",
>"router":{"name":"compositeId"},
>"maxShardsPerNode":"1",
>"autoAddReplicas":"false",
>"znodeVersion":1127,
>"configName":"default"},
> 
> The logs show repeated errors for: ERROR
> org.apache.solr.common.SolrException Error while trying to recover.
> core=SCHN_shard1_replica5:org.apache.solr.common.SolrException: No
> registered leader was found after waiting for 4000ms , collection: SCHN
> slice: shard1
> 
> I've already tried bringing the nodes down and then back up.
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Zookeeper upgrade required with Solr upgrade?

2020-02-13 Thread Erick Erickson
Yeah, 3.4.x upgrades were pretty strainght-forward.

The 3.5.5 upgrade was trickier, but the problems were in the
admin UI. The admin UI uses several “4 letter words” to do its
ZooKeeper reporting, and that required explicit permissions, but IIRC
that all only affected the admin UI reporting about Zookeeper. There
were a _lot_ of Solr changes, but that was mostly cosmetic and mostly
in the test code.

Best,
Erick

> On Feb 13, 2020, at 4:07 PM, Rahul Goswami  wrote:
> 
> Thanks Eric. Also, thanks for that little pointer about the JIRA. Just to
> make sure I also checked for the upgrade JIRAs for other two intermediate
> Zookeeper versions 3.4.11 and 3.4.13 between Solr 7.2.1 and Solr 7.7.2 and
> they didn't seem to contain any Solr code changes either.
> 
> On Thu, Feb 13, 2020 at 9:26 AM Erick Erickson 
> wrote:
> 
>> That should be OK. There were no code changes necessary for that upgrade.
>> see SOLR-13363
>> 
>>> On Feb 12, 2020, at 5:34 PM, Rahul Goswami 
>> wrote:
>>> 
>>> Hello,
>>> We are running a SolrCloud (7.2.1) cluster and upgrading to Solr 7.7.2.
>> We
>>> run a separate multi node zookeeper ensemble which currently runs
>>> Zookeeper 3.4.10.
>>> Is it also required to upgrade Zookeeper (to  3.4.14 as per change.txt
>> for
>>> Solr 7.7.2) along with Solr ?
>>> 
>>> I tried a few basic updates requests for a 2 node SolrCloud cluster with
>>> the older (3.4.10) zookeeper and it seemed to work fine. But just want to
>>> know if there are any caveats I should be aware of.
>>> 
>>> Thanks,
>>> Rahul
>> 
>> 



Re: Zookeeper upgrade required with Solr upgrade?

2020-02-13 Thread Rahul Goswami
Thanks Eric. Also, thanks for that little pointer about the JIRA. Just to
make sure I also checked for the upgrade JIRAs for other two intermediate
Zookeeper versions 3.4.11 and 3.4.13 between Solr 7.2.1 and Solr 7.7.2 and
they didn't seem to contain any Solr code changes either.

On Thu, Feb 13, 2020 at 9:26 AM Erick Erickson 
wrote:

> That should be OK. There were no code changes necessary for that upgrade.
> see SOLR-13363
>
> > On Feb 12, 2020, at 5:34 PM, Rahul Goswami 
> wrote:
> >
> > Hello,
> > We are running a SolrCloud (7.2.1) cluster and upgrading to Solr 7.7.2.
> We
> > run a separate multi node zookeeper ensemble which currently runs
> > Zookeeper 3.4.10.
> > Is it also required to upgrade Zookeeper (to  3.4.14 as per change.txt
> for
> > Solr 7.7.2) along with Solr ?
> >
> > I tried a few basic updates requests for a 2 node SolrCloud cluster with
> > the older (3.4.10) zookeeper and it seemed to work fine. But just want to
> > know if there are any caveats I should be aware of.
> >
> > Thanks,
> > Rahul
>
>


Adding replica to a shard with only down replicas

2020-02-13 Thread tedsolr
Solr 5.5.4. I have a collection with a single shard and two replicas. Both
are reporting down. No shard leader exists. Each replica is on a different
node. Should it be safe to attempt an ADDREPLICA command? Since there's no
leader I don't know if that will work. This is the cluster state for the
collection:

"SCHN":{
"shards":{"shard1":{
"range":"8000-7fff",
"state":"active",
"replicas":{
  "core_node6":{
"core":"SCHN_shard1_replica5",
"base_url":"http://:8983/solr;,
"node_name":":8983_solr",
"state":"down"},
  "core_node5":{
"core":"SCHN_shard1_replica2",
"base_url":"http://8983/solr;,
"node_name":":8983_solr",
"state":"down",
"replicationFactor":"2",
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"znodeVersion":1127,
"configName":"default"},

The logs show repeated errors for: ERROR
org.apache.solr.common.SolrException Error while trying to recover.
core=SCHN_shard1_replica5:org.apache.solr.common.SolrException: No
registered leader was found after waiting for 4000ms , collection: SCHN
slice: shard1

I've already tried bringing the nodes down and then back up.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: StatelessScriptUpdateProcessorFactory causing OOM errors?

2020-02-13 Thread Jörn Franke
I had also issues with this factory when creating atomic updates inside there. 
They worked, but searcher where never closed and new ones where open and stayed 
open with all the issues related to that one. Maybe one needs to look into more 
detail into that. However - it is a script in the end so that could be always a 
bug in your script as well.

> Am 13.02.2020 um 19:21 schrieb Haschart, Robert J (rh9ec) 
> :
> 
> Erick,
> 
> Sorry I didn't see this response, for some reason solr-users has stopped 
> being delivered to my mail box.
> 
> The script that adds a field based on the value(s) in some other field 
> doesn't add a large number of different fields to the index.
> The pool_f field only has a total of 11 different values, and except for some 
> rare cases, any given record only has a single value in that field, and those 
> rare cases will have two values.
> 
> I had previously implemented the same functionality by making a small jar 
> file containing a customized version of TemplateUpdateProcessorFactory  that 
> could generate different field names, but since I needed another bit of 
> functionality in the Update Chain I decided to port the original 
> functionality to a script  since the "development overhead" of adding a 
> script is less than adding in multiple additional custom 
> UpdateProcessorFactory objects.
> 
> I had been running solr with the the memory flag  "-m 8G" and it had been 
> running fine with that setting for a least a year, even recently when the 
> customized java version of TemplateUpdateProcessorFactory was being invoked 
> to perform essentially the same processing step.
> 
> However when I tried to accomplish the same thing via javascript through 
> StatelessScriptUpdateProcessorFactory  and start a re-index it would die 
> after about 1 million records being indexed.And since it is merely my 
> (massive) development machine, during the re-index there are close to zero 
> searches coming through while the re-index is happening.
> 
> I've managed to work around the issue on my dev box by upping the the memory 
> for solr to 16G, and haven't had an OOM since doing that, but I'm hesitant to 
> push these changes to our AWS-hosted production instances since running out 
> of memory and terminating there would be more of an issue.
> 
> -Bob
> 
> 
> 
> 
>From: Erick Erickson 
>Subject: Re: StatelessScriptUpdateProcessorFactory causing OOM errors?
>Date: Thu, 6 Feb 2020 09:18:41 -0500
> 
>How many fields do you wind up having? It looks on a quick glance like
>it depends on the values of fields. While I’ve seen Solr/Lucene handle
>indexes with over 1M different fields, it’s unsatisfactory.
> 
>What I’m wondering is if you are adding a zillion different fields to your
>docs as time passes and eventually the structures that are needed to
>maintain your field mappings are blowing up memory.
> 
>If that’s that case, you need an alternative design because your
>performance will be unacceptable.
> 
>May be off base, if so we can dig further.
> 
>Best,
>Erick
> 
>> On Feb 5, 2020, at 3:41 PM, Haschart, Robert J (rh9ec)  
>> wrote:
>> 
>> StatelessScriptUpdateProcessorFactory
> 
> 
> 
> 


Re: StatelessScriptUpdateProcessorFactory causing OOM errors?

2020-02-13 Thread Haschart, Robert J (rh9ec)
Erick,

Sorry I didn't see this response, for some reason solr-users has stopped being 
delivered to my mail box.

The script that adds a field based on the value(s) in some other field doesn't 
add a large number of different fields to the index.
The pool_f field only has a total of 11 different values, and except for some 
rare cases, any given record only has a single value in that field, and those 
rare cases will have two values.

I had previously implemented the same functionality by making a small jar file 
containing a customized version of TemplateUpdateProcessorFactory  that could 
generate different field names, but since I needed another bit of functionality 
in the Update Chain I decided to port the original functionality to a script  
since the "development overhead" of adding a script is less than adding in 
multiple additional custom UpdateProcessorFactory objects.

I had been running solr with the the memory flag  "-m 8G" and it had been 
running fine with that setting for a least a year, even recently when the 
customized java version of TemplateUpdateProcessorFactory was being invoked to 
perform essentially the same processing step.

However when I tried to accomplish the same thing via javascript through 
StatelessScriptUpdateProcessorFactory  and start a re-index it would die after 
about 1 million records being indexed.And since it is merely my (massive) 
development machine, during the re-index there are close to zero searches 
coming through while the re-index is happening.

I've managed to work around the issue on my dev box by upping the the memory 
for solr to 16G, and haven't had an OOM since doing that, but I'm hesitant to 
push these changes to our AWS-hosted production instances since running out of 
memory and terminating there would be more of an issue.

-Bob




From: Erick Erickson 
Subject: Re: StatelessScriptUpdateProcessorFactory causing OOM errors?
Date: Thu, 6 Feb 2020 09:18:41 -0500

How many fields do you wind up having? It looks on a quick glance like
it depends on the values of fields. While I’ve seen Solr/Lucene handle
indexes with over 1M different fields, it’s unsatisfactory.

What I’m wondering is if you are adding a zillion different fields to your
docs as time passes and eventually the structures that are needed to
maintain your field mappings are blowing up memory.

If that’s that case, you need an alternative design because your
performance will be unacceptable.

May be off base, if so we can dig further.

Best,
Erick

> On Feb 5, 2020, at 3:41 PM, Haschart, Robert J (rh9ec) 
 wrote:
>
> StatelessScriptUpdateProcessorFactory






Re: Bug? Documents not visible after sucessful commit - chaos testing

2020-02-13 Thread Chris Hostetter


: We think this is a bug (silently dropping commits even if the client
: requested "waitForSearcher"), or at least a missing feature (commits beging
: the only UpdateRequests not reporting the achieved RF), which should be
: worth a JIRA Ticket.

Thanks for your analysis Michael -- I agree something better should be 
done here, and have filed SOLR-14262 for subsequent discussion...

https://issues.apache.org/jira/browse/SOLR-14262

I believe the reason the local commit is ignored during replay is to 
ensure a consistent view of the index -- if the tlog being 
replayed contains COMMIT1,A,B,C,COMMIT2,D,... we should never open a new 
searcher containing just A or just A+B w/o C if a COMMIT3 comes along 
during replay -- but agree with you 100% that either commit should support 
'rf' making it obvious that this commit didn't succeed (which would also 
be important & helpful if the node was still down when the client sends 
the commit) ... *AND* ... we should consider making the commit block until 
replay is finished.

...BUT... there are probably other nuances i don't understand ... 
hoepfully other folks more familiar with the current implementation will 
chime in on the jira.




-Hoss
http://www.lucidworks.com/


Re: wildcards match end-of-word?

2020-02-13 Thread Walter Underwood
Remove the stopword and stemmer filters from your schema and reindex.

Removing stopwords means you can never match “vitamin a”.

Stemming interferes with wildcard matches. Either stem or do wildcards on a 
field, not both.

Also, what do your users expect to get with wildcard matches? Those are a slow 
and imprecise way to search. There is almost always a better way.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 13, 2020, at 1:03 AM, Sotiris Fragkiskos  wrote:
> 
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always
> had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an
> external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the term,
> even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
> Am I doing something very wrong??
> 
> thanks again!
> Sotiri
> 
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson 
> wrote:
> 
>> Steve:
>> 
>> You _really_ want to get acquainted with the admin UI/Analysis page ;).
>> Choose a core/collection and you should see the choice. It shows you
>> exactly what transformations your data goes through. If you hover over the
>> light gray pairs of letters, you’ll get a tooltip showing you what part of
>> your analysis chain is responsible for a particular change. I un-check the
>> “verbose” box 95% of the time BTW.
>> 
>> The critical bit is that what comes out of the end of the analysis pipe
>> are the tokens that are actually _in_ the index. From there, problems like
>> this make more sense.
>> 
>> My bet is that, as Walter says, you have a stemmer in the analysis chain
>> and the actual token in the index is “kinas” so of course “kinase*” won’t
>> be found. By adding OR kinase to the query, that token is stemmed to
>> “kinas” and matches.
>> 
>> Also, adding =query to your URL will show you what the query looks
>> like after parsing and analysis, also a major tool for figuring out what’s
>> really happening.
>> 
>> Wildcards are not stemmed, which can lead to surprising results. There’s
>> no perfect answer here. Let’s claim wildcards _were_ stemmed. Then you’d
>> have to try to explain why “running*” returned a doc with only “run” or
>> “runner” or “runs” or... in it, but searching for “runnin*” did not due the
>> stemmer not recognizing it as a stemmable word.
>> 
>> Finally, one of my personal hot buttons is wildcards in general. They’re
>> very often over-used because people are used to simple search capabilities.
>> Something about “if your only tool is a hammer, every problem looks like a
>> nail”. That gets into training users too though...
>> 
>> Best,
>> Erick
>> 
>>> On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
>> sfisc...@pennmedicine.upenn.edu> wrote:
>>> 
>>> Hi,
>>> 
>>> I am a solr newbie.  I was surprised to discover that a search for
>> kinase* returned fewer results than kinase.
>>> 
>>> Then I read the wildcard documentation<
>> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>,
>> and saw why.  kinase* will not match the word "kinase".
>>> 
>>> Our end-users won't expect this behavior.  Presumably the solution would
>> be for them (actually us, on their behalf), to use kinase* OR kinase.
>>> 
>>> But that is kind of a hack.
>>> 
>>> Is there a way we can configure solr to have wildcards match on
>> end-of-word?
>>> 
>>> Thanks,
>>> Steve
>> 
>> 



Re: [External] wildcards match end-of-word?

2020-02-13 Thread Jan Høydahl
Be aware that if you search a field with stemming, then the index will only 
contain the stems, i.e. cars, caring may both be indexed as «car», and when you 
do a wildcard search, all analysis is skipped, so you are only targeting the 
exact tokens that happen to be in that field. Thus a search for «ca*s» or 
«c*ing» or «cars*» will not match, but «car*» and even «c*r» will match both 
these words, which would be surprising right? So if wildcard search is a key 
feature you better provide a copyField with a fieldType in your schema that do 
not do stemming. Probably only StandardTokenizer and LowercaseFilter. Then use 
that field for your wildcard queries instead of the generic stemmed field.

Jan

> 13. feb. 2020 kl. 13:52 skrev Fischer, Stephen 
> :
> 
> Folks,
> 
> I am seeing very strange (bad) wildcard behavior (solr 8).  
> 
> "kinase" finds hits as expected.  
> 
> "kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like 
> "kinase," and "kinase-" but not "kinase"
> 
> I have done the analysis as Erick suggested (thanks!) but it is not helping 
> me understand why we'd have this problem.
> 
> I have put together 12 screenshots from the Solr web UI that show in detail:
> - the queries I ran to get the results above
> - various analyses trying to understand why
> - the schema for the fieldType in question
> 
> https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing
> 
> thanks,
> steve
> 
> -Original Message-
> From: Sotiris Fragkiskos  
> Sent: Thursday, February 13, 2020 4:03 AM
> To: solr-user@lucene.apache.org
> Subject: [External] Re: wildcards match end-of-word?
> 
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always had 
> the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an 
> external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the term, 
> even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I 
> doing something very wrong??
> 
> thanks again!
> Sotiri
> 
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson 
> wrote:
> 
>> Steve:
>> 
>> You _really_ want to get acquainted with the admin UI/Analysis page ;).
>> Choose a core/collection and you should see the choice. It shows you 
>> exactly what transformations your data goes through. If you hover over 
>> the light gray pairs of letters, you’ll get a tooltip showing you what 
>> part of your analysis chain is responsible for a particular change. I 
>> un-check the “verbose” box 95% of the time BTW.
>> 
>> The critical bit is that what comes out of the end of the analysis 
>> pipe are the tokens that are actually _in_ the index. From there, 
>> problems like this make more sense.
>> 
>> My bet is that, as Walter says, you have a stemmer in the analysis 
>> chain and the actual token in the index is “kinas” so of course 
>> “kinase*” won’t be found. By adding OR kinase to the query, that token 
>> is stemmed to “kinas” and matches.
>> 
>> Also, adding =query to your URL will show you what the query 
>> looks like after parsing and analysis, also a major tool for figuring 
>> out what’s really happening.
>> 
>> Wildcards are not stemmed, which can lead to surprising results. 
>> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed. 
>> Then you’d have to try to explain why “running*” returned a doc with 
>> only “run” or “runner” or “runs” or... in it, but searching for 
>> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>> 
>> Finally, one of my personal hot buttons is wildcards in general. 
>> They’re very often over-used because people are used to simple search 
>> capabilities.
>> Something about “if your only tool is a hammer, every problem looks 
>> like a nail”. That gets into training users too though...
>> 
>> Best,
>> Erick
>> 
>>> On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
>> sfisc...@pennmedicine.upenn.edu> wrote:
>>> 
>>> Hi,
>>> 
>>> I am a solr newbie.  I was surprised to discover that a search for
>> kinase* returned fewer results than kinase.
>>> 
>>> Then I read the wildcard documentation<
>> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
>> l#TheStandardQueryParser-WildcardSearches>,
>> and saw why.  kinase* will not match the word "kinase".
>>> 
>>> Our end-users won't expect this behavior.  Presumably the solution 
>>> would
>> be for them (actually us, on their behalf), to use kinase* OR kinase.
>>> 
>>> But that is kind of a hack.
>>> 
>>> Is there a way we can configure solr to have wildcards match on
>> end-of-word?
>>> 
>>> Thanks,
>>> Steve
>> 
>> 



Re: Zookeeper upgrade required with Solr upgrade?

2020-02-13 Thread Erick Erickson
That should be OK. There were no code changes necessary for that upgrade. see 
SOLR-13363

> On Feb 12, 2020, at 5:34 PM, Rahul Goswami  wrote:
> 
> Hello,
> We are running a SolrCloud (7.2.1) cluster and upgrading to Solr 7.7.2. We
> run a separate multi node zookeeper ensemble which currently runs
> Zookeeper 3.4.10.
> Is it also required to upgrade Zookeeper (to  3.4.14 as per change.txt for
> Solr 7.7.2) along with Solr ?
> 
> I tried a few basic updates requests for a 2 node SolrCloud cluster with
> the older (3.4.10) zookeeper and it seemed to work fine. But just want to
> know if there are any caveats I should be aware of.
> 
> Thanks,
> Rahul



Re: Using MM efficiently to get right number of results

2020-02-13 Thread Erick Erickson
It can be basically any thing you can do with a standard Solr query.

> On Feb 13, 2020, at 9:09 AM, Nitin Arora  wrote:
> 
> Thanks Erick, a follow-up question for RerankQParser:
> How complex can the rerank query itself be? Can we add multiple boost
> factors based on different conditions - say, if category is X boost by 2,
> if brand is Y boost by 3, etc.?
> 
> On Mon, 10 Feb 2020 at 18:12, Erick Erickson 
> wrote:
> 
>> There isn’t really  an “industry standard”, since the reasons someone
>> wants this kind of behavior vary from situation to situation. That said,
>> Solr has RerankQParserPlugin that’s designed for this.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 10, 2020, at 4:23 AM, Nitin Arora  wrote:
>>> 
>>> I am looking for an efficient way for setting the MM(minimum should
>> match)
>>> parameter for my solr search queries. As we go from MM=100% to MM=0%, we
>>> move from lots of zero result queries on one hand to too many irrelevant
>>> results (which may then get boosted by other factors) on the other. I can
>>> think of multiple ways to approach this:
>>> 1) Try decreasing mm from 100% to 90% to 80% in a staggered manner till
>> you
>>> have just the right number of results. Does not sound very efficient
>> though.
>>> 2) Use a low value of MM, say 0%, and then pick only the top 200 results
>> to
>>> apply other boost factors to. Would not allow to use bf, bq, boost within
>>> SOLR.
>>> 
>>> My question is, What is the standard industry practice in this matter.
>> How
>>> do you go about ensuring that your search returns *just the right* number
>>> of results so that you can use other boost functions on the relevant set
>> of
>>> results.
>>> 
>>> Thanks in advance
>>> Nitin
>> 
>> 



Re: Using MM efficiently to get right number of results

2020-02-13 Thread Nitin Arora
Thanks Erick, a follow-up question for RerankQParser:
How complex can the rerank query itself be? Can we add multiple boost
factors based on different conditions - say, if category is X boost by 2,
if brand is Y boost by 3, etc.?

On Mon, 10 Feb 2020 at 18:12, Erick Erickson 
wrote:

> There isn’t really  an “industry standard”, since the reasons someone
> wants this kind of behavior vary from situation to situation. That said,
> Solr has RerankQParserPlugin that’s designed for this.
>
> Best,
> Erick
>
> > On Feb 10, 2020, at 4:23 AM, Nitin Arora  wrote:
> >
> > I am looking for an efficient way for setting the MM(minimum should
> match)
> > parameter for my solr search queries. As we go from MM=100% to MM=0%, we
> > move from lots of zero result queries on one hand to too many irrelevant
> > results (which may then get boosted by other factors) on the other. I can
> > think of multiple ways to approach this:
> > 1) Try decreasing mm from 100% to 90% to 80% in a staggered manner till
> you
> > have just the right number of results. Does not sound very efficient
> though.
> > 2) Use a low value of MM, say 0%, and then pick only the top 200 results
> to
> > apply other boost factors to. Would not allow to use bf, bq, boost within
> > SOLR.
> >
> > My question is, What is the standard industry practice in this matter.
> How
> > do you go about ensuring that your search returns *just the right* number
> > of results so that you can use other boost functions on the relevant set
> of
> > results.
> >
> > Thanks in advance
> > Nitin
>
>


RE: [External] Re: wildcards match end-of-word?

2020-02-13 Thread Fischer, Stephen
Also, if helpful, here is our solarconfig.xml
 
https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf/solrconfig.xml

Thanks again, from a Solr Newbie,
steve

-Original Message-
From: Fischer, Stephen  
Sent: Thursday, February 13, 2020 7:52 AM
To: solr-user@lucene.apache.org
Subject: RE: [External] Re: wildcards match end-of-word?

Folks,

I am seeing very strange (bad) wildcard behavior (solr 8).  

"kinase" finds hits as expected.  

"kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like 
"kinase," and "kinase-" but not "kinase"

I have done the analysis as Erick suggested (thanks!) but it is not helping me 
understand why we'd have this problem.

I have put together 12 screenshots from the Solr web UI that show in detail:
- the queries I ran to get the results above
- various analyses trying to understand why
- the schema for the fieldType in question

https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing

thanks,
steve

-Original Message-
From: Sotiris Fragkiskos 
Sent: Thursday, February 13, 2020 4:03 AM
To: solr-user@lucene.apache.org
Subject: [External] Re: wildcards match end-of-word?

Hi Erick,
thanks very much for this information, it was immensely useful, I always had 
the same question!
I'm now seeing the Analysis page and finally I don't have to rely on an 
external online stemmer to see what solr *probably* stemmed the term to!!
But I still can't make the asterisk and question mark work inside the term, 
even in the earlier parts of it.
e.g. tr?ining
I would expect it to match train. But it doesn't.
PSF at the end just shows t | ain
every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I 
doing something very wrong??

thanks again!
Sotiri

On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson 
wrote:

> Steve:
>
> You _really_ want to get acquainted with the admin UI/Analysis page ;).
> Choose a core/collection and you should see the choice. It shows you 
> exactly what transformations your data goes through. If you hover over 
> the light gray pairs of letters, you’ll get a tooltip showing you what 
> part of your analysis chain is responsible for a particular change. I 
> un-check the “verbose” box 95% of the time BTW.
>
> The critical bit is that what comes out of the end of the analysis 
> pipe are the tokens that are actually _in_ the index. From there, 
> problems like this make more sense.
>
> My bet is that, as Walter says, you have a stemmer in the analysis 
> chain and the actual token in the index is “kinas” so of course 
> “kinase*” won’t be found. By adding OR kinase to the query, that token 
> is stemmed to “kinas” and matches.
>
> Also, adding =query to your URL will show you what the query 
> looks like after parsing and analysis, also a major tool for figuring 
> out what’s really happening.
>
> Wildcards are not stemmed, which can lead to surprising results. 
> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed. 
> Then you’d have to try to explain why “running*” returned a doc with 
> only “run” or “runner” or “runs” or... in it, but searching for 
> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>
> Finally, one of my personal hot buttons is wildcards in general. 
> They’re very often over-used because people are used to simple search 
> capabilities.
> Something about “if your only tool is a hammer, every problem looks 
> like a nail”. That gets into training users too though...
>
> Best,
> Erick
>
> > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> sfisc...@pennmedicine.upenn.edu> wrote:
> >
> > Hi,
> >
> > I am a solr newbie.  I was surprised to discover that a search for
> kinase* returned fewer results than kinase.
> >
> > Then I read the wildcard documentation<
> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> l#TheStandardQueryParser-WildcardSearches>,
> and saw why.  kinase* will not match the word "kinase".
> >
> > Our end-users won't expect this behavior.  Presumably the solution 
> > would
> be for them (actually us, on their behalf), to use kinase* OR kinase.
> >
> > But that is kind of a hack.
> >
> > Is there a way we can configure solr to have wildcards match on
> end-of-word?
> >
> > Thanks,
> > Steve
>
>


Re: [External] Re: wildcards match end-of-word?

2020-02-13 Thread Sotiris Fragkiskos
Hi,
I could be wrong, but I'm starting to think that it has to do with the
fieldType. In our case, wildcards don't seem to work at all with text_en
types, but they do work with string types.

On Thu, Feb 13, 2020 at 1:52 PM Fischer, Stephen <
sfisc...@pennmedicine.upenn.edu> wrote:

> Folks,
>
> I am seeing very strange (bad) wildcard behavior (solr 8).
>
> "kinase" finds hits as expected.
>
> "kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like
> "kinase," and "kinase-" but not "kinase"
>
> I have done the analysis as Erick suggested (thanks!) but it is not
> helping me understand why we'd have this problem.
>
> I have put together 12 screenshots from the Solr web UI that show in
> detail:
> - the queries I ran to get the results above
> - various analyses trying to understand why
> - the schema for the fieldType in question
>
>
> https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing
>
> thanks,
> steve
>
> -Original Message-
> From: Sotiris Fragkiskos 
> Sent: Thursday, February 13, 2020 4:03 AM
> To: solr-user@lucene.apache.org
> Subject: [External] Re: wildcards match end-of-word?
>
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always
> had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an
> external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the
> term, even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
> Am I doing something very wrong??
>
> thanks again!
> Sotiri
>
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson 
> wrote:
>
> > Steve:
> >
> > You _really_ want to get acquainted with the admin UI/Analysis page ;).
> > Choose a core/collection and you should see the choice. It shows you
> > exactly what transformations your data goes through. If you hover over
> > the light gray pairs of letters, you’ll get a tooltip showing you what
> > part of your analysis chain is responsible for a particular change. I
> > un-check the “verbose” box 95% of the time BTW.
> >
> > The critical bit is that what comes out of the end of the analysis
> > pipe are the tokens that are actually _in_ the index. From there,
> > problems like this make more sense.
> >
> > My bet is that, as Walter says, you have a stemmer in the analysis
> > chain and the actual token in the index is “kinas” so of course
> > “kinase*” won’t be found. By adding OR kinase to the query, that token
> > is stemmed to “kinas” and matches.
> >
> > Also, adding =query to your URL will show you what the query
> > looks like after parsing and analysis, also a major tool for figuring
> > out what’s really happening.
> >
> > Wildcards are not stemmed, which can lead to surprising results.
> > There’s no perfect answer here. Let’s claim wildcards _were_ stemmed.
> > Then you’d have to try to explain why “running*” returned a doc with
> > only “run” or “runner” or “runs” or... in it, but searching for
> > “runnin*” did not due the stemmer not recognizing it as a stemmable word.
> >
> > Finally, one of my personal hot buttons is wildcards in general.
> > They’re very often over-used because people are used to simple search
> capabilities.
> > Something about “if your only tool is a hammer, every problem looks
> > like a nail”. That gets into training users too though...
> >
> > Best,
> > Erick
> >
> > > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> > sfisc...@pennmedicine.upenn.edu> wrote:
> > >
> > > Hi,
> > >
> > > I am a solr newbie.  I was surprised to discover that a search for
> > kinase* returned fewer results than kinase.
> > >
> > > Then I read the wildcard documentation<
> > https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> > l#TheStandardQueryParser-WildcardSearches>,
> > and saw why.  kinase* will not match the word "kinase".
> > >
> > > Our end-users won't expect this behavior.  Presumably the solution
> > > would
> > be for them (actually us, on their behalf), to use kinase* OR kinase.
> > >
> > > But that is kind of a hack.
> > >
> > > Is there a way we can configure solr to have wildcards match on
> > end-of-word?
> > >
> > > Thanks,
> > > Steve
> >
> >
>


RE: [External] Re: wildcards match end-of-word?

2020-02-13 Thread Fischer, Stephen
Folks,

I am seeing very strange (bad) wildcard behavior (solr 8).  

"kinase" finds hits as expected.  

"kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like 
"kinase," and "kinase-" but not "kinase"

I have done the analysis as Erick suggested (thanks!) but it is not helping me 
understand why we'd have this problem.

I have put together 12 screenshots from the Solr web UI that show in detail:
- the queries I ran to get the results above
- various analyses trying to understand why
- the schema for the fieldType in question

https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing

thanks,
steve

-Original Message-
From: Sotiris Fragkiskos  
Sent: Thursday, February 13, 2020 4:03 AM
To: solr-user@lucene.apache.org
Subject: [External] Re: wildcards match end-of-word?

Hi Erick,
thanks very much for this information, it was immensely useful, I always had 
the same question!
I'm now seeing the Analysis page and finally I don't have to rely on an 
external online stemmer to see what solr *probably* stemmed the term to!!
But I still can't make the asterisk and question mark work inside the term, 
even in the earlier parts of it.
e.g. tr?ining
I would expect it to match train. But it doesn't.
PSF at the end just shows t | ain
every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I 
doing something very wrong??

thanks again!
Sotiri

On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson 
wrote:

> Steve:
>
> You _really_ want to get acquainted with the admin UI/Analysis page ;).
> Choose a core/collection and you should see the choice. It shows you 
> exactly what transformations your data goes through. If you hover over 
> the light gray pairs of letters, you’ll get a tooltip showing you what 
> part of your analysis chain is responsible for a particular change. I 
> un-check the “verbose” box 95% of the time BTW.
>
> The critical bit is that what comes out of the end of the analysis 
> pipe are the tokens that are actually _in_ the index. From there, 
> problems like this make more sense.
>
> My bet is that, as Walter says, you have a stemmer in the analysis 
> chain and the actual token in the index is “kinas” so of course 
> “kinase*” won’t be found. By adding OR kinase to the query, that token 
> is stemmed to “kinas” and matches.
>
> Also, adding =query to your URL will show you what the query 
> looks like after parsing and analysis, also a major tool for figuring 
> out what’s really happening.
>
> Wildcards are not stemmed, which can lead to surprising results. 
> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed. 
> Then you’d have to try to explain why “running*” returned a doc with 
> only “run” or “runner” or “runs” or... in it, but searching for 
> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>
> Finally, one of my personal hot buttons is wildcards in general. 
> They’re very often over-used because people are used to simple search 
> capabilities.
> Something about “if your only tool is a hammer, every problem looks 
> like a nail”. That gets into training users too though...
>
> Best,
> Erick
>
> > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> sfisc...@pennmedicine.upenn.edu> wrote:
> >
> > Hi,
> >
> > I am a solr newbie.  I was surprised to discover that a search for
> kinase* returned fewer results than kinase.
> >
> > Then I read the wildcard documentation<
> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> l#TheStandardQueryParser-WildcardSearches>,
> and saw why.  kinase* will not match the word "kinase".
> >
> > Our end-users won't expect this behavior.  Presumably the solution 
> > would
> be for them (actually us, on their behalf), to use kinase* OR kinase.
> >
> > But that is kind of a hack.
> >
> > Is there a way we can configure solr to have wildcards match on
> end-of-word?
> >
> > Thanks,
> > Steve
>
>


Re: Async RELOADCOLLECTION never completes

2020-02-13 Thread Karl Stoney
When performing a rolling restart we see:

09:43:31.890 
[OverseerThreadFactory-42-thread-5-processing-n:solr-5.search-solr.prod.k8.atcloud.io:80_solr]
 ERROR org.apache.solr.cloud.OverseerTaskProcessor - 
:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
= Session expired for /overseer/collection-map-failure

Which I find interesting, everything (resources wise) is very healthy.

On 13/02/2020, 09:34, "Karl Stoney"  
wrote:

Hi,
We’re periodically seeing an ASYNC task to RELOADCOLLECTION never complete, 
it’s just permanently “running”:

❯ curl -s 
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsolr.search-solr.prod.k8.atcloud.io%2Fsolr%2Fadmin%2Fcollectionsdata=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C3a627213825a4b56415008d7b067eb73%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637171832681589729sdata=Kx3OY%2BMkXw%2Bob0M0ZRmnehfAxffoSdGvJyV%2FlfdwdxY%3Dreserved=0\?action\=REQUESTSTATUS\\=1581585716
 | jq .
{
  "responseHeader": {
"status": 0,
"QTime": 2
  },
  "status": {
"state": "running",
"msg": "found [1581585716] in running tasks"
  }
}

The collection appears to have been reloaded fine (from the gui, it’s using 
the right config), so we’re a bit baffled.

The only way I’ve found to clear this up is to rolling restart solr.

Solr 8.4.1

Any ideas?
This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 
1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.


This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.


Async RELOADCOLLECTION never completes

2020-02-13 Thread Karl Stoney
Hi,
We’re periodically seeing an ASYNC task to RELOADCOLLECTION never complete, 
it’s just permanently “running”:

❯ curl -s 
http://solr.search-solr.prod.k8.atcloud.io/solr/admin/collections\?action\=REQUESTSTATUS\\=1581585716
 | jq .
{
  "responseHeader": {
"status": 0,
"QTime": 2
  },
  "status": {
"state": "running",
"msg": "found [1581585716] in running tasks"
  }
}

The collection appears to have been reloaded fine (from the gui, it’s using the 
right config), so we’re a bit baffled.

The only way I’ve found to clear this up is to rolling restart solr.

Solr 8.4.1

Any ideas?
This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.


Re: wildcards match end-of-word?

2020-02-13 Thread Sotiris Fragkiskos
Hi Erick,
thanks very much for this information, it was immensely useful, I always
had the same question!
I'm now seeing the Analysis page and finally I don't have to rely on an
external online stemmer to see what solr *probably* stemmed the term to!!
But I still can't make the asterisk and question mark work inside the term,
even in the earlier parts of it.
e.g. tr?ining
I would expect it to match train. But it doesn't.
PSF at the end just shows t | ain
every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
Am I doing something very wrong??

thanks again!
Sotiri

On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson 
wrote:

> Steve:
>
> You _really_ want to get acquainted with the admin UI/Analysis page ;).
> Choose a core/collection and you should see the choice. It shows you
> exactly what transformations your data goes through. If you hover over the
> light gray pairs of letters, you’ll get a tooltip showing you what part of
> your analysis chain is responsible for a particular change. I un-check the
> “verbose” box 95% of the time BTW.
>
> The critical bit is that what comes out of the end of the analysis pipe
> are the tokens that are actually _in_ the index. From there, problems like
> this make more sense.
>
> My bet is that, as Walter says, you have a stemmer in the analysis chain
> and the actual token in the index is “kinas” so of course “kinase*” won’t
> be found. By adding OR kinase to the query, that token is stemmed to
> “kinas” and matches.
>
> Also, adding =query to your URL will show you what the query looks
> like after parsing and analysis, also a major tool for figuring out what’s
> really happening.
>
> Wildcards are not stemmed, which can lead to surprising results. There’s
> no perfect answer here. Let’s claim wildcards _were_ stemmed. Then you’d
> have to try to explain why “running*” returned a doc with only “run” or
> “runner” or “runs” or... in it, but searching for “runnin*” did not due the
> stemmer not recognizing it as a stemmable word.
>
> Finally, one of my personal hot buttons is wildcards in general. They’re
> very often over-used because people are used to simple search capabilities.
> Something about “if your only tool is a hammer, every problem looks like a
> nail”. That gets into training users too though...
>
> Best,
> Erick
>
> > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> sfisc...@pennmedicine.upenn.edu> wrote:
> >
> > Hi,
> >
> > I am a solr newbie.  I was surprised to discover that a search for
> kinase* returned fewer results than kinase.
> >
> > Then I read the wildcard documentation<
> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>,
> and saw why.  kinase* will not match the word "kinase".
> >
> > Our end-users won't expect this behavior.  Presumably the solution would
> be for them (actually us, on their behalf), to use kinase* OR kinase.
> >
> > But that is kind of a hack.
> >
> > Is there a way we can configure solr to have wildcards match on
> end-of-word?
> >
> > Thanks,
> > Steve
>
>


Would changing the schema version from 1.5 to 1.6 require a reindex

2020-02-13 Thread Karl Stoney
Hey,
I’m going to bump our schema version from 1.5 to 1.6 to get the implicit 
useDocValuesAsStored=true, would this require a reindex?

Thanks
Karl
This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.


Re: Mongolian language in Solr

2020-02-13 Thread Charlie Hull

Hi,

There's no Mongolian stemmer in Snowball, the stemmer project Lucene 
uses. I found one paper discussing how one might lemmatize Mongolian:

https://www.researchgate.net/publication/220229332_A_lemmatization_method_for_Mongolian_and_its_application_to_indexing_for_information_retrieval
https://dl.acm.org/doi/10.1016/j.ipm.2009.01.008
but no actual code. Of course, you could use Snowball to build your own 
stemmer. https://snowballstem.org/


I did have more success finding Mongolian stopwords 
https://github.com/elastic/elasticsearch/issues/40434 - someone over in 
Elasticsearch land seems to have the same problem as you do.


Best

Charlie

On 12/02/2020 11:41, Samir Joshi wrote:

Hi,

Is it possible to get a Mongolian language in Solr indexing?

Regards,

Samir Joshi

VFS GLOBAL
EST. 2001 | Partnering Governments. Providing Solutions.

10th Floor, Tower A, Urmi Estate, 95, Ganpatrao Kadam Marg, Lower Parel (W), 
Mumbai 400 013, India
Mob: +91 9987550070 | sami...@vfsglobal.com | 
www.vfsglobal.com



--
Care4Green: Please consider the environment before printing this e-mail
--
This message contains information that may be privileged or confidential and is 
the property of the VFS Global Group. It is intended only for the person to 
whom it is addressed. Any unauthorised printing, copying, disclosure, 
distribution or use of this message or any part thereof is strictly forbidden. 
If you are not the intended recipient, you are not authorised to read, print, 
retain, copy, disseminate, distribute, or use this message or any part thereof. 
If you receive this message in error, please notify the sender immediately and 
delete all copies of this message. VFS Global Group has taken reasonable 
precaution to ensure that any attachment to this e-mail has been swept for 
viruses. However, we do not accept liability for any direct or indirect damage 
sustained as a result of software viruses and would advise that you conduct 
your own virus checks before opening any attachment. VFS Global Group does not 
guarantee the security of any information transmitted electronically and is not 
liable for the proper, timely and complete transmission thereof.
--




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com