from:"Walter Underwood"

Re: Get first value in a multivalued field

2021-03-04 Thread Walter Underwood

You can copy the field to another field, then use the 
FirstFieldValueUpdateProcessorFactory to limit that field to the first value. 
At least, that seems to be what that URP does. I have not used it.

https://solr.apache.org/guide/8_8/update-request-processors.html

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 4, 2021, at 11:42 AM, ufuk yılmaz  wrote:
> 
> Hi,
> 
> Is it possible in any way to get the first value in a multivalued field? 
> Using function queries, streaming expressions or any other way without 
> reindexing? (Stream decorators have array(), but no way to get a value at a 
> specific index?)
> 
> Another one, is it possible to match a regex to a text field and extract only 
> the matching part?
> 
> I tried very hard for this too but couldn’t find a way.
> 
> --ufuk
> 
> Sent from Mail for Windows 10
>

Re: defragmentation can improve performance on SATA class 10 disk ~10000 rpm ?

2021-02-22 Thread Walter Underwood

True, but Windows does cache files. It has been a couple of decades since I ran 
search on Windows, but Ultraseek got large gains from setting some sort of 
system property to make it act like a file server and give file caching equal 
priority with program caching.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 22, 2021, at 9:22 AM, dmitri maziuk  wrote:
> 
> On 2021-02-22 11:18 AM, Shawn Heisey wrote:
> 
>> The OS automatically uses unallocated memory to cache data on the disk.   
>> Because memory is far faster than any disk, even SSD, it performs better.
> 
> Depends on the os, from "defragmenting solrdata folder" I suspect the OP is 
> on windows whose filesystems and memory management does not always work the 
> way the Unix textbook says.
> 
> Dima

Re: defragmentation can improve performance on SATA class 10 disk ~10000 rpm ?

2021-02-22 Thread Walter Underwood

A forced merge might improve speed 20%. Going from spinning disk to SSD
will improve speed 20X or more. Don’t waste your time even thinking about
forced merges.

You need to get SSDs.

The even bigger speedup is to get enough RAM that the OS can keep the 
Solr index files in file system buffers. Check how much space is used by
your indexes, then make sure that there is that much available RAM that
is not used by the OS or Solr JVM.

Some people make the mistake of giving a huge heap to the JVM, thinking
this will improve caching. This almost always makes things worse, by 
using RAM that could be use for caching files. 8GB of heap is usually enough.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 21, 2021, at 11:52 PM, Danilo Tomasoni  wrote:
> 
> Hello all,
> we are running a solr instance with around 41 MLN documents on a SATA class 
> 10 disk with around 10.000 rpm.
> We are experiencing very slow query responses (in the order of hours..) with 
> an average of 205 segments.
> We made a test with a normal pc and an SSD disk, and there the same solr 
> instance with the same data and the same number of segments was around 45 
> times faster.
> Force optimize was also tried to improve the performances, but it was very 
> slow, so we abandoned it.
> 
> Since we still don't have enterprise server ssd disks, we are now wondering 
> if in the meanwhile defragmenting the solrdata folder can help.
> The idea is that due to many updates, each segment file is fragmented across 
> different phisical blocks.
> Put in another way, each segment file is non-contiguous on disk, and this can 
> slow-down the solr response.
> 
> What do you suggest?
> Is this somewhat equivalent to force-optimize or it can be faster?
> 
> Thank you.
> Danilo
> 
> Danilo Tomasoni
> 
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..=mailto%3acalabro%40cosbi.eu>
> http://www.cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..=http%3a%2f%2fwww.cosbi.eu%2f>
> 
> As for the European General Data Protection Regulation 2016/679 on the 
> protection of natural persons with regard to the processing of personal data, 
> we inform you that all the data we possess are object of treatment in the 
> respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how; you 
> may ask for their correction, cancellation or you may oppose to their use by 
> written request sent by recorded delivery to The Microsoft Research – 
> University of Trento Centre for Computational and Systems Biology Scarl, 
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to

Re: Why Solr questions on stackoverflow get very few views and answers, if at all?

2021-02-12 Thread Walter Underwood

Many questions have responses as comments, but no actual answers. One frequent 
contributor doesn’t understand how StackOverflow works, so he’s posting answers 
as comments. He’s also doing conversations instead of crafting a useful, 
complete answer.

I just answered a few. Mostly with “don’t use stop words” and “Solr is not a 
database”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 12, 2021, at 3:03 AM, Charlie Hull  
> wrote:
> 
> I've answered a few in my time, but my experience is that if you do so you 
> then get emailed a whole load more questions some of which aren't even 
> relevant to Solr! Also, quite a few of them are 'here is 3 pages of code 
> please debug it for me no I won't tell the actual error I got'.
> 
> This is the best place to come,  also there's the IRC channel, the new Slack 
> gateway to this at https://s.apache.org/solr-slack and in our own Relevance 
> Slack at http://opensourceconnections.com/slack there's a #solr channel (as 
> well as many others on search & relevance topics).
> 
> Solr is 'hot' (but not as hot as Elasticsearch), and search is still a niche 
> business overall.
> 
> HTH
> 
> Cheers
> 
> Charlie
> 
> On 12/02/2021 10:37, ufuk yılmaz wrote:
>> Is it because the main place for q is this mailing list, or somewhere else 
>> that I don’t know?
>> 
>> Or Solr isn’t ‘hot’ as some other topics?
>> 
>> Sent from Mail for Windows 10
>> 
>> 
> 
> -- 
> Charlie Hull - Managing Consultant at OpenSource Connections Limited 
> 
> Founding member of The Search Network <https://thesearchnetwork.com/> and 
> co-author of Searching the Enterprise 
> <https://opensourceconnections.com/about-us/books-resources/>
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828

Shards and circuit breakers

2021-02-03 Thread Walter Underwood

Should circuit breakers only kill external search requests and not 
cluster-internal requests to shards?

Circuit breakers can kill any request, whether it is a client request from 
outside the cluster or an internal distributed request to a shard. Killing a 
portion of distributed request will affect the main request. Not sure whether a 
503 from a shard will kill the whole request or cause partial results, but it 
isn’t good.

We run with 8 shards. If a circuit breaker is killing 10% of requests on each 
host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems 
like “overkill” to me. If it only kills external requests, then 10% means 10%.

Killing only external requests requires that external requests go roughly 
equally to all hosts in the cluster, or at least all NRT or PULL replicas.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Events on updating documents

2021-01-21 Thread Walter Underwood

Solr is not a database. I strongly recommend that you NOT use it as a data 
store. You will lose data.

Solr does not have transactions. Don’t think of a Solr “commit” as a database 
commit. It is a command to start indexing the queued updates. It does not even 
attempt to meet ACID properties.

Redesign your system to use a database as a data store.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 20, 2021, at 11:49 PM, haris.k...@vnc.biz wrote:
> 
> Hello,
> 
> We at VNC are using Solr for search and as a data store. We have a use-case 
> in which we want to hit a REST endpoint whenever documents are inserted, 
> updated, or deleted in Solr with the documents under consideration as well. 
> When exploring the Solr documentation, we found Event Listeners 
> <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-EventListeners>
>  with postCommit and postOptimize events. We have configured Solr to do 
> soft-commits every second and hard-commits every ten minutes to keep 
> real-time indexing intact. With that in mind the questions are:
> 
> Do we get the documents updated in the postCommit event? (Not able to find 
> any examples)
> Are there other events that are triggered when a doc is updated, deleted, or 
> inserted like those we have in an RDBMS?
> Is there a postSoftCommit event as well? (not mentioned in official docs)
> 
> Mit freundlichen Grüssen / Kind regards
> 
> Muhammad Haris Khan
> 
> VNC - Virtual Network Consult
> 
> -- Solr Ingenieur --

Re: different score from different replica of same shard

2021-01-13 Thread Walter Underwood

Yes, check performance before turning on the stats cache in prod.

When we tested the LRUStatsCache in 6.6.2, searches were 11X slower.

It should be possible to do distributed IDF with little extra overhead.
Infoseek was doing that in 1995 and the patent on the technique has
expired.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 13, 2021, at 6:31 AM, Markus Jelsma  wrote:
> 
> Hallo Bernd,
> 
> I see the different replica types in the 7.1 [1] manual but not in the 6.6.
> ExactStatsCache should work in 6.6, just add it to solrconfig.xml, not the
> request handler [1]. It will slow down searches due to added overhead.
> 
> Regards,
> Markus
> 
> [1]
> https://lucene.apache.org/solr/guide/7_1/shards-and-indexing-data-in-solrcloud.html#types-of-replicas
> [2] https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
> 
> Op wo 13 jan. 2021 om 15:11 schreef Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de>:
> 
>> Hello Markus,
>> 
>> thanks a lot.
>> Is TLOG also for SOLR 6.6.6 or only 8.x and up?
>> 
>> I will first try ExactStatsCache.
>> Should be added as invariant to request handler, right?
>> 
>> Comparing the replica index directories they have different size and
>> the index version and generation is different. Also Max Doc.
>> But Num Docs is the same.
>> 
>> Regards,
>> Bernd
>> 
>> 
>> Am 13.01.21 um 14:54 schrieb Markus Jelsma:
>>> Hello Bernd,
>>> 
>>> This is normal for NRT replicas, because the way segments are merged and
>>> deletes are removed is not synchronized between replicas. In that case
>>> counts for TF and IDF and norms become slightly different.
>>> 
>>> You can either use ExactStatsCache that fetches counts for terms before
>>> scoring, so that all replica's use the same counts. Or change the replica
>>> types to TLOG. With TLOG segments are fetched from the leader and thus
>>> identical.
>>> 
>>> Regards,
>>> Markus
>>> 
>>> Op wo 13 jan. 2021 om 14:45 schreef Bernd Fehling <
>>> bernd.fehl...@uni-bielefeld.de>:
>>> 
>>>> Hello list,
>>>> 
>>>> a question for better understanding scoring of a shard in a cloud.
>>>> 
>>>> I see different scores from different replicas of the same shard.
>>>> Is this normal and if yes, why?
>>>> 
>>>> My understanding until now was that replicas are always the same within
>> a
>>>> shard
>>>> and the same query to each replica within a shard gives always the same
>>>> score.
>>>> 
>>>> Can someone help me to understand this?
>>>> 
>>>> Regards
>>>> Bernd
>>>> 
>>> 
>>

Re: Apache Solr in High Availability Primary and Secondary node.

2021-01-11 Thread Walter Underwood

Use a load balancer. We’re in AWS, so we use an AWS ALB.

If you don’t have a failure-tolerant load balancer implementation, the site has 
bigger problems than search.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 11, 2021, at 10:15 AM, Dmitri Maziuk  wrote:
> 
> On 1/11/2021 11:25 AM, Walter Underwood wrote:
>> There are all sorts of problems with the primary/secondary approach. How do 
>> you know
>> the secondary is working? How do you deal with cold caches on the secondary 
>> when it
>> suddenly gets lots of load?
>> Instead, size the cluster with the number of hosts you need, then add one. 
>> Send traffic
>> to all of them. If any of them goes down, you have the capacity to handle 
>> the traffic.
>> This is called “N+1 provisioning”.
> 
> Where do you send your solr queries? If you have an http server at an ip 
> address that answers them, that's a single point of failure unless you put it 
> on a heartbet'ed cluster ip. (I tend to prefer ucarp to pacemaker for that as 
> the latter is bloated and too cumbersome for simple active/passive setups, 
> but that's OT.)
> 
> Dima

Re: Apache Solr in High Availability Primary and Secondary node.

2021-01-11 Thread Walter Underwood

There are all sorts of problems with the primary/secondary approach. How do you 
know
the secondary is working? How do you deal with cold caches on the secondary 
when it
suddenly gets lots of load?

Instead, size the cluster with the number of hosts you need, then add one. Send 
traffic
to all of them. If any of them goes down, you have the capacity to handle the 
traffic.
This is called “N+1 provisioning”.

This was our rule at Netflix a dozen years ago, running Solr 1.3. I do it the 
same way
today with large sharded clusters, one extra per shard. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 11, 2021, at 2:41 AM, DAVID MARTIN NIETO  wrote:
> 
> I believe Solr dont have this configuration, you need a load balancer with 
> that configuration mode for that.
> 
> Kind regards.
> 
> 
> 
> De: Kaushal Shriyan 
> Enviado: lunes, 11 de enero de 2021 11:32
> Para: solr-user@lucene.apache.org 
> Asunto: Apache Solr in High Availability Primary and Secondary node.
> 
> Hi,
> 
> We are running Apache Solr 8.7.0 search service on CentOS Linux release
> 7.9.2009 (Core).
> 
> Is there a way to set up the Solr search service in High Availability Mode
> in the Primary and Secondary node? For example, if the primary node is down
> secondary node will take care of the service.
> 
> Best Regards,
> 
> Kaushal

Re: Sending compressed (gzip) UpdateRequest with SolrJ

2021-01-08 Thread Walter Underwood

Years ago, working on the Ultraseek spider, we did a bunch of tests on 
compressed HTTP.
I expected it to be a big win, but the results were really inconclusive. 
Sometimes it was faster,
sometimes it was slower. We left it turned off.

It is an absolute win for serving already-compressed static content with Apache 
or whatever.
For dynamic content, it will increase some amount of delay as stuff is 
compressed before
sending. If the content already fits in one or two packets, it is just extra 
overhead. For really
large data, it helps with transmission time, but the processing time for large 
data probably
overwhelms the network time.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 8, 2021, at 12:01 AM, Gael Jourdan-Weil 
>  wrote:
> 
> You're right Matthew.
> 
> Jetty supports it for responses but for requests it doesn't seem to be the 
> default.
> However I found a configuration not documented that needs to be set in the 
> GzipHandler for it to work: inflateBufferSize.
> 
> For SolrJ it still hacky to send gzip requests, maybe easier to use a regular 
> http call..
> 
> ---
> 
> De : matthew sporleder 
> Envoyé : jeudi 7 janvier 2021 16:43
> À : solr-user@lucene.apache.org 
> Objet : Re: Sending compressed (gzip) UpdateRequest with SolrJ 
>  
> jetty supports http gzip and I've added it to solr before in my own
> installs (and submitted patches to do so by default to solr) but I
> don't know about the handling for solrj.
> 
> IME compression helps a little, sometimes a lot, and never hurts.
> Even the admin interface benefits a lot from regular old http gzip
> 
> On Thu, Jan 7, 2021 at 8:03 AM Gael Jourdan-Weil
>  wrote:
>> 
>> Answering to myself on this one.
>> 
>> Solr uses Jetty 9.x which does not support compressed requests by itself 
>> meaning, the application behind Jetty (that is Solr) has to decompress by 
>> itself which is not the case for now.
>> Thus even without using SolrJ, sending XML compressed in GZIP to Solr (with 
>> cURL for instance) is not possible for now.
>> 
>> Seems quite surprising to me though.
>> 
>> -
>> 
>> Hello,
>> 
>> I was wondering if someone ever had the need to send compressed (gzip) 
>> update requests (adding/deleting documents), especially using SolrJ.
>> 
>> Somehow I expected it to be done by default, but didn't find any 
>> documentation about it and when looking at the code it seems there is no 
>> option to do it. Or is javabin compressed by default?
>> - 
>> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/BinaryRequestWriter.java#L49
>> - 
>> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/request/RequestWriter.java#L55
>>  (if not using Javabin)
>> - 
>> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L587
>> 
>> By the way, is there any documentation about javabin? I could only find one 
>> on the "old wiki".
>> 
>> Thanks,
>> Gaël

Missing processor in example update request processor chain?

2020-12-28 Thread Walter Underwood

The documentation says that the default update request processor change invokes 
LogUpdateProcessorFactory, DistributedUpdateProcessorFactory, then 
RunUpdateProcessorFactory.

The example immediately below that, also in the default solrconfig.xml file, is 
missing  DistributedUpdateProcessorFactory. Is that a documentation bug or am I 
missing something?


  
true
id
false
name,features,cat
solr.processor.Lookup3Signature
  
  
  


https://lucene.apache.org/solr/guide/8_7/update-request-processors.html 
<https://lucene.apache.org/solr/guide/8_7/update-request-processors.html>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: CPU and memory circuit breaker documentation issues

2020-12-18 Thread Walter Underwood

Thanks. I’m already familiar with adoc. 
https://issues.apache.org/jira/browse/SOLR-15056 
<https://issues.apache.org/jira/browse/SOLR-15056>

Now I need to brush up on How To Contribute.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 18, 2020, at 12:23 PM, Anshum Gupta  wrote:
> 
> Hi Walter,
> 
> Thanks for taking this up.
> 
> You can file a PR for the documentation change too as our docs are now a
> part of the repo. Here's where you can find the docs:
> https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide
> 
> 
> On Fri, Dec 18, 2020 at 9:26 AM Walter Underwood 
> wrote:
> 
>> Looking at the code, the CPU circuit breaker is unusable.
>> 
>> This actually does use Unix load average
>> (operatingSystemMXBean.getSystemLoadAverage()). That is a terrible idea.
>> Interpreting the load average requires knowing the number of CPUs on a
>> system. If I have 16 CPUs, I would probably set the limit at 16, with one
>> process waiting for each CPU.
>> 
>> Unfortunately, this implementation limits the thresholds to 0.5 to 0.95,
>> because the implementer thought they were getting a CPU usage value, I
>> guess. So the whole thing doesn’t work right.
>> 
>> I’ll file a bug and submit a patch to use
>> OperatingSystemMXBean.getSystemCPULoad(). How do I fix the documentation?
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Dec 16, 2020, at 10:41 AM, Walter Underwood 
>> wrote:
>>> 
>>> In https://lucene.apache.org/solr/guide/8_7/circuit-breakers.html <
>> https://lucene.apache.org/solr/guide/8_7/circuit-breakers.html>
>>> 
>>> URL to Wikipedia is broken, but that doesn’t matter, because that
>> article is about a different metric. The Unix “load average” is the length
>> of the run queue, the number of processes or threads waiting to run. That
>> can go much, much higher than 1.0. In a high load system, I’ve seen it at
>> 2X the number of CPUs or higher.
>>> 
>>> Remove that link, it is misleading.
>>> 
>>> The page should list the JMX metrics that are used for this. I’m
>> guessing this uses OperatingSystemMXBean.getSystemCPULoad(). That metric
>> goes from 0.0 to 1.0.
>>> 
>>> 
>> https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html
>> <
>> https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html
>>> 
>>> 
>>> I can see where the “load average” and “getSystemCPULoad” names cause
>> confusion, but this should be correct in the documents.
>>> 
>>> Which metric is used for the memory threshold? My best guess is that the
>> percentage is calculated from the MemoryUsage object returned by
>> MemoryMXBean.getHeapMemoryUsage().
>>> 
>>> 
>> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryMXBean.html
>> <
>> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryMXBean.html
>>> 
>>> 
>> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryUsage.html
>> <
>> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryUsage.html
>>> 
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>> 
>> 
> 
> -- 
> Anshum Gupta

Re: CPU and memory circuit breaker documentation issues

2020-12-18 Thread Walter Underwood

Looking at the code, the CPU circuit breaker is unusable.

This actually does use Unix load average 
(operatingSystemMXBean.getSystemLoadAverage()). That is a terrible idea. 
Interpreting the load average requires knowing the number of CPUs on a system. 
If I have 16 CPUs, I would probably set the limit at 16, with one process 
waiting for each CPU.

Unfortunately, this implementation limits the thresholds to 0.5 to 0.95, 
because the implementer thought they were getting a CPU usage value, I guess. 
So the whole thing doesn’t work right.

I’ll file a bug and submit a patch to use 
OperatingSystemMXBean.getSystemCPULoad(). How do I fix the documentation?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 16, 2020, at 10:41 AM, Walter Underwood  wrote:
> 
> In https://lucene.apache.org/solr/guide/8_7/circuit-breakers.html 
> <https://lucene.apache.org/solr/guide/8_7/circuit-breakers.html>
> 
> URL to Wikipedia is broken, but that doesn’t matter, because that article is 
> about a different metric. The Unix “load average” is the length of the run 
> queue, the number of processes or threads waiting to run. That can go much, 
> much higher than 1.0. In a high load system, I’ve seen it at 2X the number of 
> CPUs or higher.
> 
> Remove that link, it is misleading.
> 
> The page should list the JMX metrics that are used for this. I’m guessing 
> this uses OperatingSystemMXBean.getSystemCPULoad(). That metric goes from 0.0 
> to 1.0.
> 
> https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html
>  
> <https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html>
> 
> I can see where the “load average” and “getSystemCPULoad” names cause 
> confusion, but this should be correct in the documents.
> 
> Which metric is used for the memory threshold? My best guess is that the 
> percentage is calculated from the MemoryUsage object returned by 
> MemoryMXBean.getHeapMemoryUsage().
> 
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryMXBean.html
>  
> <https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryMXBean.html>
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryUsage.html
>  
> <https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryUsage.html>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
>

Re: Best example solrconfig.xml?

2020-12-16 Thread Walter Underwood

That sample solrconfig.xml includes , but the 7.0 release notes say that 
is no longer supported. Should that be removed from the config?

" element in solrconfig.xml is no longer supported. Equivalent 
functionality can be configured in solr.xml using  
element and SolrJmxReporter implementation. Limited back-compatibility is 
offered by automatically adding a default instance of SolrJmxReporter if it's 
missing, AND when a local MBean server is found (which can be activated either 
via ENABLE_REMOTE_JMX_OPTS in solr.in.sh or via system properties, eg. 
-Dcom.sun.management.jmxremote). This default instance exports all Solr metrics 
from all registries as hierarchical MBeans. This behavior can be also disabled 
by specifying a SolrJmxReporter configuration with a boolean init arg "enabled" 
set to "false". For a more fine-grained control users should explicitly specify 
at least one SolrJmxReporter configuration.”

https://lucene.apache.org/solr/8_7_0/changes/Changes.html#v7.0.0.upgrading_from_solr_6.x

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 15, 2020, at 7:36 PM, Walter Underwood  wrote:
> 
> Thanks. Yeah, already enabled the ClassicIndexSchemaFactory.
> 
> Nice tip about uninvertible=false.
> 
> The circuit breakers look really useful. I was ready to front each server 
> with nginx and let it do the limiting. I’ve now seen both Netflix and Chegg 
> search clusters take out the entire site because they got into a stable 
> congested state. People just don’t believe that will happen until they see it.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Dec 15, 2020, at 6:31 PM, Erick Erickson > <mailto:erickerick...@gmail.com>> wrote:
>> 
>> I’d start with that config set, making sure that “schemaless” is disabled.
>> 
>> Do be aware that some of the defaults have changed, although the big change 
>> for docValues was there in 6.0.
>> 
>> One thing you might want to do is set uninvertible=false in your schema. 
>> That’ll cause Solr to barf if you, say, sort, facet, group on a field that 
>> does _not_ have docValues=true. I suspect this will cause no surprises for 
>> you, but it’s kind of a nice backstop to keep from having surprises in terms 
>> of heap size…
>> 
>> Best,
>> Erick
>> 
>>> On Dec 15, 2020, at 6:56 PM, Walter Underwood >> <mailto:wun...@wunderwood.org>> wrote:
>>> 
>>> We’re moving from 6.6 to 8.7 and I’m thinking of starting with an 8.7 
>>> solrconfig.xml and porting our changes into it.
>>> 
>>> Is this the best one to start with?
>>> 
>>> solr/server/solr/configsets/_default/conf/solrconfig.xml
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>> 
>

CPU and memory circuit breaker documentation issues

2020-12-16 Thread Walter Underwood

In https://lucene.apache.org/solr/guide/8_7/circuit-breakers.html
<https://lucene.apache.org/solr/guide/8_7/circuit-breakers.html>

URL to Wikipedia is broken, but that doesn’t matter, because that article is
about a different metric. The Unix “load average” is the length of the run
queue, the number of processes or threads waiting to run. That can go much,
much higher than 1.0. In a high load system, I’ve seen it at 2X the number of
CPUs or higher.

Remove that link, it is misleading.

The page should list the JMX metrics that are used for this. I’m guessing this
uses OperatingSystemMXBean.getSystemCPULoad(). That metric goes from 0.0 to 1.0.

https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html

<https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html>

I can see where the “load average” and “getSystemCPULoad” names cause
confusion, but this should be correct in the documents.

Which metric is used for the memory threshold? My best guess is that the
percentage is calculated from the MemoryUsage object returned by
MemoryMXBean.getHeapMemoryUsage().

https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryMXBean.html

<https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryMXBean.html>
https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryUsage.html
<https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryUsage.html>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)

Re: Best example solrconfig.xml?

2020-12-15 Thread Walter Underwood

Thanks. Yeah, already enabled the ClassicIndexSchemaFactory.

Nice tip about uninvertible=false.

The circuit breakers look really useful. I was ready to front each server with 
nginx and let it do the limiting. I’ve now seen both Netflix and Chegg search 
clusters take out the entire site because they got into a stable congested 
state. People just don’t believe that will happen until they see it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 15, 2020, at 6:31 PM, Erick Erickson  wrote:
> 
> I’d start with that config set, making sure that “schemaless” is disabled.
> 
> Do be aware that some of the defaults have changed, although the big change 
> for docValues was there in 6.0.
> 
> One thing you might want to do is set uninvertible=false in your schema. 
> That’ll cause Solr to barf if you, say, sort, facet, group on a field that 
> does _not_ have docValues=true. I suspect this will cause no surprises for 
> you, but it’s kind of a nice backstop to keep from having surprises in terms 
> of heap size…
> 
> Best,
> Erick
> 
>> On Dec 15, 2020, at 6:56 PM, Walter Underwood  wrote:
>> 
>> We’re moving from 6.6 to 8.7 and I’m thinking of starting with an 8.7 
>> solrconfig.xml and porting our changes into it.
>> 
>> Is this the best one to start with?
>> 
>> solr/server/solr/configsets/_default/conf/solrconfig.xml
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>

Best example solrconfig.xml?

2020-12-15 Thread Walter Underwood

We’re moving from 6.6 to 8.7 and I’m thinking of starting with an 8.7 
solrconfig.xml and porting our changes into it.

Is this the best one to start with?

solr/server/solr/configsets/_default/conf/solrconfig.xml

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Vulnerabilities in SOLR 8.6.2

2020-12-11 Thread Walter Underwood

1. There is no Solr support team. This is a mailing list of volunteers using 
the software.
2. I do not recommend running Solr in a Docker container for production.
3. Please review the Solr Jira for security issues. If you believe that there 
are security vulnerabilities that need to be fixed, open a Jira issue.

https://issues.apache.org/jira/projects/SOLR/issues/SOLR-14792?filter=allopenissues

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 11, 2020, at 8:50 AM, Narayanan, Lakshmi 
>  wrote:
> 
> Can anyone please advise?
> Who else should be notified to get some guidance on this please??
>  
> Lakshmi Narayanan
> Marsh & McLennan Companies
> 121 River Street, Hoboken,NJ-07030
> 201-284-3345
> M: 845-300-3809
> Email: lakshmi.naraya...@mmc.com <mailto:lakshmi.naraya...@mmc.com>
>  
>  
> From: Narayanan, Lakshmi  <mailto:lakshmi.naraya...@mmc.com>> 
> Sent: Friday, November 13, 2020 11:21 AM
> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
> Subject: FW: Vulnerabilities in SOLR 8.6.2
>  
> This is my 5th attempt in the last 60 days
> Is there anyone looking at these mails?
> Does anyone care?? L
>  
>  
> Lakshmi Narayanan
> Marsh & McLennan Companies
> 121 River Street, Hoboken,NJ-07030
> 201-284-3345
> M: 845-300-3809
> Email: lakshmi.naraya...@mmc.com <mailto:lakshmi.naraya...@mmc.com>
>  
>  
> From: Narayanan, Lakshmi  <mailto:lakshmi.naraya...@mmc.com>> 
> Sent: Thursday, October 22, 2020 1:06 PM
> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
> Subject: FW: Vulnerabilities in SOLR 8.6.2
>  
> This is my 4th attempt to contact
> Please advise, if there is a build that fixes these vulnerabilities
>  
> Lakshmi Narayanan
> Marsh & McLennan Companies
> 121 River Street, Hoboken,NJ-07030
> 201-284-3345
> M: 845-300-3809
> Email: lakshmi.naraya...@mmc.com <mailto:lakshmi.naraya...@mmc.com>
>  
>  
> From: Narayanan, Lakshmi  <mailto:lakshmi.naraya...@mmc.com>> 
> Sent: Sunday, October 18, 2020 4:01 PM
> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
> Subject: FW: Vulnerabilities in SOLR 8.6.2
>  
> SOLR-User Support team
> Is there anyone who can answer my question or can point to someone who can 
> help
> I have not had any response for the past 3 weeks !?
> Please advise
>  
>  
> Lakshmi Narayanan
> Marsh & McLennan Companies
> 121 River Street, Hoboken,NJ-07030
> 201-284-3345
> M: 845-300-3809
> Email: lakshmi.naraya...@mmc.com <mailto:lakshmi.naraya...@mmc.com>
>  
>  
> From: Narayanan, Lakshmi  <mailto:lakshmi.naraya...@mmc.com>> 
> Sent: Sunday, October 04, 2020 2:11 PM
> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
> Cc: Chattopadhyay, Salil  <mailto:salil.chattopadh...@mmc.com>>; Mutnuri, Vishnu D 
> mailto:vishnu.d.mutn...@mmc.com>>; Pathak, Omkar 
> mailto:omkar.pat...@mmc.com>>; Shenouda, Nasir B 
> mailto:nasir.b.sheno...@mmc.com>>
> Subject: RE: Vulnerabilities in SOLR 8.6.2
>  
> Hello Solr-User Support team
> Please advise or provide further guidance on the request below
>  
> Thank you!
>  
> Lakshmi Narayanan
> Marsh & McLennan Companies
> 121 River Street, Hoboken,NJ-07030
> 201-284-3345
> M: 845-300-3809
> Email: lakshmi.naraya...@mmc.com <mailto:lakshmi.naraya...@mmc.com>
>  
>  
> From: Narayanan, Lakshmi  <mailto:lakshmi.naraya...@mmc.com>> 
> Sent: Monday, September 28, 2020 1:52 PM
> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
> Cc: Chattopadhyay, Salil  <mailto:salil.chattopadh...@mmc.com>>; Mutnuri, Vishnu D 
> mailto:vishnu.d.mutn...@mmc.com>>; Pathak, Omkar 
> mailto:omkar.pat...@mmc.com>>; Shenouda, Nasir B 
> mailto:nasir.b.sheno...@mmc.com>>
> Subject: Vulnerabilities in SOLR 8.6.2
> Importance: High
>  
> Hello Solr-User Support team
> We have installed the SOLR 8.6.2 package into docker container in our DEV 
> environment. Prior to using it, our security team scanned the docker image 
> using SysDig and found a lot of Critical/High/Medium vulnerabilities. The 
> full list is in the attached spreadsheet
>  
> Scan Summary
> 30 STOPS 190 WARNS188 Vulnerabilities
>  
> Please advise or point us to how/where to get a package that has been patched 
> for the Critical/High/Medium vulnerabilities in the attached spreadsheet
> Your help will be gratefully received
>  
>  
> Lakshmi Narayanan
> Marsh & McLennan Companies
> 121 River Street, Hoboken,NJ-07030
> 201-284

Re: SolrCloud crashing due to memory error - 'Cannot allocate memory' (errno=12)

2020-12-10 Thread Walter Underwood

How much RAM do you have on those machines? That message says you ran out.

32 GB is a HUGE heap. Unless you have a specific need for that, run with a 8 GB
heap and see how that works. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 10, 2020, at 7:55 PM, Altamirano, Emmanuel 
>  wrote:
> 
> Hello,
>  
> We have a SolrCloud(8.6) with 3 servers with the same characteristics and 
> configuration. We assigned32GB for heap memory each, and after some short 
> period of time sending 40 concurrent requests to the SolrCloud using a load 
> balancer, we are getting the following error that shutdown each Solr Server 
> and Zookeeper:
>  
> OpenJDK 64-Bit Server VM warning: Failed to reserve large pages memory 
> req_addr: 0x bytes: 536870912 (errno = 12).
> OpenJDK 64-Bit Server VM warning: Attempt to deallocate stack guard pages 
> failed.
> OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x7edd4d9da000, 
> 12288, 0) failed; error='Cannot allocate memory' (errno=12)
>  
>  
> 20201201 10:43:29.495 [ERROR] {qtp2051853139-23369} [c:express s:shard1 
> r:core_node6 x:express_shard1_replica_n4] 
> [org.apache.solr.handler.RequestHandlerBase, 148] | 
> org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are 
> disabled.
> at 
> org.apache.solr.update.processor.DistributedZkUpdateProcessor.zkCheck(DistributedZkUpdateProcessor.java:1245)
> at 
> org.apache.solr.update.processor.DistributedZkUpdateProcessor.setupRequest(DistributedZkUpdateProcessor.java:582)
> at 
> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:239)
>  
> 
>  
> We have a one collection with one shard, almost 400 million documents 
> (~334GB).
>  
> $ sysctl vm.nr_hugepages
> vm.nr_hugepages = 32768
> $ sysctl vm.max_map_count
> vm.max_map_count = 131072
>  
> /etc/security/limits.conf
>  
> * - core unlimited
> * - data unlimited
> * - priority unlimited
> * - fsize unlimited
> * - sigpending 513928
> * - memlock unlimited
> * - nofile 131072
> * - msgqueue 819200
> * - rtprio 0
> * - stack 8192
> * - cpu unlimited
> * - rss unlimited #virtual memory unlimited
> * - locks unlimited
> * soft nproc 65536
> * hard nproc 65536
> * - nofile 131072
>  
>  
>  
> /etc/sysctl.conf
>  
> vm.nr_hugepages =  32768
> vm.max_map_count = 131072
>  
>  
> Could you please provide me some advice to fix this error?
>  
> Thanks,
>  
> Emmanuel Altamirano

Re: is there a way to trigger a notification when a document is deleted in solr

2020-12-07 Thread Walter Underwood

That wouldn’t help, because that is a feature request to know when the space is
recovered after documents are deleted. 

I’d look at what shows up in the logs when the delete happens. From that info,
you could configure a log follower to notifiy. If your logs go to a log 
database, that
probably supports queries that send notifications.

The original feature request could be satisfied the same way.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 7, 2020, at 6:22 AM, Pushkar Mishra  wrote:
> 
> Hi All
> https://issues.apache.org/jira/browse/SOLR-13609, was this fixed ever ?
> 
> Regards
> 
> On Mon, Dec 7, 2020 at 6:32 PM Pushkar Mishra  wrote:
> 
>> Hi All,
>> 
>> Is there a way to trigger a notification when a document is deleted in
>> solr? Or may be when auto purge gets complete of deleted documents in solr?
>> 
>> Thanks
>> 
>> --
>> Pushkar Kumar Mishra
>> "Reactions are always instinctive whereas responses are always well
>> thought of... So start responding rather than reacting in life"
>> 
>> 
> 
> -- 
> Pushkar Kumar Mishra
> "Reactions are always instinctive whereas responses are always well thought
> of... So start responding rather than reacting in life"

Re: Solr8.7 - How to optmize my index ?

2020-12-01 Thread Walter Underwood

Even better DO NOT OPTIMIZE.

Just let Solr manage the indexes automatically.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 1, 2020, at 11:31 AM, Info MatheoSoftware  
> wrote:
> 
> Hi All,
> 
> 
> 
> I found the solution, I must do :
> 
> curl http://xxx:8983/solr/my_core/update?
> <http://xxx:8983/solr/my_core/update?optimize=true>
> commit=true=true
> 
> 
> 
> It works fine
> 
> 
> 
> Thanks,
> 
> Bruno
> 
> 
> 
> 
> 
> 
> 
> De : Matheo Software [mailto:i...@matheo-software.com]
> Envoyé : mardi 1 décembre 2020 13:28
> À : solr-user@lucene.apache.org
> Objet : Solr8.7 - How to optmize my index ?
> 
> 
> 
> Hi All,
> 
> 
> 
> With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.
> 
> 
> 
> So I decide to use the command line:
> 
> curl http://xxx:8983/solr/my_core/update?optimize=true
> 
> 
> 
> My collection my_core exists of course.
> 
> 
> 
> The answer of the command line is:
> 
> {
> 
>  "responseHeader":{
> 
>"status":0,
> 
>"QTime":18}
> 
> }
> 
> 
> 
> But nothing change.
> 
> I always have 38M deleted docs in my collection and directory size no change
> like with solr5.4.
> 
> The size of the collection stay always at : 466.33Go
> 
> 
> 
> Could you tell me how can I purge deleted docs ?
> 
> 
> 
> Cordialement, Best Regards
> 
> Bruno Mannina
> 
> <http://www.matheo-software.com> www.matheo-software.com
> 
> <http://www.patent-pulse.com> www.patent-pulse.com
> 
> Tél. +33 0 970 738 743
> 
> Mob. +33 0 634 421 817
> 
> <https://www.facebook.com/PatentPulse> facebook (1)
> <https://twitter.com/matheosoftware> 1425551717
> <https://www.linkedin.com/company/matheo-software> 1425551737
> <https://www.youtube.com/user/MatheoSoftware> 1425551760
> 
> 
> 
> 
> 
>  _
> 
> 
> <https://www.avast.com/antivirus> Avast logo
> 
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> www.avast.com <https://www.avast.com/antivirus>
> 
> 
> 
> 
> 
> 
> 
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus

Re: data import handler deprecated?

2020-11-29 Thread Walter Underwood

I recommend building an outboard loader, like I did a dozen years ago for
Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
program, though it reads from a JSONL file, not a database.

Run a loop fetching records from a database. Put each record into a synchronized
(thread-safe) queue. Run multiple worker threads, each pulling records from the
queue, batching them up, and sending them to Solr. For maximum indexing speed
(at the expense of query performance), count the number of CPUs per shard leader
and run two worker threads per CPU.

Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
documents, depending on the content.

With this setup, your database will probably be your bottleneck. I’ve had this
index a million (small) documents per minute to a multi-shard cluster, from a 
JSONL
file on local disk.

Also, don’t worry about finding the leaders and sending the right document to
the right shard. I just throw the batches at the load balancer and let Solr 
figure
it out. That is super simple and amazingly fast.

If you are doing big batches, building a dumb ETL system with JSONL files in 
Amazon S3 has some real advantages. It allows loading prod data into a test
cluster for load benchmarks, for example. Also good for disaster recovery, just
load the recent batches from S3. Want to know exactly which documents were
in the index in October? Look at the batches in S3.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 28, 2020, at 6:23 PM, matthew sporleder  wrote:
> 
> I went through the same stages of grief that you are about to start
> but (luckily?) my core dataset grew some weird cousins and we ended up
> writing our own indexer to join them all together/do partial
> updates/other stuff beyond DIH.  It's not difficult to upload docs but
> is definitely slower so far.  I think there is a bit of a 'clean core'
> focus going on in solr-land right now and DIH is easy(!) but it's also
> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
> etc) so anyway try to be happy that you are aware of it now.
> 
> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  wrote:
>> 
>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>> 
>>> ...  The bottom of
>>> that github page isn't hopeful however :)
>> 
>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>> JAR" :)
>> 
>> It's a more general queston though, what is the path forward for users
>> who with data in two places? Hope that a community-maintained plugin
>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>> roll our own delta-updates logic? Or are we to choose one datastore and
>> drop the other?
>> 
>> Dima

Re: Query generation is different for search terms with and without "-"

2020-11-25 Thread Walter Underwood

Ages ago at Netflix, I fixed this with a few hundred synonyms. If you are 
working with
a fixed vocabulary (movie titles, product names), that can work just fine.

babysitter, baby-sitter, baby sitter
fullmetal, full-metal, full metal
manhunter, man-hunter, man hunter
spiderman, spider-man, spider man

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 25, 2020, at 9:26 AM, Erick Erickson  wrote:
> 
> Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:
> 
> *FilterFactory are _not_ what you want in this case, they are applied to 
> individual tokens after parsing
> 
> *CharFiterFactory are invoked on the entire input to the field, although I 
> can’t say for certain that even that’s early enough.
> 
> There are two other options to consider:
> StatelessScriptUpdateProcessor
> FieldMutatingUpdateProcessor
> 
> Stateless... is probably easiest…
> 
> Best,
> ERick
> 
>> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez 
>>  wrote:
>> 
>> Are there any good workarounds/parameters we can use to fix this so it
>> doesn't have to be solved client side?
>> 
>> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder 
>> wrote:
>> 
>>> Is the normal/standard solution here to regex remove the '-'s and
>>> combine them into a single token?
>>> 
>>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson 
>>> wrote:
>>>> 
>>>> This is a common point of confusion. There are two phases for creating a
>>> query,
>>>> query _parsing_ first, then the analysis chain for the parsed result.
>>>> 
>>>> So what e-dismax sees in the two cases is:
>>>> 
>>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>>> comes into play.
>>>> 
>>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>>> splitting it on the hyphen comes later.
>>>> 
>>>> It’s especially confusing since the field analysis then breaks up
>>> “high-tech” into two tokens that
>>>> look the same as “high tech” in the debug response, just without the
>>> phrase query.
>>>> 
>>>> Name_enUS:high
>>>> Name_enUS:tech
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>>> samuel.gutier...@iherb.com.INVALID> wrote:
>>>>> 
>>>>> I am troubleshooting an issue with ranking for search terms that
>>> contain a
>>>>> "-" vs the same query that does not contain the dash e.g. "high-tech"
>>> vs
>>>>> "high tech". The field that I am querying is using the standard
>>> tokenizer,
>>>>> so I would expect that the underlying lucene query should be the same
>>> for
>>>>> both versions of the query, however when printing the debug, it appears
>>>>> they are generated differently. I know "-" must be escaped as it has
>>>>> special meaning in lucene, however escaping does not fix the problem.
>>> It
>>>>> appears that with the "-" present, the pf2 edismax parameter is not
>>>>> respected and omitted from the final query. We use sow=false as we have
>>>>> multiterm synonyms and need to ensure they are included in the final
>>> lucene
>>>>> query. My expectation is that the final underlying lucene query should
>>> be
>>>>> based on the output  of the field analyzer, however after briefly
>>> looking
>>>>> at the code for ExtendedDismaxQParser, it appears that there is some
>>> string
>>>>> processing happening outside of the analysis step which causes the
>>>>> unexpected lucene query.
>>>>> 
>>>>> 
>>>>> Solr Debug for "high tech":
>>>>> 
>>>>> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
>>>>> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
>>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
>>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
>>>>> parsedquery_toString: "+(((Name_enUS:high)~0.4
>>>>> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
>>>>> (Name_enUS:"high tech"~4)~0.4",
>>>>> 
>>>>> 
>>>>> Solr Debug for "high-tec

Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-10 Thread Walter Underwood

By far the simplest solution is to leave stopwords in the index. That also 
improves
relevance, because it becomes possible to search for “vitamin a” or “to be or 
not to be”.

Stopword remove was a performance and disk space hack from the 1960s. It is no 
longer needed. We were keeping stopwords in the index at Infoseek, back in 1996.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 10, 2020, at 1:16 AM, Edward Turner  wrote:
> 
> Hi all,
> 
> Okay, I've been doing more research about this problem and from what I
> understand, phrase queries + stopwords are known to have some difficulties
> working together in some circumstances.
> 
> E.g.,
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> https://issues.apache.org/jira/browse/SOLR-6468
> 
> I was thinking about workarounds, but each solution I've attempted doesn't
> quite work.
> 
> Therefore, maybe one possible solution is to take a step back and
> preprocess index/query data going to Solr, something like:
> 
> String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> data")
> // wordsForSolr = "pretend index query data"
> 
> Off the top of my head, this will by-pass position issues.
> 
> I will give this a go, but was wondering whether this is something others
> have done?
> 
> Best wishes,
> Edd
> 
> 
> Edward Turner
> 
> 
> On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:
> 
>> Hi all,
>> 
>> We are experiencing some unexpected behaviour for phrase queries which we
>> believe might be related to the FlattenGraphFilterFactory and stopwords.
>> 
>> Brief description: when performing a phrase query
>> "Molecular cloning and evolution of the" => we get expected hits
>> "Molecular cloning and evolution of the genes" => we get no hits
>> (unexpected behaviour)
>> 
>> I think it's worthwhile adding the analyzers we use to help you see what
>> we're doing:
>>  Analyzers 
>> >   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>   
>>  > pattern="[- /()]+" />
>>  > ignoreCase="true" />
>>  > preserveOriginal="false" />
>>  
>>  > generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>> splitOnNumerics="0" stemEnglishPossessive="1"
>> generateWordParts="1"
>> catenateNumbers="0" catenateWords="1" catenateAll="1" />
>>  
>>   
>>   
>>  > pattern="[- /()]+" />
>>  > ignoreCase="true" />
>>  > preserveOriginal="false" />
>>  
>>  > generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>> splitOnNumerics="0" stemEnglishPossessive="1"
>> generateWordParts="1"
>> catenateNumbers="0" catenateWords="0" catenateAll="0" />
>>   
>> 
>>  End of Analyzers 
>> 
>>  Stopwords 
>> We use the following stopwords:
>> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
>> of, on, or, such, that, the, their, then, there, these, they, this, to,
>> was, will, with, which
>>  End of Stopwords 
>> 
>>  Analysis Admin page output ---
>> ... And to see what's going on when we're indexing/querying, I created a
>> gist with an image of the (non-verbose) output of the analysis admin page
>> for, index data/query, "Molecular cloning and evolution of the genes":
>> 
>> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
>> 
>> Hopefully this link works, and you can see that the resulting terms and
>> positions are identical until the FlattenGraphFilterFactory step in the
>> "index" phase.
>> 
>> Final stage of index analysis:
>> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
>> 
>> Final stage of query analysis:
>> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
>> 
>> The empty positions are because of stopwords (presumably)
>>  End of Analysis Admin page output ---
>> 
>> Main question:
>> Could someone explain why the FlattenGraphFilterFactory changes the
>> position of the "genes" token? From what we see, this happens after a,
>> "the" (but we've not checked exhaustively, and continue to test).
>> 
>> Perhaps, we are doing something wrong in our analysis setup?
>> 
>> Any help would be much appreciated -- getting phrase queries to work is an
>> important use-case of ours.
>> 
>> Kind regards and thank you in advance,
>> Edd
>> 
>> Edward Turner
>>

Re: Solr tag cloud - words and counts

2020-11-03 Thread Walter Underwood

For a tag cloud, the anomalous words are what you want. If you choose the most 
common words, then every tag cloud will have the same words. It will look like:

the, be, to, it, of, and, a, in, that, have, I, it, for, not, on, with, ...

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 3, 2020, at 10:04 AM, uyilmaz  wrote:
> 
> 
> I have been trying to find a way to do this in Solr for a while. Perform a 
> query, and for a text_general field in the result set, find each term's # of 
> occurences.
> 
> - I tried the Terms Component, it doesn't have the ability to restrict the 
> result set with a query.
> 
> - Tried faceting on the field, since it's a text_general field it doesn't 
> have docValues, plus cardinality is very high (millions of documents * tens 
> of words in each field), so it works but it's very slow and sometimes times 
> out.
> 
> - Tried significantTerms streaming expression, but it's logically not the 
> same with what I'm looking for. It gives the words occuring frequently in the 
> result set, but not occuring as frequently outside it. So it's better to find 
> out frequency anomalies rather than simply the counts.
> 
> Do you have any suggestions?
> 
> Regards
> 
> -- 
> uyilmaz

Re: Avoiding duplicate entry for a multivalued field

2020-10-29 Thread Walter Underwood

Since you are already taking the performance hit of atomic updates, 
I doubt you’ll see any impact from field types or update request processors.
The extra cost of atomic updates will be much greater than indexing cost.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 29, 2020, at 3:16 AM, Srinivas Kashyap 
>  wrote:
> 
> Thanks Dwane,
> 
> I have a doubt, according to the java doc, the duplicates still continue to 
> exist in the field. May be during query time, the field returns only unique 
> values? Am I right with my assumption?
> 
> And also, what is the performance overhead for this UniqueFiled*Factory?
> 
> Thanks,
> Srinivas
> 
> From: Dwane Hall 
> Sent: 29 October 2020 14:33
> To: solr-user@lucene.apache.org
> Subject: Re: Avoiding duplicate entry for a multivalued field
> 
> Srinivas this is possible by adding an unique field update processor to the 
> update processor chain you are using to perform your updates (/update, 
> /update/json, /update/json/docs, .../a_custom_one)
> 
> The Java Documents explain its use nicely
> (https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html<https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html>)
>  or there are articles on stack overflow addressing this exact problem 
> (https://stackoverflow.com/questions/37005747/how-to-remove-duplicates-from-multivalued-fields-in-solr#37006655<https://stackoverflow.com/questions/37005747/how-to-remove-duplicates-from-multivalued-fields-in-solr#37006655>)
> 
> Thanks,
> 
> Dwane
> 
> From: Srinivas Kashyap 
> mailto:srini...@bamboorose.com.INVALID>>
> Sent: Thursday, 29 October 2020 3:49 PM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> 
> mailto:solr-user@lucene.apache.org>>
> Subject: Avoiding duplicate entry for a multivalued field
> 
> Hello,
> 
> Say, I have a schema field which is multivalued. Is there a way to maintain 
> distinct values for that field though I continue to add duplicate values 
> through atomic update via solrj?
> 
> Is there some property setting to have only unique values in a multi valued 
> fields?
> 
> Thanks,
> Srinivas
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
> 
> Disclaimer
> 
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
> 
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.

Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-10-28 Thread Walter Underwood

Double the heap.

All that CPU is the GC trying to free up space.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 28, 2020, at 6:29 AM, Jaan Arjasepp  wrote:
> 
> Hi all,
> 
> Its me again. Anyway, I did a little research and we tried different things 
> and well, some questions I want to ask and some things that I found.
> 
> Well after monitoring my system with VirtualVM, I found that GC jumping is 
> from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an 
> issue anymore or what? But will observe it a bit as it might rise I guess a 
> bit.
> 
> Next thing we found or are thinking about is that writing on a disk might be 
> an issue, we turned off the indexing and some other stuff, but I would say, 
> it did not save much still.
> I also did go through all the schema fields, not that much really. They are 
> all docValues=true. Also I must say they are all automatically generated, so 
> no manual working there except one field, but this also has docValue=true. 
> Just curious, if the field is not a string/text, can it be docValue=false or 
> still better to have true? And as for uninversion, then we are not using much 
> facets nor other specific things in query, just simple queries. 
> 
> Though I must say we are updating documents quite a bunch, but usage of CPU 
> for being so high, not sure about that. Older version seemed not using CPU so 
> much...
> 
> I am a bit running out of ideas and hoping that this will continue to work, 
> but I dont like the CPU usage even over night, when nobody uses it. We will 
> try to figure out the issue here and I hope I can ask more questions when in 
> doubt or out of ideas. Also I must admit, solr is really new for me 
> personally.
> 
> Jaan
> 
> -Original Message-
> From: Walter Underwood  
> Sent: 27 October 2020 18:44
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server
> 
> That first graph shows a JVM that does not have enough heap for the program 
> it is running. Look at the bottom of the dips. That is the amount of memory 
> still in use after a full GC.
> 
> You want those dips to drop to about half of the available heap, so I’d 
> immediately increase that heap to 4G. That might not be enough, so you’ll 
> need to to watch that graph after the increase.
> 
> I’ve been using 8G heaps with Solr since version 1.2. We run this config with 
> Java 8 on over 100 machines. We do not do any faceting, which can take more 
> memory.
> 
> SOLR_HEAP=8g
> # Use G1 GC  -- wunder 2017-01-23
> # Settings from https://wiki.apache.org/solr/ShawnHeisey
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> "
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Oct 27, 2020, at 12:48 AM, Jaan Arjasepp  wrote:
>> 
>> Hello,
>> 
>> We have been using SOLR for quite some time. We used 6.0 and now we did a 
>> little upgrade to our system and servers and we started to use 8.6.1.
>> We use it on a Windows Server 2019.
>> Java version is 11
>> Basically using it in a default setting, except giving SOLR 2G of heap. It 
>> used 512, but it ran out of memory and stopped responding. Not sure if it 
>> was the issue. When older version, it managed fine with 512MB.
>> SOLR is not in a cloud mode, but in solo mode as we use it internally and it 
>> does not have too many request nor indexing actually.
>> Document sizes are not big, I guess. We only use one core.
>> Document stats are here:
>> Num Docs: 3627341
>> Max Doc: 4981019
>> Heap Memory Usage: 434400
>> Deleted Docs: 1353678
>> Version: 15999036
>> Segment Count: 30
>> 
>> The size of index is 2.66GB
>> 
>> While making upgrade we had to modify one field and a bit of code that uses 
>> it. Thats basically it. It works.
>> If needed more information about background of the system, I am happy to 
>> help.
>> 
>> 
>> But now to the issue I am having.
>> If SOLR is started, at first 40-60 minutes it works just fine. CPU is not 
>> high, heap usage seem normal. All is good, but then suddenly, the heap usage 
>> goes crazy, going up and down, up and down and CPU rises to 50-60% of the 
>> usage. Also I noticed over the weekend, when there are no writing usage, the 
>> CPU remains low and decent. I can try it this weekend again to see if and 
>> how this works out.
>> Also it seems to me, that afte

Re: Tangent: old Solr versions

2020-10-28 Thread Walter Underwood

Chegg is running a 4.10.2 master/slave cluster for textbook search and several
other collections.

1. None of the features past 4.x are needed.
2. We depend on the extended edismax (SOLR-629).
3. Ain’t broke.

We are moving our Solr Cloud clusters to 8.x, even though there are no
features we need that aren’t in 6.6.2. Moving the Solr 4 cluster is way at
the bottom of the list.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 28, 2020, at 5:37 AM, Mark H. Wood  wrote:
> 
> On Tue, Oct 27, 2020 at 04:25:54PM -0500, Mike Drob wrote:
>> Based on the questions that we've seen over the past month on this list,
>> there are still users with Solr on 6, 7, and 8. I suspect there are still
>> Solr 5 users out there too, although they don't appear to be asking for
>> help - likely they are in set it and forget it mode.
> 
> Oh, there are quite a few instances of Solr 4 out there as well.  Many
> of them will be moving to v7 or v8, probably starting in the next 6-12
> months.
> 
> -- 
> Mark H. Wood
> Lead Technology Analyst
> 
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu

Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-10-27 Thread Walter Underwood

That first graph shows a JVM that does not have enough heap for the 
program it is running. Look at the bottom of the dips. That is the amount
of memory still in use after a full GC.

You want those dips to drop to about half of the available heap, so I’d 
immediately increase that heap to 4G. That might not be enough, so 
you’ll need to to watch that graph after the increase.

I’ve been using 8G heaps with Solr since version 1.2. We run this config
with Java 8 on over 100 machines. We do not do any faceting, which
can take more memory.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 27, 2020, at 12:48 AM, Jaan Arjasepp  wrote:
> 
> Hello,
> 
> We have been using SOLR for quite some time. We used 6.0 and now we did a 
> little upgrade to our system and servers and we started to use 8.6.1.
> We use it on a Windows Server 2019.
> Java version is 11
> Basically using it in a default setting, except giving SOLR 2G of heap. It 
> used 512, but it ran out of memory and stopped responding. Not sure if it was 
> the issue. When older version, it managed fine with 512MB.
> SOLR is not in a cloud mode, but in solo mode as we use it internally and it 
> does not have too many request nor indexing actually.
> Document sizes are not big, I guess. We only use one core.
> Document stats are here:
> Num Docs: 3627341
> Max Doc: 4981019
> Heap Memory Usage: 434400
> Deleted Docs: 1353678
> Version: 15999036
> Segment Count: 30
> 
> The size of index is 2.66GB
> 
> While making upgrade we had to modify one field and a bit of code that uses 
> it. Thats basically it. It works.
> If needed more information about background of the system, I am happy to help.
> 
> 
> But now to the issue I am having.
> If SOLR is started, at first 40-60 minutes it works just fine. CPU is not 
> high, heap usage seem normal. All is good, but then suddenly, the heap usage 
> goes crazy, going up and down, up and down and CPU rises to 50-60% of the 
> usage. Also I noticed over the weekend, when there are no writing usage, the 
> CPU remains low and decent. I can try it this weekend again to see if and how 
> this works out.
> Also it seems to me, that after 4-5 days of working like this, it stops 
> responding, but needs to be confirmed with more heap also.
> 
> Heap memory usage via JMX and jconsole -> 
> https://drive.google.com/file/d/1Zo3B_xFsrrt-WRaxW-0A0QMXDNscXYih/view?usp=sharing
> As you can see, it starts of normal, but then goes crazy and it has been like 
> this over night.
> 
> This is overall monitoring graphs, as you can see CPU is working hard or 
> hardly working. -> 
> https://drive.google.com/file/d/1_Gtz-Bi7LUrj8UZvKfmNMr-8gF_lM2Ra/view?usp=sharing
> VM summary can be found here -> 
> https://drive.google.com/file/d/1FvdCz0N5pFG1fmX_5OQ2855MVkaL048w/view?usp=sharing
> And finally to have better and quick overview of the SOLR executing 
> parameters that I have -> 
> https://drive.google.com/file/d/10VCtYDxflJcvb1aOoxt0u3Nb5JzTjrAI/view?usp=sharing
> 
> If you can point me what I have to do to make it work, then I appreciate it a 
> lot.
> 
> Thank you in advance.
> 
> Best regards,
> Jaan
> 
>

Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread Walter Underwood

Hmm. Fields used for faceting will also be used for filtering, which is a kind
of search. Are docValues OK for filtering? I expect they might be slow the
first time, then cached.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 19, 2020, at 11:15 AM, Erick Erickson  wrote:
> 
> uyilmaz:
> 
> Hmm, that _is_ confusing. And inaccurate.
> 
> In this context, it should read something like
> 
> The Text field should have indexed="true" docValues=“false" if used for 
> searching 
> but not faceting and the String field should have indexed="false" 
> docValues=“true"
> if used for faceting but not searching.
> 
> I’ll fix this, thanks for pointing this out.
> 
> Erick
> 
>> On Oct 19, 2020, at 1:42 PM, uyilmaz  wrote:
>> 
>> Thanks! This also contributed to my confusion:
>> 
>> https://lucene.apache.org/solr/guide/8_4/faceting.html#field-value-faceting-parameters
>> 
>> "If you want Solr to perform both analysis (for searching) and faceting on 
>> the full literal strings, use the copyField directive in your Schema to 
>> create two versions of the field: one Text and one String. Make sure both 
>> are indexed="true"."
>> 
>> On Mon, 19 Oct 2020 13:08:00 -0400
>> Alexandre Rafalovitch  wrote:
>> 
>>> I think this is all explained quite well in the Ref Guide:
>>> https://lucene.apache.org/solr/guide/8_6/docvalues.html
>>> 
>>> DocValues is a different way to index/store values. Faceting is a
>>> primary use case where docValues are better than what 'indexed=true'
>>> gives you.
>>> 
>>> Regards,
>>>  Alex.
>>> 
>>> On Mon, 19 Oct 2020 at 12:51, uyilmaz  wrote:
>>>> 
>>>> 
>>>> Hey all,
>>>> 
>>>> From my little experiments, I see that (if I didn't make a stupid mistake) 
>>>> we can facet on fields marked as both indexed and stored being false:
>>>> 
>>>> >>> stored="false" docValues="true"/>
>>>> 
>>>> I'm suprised by this, I thought I would need to index it. Can you confirm 
>>>> this?
>>>> 
>>>> Regards
>>>> 
>>>> --
>>>> uyilmaz 
>> 
>> 
>> -- 
>> uyilmaz 
>

Re: converting string to solr.TextField

2020-10-17 Thread Walter Underwood

Because Solr is not updating documents. Solr is adding to indexes
of fields. You cannot add a TextField document to a StringField index.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 17, 2020, at 5:23 AM, Vinay Rajput  wrote:
> 
> Sorry to jump into this discussion. I also get confused whenever I see this
> strange Solr/Lucene behaviour. Probably, As @Erick said in his last year
> talk, this is how it has been designed to avoid many problems that are
> hard/impossible to solve.
> 
> That said, one more time I want to come back to the same question: why
> solr/lucene can not handle this when we are updating all the documents?
> Let's take a couple of examples :-
> 
> *Ex 1:*
> Let's say I have only 10 documents in my index and all of them are in a
> single segment (Segment 1). Now, I change the schema (update field type in
> this case) and reindex all of them.
> This is what (according to me) should happen internally :-
> 
> 1st update req : Solr will mark 1st doc as deleted and index it again
> (might run the analyser chain based on config)
> 2nd update req : Solr will mark 2st doc as deleted and index it again
> (might run the analyser chain based on config)
> And so on..
> based on autoSoftCommit/autoCommit configuration, all new documents will be
> indexed and probably flushed to disk as part of new segment (Segment 2)
> 
> 
> Now, whenever segment merging happens (during commit or later in time),
> lucene will create a new segment (Segment 3) can discard all the docs
> present in segment 1 as there are no live docs in it. And there would *NOT*
> be any situation to decide whether to choose the old config or new config
> as there is not even a single live document with the old config. Isn't it?
> 
> *Ex 2:*
> I see that it can be an issue if we think about reindexing millions of
> docs. Because in that case, merging can be triggered when indexing is half
> way through, and since there are some live docs in the old segment (with
> old cofig), things will blow up. Please correct me if I am wrong.
> 
> I am *NOT* a Solr/Lucene expert and just started learning the ways things
> are working internally. In the above example, I can be wrong at many
> places. Can someone confirm if scenarios like Ex-2 are the reasons behind
> the fact that even re-indexing all documents doesn't help if some
> incompatible schema changes are done?  Any other insight would also be
> helpful.
> 
> Thanks,
> Vinay
> 
> On Sat, Oct 17, 2020 at 5:48 AM Shawn Heisey  wrote:
> 
>> On 10/16/2020 2:36 PM, David Hastings wrote:
>>> sorry, i was thinking just using the
>>> *:*
>>> method for clearing the index would leave them still
>> 
>> In theory, if you delete all documents at the Solr level, Lucene will
>> delete all the segment files on the next commit, because they are empty.
>>  I have not confirmed with testing whether this actually happens.
>> 
>> It is far safer to use a new index as Erick has said, or to delete the
>> index directories completely and restart Solr ... so you KNOW the index
>> has nothing in it.
>> 
>> Thanks,
>> Shawn
>>

Re: converting string to solr.TextField

2020-10-16 Thread Walter Underwood

In addition, what happens at query time when documents have
been index under a varying field type? Well, it doesn’t work well.

The full set of steps for uninterrupted searching is:

1. Add the new text field.
2. Reindex to populate that.
3. Switch querying to use the new text field.
4. Change the old string field to indexed=“false” stored=“false” and/or stop
including that field in search updates and/or populating it with copyField.
5. Reindex again to clean up all occurrences of the old field.
6. Remove the old field from the schema.

I just finished this process on two big clusters in prod. We had
created a bunch of extra fields for a series of A/B tests on 
relevance improvements. Those tests were finished, so we 
needed to remove those from the index. It was slightly simpler
because we had already stopped querying those fields.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 16, 2020, at 12:57 PM, David Hastings  
> wrote:
> 
> Gotcha, thanks for the explanation.  another small question if you
> dont mind, when deleting docs they arent actually removed, just tagged as
> deleted, and the old field/field type is still in the index until
> merged/optimized as well, wouldnt that cause almost the same conflicts
> until then?
> 
> On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson 
> wrote:
> 
>> Doesn’t re-indexing a document just delete/replace….
>> 
>> It’s complicated. For the individual document, yes. The problem
>> comes because the field is inconsistent _between_ documents, and
>> segment merging blows things up.
>> 
>> Consider. I have segment1 with documents indexed with the old
>> schema (String in this case). I  change my schema and index the same
>> field as a text type.
>> 
>> Eventually, a segment merge happens and these two segments get merged
>> into a single new segment. How should the field be handled? Should it
>> be defined as String or Text in the new segment? If you convert the docs
>> with a Text definition for the field to String,
>> you’d lose the ability to search for individual tokens. If you convert the
>> String to Text, you don’t have any guarantee that the information is even
>> available.
>> 
>> This is just the tip of the iceberg in terms of trying to change the
>> definition of a field. Take the case of changing the analysis chain,
>> say you use a phonetic filter on a field then decide to remove it and
>> do not store the original. Erick might be encoded as “ENXY” so the
>> original data is simply not there to convert. Ditto removing a
>> stemmer, lowercasing, applying a regex, …...
>> 
>> 
>> From Mike McCandless:
>> 
>> "This really is the difference between an index and a database:
>> we do not store, precisely, the original documents.  We store
>> an efficient derived/computed index from them.  Yes, Solr/ES
>> can add database-like behavior where they hold the true original
>> source of the document and use that to rebuild Lucene indices
>> over time.  But Lucene really is just a "search index" and we
>> need to be free to make important improvements with time."
>> 
>> And all that aside, you have to re-index all the docs anyway or
>> your search results will be inconsistent. So leaving aside the
>> impossible task of covering all the possibilities on the fly, it’s
>> better to plan on re-indexing….
>> 
>> Best,
>> Erick
>> 
>> 
>>> On Oct 16, 2020, at 3:16 PM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>>> 
>>> "If you want to
>>> keep the same field name, you need to delete all of the
>>> documents in the index, change the schema, and reindex."
>>> 
>>> actually doesnt re-indexing a document just delete/replace anyways
>> assuming
>>> the same id?
>>> 
>>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
>> arafa...@gmail.com>
>>> wrote:
>>> 
>>>> Just as a side note,
>>>> 
>>>>> indexed="true"
>>>> If you are storing 32K message, you probably are not searching it as a
>>>> whole string. So, don't index it. You may also want to mark the field
>>>> as 'large' (and lazy):
>>>> 
>>>> 
>> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>>>> 
>>>> When you are going to make it a text field, you will probably be
>>>> having the same issues as well.
>>>> 
>>>> And honestly, if you are not storing those field

Re: converting string to solr.TextField

2020-10-16 Thread Walter Underwood

No. The data is already indexed as a StringField.

You need to make a new field and reindex. If you want to 
keep the same field name, you need to delete all of the 
documents in the index, change the schema, and reindex.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 16, 2020, at 11:01 AM, yaswanth kumar  wrote:
> 
> I am using solr 8.2
> 
> Can I change the schema fieldtype from string to solr.TextField
> without indexing?
> 
>
> 
> The reason is that string has only 32K char limit where as I am looking to
> store more than 32K now.
> 
> The contents on this field doesn't require any analysis or tokenized but I
> need this field in the queries and as well as output fields.
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Re: Solr 8.6.3

2020-10-15 Thread Walter Underwood

Solr does not index XML. It has an XML data format for indexing text.

If you want to index and search XML, get MarkLogic. I used to work there.
It is seriously awesome technology.

https://www.marklogic.com <https://www.marklogic.com/>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 15, 2020, at 1:10 PM, Kris Gurusamy  
> wrote:
> 
> I've just downloaded solr 8.6.3 and trying to create DIH for loading 
> structured XML. I found out that DIH will be deprecated soon with version 
> 9.0. What is the equivalent of DIH in new solr version? How do I import 
> structured XML data which is very custom and index in Solr new version? Any 
> help is appreciated.
> 
> Regards
> 
> Kris Gurusamy
> Director, Engineering
> kgurus...@xpanse.com
> www.xpanse.com
> 
> On 10/15/20, 1:08 PM, "Anshum Gupta (Jira)"  wrote:
> 
> 
> [ 
> https://issues.apache.org/jira/browse/SOLR-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
>Anshum Gupta resolved SOLR-14938.
>-
>Resolution: Invalid
> 
>[~krisgurusamy] - Please ask questions regarding usage on the Solr user 
> mailing list. 
> 
>JIRA is meant for issue tracking purposes.
> 
>> Solr 8.6.3
>> --
>> 
>>Key: SOLR-14938
>>URL: https://issues.apache.org/jira/browse/SOLR-14938
>>Project: Solr
>> Issue Type: Bug
>> Security Level: Public(Default Security Level. Issues are Public) 
>> Components: contrib - DataImportHandler
>>   Reporter: Krishnan
>>   Priority: Major
>> 
>> I've just downloaded solr 8.6.3 and trying to create DIH for loading 
>> structured XML. I found out that DIH will be deprecated soon with version 
>> 9.0. What is the equivalent of DIH in new solr version? How do I import 
>> structured XML data which is very custom and index in Solr new version? Any 
>> help is appreciated.
> 
> 
> 
>--
>This message was sent by Atlassian Jira
>(v8.3.4#803005)
>

Re: Memory line in status output

2020-10-13 Thread Walter Underwood

I recommend using the options mentioned in recent messages on this list.

Solr has pretty specific memory demands, with lots of allocations with a
lifetime of a single request, plus very long-lived allocations that aren’t freed
until they are evicted from a cache.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 13, 2020, at 1:35 PM, Ryan W  wrote:
> 
> Thanks.  The G1 docs say "G1 is designed to provide good overall
> performance without the need to specify additional options."
> 
> Would that look like this...
> 
> GC_TUNE=" \
> -XX:+UseG1GC \
> "
> 
> Is that the most minimal config? Is it typical to use it without options?
> 
> On Tue, Oct 13, 2020 at 4:22 PM Walter Underwood 
> wrote:
> 
>> The home page of the Solr admin UI shows all of the options to the JVM.
>> That will include the choice of garbage collector.
>> 
>> You can also see the options with “ps -ef | grep solr”.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Oct 13, 2020, at 1:19 PM, Ryan W  wrote:
>>> 
>>> I think I have it sorted. At this point I'm using GCG1, I take it,
>> because
>>> most recently I started Solr as a service...
>>> 
>>> service solr start
>>> 
>>> And that is running solr by way of /etc/init.d/solr because I don't have
>>> any systemd unit for solr, as explained here...
>>> 
>> https://askubuntu.com/questions/903354/difference-between-systemctl-and-service-commands
>>> 
>>> And I can see in the System V script for solr that /etc/default/
>> solr.in.sh
>>> is the relevant config file.
>>> 
>>> 
>>> On Tue, Oct 13, 2020 at 11:23 AM Ryan W  wrote:
>>> 
>>>> Or, perhaps if I start solr like so
>>>> 
>>>> service solr start
>>>> 
>>>> ...it will use the solr.in.sh at /etc/default/solr.in.sh ?
>>>> 
>>>> 
>>>> 
>>>> On Tue, Oct 13, 2020 at 11:19 AM Ryan W  wrote:
>>>> 
>>>>> This is how I start solr:
>>>>> 
>>>>> /opt/solr/bin/solr start
>>>>> 
>>>>> In my /etc/default/solr.in.sh, I have this...
>>>>> 
>>>>> GC_TUNE=" \
>>>>> -XX:+UseG1GC \
>>>>> -XX:+ParallelRefProcEnabled \
>>>>> -XX:G1HeapRegionSize=8m \
>>>>> -XX:MaxGCPauseMillis=200 \
>>>>> -XX:+UseLargePages \
>>>>> -XX:+AggressiveOpts \
>>>>> "
>>>>> 
>>>>> But I don't know how to tell if Solr is using that file.
>>>>> 
>>>>> In my /opt/solr/bin there is no solr.in.sh, but there is a
>>>>> solr.in.sh.orig -- perhaps I should copy my /etc/default/solr.in.sh to
>>>>> /opt/solr/bin ?
>>>>> 
>>>>> I am running Linux (RHEL).  The Solr version is 7.7.2.  Solr 8.x is not
>>>>> compatible with my application.
>>>>> 
>>>>> Thank you.
>>>>> 
>>>>> 
>>>>> On Mon, Oct 12, 2020 at 9:46 PM Shawn Heisey 
>>>>> wrote:
>>>>> 
>>>>>> On 10/12/2020 5:11 PM, Ryan W wrote:
>>>>>>> Thanks.  How do I activate the G1GC collector?  Do I do this by
>>>>>> editing a
>>>>>>> config file, or by adding a parameter when I start solr?
>>>>>>> 
>>>>>>> Oracle's docs are pointing me to a file that supposedly is at
>>>>>>> instance-dir/OUD/config/java.properties, but I don't have that path.
>>>>>> I am
>>>>>>> not sure what is meant by instance-dir here, but perhaps it means my
>>>>>> JRE
>>>>>>> install, which is at
>>>>>>> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre --
>> but
>>>>>>> there is no "OUD" directory in this location.
>>>>>> 
>>>>>> The collector is chosen by the startup options given to Java, in this
>>>>>> case by the start script for Solr.  I've never heard of it being set
>> by
>>>>>> a config in the JRE.
>>>>>> 
>>>>>> In Solr 7, the start script defaults to the CMS collector.  We have
>>>>>> updated that to G1 in the latest Solr 8.x versions, because CMS has
>> been
>>>>>> deprecated by Oracle.
>>>>>> 
>>>>>> Adding the following lines to the correct solr.in.sh would change the
>>>>>> garbage collector to G1.  I got this from the "bin/solr" script in
>> Solr
>>>>>> 8.5.1:
>>>>>> 
>>>>>>  GC_TUNE=('-XX:+UseG1GC' \
>>>>>>'-XX:+PerfDisableSharedMem' \
>>>>>>'-XX:+ParallelRefProcEnabled' \
>>>>>>'-XX:MaxGCPauseMillis=250' \
>>>>>>'-XX:+UseLargePages' \
>>>>>>'-XX:+AlwaysPreTouch')
>>>>>> 
>>>>>> If you used the service installer script to install Solr, then the
>>>>>> correct file to add this to is usually /etc/default/solr.in.sh ...
>> but
>>>>>> if you did the install manually, it may be in the same bin directory
>>>>>> that contains the solr script itself.  Your initial message says the
>>>>>> solr home is /opt/solr/server/solr so I am assuming it's not running
>> on
>>>>>> Windows.
>>>>>> 
>>>>>> Thanks,
>>>>>> Shawn
>>>>>> 
>>>>> 
>> 
>>

Re: Memory line in status output

2020-10-13 Thread Walter Underwood

The home page of the Solr admin UI shows all of the options to the JVM.
That will include the choice of garbage collector.

You can also see the options with “ps -ef | grep solr”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 13, 2020, at 1:19 PM, Ryan W  wrote:
> 
> I think I have it sorted. At this point I'm using GCG1, I take it, because
> most recently I started Solr as a service...
> 
> service solr start
> 
> And that is running solr by way of /etc/init.d/solr because I don't have
> any systemd unit for solr, as explained here...
> https://askubuntu.com/questions/903354/difference-between-systemctl-and-service-commands
> 
> And I can see in the System V script for solr that /etc/default/solr.in.sh
> is the relevant config file.
> 
> 
> On Tue, Oct 13, 2020 at 11:23 AM Ryan W  wrote:
> 
>> Or, perhaps if I start solr like so
>> 
>> service solr start
>> 
>> ...it will use the solr.in.sh at /etc/default/solr.in.sh ?
>> 
>> 
>> 
>> On Tue, Oct 13, 2020 at 11:19 AM Ryan W  wrote:
>> 
>>> This is how I start solr:
>>> 
>>> /opt/solr/bin/solr start
>>> 
>>> In my /etc/default/solr.in.sh, I have this...
>>> 
>>> GC_TUNE=" \
>>> -XX:+UseG1GC \
>>> -XX:+ParallelRefProcEnabled \
>>> -XX:G1HeapRegionSize=8m \
>>> -XX:MaxGCPauseMillis=200 \
>>> -XX:+UseLargePages \
>>> -XX:+AggressiveOpts \
>>> "
>>> 
>>> But I don't know how to tell if Solr is using that file.
>>> 
>>> In my /opt/solr/bin there is no solr.in.sh, but there is a
>>> solr.in.sh.orig -- perhaps I should copy my /etc/default/solr.in.sh to
>>> /opt/solr/bin ?
>>> 
>>> I am running Linux (RHEL).  The Solr version is 7.7.2.  Solr 8.x is not
>>> compatible with my application.
>>> 
>>> Thank you.
>>> 
>>> 
>>> On Mon, Oct 12, 2020 at 9:46 PM Shawn Heisey 
>>> wrote:
>>> 
>>>> On 10/12/2020 5:11 PM, Ryan W wrote:
>>>>> Thanks.  How do I activate the G1GC collector?  Do I do this by
>>>> editing a
>>>>> config file, or by adding a parameter when I start solr?
>>>>> 
>>>>> Oracle's docs are pointing me to a file that supposedly is at
>>>>> instance-dir/OUD/config/java.properties, but I don't have that path.
>>>> I am
>>>>> not sure what is meant by instance-dir here, but perhaps it means my
>>>> JRE
>>>>> install, which is at
>>>>> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre -- but
>>>>> there is no "OUD" directory in this location.
>>>> 
>>>> The collector is chosen by the startup options given to Java, in this
>>>> case by the start script for Solr.  I've never heard of it being set by
>>>> a config in the JRE.
>>>> 
>>>> In Solr 7, the start script defaults to the CMS collector.  We have
>>>> updated that to G1 in the latest Solr 8.x versions, because CMS has been
>>>> deprecated by Oracle.
>>>> 
>>>> Adding the following lines to the correct solr.in.sh would change the
>>>> garbage collector to G1.  I got this from the "bin/solr" script in Solr
>>>> 8.5.1:
>>>> 
>>>>   GC_TUNE=('-XX:+UseG1GC' \
>>>> '-XX:+PerfDisableSharedMem' \
>>>> '-XX:+ParallelRefProcEnabled' \
>>>> '-XX:MaxGCPauseMillis=250' \
>>>> '-XX:+UseLargePages' \
>>>> '-XX:+AlwaysPreTouch')
>>>> 
>>>> If you used the service installer script to install Solr, then the
>>>> correct file to add this to is usually /etc/default/solr.in.sh ... but
>>>> if you did the install manually, it may be in the same bin directory
>>>> that contains the solr script itself.  Your initial message says the
>>>> solr home is /opt/solr/server/solr so I am assuming it's not running on
>>>> Windows.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>

Re: Any solr api to force leader on a specified node

2020-10-11 Thread Walter Underwood

Don’t use DIH. DIH has a lot of limitations and problems, as you are 
discovering.

Write a simple program that fetches from the database and sends documents 
in batches to Solr. I did this before DIH was invented (Solr 1.3) and I’m doing 
it
now.

You can send the updates to the load balancer for the Solr Cloud cluster. The
updates will be automatically routed to the right leader. It is very fast.

My loader is written in Python and I don’t even bother with a special Solr 
library.
It just sends JSON to the update handler with the right options.

We do this for all of our clusters. Our biggest one is 48 hosts with 55 million
documents.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 11, 2020, at 8:40 PM, yaswanth kumar  wrote:
> 
> Hi wunder 
> 
> Thanks for replying on this..
> 
> I did setup solr cloud with 4 nodes being one node having DIH configured that 
> pulls data from ms sql every minute.. if I install DIH on rest of the nodes 
> it’s causing connection issues on the source dB which I don’t want and manage 
> with only one sever polling dB while rest are used as replicas for search.
> 
> So now everything works fine but when the severs are rebooted for maintenance 
> and once they come back and if the leader is not the one that doesn’t have 
> DIH it stops pulling data from sql , so that’s the reason why I want always 
> to force a node as leader
> 
> Sent from my iPhone
> 
>> On Oct 11, 2020, at 11:05 PM, Walter Underwood  wrote:
>> 
>> That requirement is not necessary. Let Solr choose a leader.
>> 
>> Why is someone making this bad requirement?
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Oct 11, 2020, at 8:01 PM, yaswanth kumar  wrote:
>>> 
>>> Can someone pls help me to know if there is any solr api /config where we 
>>> can make sure to always opt leader on a particular solr node in solr cloud??
>>> 
>>> Using solr 8.2 and zoo 3.4
>>> 
>>> I have four nodes and my requirement is to always make a particular node as 
>>> leader
>>> 
>>> Sent from my iPhone
>>

Re: Any solr api to force leader on a specified node

2020-10-11 Thread Walter Underwood

That requirement is not necessary. Let Solr choose a leader.

Why is someone making this bad requirement?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 11, 2020, at 8:01 PM, yaswanth kumar  wrote:
> 
> Can someone pls help me to know if there is any solr api /config where we can 
> make sure to always opt leader on a particular solr node in solr cloud??
> 
> Using solr 8.2 and zoo 3.4
> 
> I have four nodes and my requirement is to always make a particular node as 
> leader
> 
> Sent from my iPhone

Re: Help with uploading files to a core.

2020-10-11 Thread Walter Underwood

Solr is not a database. You can make a huge mess pretending it is a DB.

Also, it doesn’t store files.

What is your use case?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 11, 2020, at 1:28 PM, Guilherme dos Reis Meneguello 
>  wrote:
> 
> Hello! My name is Guilherme and I'm a new user of Solr.
> 
> Basically, I'm developing a database to help a research team in my
> university, but I'm having some problems uploading the files to the
> database. Either using curl commands or through the admin interface, I
> can't quite upload the files from my computer to Solr and set up the field
> types I want that file to have while indexed. I can do that through the
> document builder, but my intent was to have the research team I'm
> supporting just upload them through the terminal or something like that. My
> schema is all set up nicely, however the Solr's field class guessing isn't
> guessing correctly.
> 
> The reference guides in lucene apache's website didn't help me much. I'm
> pretty newbie when it comes to this field, but I feel it's something really
> basic that I'm missing. If anyone could help me or point me in the right
> direction, I'd be really thankful.
> 
> Regards,
> Guilherme.
> 
> <https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>
> Livre
> de vírus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>.
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Folding Repeated Letters

2020-10-09 Thread Walter Underwood

Actually, helping the humans to use proper spelling is a good approach. Include 
a
spelling correction step (non-optional) for user-generated content and spelling
suggestions for queries. Completion/suggestion is another way to guide people
to properly spelled words that exist in your index.

I agree that trying to fix this after you have the query is hard.

If edismax supported fuzzy matching, it would be much easier. I know that, 
because
we’ve been running that patch (SOLR-629) in prod for several years.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2020, at 4:27 AM, Erick Erickson  wrote:
> 
> Anything you do will be wrong ;).
> 
> I suppose you could kick out words that weren’t in some dictionary and 
> accumulate a list of words not in the dictionary and just deal with them 
> “somehow", but that’s labor-intensive since you then have to deal with proper 
> names and the like. Sometimes you can get by with ignoring words with _only_ 
> the first letter capitalized, which is also not perfect but might get you 
> closer. You mentioned phonetic filters, but frankly I have no idea whether 
> YES and YY would reduce to the same code, I rather doubt 
> it.
> 
> In general, you _can’t_ solve this problem perfectly without inspecting each 
> input, you can only get an approximation. And at some point it’s worth asking 
> “is it worth it?”. I suppose you could try the regex Andy suggested in a 
> copyField destination and use that as well as the primary field in queries, 
> that might help at least find things like this.
> 
> If we were just able to require humans to use proper spelling, this would be 
> a lot easier….
> 
> Wish there were a solution
> 
> Best,
> Erick
> 
>> On Oct 8, 2020, at 10:59 PM, Mike Drob  wrote:
>> 
>> I was thinking about that, but there are words that are legitimately
>> different with repeated consonants. My primary school teacher lost hair
>> over getting us to learn the difference between desert and dessert.
>> 
>> Maybe we need something that can borrow the boosting behaviour of fuzzy
>> query - match the exact term, but also the neighbors with a slight deboost,
>> so that if the main term exists those others won't show up.
>> 
>> On Thu, Oct 8, 2020 at 5:46 PM Andy Webb  wrote:
>> 
>>> How about something like this?
>>> 
>>> {
>>>   "add-field-type": [
>>>   {
>>>   "name": "norepeat",
>>>   "class": "solr.TextField",
>>>   "analyzer": {
>>>   "tokenizer": {
>>>   "class": "solr.StandardTokenizerFactory"
>>>   },
>>>   "filters": [
>>>   {
>>>   "class": "solr.LowerCaseFilterFactory"
>>>   },
>>>   {
>>>   "class": "solr.PatternReplaceFilterFactory",
>>>   "pattern": "(.)\\1+",
>>>   "replacement": "$1"
>>>   }
>>>   ]
>>>   }
>>>   }
>>>   ]
>>> }
>>> 
>>> This finds a match...
>>> 
>>> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes=YyyeeEssSs=norepeat
>>> 
>>> Andy
>>> 
>>> 
>>> 
>>> On Thu, 8 Oct 2020 at 23:02, Mike Drob  wrote:
>>> 
>>>> I'm looking for a way to transform words with repeated letters into the
>>>> same token - does something like this exist out of the box? Do our
>>> stemmers
>>>> support it?
>>>> 
>>>> For example, say I would want all of these terms to return the same
>>> search
>>>> results:
>>>> 
>>>> YES
>>>> YESSS
>>>> YYYEEESSS
>>>> YYEE[...]S
>>>> 
>>>> I don't know how long a user would hold down the S key at the end to
>>>> capture their level of excitement, and I don't want to manually define
>>>> synonyms for every length.
>>>> 
>>>> I'm pretty sure that I don't want PhoneticFilter here, maybe
>>>> PatternReplace? Not a huge fan of how that one is configured, and I think
>>>> I'd have to set up a bunch of patterns inline for it?
>>>> 
>>>> Mike
>>>> 
>>> 
>

Re: Solr endpoint on the public internet

2020-10-08 Thread Walter Underwood

Let me know where it is and I’ll delete all the documents in your collection.
It is easy, just one HTTP request.

https://gist.github.com/nz/673027/313f70681daa985ea13ba33a385753aef951a0f3

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 8, 2020, at 11:49 AM, Alexandre Rafalovitch  wrote:
> 
> I think there were past discussions about people doing but they really
> really knew what they were doing from a security perspective, not just
> Solr one.
> 
> You are increasing your risk factor a lot, so you need to think
> through this. What are you protecting and what are you exposing. Are
> you trying to protect the updates? You may be able to do it with - for
> example - read-only docker container, or with embedded Solr or/and
> with reverse proxy.
> 
> Are you trying to protect some of the data from being read? Even harder.
> 
> There are implicit handlers, admin handlers, 'qt' to select query
> parser, etc. Lots of things to think about.
> 
> It just may not be worth it.
> 
> Regards,
>   Alex.
> 
> 
> On Thu, 8 Oct 2020 at 14:27, Marco Aurélio  wrote:
>> 
>> Hi!
>> 
>> We're looking into the option of setting up search with Solr without an
>> intermediary application. This would mean our backend would index data into
>> Solr and we would have a public Solr endpoint on the internet that would
>> receive search requests directly.
>> 
>> Since I couldn't find an existing solution similar to ours, I would like to
>> know whether it's possible to secure Solr in a way that allows anyone only
>> read-access only to collections and how to achieve that. Specifically
>> because of this part of the documentation
>> <https://lucene.apache.org/solr/guide/8_5/securing-solr.html>:
>> 
>> *No Solr API, including the Admin UI, is designed to be exposed to
>> non-trusted parties. Tune your firewall so that only trusted computers and
>> people are allowed access. Because of this, the project will not regard
>> e.g., Admin UI XSS issues as security vulnerabilities. However, we still
>> ask you to report such issues in JIRA.*
>> Is there a way we can restrict read-only access to Solr collections so as
>> to allow users to make search requests directly to it or should we always
>> keep our Solr instances completely private?
>> 
>> Thanks in advance!
>> 
>> Best regards,
>> Marco Godinho

Re: Term too complex for spellcheck.q param

2020-10-07 Thread Walter Underwood

The spellcheck feature was replaced by the suggester in Solr 4, released in 
2012,
so I would not expect any changes in spellcheck.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 7, 2020, at 3:53 PM, gnandre  wrote:
> 
> Is there a way to truncate spellcheck.q param value from Solr side?
> 
> On Wed, Oct 7, 2020, 6:22 PM gnandre  wrote:
> 
>> Thanks. Is this going to be fixed in some future version?
>> 
>> On Wed, Oct 7, 2020, 4:15 PM Mike Drob  wrote:
>> 
>>> Right now the only solution is to use a shorter term.
>>> 
>>> In a fuzzy query you could also try using a lower edit distance e.g.
>>> term~1
>>> (default is 2), but I’m not sure what the syntax for a spellcheck would
>>> be.
>>> 
>>> Mike
>>> 
>>> On Wed, Oct 7, 2020 at 2:59 PM gnandre  wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am getting following error when I pass '
>>>> 김포오피➬유유닷컴➬✗UUDAT3.COM유유닷컴김포풀싸롱て김포오피ふ김포휴게텔け김포마사지❂김포립카페じ김포안마
>>>> ' in spellcheck.q param. How to avoid this error? I am using Solr 8.5.2
>>>> 
>>>> {
>>>>  "error": {
>>>>"code": 500,
>>>>"msg": "Term too complex: 김포오피➬유유닷컴➬✗uudat3.com
>>>> 유유닷컴김포풀싸롱て김포오피ふ김포휴게텔け김포마사지❂김포립카페じ김포안마",
>>>>"trace":
>>> "org.apache.lucene.search.FuzzyTermsEnum$FuzzyTermsException:
>>>> Term too complex:
>>>> 김포오피➬유유닷컴➬✗uudat3.com유유닷컴김포풀싸롱て김포오피ふ김포휴게텔け김포마사지❂김포립카페じ김포안마\n\tat
>>>> 
>>>> 
>>> org.apache.lucene.search.FuzzyAutomatonBuilder.buildAutomatonSet(FuzzyAutomatonBuilder.java:63)\n\tat
>>>> 
>>>> 
>>> org.apache.lucene.search.FuzzyTermsEnum$AutomatonAttributeImpl.init(FuzzyTermsEnum.java:365)\n\tat
>>>> 
>>>> 
>>> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:125)\n\tat
>>>> 
>>>> 
>>> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:92)\n\tat
>>>> 
>>>> 
>>> org.apache.lucene.search.spell.DirectSpellChecker.suggestSimilar(DirectSpellChecker.java:425)\n\tat
>>>> 
>>>> 
>>> org.apache.lucene.search.spell.DirectSpellChecker.suggestSimilar(DirectSpellChecker.java:376)\n\tat
>>>> 
>>>> 
>>> org.apache.solr.spelling.DirectSolrSpellChecker.getSuggestions(DirectSolrSpellChecker.java:196)\n\tat
>>>> 
>>>> 
>>> org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:195)\n\tat
>>>> 
>>>> 
>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328)\n\tat
>>>> 
>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)\n\tat
>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)\n\tat
>>>> 
>>> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:802)\n\tat
>>>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:579)\n\tat
>>>> 
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:420)\n\tat
>>>> 
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:352)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat
>>>> 
>>>> 
>>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)\n\tat
>>>> 
>>>> 
>&

Re: Java GC issue investigation

2020-10-07 Thread Walter Underwood

First thing is to stop using CMS and use G1GC.

We’ve been using these settings with over a hundred machines
in prod for nearly four years.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 7, 2020, at 2:39 AM, Karol Grzyb  wrote:
> 
> Hi Matthew, Erick!
> 
> Thank you very much for the feedback, I'll try to convince them to
> reduce the heap size.
> 
> current GC settings:
> 
> -XX:+CMSParallelRemarkEnabled
> -XX:+CMSScavengeBeforeRemark
> -XX:+ParallelRefProcEnabled
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:CMSInitiatingOccupancyFraction=50
> -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:ConcGCThreads=4
> -XX:MaxTenuringThreshold=8
> -XX:NewRatio=3
> -XX:ParallelGCThreads=4
> -XX:PretenureSizeThreshold=64m
> -XX:SurvivorRatio=4
> -XX:TargetSurvivorRatio=90
> 
> Kind regards,
> Karol
> 
> 
> wt., 6 paź 2020 o 16:52 Erick Erickson  napisał(a):
>> 
>> 12G is not that huge, it’s surprising that you’re seeing this problem.
>> 
>> However, there are a couple of things to look at:
>> 
>> 1> If you’re saying that you have 16G total physical memory and are 
>> allocating 12G to Solr, that’s an anti-pattern. See:
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> If at all possible, you should allocate between 25% and 50% of your physical 
>> memory to Solr...
>> 
>> 2> what garbage collector are you using? G1GC might be a better choice.
>> 
>>> On Oct 6, 2020, at 10:44 AM, matthew sporleder  wrote:
>>> 
>>> Your index is so small that it should easily get cached into OS memory
>>> as it is accessed.  Having a too-big heap is a known problem
>>> situation.
>>> 
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
>>> 
>>> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
>>>> 
>>>> Hi Matthew,
>>>> 
>>>> Thank you for the answer, I cannot reproduce the setup locally I'll
>>>> try to convince them to reduce Xmx, I guess they will rather not agree
>>>> to 1GB but something less than 12G for sure.
>>>> And have some proper dev setup because for now we could only test prod
>>>> or stage which are difficult to adjust.
>>>> 
>>>> Is being stuck in GC common behaviour when the index is small compared
>>>> to available heap during bigger load? I was more worried about the
>>>> ratio of heap to total host memory.
>>>> 
>>>> Regards,
>>>> Karol
>>>> 
>>>> 
>>>> wt., 6 paź 2020 o 14:39 matthew sporleder  
>>>> napisał(a):
>>>>> 
>>>>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
>>>>> to, like, 1g ?
>>>>> 
>>>>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm involved in investigation of issue that involves huge GC overhead
>>>>>> that happens during performance tests on Solr Nodes. Solr version is
>>>>>> 6.1. Last test were done on staging env, and we run into problems for
>>>>>> <100 requests/second.
>>>>>> 
>>>>>> The size of the index itself is ~200MB ~ 50K docs
>>>>>> Index has small updates every 15min.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Queries involve sorting and faceting.
>>>>>> 
>>>>>> I've gathered some heap dumps, I can see from them that most of heap
>>>>>> memory is retained because of object of following classes:
>>>>>> 
>>>>>> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
>>>>>> (>4G, 91% of heap)
>>>>>> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
>>>>>> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
>>>>>> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
>>>>>> (>3.7G 76% of heap)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Based on information above is there anything generic that can been
>>>>>> looked at as source of potential improvement without diving deeply
>>>>>> into schema and queries (which may be very difficlut to change at this
>>>>>> moment)? I don't see docvalues being enabled - could this help, as if
>>>>>> I get the docs correctly, it's specifically helpful when there are
>>>>>> many sorts/grouping/facets? Or I
>>>>>> 
>>>>>> Additionaly I see, that many threads are blocked on LRUCache.get,
>>>>>> should I recomend switching to FastLRUCache?
>>>>>> 
>>>>>> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
>>>>>> memory? I see some (~5/s) page faults in Dynatrace during the biggest
>>>>>> traffic.
>>>>>> 
>>>>>> Thank you very much for any help,
>>>>>> Kind regards,
>>>>>> Karol
>>

Re: Order of applying tokens/filter

2020-10-06 Thread Walter Underwood

Synonyms only need to be done once. Generally, expand synonyms at index time 
only.

Also, consider the StandardTokeniizer. It is a bit smarter and that can be 
useful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 5, 2020, at 9:08 PM, Jayadevan Maymala  
> wrote:
> 
>> 
>> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
>> WhitespaceTokenizerFactory
>> SynonymGraphFilterFactory
>> FlattenGraphFilterFactory
>> KStemFilterFactory
>> RemoveDuplicatesFilterFactory
>> 
>> One doubt related to this. Ideally, the same sequence should be followed
> for indexing and querying, right?
> Regards,
> Jayadevan

Re: Order of applying tokens/filter

2020-10-04 Thread Walter Underwood

Several problems.

1. Do not remove stopwords. That is a 1970s-era hack for saving disk space. 
Want to search for “vitamin a”? Better not remove stopwords.
2. Synonyms are before the stemmer, especially the Porter stemmer, where the 
output isn’t English words.
3. Use KStem instead of Porter. Porter is a clever hack from 1980, but we have 
better technology now.
4. Add RemoveDuplcatesFilter as the last step, just in case your synonyms stem 
to the same word. It is cheap insurance.

Also, I really recommend using the ICUNormalizer2CharFilterFactory with “nfkc” 
mode as the first step before the tokenizer. Otherwise, you’ll get bitten by 
some weird Unicode thing that takes days to debug. And if you are going to 
lower-case everything, let ICU do that for you with “nfkc_cf” mode.

So that gives:

ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
WhitespaceTokenizerFactory
SynonymGraphFilterFactory
FlattenGraphFilterFactory
KStemFilterFactory
RemoveDuplicatesFilterFactory

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 4, 2020, at 9:24 PM, Jayadevan Maymala  
> wrote:
> 
> Hi all,
> 
> Is this the best (performance-wise as well as efficacy) order of applying
> analyzers/filters? We have an eCom site where the many products are listed,
> and users may type in search words and get relevant results.
> 
> 1) Tokenize on whitespace (WhitespaceTokenizerFactory)
> 2) Remove stopwords (StopFilterFactory)
> 3) Stem (PorterStemFilterFactory)
> 4) Convert to lowercase  (LowerCaseFilterFactory)
> 5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory)
> 
> Any possible gotchas?
> 
> Regards,
> Jayadevan

Re: advice on whether to use stopwords for use case

2020-10-01 Thread Walter Underwood

I can’t think of an easy way to do this in Solr.

Do a bunch of string searches on the query on the client side. If any of them 
match, 
make a “no hits” result page.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 11:56 PM, Derek Poh  wrote:
> 
> Yes, the requirements (for now) is not to return any results. I think they 
> may change the requirements,pending their return from the holidays.
> 
>> If so, then check for those words in the query before sending it to Solr.
> That is what I think so too.
> 
> Thinking further, using stopwords for this, there will still be results 
> return when the number of words in the search keywords is more than the 
> stopwords.
> 
> On 1/10/2020 2:57 am, Walter Underwood wrote:
>> I’m not clear on the requirements. It sounds like the query “cigar” or 
>> “cuban cigar”
>> should return zero results. Is that right?
>> 
>> If so, then check for those words in the query before sending it to Solr.
>> 
>> But the stopwords approach seems like the requirement is different. Could 
>> you give
>> some examples?
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
>>> <mailto:arafa...@gmail.com> wrote:
>>> 
>>> You may also want to look at something like: 
>>> https://docs.querqy.org/index.html <https://docs.querqy.org/index.html>
>>> 
>>> ApacheCon had (is having..) a presentation on it that seemed quite
>>> relevant to your needs. The videos should be live in a week or so.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
>>> <mailto:arafa...@gmail.com> wrote:
>>>> I am not sure why you think stop words are your first choice. Maybe I
>>>> misunderstand the question. I read it as that you need to exclude
>>>> completely a set of documents that include specific keywords when
>>>> called from specific module.
>>>> 
>>>> If I wanted to differentiate the searches from specific module, I
>>>> would give that module a different end-point (Request Query Handler),
>>>> instead of /select. So, /nocigs or whatever.
>>>> 
>>>> Then, in that end-point, you could do all sorts of extra things, such
>>>> as setting appends or even invariants parameters, which would include
>>>> filter query to exclude any documents matching specific keywords. I
>>>> assume it is ok to return documents that are matching for other
>>>> reasons.
>>>> 
>>>> Ideally, you would mark the cigs documents during indexing with a
>>>> binary or enumeration flag and then during search you just need to
>>>> check against that flag. In that case, you could copyField  your text
>>>> and run it against something like
>>>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>>>>  
>>>> <https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter>
>>>> combined with Shingles for multiwords. Or similar. And just transform
>>>> it as index-only so that the result is basically a yes/no flag.
>>>> Similar thing could be done with UpdateRequestProcessor pipeline if
>>>> you want to end up with a true boolean flag. The idea is the same,
>>>> just to have an index-only flag that you force lock into for any
>>>> request from specific module.
>>>> 
>>>> Or even with something like ElevationSearchComponent. Same idea.
>>>> 
>>>> Hope this helps.
>>>> 
>>>> Regards,
>>>>   Alex.
>>>> 
>>>> On Tue, 29 Sep 2020 at 22:28, Derek Poh  
>>>> <mailto:d...@globalsources.com> wrote:
>>>>> Hi
>>>>> 
>>>>> I have read in the mailings list that we should try to avoid using stop
>>>>> words.
>>>>> 
>>>>> I have a use case where I would like to know if there is other
>>>>> alternative solutions beside using stop words.
>>>>> 
>>>>> There is business requirement to return zero result when the search is
>>>>> cigarette related words and the search is coming from a particular
>>>>> module on our site. It does not ap

Re: Master/Slave

2020-09-30 Thread Walter Underwood

We do this sort of thing outside of Solr. The indexing process includes creating
a feed file with one JSON object per line. The feed files are stored in S3 with
names that are ISO 8601 timestamps. Those files are picked up and loaded into
Solr. Because S3 is cross-region in AWS, those files are also our disaster
recovery method for indexing. And of course, two clusters could be loaded
from the same file.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 12:09 PM, David Hastings  
> wrote:
> 
>> whether we should expect Master/Slave replication also to be deprecated
> 
> it better not ever be depreciated.  it has been the most reliable mechanism
> for its purpose, solr cloud isnt going to replace standalone, if it does,
> thats when I guess I stop upgrading or move to elastic
> 
> On Wed, Sep 30, 2020 at 2:58 PM Oakley, Craig (NIH/NLM/NCBI) [C]
>  wrote:
> 
>> Based on the thread below (reading "legacy" as meaning "likely to be
>> deprecated in later versions"), we have been working to extract ourselves
>> from Master/Slave replication
>> 
>> Most of our collections need to be in two data centers (a read/write copy
>> in one local data center: the disaster-recovery-site SolrCloud could be
>> read-only). We also need redundancy within each data center for when one
>> host or another is unavailable. We implemented this by having different
>> SolrClouds in the different data centers; with Master/Slave replication
>> pulling data from one of the read/write replicas to each of the Slave
>> replicas in the disaster-recovery-site read-only SolrCloud. Additionally,
>> for some collections, there is a desire to have local read-only replicas
>> remain unchanged for querying during the loading process: for these
>> collections, there is a local read/write loading SolrCloud, a local
>> read-only querying SolrCloud (normally configured for Master/Slave
>> replication from one of the replicas of the loader SolrCloud to both
>> replicas of the query SolrCloud, but with Master/Slave disabled when the
>> load was in progress on the loader SolrCloud, and with Master/Slave resumed
>> after the loaded data passes QA checks).
>> 
>> Based on the thread below, we made an attempt to switch to CDCR. The main
>> reason for wanting to change was that CDCR was said to be the supported
>> mechanism, and the replacement for Master/Slave replication.
>> 
>> After multiple unsuccessful attempts to get CDCR to work, we ended up with
>> reproducible cases of CDCR loosing data in transit. In June, I initiated a
>> thread in this group asking for clarification of how/whether CDCR could be
>> made reliable. This seemed to me to be met with deafening silence until the
>> announcement in July of the release of Solr8.6 and the deprecation of CDCR.
>> 
>> So we are left with the question whether we should expect Master/Slave
>> replication also to be deprecated; and if so, with what is it expected to
>> be replaced (since not with CDCR)? Or is it now sufficiently safe to assume
>> that Master/Slave replication will continue to be supported after all
>> (since the assertion that it would be replaced by CDCR has been
>> discredited)? In either case, are there other suggested implementations of
>> having a read-only SolrCloud receive data from a read/write SolrCloud?
>> 
>> 
>> Thanks
>> 
>> -Original Message-
>> From: Shawn Heisey 
>> Sent: Tuesday, May 21, 2019 11:15 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: SolrCloud (7.3) and Legacy replication slaves
>> 
>> On 5/21/2019 8:48 AM, Michael Tracey wrote:
>>> Is it possible set up an existing SolrCloud cluster as the master for
>>> legacy replication to a slave server or two?   It looks like another
>> option
>>> is to use Uni-direction CDCR, but not sure what is the best option in
>> this
>>> case.
>> 
>> You're asking for problems if you try to combine legacy replication with
>> SolrCloud.  The two features are not guaranteed to work together.
>> 
>> CDCR is your best bet.  This replicates from one SolrCloud cluster to
>> another.
>> 
>> Thanks,
>> Shawn
>>

Re: advice on whether to use stopwords for use case

2020-09-30 Thread Walter Underwood

I’m not clear on the requirements. It sounds like the query “cigar” or “cuban 
cigar”
should return zero results. Is that right?

If so, then check for those words in the query before sending it to Solr.

But the stopwords approach seems like the requirement is different. Could you 
give
some examples?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
> wrote:
> 
> You may also want to look at something like: 
> https://docs.querqy.org/index.html
> 
> ApacheCon had (is having..) a presentation on it that seemed quite
> relevant to your needs. The videos should be live in a week or so.
> 
> Regards,
>   Alex.
> 
> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
> wrote:
>> 
>> I am not sure why you think stop words are your first choice. Maybe I
>> misunderstand the question. I read it as that you need to exclude
>> completely a set of documents that include specific keywords when
>> called from specific module.
>> 
>> If I wanted to differentiate the searches from specific module, I
>> would give that module a different end-point (Request Query Handler),
>> instead of /select. So, /nocigs or whatever.
>> 
>> Then, in that end-point, you could do all sorts of extra things, such
>> as setting appends or even invariants parameters, which would include
>> filter query to exclude any documents matching specific keywords. I
>> assume it is ok to return documents that are matching for other
>> reasons.
>> 
>> Ideally, you would mark the cigs documents during indexing with a
>> binary or enumeration flag and then during search you just need to
>> check against that flag. In that case, you could copyField  your text
>> and run it against something like
>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>> combined with Shingles for multiwords. Or similar. And just transform
>> it as index-only so that the result is basically a yes/no flag.
>> Similar thing could be done with UpdateRequestProcessor pipeline if
>> you want to end up with a true boolean flag. The idea is the same,
>> just to have an index-only flag that you force lock into for any
>> request from specific module.
>> 
>> Or even with something like ElevationSearchComponent. Same idea.
>> 
>> Hope this helps.
>> 
>> Regards,
>>   Alex.
>> 
>> On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:
>>> 
>>> Hi
>>> 
>>> I have read in the mailings list that we should try to avoid using stop
>>> words.
>>> 
>>> I have a use case where I would like to know if there is other
>>> alternative solutions beside using stop words.
>>> 
>>> There is business requirement to return zero result when the search is
>>> cigarette related words and the search is coming from a particular
>>> module on our site. It does not apply to all searches from our site.
>>> There is a list of these cigarette related words. This list contains
>>> single word, multiple words (Electronic cigar), multiple words with
>>> punctuation (e-cigarette case).
>>> I am planning to copy a different set of search fields, that will
>>> include the stopword filter in the index and query stage, for this
>>> module to use.
>>> 
>>> For this use case, other than using stop words to handle it, is there
>>> any alternative solution?
>>> 
>>> Derek
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> 
>>> This e-mail (including any attachments) may contain confidential and/or 
>>> privileged information. If you are not the intended recipient or have 
>>> received this e-mail in error, please inform the sender immediately and 
>>> delete this e-mail (including any attachments) from your computer, and you 
>>> must not use, disclose to anyone else or copy this e-mail (including any 
>>> attachments), whether in whole or in part.
>>> 
>>> This e-mail and any reply to it may be monitored for security, legal, 
>>> regulatory compliance and/or other appropriate reasons.

Re: Doing what does using SolrJ API

2020-09-17 Thread Walter Underwood

If you want to ignore a field being sent to Solr, you can set indexed=false and 
stored=false for that field in schema.xml. It will take up room in schema.xml 
but
zero room on disk.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 17, 2020, at 10:23 AM, Alexandre Rafalovitch  
> wrote:
> 
> Solr has a whole pipeline that you can run during document ingesting before
> the actual indexing happens. It is called Update Request Processor (URP)
> and is defined in solrconfig.xml or in an override file. Obviously, since
> you are indexing from SolrJ client, you have even more flexibility, but it
> is good to know about anyway.
> 
> You can read all about it at:
> https://lucene.apache.org/solr/guide/8_6/update-request-processors.html and
> see the extensive list of processors you can leverage. The specific
> mentioned one is this one:
> https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
> 
> Just a word of warning that Stateless URP is using Javascript, which is
> getting a bit of a complicated story as underlying JVM is upgraded (Oracle
> dropped their javascript engine in JDK 14). So if one of the simpler URPs
> will do the job or a chain of them, that may be a better path to take.
> 
> Regards,
>   Alex.
> 
> 
> On Thu, 17 Sep 2020 at 13:13, Steven White  wrote:
> 
>> Thanks Erick.  Where can I learn more about "stateless script update
>> processor factory".  I don't know what you mean by this.
>> 
>> Steven
>> 
>> On Thu, Sep 17, 2020 at 1:08 PM Erick Erickson 
>> wrote:
>> 
>>> 1000 fields is fine, you'll waste some cycles on bookkeeping, but I
>> really
>>> doubt you'll notice. That said, are these fields used for searching?
>>> Because you do have control over what gous into the index if you can put
>> a
>>> "stateless script update processor factory" in your update chain. There
>> you
>>> can do whatever you want, including combine all the fields into one and
>>> delete the original fields. There's no point in having your index
>> cluttered
>>> with unused fields, OTOH, it may not be worth the effort just to satisfy
>> my
>>> sense of aesthetics 
>>> 
>>> On Thu, Sep 17, 2020, 12:59 Steven White  wrote:
>>> 
>>>> Hi Eric,
>>>> 
>>>> Yes, this is coming from a DB.  Unfortunately I have no control over
>> the
>>>> list of fields.  Out of the 1000 fields that there maybe, no document,
>>> that
>>>> gets indexed into Solr will use more then about 50 and since i'm
>> copying
>>>> the values of those fields to the catch-all field and the catch-all
>> field
>>>> is my default search field, I don't expect any problem for having 1000
>>>> fields in Solr's schema, or should I?
>>>> 
>>>> Thanks
>>>> 
>>>> Steven
>>>> 
>>>> 
>>>> On Thu, Sep 17, 2020 at 8:23 AM Erick Erickson <
>> erickerick...@gmail.com>
>>>> wrote:
>>>> 
>>>>> “there over 1000 of them[fields]”
>>>>> 
>>>>> This is often a red flag in my experience. Solr will handle that many
>>>>> fields, I’ve seen many more. But this is often a result of
>>>>> “database thinking”, i.e. your mental model of how all this data
>>>>> is from a DB perspective rather than a search perspective.
>>>>> 
>>>>> It’s unwieldy to have that many fields. Obviously I don’t know the
>>>>> particulars of
>>>>> your app, and maybe that’s the best design. Particularly if many of
>> the
>>>>> fields
>>>>> are sparsely populated, i.e. only a small percentage of the documents
>>> in
>>>>> your
>>>>> corpus have any value for that field then taking a step back and
>>> looking
>>>>> at the design might save you some grief down the line.
>>>>> 
>>>>> For instance, I’ve seen designs where instead of
>>>>> field1:some_value
>>>>> field2:other_value….
>>>>> 
>>>>> you use a single field with _tokens_ like:
>>>>> field:field1_some_value
>>>>> field:field2_other_value
>>>>> 
>>>>> that drops the complexity and increases performance.
>>>>> 
>>>>> Anyway, just a thought you might want to consider.
>>>>> 
>>>>> Best,
>>>>> Erick
>>>>> 
>>>>>> On Sep 16, 2020, at 9:31 PM, Steven White 
>>>> wrote:
>>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> I figured it out.  It is as simple as creating a List and
>>> using
>>>>>> that as the value part for SolrInputDocument.addField() API.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Steven
>>>>>> 
>>>>>> 
>>>>>> On Wed, Sep 16, 2020 at 9:13 PM Steven White >> 
>>>>> wrote:
>>>>>> 
>>>>>>> Hi everyone,
>>>>>>> 
>>>>>>> I want to avoid creating a >>>>>> source="OneFieldOfMany"/> in my schema (there will be over 1000 of
>>>> them
>>>>> and
>>>>>>> maybe more so managing it will be a pain).  Instead, I want to use
>>>> SolrJ
>>>>>>> API to do what  does.  Any example of how I can do
>> this?
>>>> If
>>>>>>> there is an example online, that would be great.
>>>>>>> 
>>>>>>> Thanks in advance.
>>>>>>> 
>>>>>>> Steven
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Updating configset

2020-09-11 Thread Walter Underwood

I wrote some Python to get the Zookeeper address from CLUSTERSTATUS, then
use the Kazoo library to upload a configset. Then it goes back to the cluster 
and
runs an async command to RELOAD.

I really should open source that thing (in my copious free time).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 11, 2020, at 9:35 AM, Tomás Fernández Löbbe  
> wrote:
> 
> I was in the same situation recently. I think it would be nice to have the
> configset UPLOAD command be able to override the existing configset instead
> of just fail (with a parameter such as override=true or something). We need
> to be careful with the trusted/unstrusted flag there, but that should be
> possible.
> 
>> If we can’t modify the configset wholesale this way, is it possible to
> create a new configset and swap the old collection to it?
> You can create a new one and then call MODIFYCOLLECTION on the collection
> that uses it:
> https://lucene.apache.org/solr/guide/8_6/collection-management.html#modifycollection-parameters.
> I've never used that though.
> 
> On Fri, Sep 11, 2020 at 7:26 AM Carroll, Michael (ELS-PHI) <
> m.carr...@elsevier.com> wrote:
> 
>> Hello,
>> 
>> I am running SolrCloud in Kubernetes with Solr version 8.5.2.
>> 
>> Is it possible to update a configset being used by a collection using a
>> SolrCloud API directly? I know that this is possible using the zkcli and a
>> collection RELOAD. We essentially want to be able to checkout our configset
>> from source control, and then replace everything in the active configset in
>> SolrCloud (other than the schema.xml).
>> 
>> We have a couple of custom plugins that use config files that reside in
>> the configset, and we don’t want to have to rebuild the collection or
>> access zookeeper directly if we don’t have to. If we can’t modify the
>> configset wholesale this way, is it possible to create a new configset and
>> swap the old collection to it?
>> 
>> Best,
>> Michael Carroll
>>

Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Walter Underwood

It is very common for us to do more processing in the index analysis chain. In 
general, we do that when we want additional terms in the index to be 
searchable. Some examples:

* synonyms: If the book title is “EMT” add “Emergency Medical Technician”.
* ngrams: For prefix matching, generate all edge ngrams, for example for 
“french” add “f”, “fr” “fre”, “fren”, and “frenc”.
* shingles: Make pairs, so the query “babysitter” can match “baby sitter”.
* split on delimiters: break up compounds, so “baby sitter” can match 
“baby-sitter”. Do this before shingles and you get matches for “babysitter”, 
“baby-sitter”, and “baby sitter”.
* remove HTML: we rarely see HTML in queries, but we never know when someone 
will get clever with the source text, sigh.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 10, 2020, at 9:48 AM, Erick Erickson  wrote:
> 
> When you want to do something different and index and query time. There, an 
> answer that’s almost, but not quite, completely useless while being accurate 
> ;)
> 
> A concrete example is synonyms as have been mentioned. Say you have an 
> index-time synonym definition of
> A,B,C
> 
> These three tokens will be “stacked” in the index wherever any of them are 
> found. 
> A query "q=field:B” would find a document with any of the three tokens in the 
> original. It would be wasteful for the query to be transformed into 
> “q=field:(A B C)”…
> 
> And take a very close look at WordDelimiterGraphFilterFactory. I’m pretty 
> sure you’ll find the parameters are different. Say the parameters for the 
> input 123-456-7890 cause WDGFF to add
> 123, 456, 7890, 1234567890 to the index. Again, at query time you don’t need 
> to repeat and have all of those tokens in the search itself.
> 
> Best,
> Erick
> 
>> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch  
>> wrote:
>> 
>> There are a lot of different use cases and the separate analyzers for
>> indexing and query is part of the Solr power. For example, you could
>> apply ngram during indexing time to generate multiple substrings. But
>> you don't want to do that during the query, because otherwise you are
>> matching on 'shared prefix' instead of on what user entered. Thinking
>> phone number directory where people may enter any suffix and you want
>> to match it.
>> See for example
>> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
>> , starting slide 16 onwards.
>> 
>> Or, for non-production but fun use case:
>> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
>> (search phonetically mapped Thai text in English).
>> 
>> Similarly, you may want to apply synonyms at query time only if you
>> want to avoid diluting some relevancy. Or at index type to normalize
>> spelling and help relevancy.
>> 
>> Or you may want to be doing some accent folding for sorting or
>> faceting (which uses indexed tokens).
>> 
>> Regards,
>>  Alex.
>> 
>> On Thu, 10 Sep 2020 at 11:19, Steven White  wrote:
>>> 
>>> Hi everyone,
>>> 
>>> In Solr's schema, I have come across field types that use a different logic
>>> for "index" than for "query".  To be clear, I"m talking about this block:
>>> 
>>>   >> positionIncrementGap="100">
>>> 
>>>  
>>> 
>>> 
>>>  
>>> 
>>>   
>>> 
>>> Why would one want to not use the same logic for both and simply use:
>>> 
>>>   >> positionIncrementGap="100">
>>> 
>>>  
>>> 
>>>   
>>> 
>>> What are real word use cases to use a different analyzer for index and
>>> query?
>>> 
>>> Thanks,
>>> 
>>> Steve
>

Re: Understanding Solr heap %

2020-09-01 Thread Walter Underwood

This is misleading and not particularly good advice.

Solr 8 does NOT contain G1. G1GC is a feature of the JVM. We’ve been using
it with Java 8 and Solr 6.6.2 for a few years.

A test with eighty documents doesn’t test anything. Try a million documents to
get Solr memory usage warmed up.

GC_TUNE has been in the solr.in.sh file for a long time. Here are the settings
we use with Java 8. We have about 120 hosts running Solr in six prod clusters.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 1, 2020, at 8:39 AM, Joe Doupnik  wrote:
> 
> Erick states this correctly. To give some numbers from my experiences, 
> here are two slides from my presentation about installing Solr 
> (https://netlab1.net/ <https://netlab1.net/>, locate item "Solr/Lucene Search 
> Service"):
>> 
> 
>> 
> 
> Thus we see a) experiments are the key, just as Erick says, and b) the 
> choice of garbage collection algorithm plays a major role.
> In my setup I assigned SOLR_HEAP to be 2048m, SOLR_OPTS has -Xss1024k, 
> plus stock GC_TUNE values. Your "memorage" may vary.
> Thanks,
> Joe D.
> 
> On 01/09/2020 15:33, Erick Erickson wrote:
>> You want to run with the smallest heap you can due to Lucene’s use of 
>> MMapDirectory, 
>> see the excellent:
>> 
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html 
>> <https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>
>> 
>> There’s also little reason to have different Xms and Xmx values, that just 
>> means you’ll
>> eventually move a bunch of memory around as the heap expands, I usually set 
>> them both
>> to the same value.
>> 
>> How to determine what “the smallest heap you can” is? Unfortunately there’s 
>> no good way
>> outside of stress-testing your application with less and less memory until 
>> you have problems,
>> then add some extra…
>> 
>> Best,
>> Erick
>> 
>>> On Sep 1, 2020, at 10:27 AM, Dominique Bejean  
>>> <mailto:dominique.bej...@eolya.fr> wrote:
>>> 
>>> Hi,
>>> 
>>> As all Java applications the Heap memory is regularly cleaned by the
>>> garbage collector (some young items moved to the old generation heap zone
>>> and unused old items removed from the old generation heap zone). This
>>> causes heap usage to continuously grow and reduce.
>>> 
>>> Regards
>>> 
>>> Dominique
>>> 
>>> 
>>> 
>>> 
>>> Le mar. 1 sept. 2020 à 13:50, yaswanth kumar  
>>> <mailto:yaswanth...@gmail.com> a
>>> écrit :
>>> 
>>>> Can someone make me understand on how the value % on the column Heap is
>>>> calculated.
>>>> 
>>>> I did created a new solr cloud with 3 solr nodes and one zookeeper, its
>>>> not yet live neither interms of indexing or searching, but I do see some
>>>> spikes in the HEAP column against nodes when I refresh the page multiple
>>>> times. Its like almost going to 95% (sometimes) and then coming down to 50%
>>>> 
>>>> Solr version: 8.2
>>>> Zookeeper: 3.4
>>>> 
>>>> JVM size configured in solr.in.sh is min of 1GB to max of 10GB (actually
>>>> RAM size on the node is 16GB)
>>>> 
>>>> Basically need to understand if I need to worry about this heap % which
>>>> was quite altering before making it live? or is that quite normal, because
>>>> this is new UI change on solr cloud is kind of new to us as we used to have
>>>> solr 5 version before and this UI component doesn't exists then.
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Yaswanth Kumar Konathala.
>>>> yaswanth...@gmail.com <mailto:yaswanth...@gmail.com>
>>>> 
>>>> Sent from my iPhone
>

Re: Exclude a folder/directory from indexing

2020-08-28 Thread Walter Underwood

For building a crawler, I’d start with Scrapy (https://scrapy.org 
<https://scrapy.org/>). It is a solid design and
should be easy to use for crawling web pages, files, or an API. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 28, 2020, at 4:16 AM, Joe Doupnik  wrote:
> 
> Some time ago I faced a roughly similar challenge. After many trials and 
> tests I ended up creating my own programs to accomplish the tasks of fetching 
> files, selecting which are allowed to be indexed, and feeding them into Solr 
> (POST style). This work is open source, found on https://netlab1.net/, web 
> page section titled Presentations of long term utility, item Solr/Lucene 
> Search Service. This is a set of docs, three small PHP programs, and a Solr 
> schema etc bundle, all within one downloadable zip file.
> On filtering found files, my solution uses a list of regular expressions 
> which are simple to state and to process. The docs discuss the rules. 
> Luckily, the code dealing with rules per se and doing the filtering is very 
> short and simple; see crawler.php for convertfilter() and filterbyname(). 
> Thus you may wish to consider them or equivalents for inclusion in your 
> system, whatever that may be.
> Thanks,
> Joe D.
> 
> On 27/08/2020 20:32, Alexandre Rafalovitch wrote:
>> If you are indexing from Drupal into Solr, that's the question for
>> Drupal's solr module. If you are doing it some other way, which way
>> are you doing it? bin/post command?
>> 
>> Most likely this is not the Solr question, but whatever you have
>> feeding data into Solr.
>> 
>> Regards,
>>   Alex.
>> 
>> On Thu, 27 Aug 2020 at 15:21, Staley, Phil R - DCF
>>  wrote:
>>> Can you or how do you exclude a specific folder/directory from indexing in 
>>> SOLR version 7.x or 8.x?   Also our CMS is Drupal 8
>>> 
>>> Thanks,
>>> 
>>> Phil Staley
>>> DCF Webmaster
>>> 608 422-6569
>>> phil.sta...@wisconsin.gov
>>> 
>>> 
>

Re: PDF extraction using Tika

2020-08-26 Thread Walter Underwood

When I worked for a search engine vendor, we did exactly the same thing.

We always ran the document crackers in a different process because they tended 
to hang, crash, run forever, or use all of memory. Adobe PDFlib was not an 
exception to that rule.

wunder
Walter Underwood
Ultraseek Server (at Infoseek, Disney/GO.com, Inktomi, Verity, Autonomy)
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 26, 2020, at 5:18 AM, Jan Høydahl  wrote:
> 
> When I worked for a search engine vendor in my previous life, the PDF parsing 
> pipeline looked something like this
> 
> Try parsing the PDF file with tool X
> If failure or timeout, try instead with tool Y
> If failure or timeout, try instead with tool Z
> 
> In this case X would be the preferred parser, but Y and Z would be fallbacks 
> that would hopefully not fail in the same place as X.
> 
> Agree that PDFBox and Tika is impressive. However, in your own code you could 
> also fallback to some other tool if you want a more robust pipeline.
> 
> Jan
> 
>> 26. aug. 2020 kl. 11:06 skrev Charlie Hull :
>> 
>> Hi Joe,
>> 
>> Tika is pretty amazing at coping with the things people throw at it and I 
>> know the team behind it have added a very extensive testing framework. 
>> However, the reality is that malformed, huge or just plain crazy documents 
>> may cause crashes - PDFs are mad, you can even embed Javascript in them I 
>> believe, and I've also seen PDFs running to thousands of pages. There's *no 
>> way* to design out every possible crash, and it's far better to design your 
>> system to cope if necessary by separating the PDF processing from Solr.
>> 
>> Charlie
>> 
>> On 25/08/2020 11:46, Joe Doupnik wrote:
>>> More properly,it would be best to fix Tika and thus not push extra 
>>> complexity upon many many users. Error handling is one thing, crashes 
>>> though ought to be designed out.
>>>Thanks,
>>>Joe D.
>>> 
>>> On 25/08/2020 10:54, Charlie Hull wrote:
>>>> On 25/08/2020 06:04, Srinivas Kashyap wrote:
>>>>> Hi Alexandre,
>>>>> 
>>>>> Yes, these are the same PDF files running in windows and linux. There are 
>>>>> around 30 pdf files and I tried indexing single file, but faced same 
>>>>> error. Is it related to how PDF stored in linux?
>>>> Did you try running Tika (the same version as you're using in Solr) 
>>>> standalone on the file as Alexandre suggested?
>>>>> 
>>>>> And with regard to DIH and TIKA going away, can you share if any program 
>>>>> which extracts from PDF and pushes into solr?
>>>> 
>>>> https://lucidworks.com/post/indexing-with-solrj/ is one example. You 
>>>> should run Tika separately as it's entirely possible for it to fail to 
>>>> parse a PDF and crash - and if you're running it in DIH & Solr it then 
>>>> brings down everything. Separate your PDF processing from your Solr 
>>>> indexing.
>>>> 
>>>> 
>>>> Cheers
>>>> 
>>>> Charlie
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Srinivas Kashyap
>>>>> 
>>>>> -Original Message-
>>>>> From: Alexandre Rafalovitch 
>>>>> Sent: 24 August 2020 20:54
>>>>> To: solr-user 
>>>>> Subject: Re: PDF extraction using Tika
>>>>> 
>>>>> The issue seems to be more with a specific file and at the level way 
>>>>> below Solr's or possibly even Tika's:
>>>>> Caused by: java.io.IOException: expected='>' actual='
>>>>> ' at offset 2383
>>>>> at
>>>>> org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)
>>>>>  
>>>>> 
>>>>> Are you indexing the same files on Windows and Linux? I am guessing not. 
>>>>> I would try to narrow down which of the files it is. One way could be to 
>>>>> get a standalone Tika (make sure to match the version Solr
>>>>> embeds) and run it over the documents by itself. It will probably 
>>>>> complain with the same error.
>>>>> 
>>>>> Regards,
>>>>>Alex.
>>>>> P.s. Additionally, both DIH and Embedded Tika are not recommended for 
>>>>> production. And both will be going away in future Solr versions. You may 
>>>>> have a much less brittle pipeline if

Re: Solr doesn't run after editing solr.in.sh

2020-08-23 Thread Walter Underwood

Also, what platform is this on and what editor did you use (especially if you 
are on Windows)?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 23, 2020, at 4:35 PM, Erick Erickson  wrote:
> 
> Well, first show exactly what you uncommented. I doubt you uncommented them 
> one by one and tried everything, so you leave us guessing. Uncommenting 
> SOLR_HOME for instance would be shooting yourself in the foot since Solr 
> wouldn’t know where to start. Uncommenting some the authorization parameters 
> without providing the proper values would cause Solr not to run. Uncommenting 
> #SOLR_OPTS="$SOLR_OPTS -Dsolr.environment=prod” should be fine.
> 
> 
> Second, show us the output when you _do_ try to run. You can use the -f 
> option to dump logging to the console.
> 
> Best,
> Erick
> 
>> On Aug 23, 2020, at 9:58 AM, Joe Doupnik  wrote:
>> 
>> On 22/08/2020 22:08, maciejpreg...@tutanota.com.INVALID wrote:
>>> Good morning.
>>> When I uncomment any of commands in solr.in.sh, Solr doesn't run. What do I 
>>> have to do to fix a problem?
>>> Best regards,
>>> Maciej Pregiel
>> On 22/08/2020 22:08, maciejpreg...@tutanota.com.INVALID wrote:
>>> Good morning.
>>> When I uncomment any of commands in solr.in.sh, Solr doesn't run. What do I 
>>> have to do to fix a problem?
>>> Best regards,
>>> Maciej Pregiel
>> -
>>My approach has been to add local configuration options to the end of the 
>> file and leave intact the original text. Here is the end of my file, which 
>> has no changes above this material:
>> 
>> #SOLR_SECURITY_MANAGER_ENABLED=false
>> ## JRD values
>> SOLR_ULIMIT_CHECKS=falseGC_TUNE=" \
>> -XX:SurvivorRatio=4 \
>> -XX:TargetSurvivorRatio=90 \
>> -XX:MaxTenuringThreshold=8 \
>> -XX:+UseConcMarkSweepGC \
>> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
>> -XX:+CMSScavengeBeforeRemark \
>> -XX:PretenureSizeThreshold=64m \
>> -XX:+UseCMSInitiatingOccupancyOnly \
>> -XX:CMSInitiatingOccupancyFraction=50 \
>> -XX:CMSMaxAbortablePrecleanTime=6000 \
>> -XX:+CMSParallelRemarkEnabled \
>> -XX:+ParallelRefProcEnabled \
>> -XX:-OmitStackTraceInFastThrow"
>> #JRD give more memory
>> ##SOLR_HEAP="4096m"
>> SOLR_HEAP="2048m"
>> ##JRD enlarge this
>> #SOLR_OPTS="$SOLR_OPTS -Xss512k"
>> SOLR_OPTS="$SOLR_OPTS -Xss1024k"
>> SOLR_STOP_WAIT=30
>> SOLR_JAVA_HOME="/usr/java/latest/"
>> SOLR_PID_DIR="/home/search/solr"
>> SOLR_HOME="/home/search/solr/data"
>> SOLR_LOGS_DIR="/home/search/solr/logs"
>> SOLR_PORT="8983"
>> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=3000"
>> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=6"
>> SOLR_OPTS="$SOLR_OPTS -Djava.io.tmpdir=/home/search/tmp"
>> 
>>Thanks,
>>Joe D.
>

Re: SOLR indexing takes longer time

2020-08-18 Thread Walter Underwood

Instead of writing code, I’d fire up SQL Workbench/J, load the same JDBC driver
that is being used in Solr, and run the query.

https://www.sql-workbench.eu <https://www.sql-workbench.eu/>

If that takes 3.5 hours, you have isolated the problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 18, 2020, at 6:50 AM, David Hastings  
> wrote:
> 
> Another thing to mention is to make sure the indexer you build doesnt send
> commits until its actually done.  Made that mistake with some early in
> house indexers.
> 
> On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:
> 
>> 1. You could write some code to pull the items out of Mongo and dump
>> them to disk - if this is still slow, then it's Mongo that's the problem.
>> 2. Write a standalone indexer to replace DIH, it's single threaded and
>> deprecated anyway.
>> 3. Minor point - consider whether you need to index everything every
>> time or just the deltas.
>> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
>> old version you're running.
>> 
>> HTH
>> 
>> Charlie
>> 
>> On 17/08/2020 19:22, Abhijit Pawar wrote:
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>> 
>> --
>> Charlie Hull
>> OpenSource Connections, previously Flax
>> 
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.o19s.com
>> 
>>

Re: SOLR indexing takes longer time

2020-08-17 Thread Walter Underwood

I’m seeing multiple red flags for performance here. The top ones are “DIH”,
“MongoDB”, and “SQL on MongoDB”. MongoDB is not a relational database.

Our multi-threaded extractor using the Mongo API was still three times slower
than the same approach on MySQL.

Check the CPU usage on the Solr hosts while you are indexing. If it is under 
50%, the bottleneck is MongoDB and single-threaded indexing.

For another check, run that same query in a regular database client and time it.
The Solr indexing will never be faster than that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 17, 2020, at 11:58 AM, Abhijit Pawar  wrote:
> 
> Sure Divye,
> 
> *Here's the config.*
> 
> *conf/solr-config.xml:*
> 
> 
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
>  name="config">/home/ec2-user/solr/solr-5.4.1/server/solr/test_core/conf/dataimport/data-source-config.xml
> 
> 
> 
> 
> 
> *schema.xml:*
> has of all the field definitions
> 
> *conf/dataimport/data-source-config.xml*
> 
> 
>  driver="com.mongodb.jdbc.MongoDriver" url="mongodb://< ADDRESS>>:27017/<>"/>
> 
>  dataSource="mongod"
> transformer="<>,TemplateTransformer"
> onError="continue"
> pk="uuid"
> query="SELECT field1,field2,field3,.. FROM products"
> deltaImportQuery="SELECT field1,field2,field3,.. FROM products WHERE
> orgidStr = '${dataimporter.request.orgid}' AND idStr =
> '${dataimporter.delta.idStr}'"
> deltaQuery="SELECT idStr FROM products WHERE orgidStr =
> '${dataimporter.request.orgid}' AND updatedAt >
> '${dataimporter.last_index_time}'"
>> 
> 
> 
> 
> 
> .
> .
> . 4-5 more nested entities...
> 
> On Mon, Aug 17, 2020 at 1:32 PM Divye Handa 
> wrote:
> 
>> Can you share the dih configuration you are using for same?
>> 
>> On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:
>> 
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>>

Looking for Solr contractor at Chegg

2020-08-17 Thread Walter Underwood

We plan to upgrade all of our custers to Solr 8.x and are looking for a 
contractor. The Solr Cloud clusters are on 6.6.2 and we have a master/slave 
cluster on 4.10.4 with a customized edismax query parser (eedismax?).

https://jobs.chegg.com/job/CHEGA0056OWKCDFWP/Senior-Software-Engineer-Search-6-month-contract
 
<https://jobs.chegg.com/job/CHEGA0056OWKCDFWP/Senior-Software-Engineer-Search-6-month-contract>

Glad to answer further questions by direct email.

Sorry if this is off-topic for the solr-user list. I checked the Apache page 
and job listings aren’t prohibited, though they are somewhat different than the 
main topic of helping Solr users.

wunder
Walter Underwood
Principal Software Engineer, Chegg
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Multiple Collections in a Alias.

2020-08-12 Thread Walter Underwood

Different absolute scores from different collections are OK, because
the exact values depend on the number of deleted documents.

For the set of documents that are in different orders from different
collections, are the scores of that set identical? If they are, then it
is normal to have a different order from different collections.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 12, 2020, at 4:29 PM, Jae Joo  wrote:
> 
> Good question. How can I validate if the replicas are all synched?
> 
> 
> On Wed, Aug 12, 2020 at 7:28 PM Jae Joo  wrote:
> 
>> numFound  is same but different score.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Aug 12, 2020 at 6:01 PM Aroop Ganguly
>>  wrote:
>> 
>>> Try a simple test of querying each collection 5 times in a row, if the
>>> numFound are different for a single collection within tase 5 calls then u
>>> have it.
>>> Please try it, what you may think is sync’d may actually not be. How do
>>> you validate correct sync ?
>>> 
>>>> On Aug 12, 2020, at 10:55 AM, Jae Joo  wrote:
>>>> 
>>>> The replications are all synched and there are no updates while I was
>>>> testing.
>>>> 
>>>> 
>>>> On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly
>>>>  wrote:
>>>> 
>>>>> Most likely you have 1 or more collections behind the alias that have
>>>>> replicas out of sync :)
>>>>> 
>>>>> Try querying each collection to find the one out of sync.
>>>>> 
>>>>>> On Aug 12, 2020, at 10:47 AM, Jae Joo  wrote:
>>>>>> 
>>>>>> I have 10 collections in single alias and having different result sets
>>>>> for
>>>>>> every time with the same query.
>>>>>> 
>>>>>> Is it as designed or do I miss something?
>>>>>> 
>>>>>> The configuration and schema for all 10 collections are identical.
>>>>>> Thanks,
>>>>>> 
>>>>>> Jae
>>>>> 
>>>>> 
>>> 
>>>

Re: Multiple Collections in a Alias.

2020-08-12 Thread Walter Underwood

Are the scores the same for the documents that are ordered differently?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 12, 2020, at 10:55 AM, Jae Joo  wrote:
> 
> The replications are all synched and there are no updates while I was
> testing.
> 
> 
> On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly
>  wrote:
> 
>> Most likely you have 1 or more collections behind the alias that have
>> replicas out of sync :)
>> 
>> Try querying each collection to find the one out of sync.
>> 
>>> On Aug 12, 2020, at 10:47 AM, Jae Joo  wrote:
>>> 
>>> I have 10 collections in single alias and having different result sets
>> for
>>> every time with the same query.
>>> 
>>> Is it as designed or do I miss something?
>>> 
>>> The configuration and schema for all 10 collections are identical.
>>> Thanks,
>>> 
>>> Jae
>> 
>>

Re: Searching for credit card numbers

2020-07-28 Thread Walter Underwood

If you reindex, I’ve become a big fan of adding a date field with an index 
timestamp.
That will allow you to check whether everything has been reindexed.

   

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 28, 2020, at 2:11 PM, Jörn Franke  wrote:
> 
> A regex search at query time would leave room for attacks (eg a regex can 
> easily be designed to block the Solr server forever).
> 
> If the field is store you can also try to use a cursor to go through all 
> entries using a cursor and reindex the doc based on the field:
> 
> https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html
> 
> This would also imply that you have the other fields stored. Otherwise 
> reindex.
> You can do this in parallel to the existing index and once finished simply 
> change the alias for the collection (that would be without any downtime for 
> the users but you require of course corresponding space).
> 
>> Am 28.07.2020 um 21:06 schrieb lstusr 5u93n4 :
>> 
>> Possible... yes. Agreed that this is the right approach. But if we already
>> have a big index that we're searching through? Any way to "hack it"?
>> 
>>> On Tue, 28 Jul 2020 at 14:55, Walter Underwood 
>>> wrote:
>>> 
>>> I’d do that at index time. Add an update request processor script that
>>> does the regex and adds a field has_credit_card_number:true.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>>> On Jul 28, 2020, at 11:50 AM, lstusr 5u93n4  wrote:
>>>> 
>>>> Let's say I have a text field that's been indexed with the standard
>>>> tokenizer, and I want to match the docs that have credit card numbers in
>>>> them (this is for altruistic purposes, not nefarious ones!). What's the
>>>> best way to build a search that will do this?
>>>> 
>>>> Searching for "   " seems to return inconsistent results.
>>>> 
>>>> Maybe a regex search? "[0-9]{4}?[0-9]{4}?[0-9]{4}?[0-9]{4}" seems like it
>>>> should work, but that's not matching the docs I think it should either...
>>>> 
>>>> Any suggestions?
>>>> 
>>>> Thanks In Advance!
>>> 
>>>

Re: Searching for credit card numbers

2020-07-28 Thread Walter Underwood

I’d do that at index time. Add an update request processor script that
does the regex and adds a field has_credit_card_number:true.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 28, 2020, at 11:50 AM, lstusr 5u93n4  wrote:
> 
> Let's say I have a text field that's been indexed with the standard
> tokenizer, and I want to match the docs that have credit card numbers in
> them (this is for altruistic purposes, not nefarious ones!). What's the
> best way to build a search that will do this?
> 
> Searching for "   " seems to return inconsistent results.
> 
> Maybe a regex search? "[0-9]{4}?[0-9]{4}?[0-9]{4}?[0-9]{4}" seems like it
> should work, but that's not matching the docs I think it should either...
> 
> Any suggestions?
> 
> Thanks In Advance!

Re: tlog keeps growing

2020-07-23 Thread Walter Underwood

This is a long shot, but look in the overseer queue to see if stuff is stuck. 
We ran into that with 6.x.
We restarted the instance that was the overseer and the newly-elected overseer 
cleared the queue.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 23, 2020, at 10:43 AM, Erick Erickson  wrote:
> 
> Yes, you should have seen a new tlog after:
> - a doc was indexed
> - 15 minutes had passed
> - another doc was indexed
> 
> Well, yes, a leader can be in recovery. It looks like this:
> 
> - You’re indexing and docs are written to the tlog.
> - Solr un-gracefully shuts down so the segments haven’t been closed. Note, 
> these are thrown away on restart.
> - Solr is restarted and starts replaying the tlog.
> 
> But, the node shouldn’t be active during this time.
> 
> Of course it’s possible that for some strange reason, the tlog gets set to 
> the buffering state and never gets back to active, which is what the message 
> you posted seems to be indicating.
> 
> So I’m puzzled, let us know what you find…
> 
> Erick
> 
>> On Jul 23, 2020, at 12:56 PM, Gael Jourdan-Weil 
>>  wrote:
>> 
>>> Note that for my previous e-mail you’d have to wait 15 minutes after you 
>>> started indexing to see a new tlog and also wait until at least 1,000 new 
>>> document after _that_ before the large tlog went away. I don't think that’s 
>>> your issue though.
>> Indeed I did wait 15 minutes but not sure 1000 documents were indexed in the 
>> meantime. Though I should've seen a new tlog even if the large one was still 
>> there, right?
>> 
>>> So I think that’s the place to focus. Did the node recover completely and 
>>> go active? Just checking the admin UI and seeing it be green is sometimes 
>>> not enough. Check the state.json znode and see if the state is also 
>>> “active” there.
>> On ZooKeeper (through the Solr UI or directly connecting to ZK) I can see 
>> "state":"active" in the state.json. This seems fine.
>> To be more weird, this is the leader node. Can a leader be in recovery??
>> 
>>> Next, try sending a request directly to that replica. Frankly I’m not sure 
>>> what to expect, but if you get something weird that’d be a “smoking gun” 
>>> that no matter what the admin UI says, the replica isn’t really active. 
>>> Something like “http://blah blah 
>>> blah/solr/collection1_shard1_replica_n1?q=some_query=false. The 
>>> “distrib=false” is important, otherwise the request will be forwarded to a 
>>> truly active node.
>> The request works fine, I don't see anything weird at that time in the logs.
>> 
>> I will investigate further and take a look at all what you mentionned.
>> 
>> Kind regards,
>> Gaël
>

Re: Replica goes into recovery mode in Solr 6.1.0

2020-07-21 Thread Walter Underwood

Upgrade to 6.6.2. That will be compatible, but will fix several bugs that were
discovered during the 6.x releases.

If the problem happens after that, ask again. It might, we’ve had some issues
with 6.6.2, but upgrade first.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2020, at 10:37 PM, vishal patel  
> wrote:
> 
> I am using Solr version 6.1.0, Java 8 version and G1GC on production. We have 
> 2 shards and each shard has 1 replica.
> Some times my replica goes into recovery mode and when I check my GC log, I 
> can not find the GC pause time more than 600 milliseconds. sometimes GC pause 
> time goes near to 1 seconds but at that time replica does not go into 
> recovery mode.
> 
> My Error Log:
> shard: https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view
> replica: 
> https://drive.google.com/file/d/1y0fC_n5u3MBMQbXrvxtqaD8vBBXDLR6I/view
> 
> When I searched my error "org.apache.http.NoHttpResponseException:  failed to 
> respond" in Google, I found the one Solr jira case : 
> https://issues.apache.org/jira/browse/SOLR-7483
> 
> Any one gives me details about that jira case? is it resolved in other jira 
> case?
> 
> Regards,
> Vishal patel
> 
> 
> 
> 
> Sent from Outlook<http://aka.ms/weboutlook>

Re: Solr fails to start with G1 GC

2020-07-16 Thread Walter Underwood

Instead of editing bin/solr, you should be able to set GC_TUNE in 
solr.in.sh, as I showed in my post below.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 16, 2020, at 7:52 AM, krishan goyal  wrote:
> 
> The issue was figured out by starting solr with the -f parameter which
> starts solr in foreground and provides the errors if any
> 
> Got an error - "Conflicting collector combinations in option list; please
> refer to the release notes for the combinations allowed"
> 
> Turns out bin/solr file starts with CMS by default and had to disable that
> to resolve the conflict.
> 
> 
> On Wed, Jul 15, 2020 at 10:20 PM Walter Underwood 
> wrote:
> 
>> I don’t see a heap size specified, so it is probably trying to run with
>> a 512 Megabyte heap. That might just not work with the 32M region
>> size.
>> 
>> Here are the options we have been using for 3+ years on about 150 hosts.
>> 
>> SOLR_HEAP=8g
>> # Use G1 GC  -- wunder 2017-01-23
>> # Settings from https://wiki.apache.org/solr/ShawnHeisey
>> GC_TUNE=" \
>> -XX:+UseG1GC \
>> -XX:+ParallelRefProcEnabled \
>> -XX:G1HeapRegionSize=8m \
>> -XX:MaxGCPauseMillis=200 \
>> -XX:+UseLargePages \
>> -XX:+AggressiveOpts \
>> "
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jul 15, 2020, at 4:24 AM, krishan goyal 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> I am using Solr 7.7
>>> 
>>> I am trying to start my solr server with G1 GC instead of the default CMS
>>> but the solr service doesn't get up.
>>> 
>>> The command I use to start solr is
>>> 
>>> bin/solr start -p 25280 -a "-Dsolr.solr.home=
>>> -Denable.slave=true -Denable.master=false -XX:+UseG1GC
>>> -XX:MaxGCPauseMillis=500 -XX:+UnlockExperimentalVMOptions
>>> -XX:G1MaxNewSizePercent=30 -XX:G1NewSizePercent=5
>> -XX:G1HeapRegionSize=32M
>>> -XX:InitiatingHeapOccupancyPercent=70"
>>> 
>>> I have tried various permutations of the start command by dropping /
>> adding
>>> other parameters but it doesn't work. However starts up just fine with
>>> just "-Dsolr.solr.home= -Denable.slave=true
>>> -Denable.master=false" and starts up with the default CMS collector
>>> 
>>> I don't get any useful error logs too. It waits for default 180 secs and
>>> then prints
>>> 
>>> Warning: Available entropy is low. As a result, use of the UUIDField,
>> SSL,
>>> or any other features that require
>>> RNG might not work properly. To check for the amount of available
>> entropy,
>>> use 'cat /proc/sys/kernel/random/entropy_avail'.
>>> 
>>> Waiting up to 180 seconds to see Solr running on port 25280 [|]  Still
>> not
>>> seeing Solr listening on 25280 after 180 seconds!
>>> 2020-07-15 07:07:52.042 INFO  (coreCloseExecutor-60-thread-6) [
>>> x:coreName] o.a.s.c.SolrCore [coreName]  CLOSING SolrCore
>>> org.apache.solr.core.SolrCore@7cc638d8
>>> 2020-07-15 07:07:52.099 INFO  (coreCloseExecutor-60-thread-6) [
>>> x:coreName] o.a.s.m.SolrMetricManager Closing metric reporters for
>>> registry=solr.core.coreName, tag=7cc638d8
>>> 2020-07-15 07:07:52.100 INFO  (coreCloseExecutor-60-thread-6) [
>>> x:coreName] o.a.s.m.r.SolrJmxReporter Closing reporter
>>> [org.apache.solr.metrics.reporters.SolrJmxReporter@5216981f: rootName =
>>> null, domain = solr.core.coreName, service url = null, agent id = null]
>> for
>>> registry solr.core.coreName /
>> com.codahale.metrics.MetricRegistry@32988ddf
>>> 2020-07-15 07:07:52.173 INFO  (ShutdownMonitor) [   ]
>>> o.a.s.m.SolrMetricManager Closing metric reporters for
>> registry=solr.node,
>>> tag=null
>>> 2020-07-15 07:07:52.173 INFO  (ShutdownMonitor) [   ]
>>> o.a.s.m.r.SolrJmxReporter Closing reporter
>>> [org.apache.solr.metrics.reporters.SolrJmxReporter@28952dea: rootName =
>>> null, domain = solr.node, service url = null, agent id = null] for
>> registry
>>> solr.node / com.codahale.metrics.MetricRegistry@655f4a3f
>>> 2020-07-15 07:07:52.175 INFO  (ShutdownMonitor) [   ]
>>> o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.jvm,
>>> tag=null
>>> 2020-07-15 07:07:52.175 INFO  (ShutdownMonitor) [   ]
>>> o.a.s.m.r.SolrJmxReporter Closing reporter
>>> [org.apache.solr.metrics.reporters.SolrJmxReporter@69c6161d: rootName =
>>> null, domain = solr.jvm, service url = null, agent id = null] for
>> registry
>>> solr.jvm / com.codahale.metrics.MetricRegistry@1252ce77
>>> 2020-07-15 07:07:52.176 INFO  (ShutdownMonitor) [   ]
>>> o.a.s.m.SolrMetricManager Closing metric reporters for
>> registry=solr.jetty,
>>> tag=null
>>> 2020-07-15 07:07:52.176 INFO  (ShutdownMonitor) [   ]
>>> o.a.s.m.r.SolrJmxReporter Closing reporter
>>> [org.apache.solr.metrics.reporters.SolrJmxReporter@3aefae67: rootName =
>>> null, domain = solr.jetty, service url = null, agent id = null] for
>>> registry solr.jetty / com.codahale.metrics.MetricRegistry@3a538ecd
>> 
>>

Re: Solr fails to start with G1 GC

2020-07-15 Thread Walter Underwood

I don’t see a heap size specified, so it is probably trying to run with
a 512 Megabyte heap. That might just not work with the 32M region
size.

Here are the options we have been using for 3+ years on about 150 hosts.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 15, 2020, at 4:24 AM, krishan goyal  wrote:
> 
> Hi,
> 
> I am using Solr 7.7
> 
> I am trying to start my solr server with G1 GC instead of the default CMS
> but the solr service doesn't get up.
> 
> The command I use to start solr is
> 
> bin/solr start -p 25280 -a "-Dsolr.solr.home=
> -Denable.slave=true -Denable.master=false -XX:+UseG1GC
> -XX:MaxGCPauseMillis=500 -XX:+UnlockExperimentalVMOptions
> -XX:G1MaxNewSizePercent=30 -XX:G1NewSizePercent=5 -XX:G1HeapRegionSize=32M
> -XX:InitiatingHeapOccupancyPercent=70"
> 
> I have tried various permutations of the start command by dropping / adding
> other parameters but it doesn't work. However starts up just fine with
> just "-Dsolr.solr.home= -Denable.slave=true
> -Denable.master=false" and starts up with the default CMS collector
> 
> I don't get any useful error logs too. It waits for default 180 secs and
> then prints
> 
> Warning: Available entropy is low. As a result, use of the UUIDField, SSL,
> or any other features that require
> RNG might not work properly. To check for the amount of available entropy,
> use 'cat /proc/sys/kernel/random/entropy_avail'.
> 
> Waiting up to 180 seconds to see Solr running on port 25280 [|]  Still not
> seeing Solr listening on 25280 after 180 seconds!
> 2020-07-15 07:07:52.042 INFO  (coreCloseExecutor-60-thread-6) [
> x:coreName] o.a.s.c.SolrCore [coreName]  CLOSING SolrCore
> org.apache.solr.core.SolrCore@7cc638d8
> 2020-07-15 07:07:52.099 INFO  (coreCloseExecutor-60-thread-6) [
> x:coreName] o.a.s.m.SolrMetricManager Closing metric reporters for
> registry=solr.core.coreName, tag=7cc638d8
> 2020-07-15 07:07:52.100 INFO  (coreCloseExecutor-60-thread-6) [
> x:coreName] o.a.s.m.r.SolrJmxReporter Closing reporter
> [org.apache.solr.metrics.reporters.SolrJmxReporter@5216981f: rootName =
> null, domain = solr.core.coreName, service url = null, agent id = null] for
> registry solr.core.coreName / com.codahale.metrics.MetricRegistry@32988ddf
> 2020-07-15 07:07:52.173 INFO  (ShutdownMonitor) [   ]
> o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.node,
> tag=null
> 2020-07-15 07:07:52.173 INFO  (ShutdownMonitor) [   ]
> o.a.s.m.r.SolrJmxReporter Closing reporter
> [org.apache.solr.metrics.reporters.SolrJmxReporter@28952dea: rootName =
> null, domain = solr.node, service url = null, agent id = null] for registry
> solr.node / com.codahale.metrics.MetricRegistry@655f4a3f
> 2020-07-15 07:07:52.175 INFO  (ShutdownMonitor) [   ]
> o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.jvm,
> tag=null
> 2020-07-15 07:07:52.175 INFO  (ShutdownMonitor) [   ]
> o.a.s.m.r.SolrJmxReporter Closing reporter
> [org.apache.solr.metrics.reporters.SolrJmxReporter@69c6161d: rootName =
> null, domain = solr.jvm, service url = null, agent id = null] for registry
> solr.jvm / com.codahale.metrics.MetricRegistry@1252ce77
> 2020-07-15 07:07:52.176 INFO  (ShutdownMonitor) [   ]
> o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.jetty,
> tag=null
> 2020-07-15 07:07:52.176 INFO  (ShutdownMonitor) [   ]
> o.a.s.m.r.SolrJmxReporter Closing reporter
> [org.apache.solr.metrics.reporters.SolrJmxReporter@3aefae67: rootName =
> null, domain = solr.jetty, service url = null, agent id = null] for
> registry solr.jetty / com.codahale.metrics.MetricRegistry@3a538ecd

Re: Replica goes into recovery mode in Solr 6.1.0

2020-07-10 Thread Walter Underwood

Sorting and faceting takes a lot of memory. From your charts, I would try
a 31 GB heap. That would make GC faster. 680 ms is very long for a GC
and can cause problems.

Combine a 680 ms GC with a 100 ms soft commit time and you can have
lots of trouble.

Change your soft commit time to 1 (ten seconds) or longer.

Look at a 24 hour graph of heap usage. It should look like a sawtooth,
increasing, then dropping after every full GC. The bottom of the sawtooth
is the the memory that Solr actually needs. Take the highest number from
the bottom of the sawtooth and add some extra, maybe 2 GB. Try that
heap size.

Upgrade to 6.6.2. That includes all bug fixes for the 6.x release. The 6.x 
release had several bad bugs, especially in the middle releases. We were
switching prod to Sol Cloud while those were being released and it was
not fun.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 10, 2020, at 4:59 AM, vishal patel  
> wrote:
> 
> Thanks for quick reply.
> 
> I assume caches (are they too large?), perhaps uninverted indexes.
> Docvalues would help with latter ones. Do you use them?
>>> We do not use any cache. we disabled the cache from solrconfig.xml
> here is my solrconfig .xml and schema.xml
> https://drive.google.com/file/d/12SHl3YGP7jT4goikBkeyB2s1NX5_C2gz/view
> https://drive.google.com/file/d/1LwA1d4OiMhQQv806tR0HbZoEjA8IyfdR/view
> 
> We used Docvalues on that field which is used for sorting or faceting.
> 
> You could also try upgrading to the latest version in 6.x series as a starter.
>>> I will surely try.
> 
> So, the node in question isn't responding quickly enough to http requests and 
> gets put into recovery. The log for the recovering node starts too late, so I 
> can't say anything about what happened before 14:42:43.943 that lead to 
> recovery.
>>> There is no error before 14:42:43.943 just search and insert requests are 
>>> there. I got that node is responding but why it is not responding? Due to 
>>> lack of memory or any other cause
> why we cannot get idea from log for reason of not responding.
> 
> Is there any monitor for Solr from where we can find the root cause?
> 
> Regards,
> Vishal Patel
> 
> 
> 
> From: Ere Maijala 
> Sent: Friday, July 10, 2020 4:27 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
> 
> vishal patel kirjoitti 10.7.2020 klo 12.45:
>> Thanks for your input.
>> 
>> Walter already said that setting soft commit max time to 100 ms is a recipe 
>> for disaster
>>>> I know that but our application is already developed and run on live 
>>>> environment since last 5 years. Actually, we want to show a data very 
>>>> quickly after the insert.
>> 
>> you have huge JVM heaps without an explanation for the reason
>>>> We gave the 55GB ram because our usage is like that large query search and 
>>>> very frequent searching and indexing.
>> Here is my memory snapshot which I have taken from GC.
> 
> Yes, I can see that a lot of memory is in use, but the question is why.
> I assume caches (are they too large?), perhaps uninverted indexes.
> Docvalues would help with latter ones. Do you use them?
> 
>> I have tried Solr upgrade from 6.1.0 to 8.5.1 but due to some issue we 
>> cannot do. I have also asked in here
>> https://lucene.472066.n3.nabble.com/Sorting-in-other-collection-in-Solr-8-5-1-td4459506.html#a4459562
> 
> You could also try upgrading to the latest version in 6.x series as a
> starter.
> 
>> Why we cannot find the reason of recovery from log? like memory or CPU 
>> issue, frequent index or search, large query hit,
>> My log at the time of recovery
>> https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view
>> [https://lh5.googleusercontent.com/htOUfpihpAqncFsMlCLnSUZPu1_9DRKGNajaXV1jG44fpFzgx51ecNtUK58m5lk=w1200-h630-p]<https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view>
>> recovery_shard.txt<https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view>
>> drive.google.com
> 
> Isn't it right there on the first lines?
> 
> 2020-07-09 14:42:43.943 ERROR
> (updateExecutor-2-thread-21007-processing-http:11.200.212.305:8983//solr//products
> x:products r:core_node1 n:11.200.212.306:8983_solr s:shard1 c:products)
> [c:products s:shard1 r:core_node1 x:products]
> o.a.s.u.StreamingSolrClients error
> org.apache.http.NoHttpResponseException: 11.200.212.305:8983 failed to
> respond
> 
> followed by a couple more error messages about the same problem and then
> initiation of recovery:
> 
> 2020-07-09 14:42

Re: Replica goes into recovery mode in Solr 6.1.0

2020-07-09 Thread Walter Underwood

Those are extremely large JVMs. Unless you have proven that you MUST
have 55 GB of heap, use a smaller heap.

I’ve been running Solr for a dozen years and I’ve never needed a heap
larger than 8 GB.

Also, there is usually no need to use one JVM per replica.

Your configuration is using 110 GB (two JVMs) just for Java
where I would configure it with a single 8 GB JVM. That would
free up 100 GB for file caches.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 8, 2020, at 10:10 PM, vishal patel  
> wrote:
> 
> Thanks for reply.
> 
> what you mean by "Shard1 Allocated memory”
>>> It means JVM memory of one solr node or instance.
> 
> How many Solr JVMs are you running?
>>> In one server 2 solr JVMs in which one is shard and other is replica.
> 
> What is the heap size for your JVMs?
>>> 55GB of one Solr JVM.
> 
> Regards,
> Vishal Patel
> 
> Sent from Outlook<http://aka.ms/weboutlook>
> 
> From: Walter Underwood 
> Sent: Wednesday, July 8, 2020 8:45 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
> 
> I don’t understand what you mean by "Shard1 Allocated memory”. I don’t know of
> any way to dedicate system RAM to an application object like a replica.
> 
> How many Solr JVMs are you running?
> 
> What is the heap size for your JVMs?
> 
> Setting soft commit max time to 100 ms does not magically make Solr super 
> fast.
> It makes Solr do too much work, makes the work queues fill up, and makes it 
> fail.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jul 7, 2020, at 10:55 PM, vishal patel  
>> wrote:
>> 
>> Thanks for your reply.
>> 
>> One server has total 320GB ram. In this 2 solr node one is shard1 and second 
>> is shard2 replica. Each solr node have 55GB memory allocated. shard1 has 
>> 585GB data and shard2 replica has 492GB data. means almost 1TB data in this 
>> server. server has also other applications and for that 60GB memory 
>> allocated. So total 150GB memory is left.
>> 
>> Proper formatting details:
>> https://drive.google.com/file/d/1K9JyvJ50Vele9pPJCiMwm25wV4A6x4eD/view
>> 
>> Are you running multiple huge JVMs?
>>>> Not huge but 60GB memory allocated for our 11 application. 150GB memory 
>>>> are still free.
>> 
>> The servers will be doing a LOT of disk IO, so look at the read and write 
>> iops. I expect that the solr processes are blocked on disk reads almost all 
>> the time.
>>>> is it chance to go in recovery mode if more IO read and write or blocked?
>> 
>> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
>>>> Our requirement is NRT so we keep the less time
>> 
>> Regards,
>> Vishal Patel
>> 
>> From: Walter Underwood 
>> Sent: Tuesday, July 7, 2020 8:15 PM
>> To: solr-user@lucene.apache.org 
>> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
>> 
>> This isn’t a support list, so nobody looks at issues. We do try to help.
>> 
>> It looks like you have 1 TB of index on a system with 320 GB of RAM.
>> I don’t know what "Shard1 Allocated memory” is, but maybe half of
>> that RAM is used by JVMs or some other process, I guess. Are you
>> running multiple huge JVMs?
>> 
>> The servers will be doing a LOT of disk IO, so look at the read and
>> write iops. I expect that the solr processes are blocked on disk reads
>> almost all the time.
>> 
>> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
>> That is probably causing your outages.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jul 7, 2020, at 5:18 AM, vishal patel  
>>> wrote:
>>> 
>>> Any one is looking my issue? Please guide me.
>>> 
>>> Regards,
>>> Vishal Patel
>>> 
>>> 
>>> 
>>> From: vishal patel 
>>> Sent: Monday, July 6, 2020 7:11 PM
>>> To: solr-user@lucene.apache.org 
>>> Subject: Replica goes into recovery mode in Solr 6.1.0
>>> 
>>> I am using Solr version 6.1.0, Java 8 version and G1GC on production. We 
>>> have 2 shards and each shard has 1 replica. We have 3 collection.
>>> We do not use any cache and also disable in Solr config.xml. Search and 
>>> Update requests are coming frequent

Re: Replica goes into recovery mode in Solr 6.1.0

2020-07-08 Thread Walter Underwood

I don’t understand what you mean by "Shard1 Allocated memory”. I don’t know of
any way to dedicate system RAM to an application object like a replica.

How many Solr JVMs are you running?

What is the heap size for your JVMs?

Setting soft commit max time to 100 ms does not magically make Solr super fast.
It makes Solr do too much work, makes the work queues fill up, and makes it 
fail.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 7, 2020, at 10:55 PM, vishal patel  
> wrote:
> 
> Thanks for your reply.
> 
> One server has total 320GB ram. In this 2 solr node one is shard1 and second 
> is shard2 replica. Each solr node have 55GB memory allocated. shard1 has 
> 585GB data and shard2 replica has 492GB data. means almost 1TB data in this 
> server. server has also other applications and for that 60GB memory 
> allocated. So total 150GB memory is left.
> 
> Proper formatting details:
> https://drive.google.com/file/d/1K9JyvJ50Vele9pPJCiMwm25wV4A6x4eD/view
> 
> Are you running multiple huge JVMs?
>>> Not huge but 60GB memory allocated for our 11 application. 150GB memory are 
>>> still free.
> 
> The servers will be doing a LOT of disk IO, so look at the read and write 
> iops. I expect that the solr processes are blocked on disk reads almost all 
> the time.
>>> is it chance to go in recovery mode if more IO read and write or blocked?
> 
> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
>>> Our requirement is NRT so we keep the less time
> 
> Regards,
> Vishal Patel
> 
> From: Walter Underwood 
> Sent: Tuesday, July 7, 2020 8:15 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
> 
> This isn’t a support list, so nobody looks at issues. We do try to help.
> 
> It looks like you have 1 TB of index on a system with 320 GB of RAM.
> I don’t know what "Shard1 Allocated memory” is, but maybe half of
> that RAM is used by JVMs or some other process, I guess. Are you
> running multiple huge JVMs?
> 
> The servers will be doing a LOT of disk IO, so look at the read and
> write iops. I expect that the solr processes are blocked on disk reads
> almost all the time.
> 
> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
> That is probably causing your outages.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jul 7, 2020, at 5:18 AM, vishal patel  
>> wrote:
>> 
>> Any one is looking my issue? Please guide me.
>> 
>> Regards,
>> Vishal Patel
>> 
>> 
>> 
>> From: vishal patel 
>> Sent: Monday, July 6, 2020 7:11 PM
>> To: solr-user@lucene.apache.org 
>> Subject: Replica goes into recovery mode in Solr 6.1.0
>> 
>> I am using Solr version 6.1.0, Java 8 version and G1GC on production. We 
>> have 2 shards and each shard has 1 replica. We have 3 collection.
>> We do not use any cache and also disable in Solr config.xml. Search and 
>> Update requests are coming frequently in our live platform.
>> 
>> *Our commit configuration in solr.config are below
>> 
>> 60
>>  2
>>  false
>> 
>> 
>>  ${solr.autoSoftCommit.maxTime:-1}
>> 
>> 
>> *We used Near Real Time Searching So we did below configuration in 
>> solr.in.cmd
>> set SOLR_OPTS=%SOLR_OPTS% -Dsolr.autoSoftCommit.maxTime=100
>> 
>> *Our collections details are below:
>> 
>> Collection  Shard1  Shard1 Replica  Shard2  Shard2 Replica
>> Number of Documents Size(GB)Number of Documents Size(GB) 
>>Number of Documents Size(GB)Number of Documents Size(GB)
>> collection1 26913364201 26913379202 26913380 
>>198 26913379198
>> collection2 13934360310 13934367310 13934368 
>>219 13934367219
>> collection3 351539689   73.5351540040   73.5351540136
>>75.2351539722   75.2
>> 
>> *My server configurations are below:
>> 
>>   Server1 Server2
>> CPU Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 2301 Mhz, 10 Core(s), 20 
>> Logical Processor(s)Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 2301 
>> Mhz, 10 Core(s), 20 Logical Processor(s)
>> HardDisk(GB)3845 ( 3.84 TB) 3485 GB (3.48 TB)
>> Total memory(GB)320 320
>> Shard1 Allocated memory(GB) 55
>> Shard2 Replica Allocated memory(GB) 55
>> Shard2 Allocated memory(GB) 55
>> Shard1 Replica Allocated memory(GB) 55
>> Other Applications Allocated Memory(GB) 60  22
>> Other Number Of Applications11  7
>> 
>> 
>> Sometimes, any one replica goes into recovery mode. Why replica goes into 
>> recovery? Due to heavy search OR heavy update/insert OR long GC pause time? 
>> If any one of them then what should we do in configuration?
>> Should we increase the shard for recovery issue?
>> 
>> Regards,
>> Vishal Patel
>> 
>

Re: Replica goes into recovery mode in Solr 6.1.0

2020-07-07 Thread Walter Underwood

This isn’t a support list, so nobody looks at issues. We do try to help.

It looks like you have 1 TB of index on a system with 320 GB of RAM.
I don’t know what "Shard1 Allocated memory” is, but maybe half of
that RAM is used by JVMs or some other process, I guess. Are you
running multiple huge JVMs?

The servers will be doing a LOT of disk IO, so look at the read and
write iops. I expect that the solr processes are blocked on disk reads
almost all the time. 

"-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms). 
That is probably causing your outages.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 7, 2020, at 5:18 AM, vishal patel  
> wrote:
> 
> Any one is looking my issue? Please guide me.
> 
> Regards,
> Vishal Patel
> 
> 
> 
> From: vishal patel 
> Sent: Monday, July 6, 2020 7:11 PM
> To: solr-user@lucene.apache.org 
> Subject: Replica goes into recovery mode in Solr 6.1.0
> 
> I am using Solr version 6.1.0, Java 8 version and G1GC on production. We have 
> 2 shards and each shard has 1 replica. We have 3 collection.
> We do not use any cache and also disable in Solr config.xml. Search and 
> Update requests are coming frequently in our live platform.
> 
> *Our commit configuration in solr.config are below
> 
> 60
>   2
>   false
> 
> 
>   ${solr.autoSoftCommit.maxTime:-1}
> 
> 
> *We used Near Real Time Searching So we did below configuration in solr.in.cmd
> set SOLR_OPTS=%SOLR_OPTS% -Dsolr.autoSoftCommit.maxTime=100
> 
> *Our collections details are below:
> 
> Collection  Shard1  Shard1 Replica  Shard2  Shard2 Replica
> Number of Documents Size(GB)Number of Documents Size(GB)  
>   Number of Documents Size(GB)Number of Documents Size(GB)
> collection1 26913364201 26913379202 26913380  
>   198 26913379198
> collection2 13934360310 13934367310 13934368  
>   219 13934367219
> collection3 351539689   73.5351540040   73.5351540136 
>   75.2351539722   75.2
> 
> *My server configurations are below:
> 
>Server1 Server2
> CPU Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 2301 Mhz, 10 Core(s), 20 
> Logical Processor(s)Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 2301 
> Mhz, 10 Core(s), 20 Logical Processor(s)
> HardDisk(GB)3845 ( 3.84 TB) 3485 GB (3.48 TB)
> Total memory(GB)320 320
> Shard1 Allocated memory(GB) 55
> Shard2 Replica Allocated memory(GB) 55
> Shard2 Allocated memory(GB) 55
> Shard1 Replica Allocated memory(GB) 55
> Other Applications Allocated Memory(GB) 60  22
> Other Number Of Applications11  7
> 
> 
> Sometimes, any one replica goes into recovery mode. Why replica goes into 
> recovery? Due to heavy search OR heavy update/insert OR long GC pause time? 
> If any one of them then what should we do in configuration?
> Should we increase the shard for recovery issue?
> 
> Regards,
> Vishal Patel
>

Re: Max number of documents in update request

2020-07-07 Thread Walter Underwood

Agreed, I do something between 20 and 1000. If the master node is not 
handling any search traffic, use twice as many client threads as there are
CPUs in the node. That should get you close to 100% CPU utilization.
One thread will be waiting while a batch is being processed and another
thread will be sending the next batch so there is no pause in processing.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 7, 2020, at 6:12 AM, Erick Erickson  wrote:
> 
> As many as you can send before blowing up.
> 
> Really, the question is not answerable. 1K docs? 1G docs? 1 field or 500?
> 
> And I don’t think it’s a good use of time to pursue much. See:
> 
> https://lucidworks.com/post/really-batch-updates-solr-2/
> 
> If you’re looking at trying to maximize throughput, adding
> client threads that send Solr documents is a better approach.
> 
> All that said, I usually just pick 1,000 and don’t worry about it.
> 
> Best,
> Erick
> 
>> On Jul 7, 2020, at 8:59 AM, Sidharth Negi  wrote:
>> 
>> Hi,
>> 
>> Could someone help me with the best way to go about determining the maximum
>> number of docs I can send in a single update call to Solr in a master /
>> slave architecture.
>> 
>> Thanks!
>

Re: How to use two search string in a single solr query

2020-07-02 Thread Walter Underwood

First, remove the “mm” parameter from the request handler definition. That can 
be added back in and tweaked later, or just left out.

Second, you don’t need any query syntax to search for two words. This query 
should work fine:

  books bags

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 1, 2020, at 10:22 PM, Tushar Arora  wrote:
> 
> Hi,
> I have a scenario with following entry in the request handler(handler1) of
> solrconfig.xml.(defType=edismax is used)
> description category  "qf">title^4 demand^0.3
> 2-1 4-30%
> 
> When I searched 'bags' as a search string, solr returned 15000 results.
> Query Used :
> http://localhost:8984/solr/core_name/select?fl=title=on=bags=handler1=10=json
> 
> And when searched 'books' as a search string, solr returns say 3348 results.
> Query Used :
> http://localhost:8984/solr/core_name/select?fl=title=on=books=handler1=10=json
> 
> I want to use both 'bags' and 'books' as a search string in a single query.
> I used the following query:
> http://localhost:8984/solr/core_name/select?fl=title=on=%22bags%22+OR+%22books%22=handler1=10=json
> But OR operator not working. It will only give 7 results.
> 
> 
> I even tried this :
> http://localhost:8984/solr/core_name/select?fl=title=on=(bags)+OR+(books)=handler1=10=json
> But it also gives 7 results.
> 
> But my concern is to include the result of both 'bags' OR 'books' in a
> single query.
> Is there any way to use two search strings in a single query?

Re: Query in quotes cannot find results

2020-06-30 Thread Walter Underwood

This is exactly why the “mm” (minimum match) parameter exists, to reduce the 
number of hits with fewer matches. Think of it as a sliding scale between OR 
and AND.

On the other hand, I don’t usually worry about hits with fewer matches. Those 
are not on the first page, so I don’t care.

In general, you can either optimize more related hits or optimize fewer 
unrelated hits. Everything you do to reduce the unrelated hits will cause some 
related hits to not match. 

Also, do all of your tuning with real user queries from logs. Making up queries 
for testing will lead to fixing problems that never occur in production and to 
missing problems that do occur.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 30, 2020, at 11:07 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erick,
> Thank you for the suggestion, I should of add it. Actually before asking this 
> question here, I tried to add and remove the FlattenGraphFilterFactory, plus 
> other variations, like expand / not expand, autoGeneratePhraseQueries / not 
> autoGeneratePhraseQueries - it just does not work with this particular 
> example. You can try it yourself.
> 
> Regarding removing the stopwords, I agree, there are many cases when you 
> don't want to remove the stopwords, but there is one very compelling case 
> when you want them to be removed.
> 
> Imagine, you have one document with the following text: 
> 1. "to expand the methods for mailing cancellation" 
> And another document with the text: 
> 2. "to expand methods for mailing cancellation"
> 
> The user query is (without quotes): q=expand the methods for mailing 
> cancellation
> I don't want to bring all the documents with condition q.op=OR, it will find 
> too many unrelated documents, so I want to search with q.op=AND. 
> Unfortunately, the document 2 will not be found as it has no stop word "the" 
> in it.
> What should I do now?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Tuesday, June 30, 2020 12:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Well, the first thing is that you haven’t include FlattenGraphFilterFactory 
> in the index analysis chain, see: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter=DwIFaQ=birp9sjcGzT9DCP3EIAtLA=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4=
>  . IDK whether that actually pertains, but I’d reindex with that included 
> before pursuing.
> 
> Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s 
> necessary? Is there any evidence for this or any use-case that shows it _is_ 
> necessary? Removing stopwords became common in the long-ago days when memory 
> and disk capacity were vastly more constrained than now. At this point, I 
> require proof that it’s _necessary_ to remove them before accepting this kind 
> of requirement.
> 
> There are situations where removing stopwords is worth the difficulty it 
> causes. But I’ve seen far too many unnecessary requirements to let that one 
> pass without pushing back ;).
> 
> And you can hack around this by adding slop to the phrase, perhaps you can 
> get “good enough” results by adding one slop for every stopword, i.e. if the 
> input is “expand the methods”, detect that there’s one stopword and change it 
> to “expand the methods”~1. That’ll introduce other problems of course.
> 
> Best,
> Erick
> 
>> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Erik,
>> That's what I did in the past, but this is an enterprise search and I have a 
>> requirement to remove the stopwords.
>> To have both features I can add synonyms in the front-end application, I 
>> know it will work, but I need a justification why I have to do it in the 
>> application as it is an additional effort.
>> I thought there is a bug for such case to which I can refer, because 
>> according to documentation it should work, right?
>> Anyway, there is more to it. If I'll add the same synonym processing to the 
>> indexing part, i.e. the configuration will be like this:
>> 
>>   > positionIncrementGap="100" autoGeneratePhraseQueries="true">
>> 
>>   
>>   > ignoreCase="true"/>
>>   > words="stopwords.txt"/>
>>   
>> 
>> 
>>   
>>   > ignoreCase="true" expand="true"/>
>>   > words=&quo

Re: Query in quotes cannot find results

2020-06-30 Thread Walter Underwood

Removing stopwords is a dumb requirement. “Doctor, it hurts when I shove 
hedgehogs up my arse.”

Part of our job as search engineers is to solve the real problem, not implement 
a pile of requirements from people who don’t understand how search works.

Here is an article I wrote 13 years ago about why we didn’t remove stopwords at 
Netflix.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 30, 2020, at 8:56 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erik,
> That's what I did in the past, but this is an enterprise search and I have a 
> requirement to remove the stopwords.
> To have both features I can add synonyms in the front-end application, I know 
> it will work, but I need a justification why I have to do it in the 
> application as it is an additional effort.
> I thought there is a bug for such case to which I can refer, because 
> according to documentation it should work, right?
> Anyway, there is more to it. If I'll add the same synonym processing to the 
> indexing part, i.e. the configuration will be like this:
> 
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>  
>
> ignoreCase="true"/>
> words="stopwords.txt"/>
>
>  
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt"/>
>
>  
>
> 
> The analysis shows the parsing is matching now for indexing and querying 
> path, but the exact match result still cannot be found! This is weird.
> Any thoughts?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Monday, June 29, 2020 10:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Looks like you’re removing stopwords. Stopwords cause issues like this with 
> the positions being off.
> 
> It’s becoming more and more common to _NOT_ remove stopwords, is that an 
> option?
> 
> 
> 
> Best,
> Erick
> 
>> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Shawn,
>> Many thanks for the response, I checked the field and it is correct. Let's 
>> call it _text_ to make it easier.
>> I believe the parsing is also correct, please see below:
>> - Query without quotes (works):
>>   "querystring":"expand the methods",
>>   "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) 
>> _text_:methods",
>> 
>> - Query with quotes (does not work):
>>   "querystring":"\"expand the methods\"",
>>   "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, 
>> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))",
>> 
>> The document has text:
>> "to expand the methods for mailing cancellation"
>> 
>> The analysis on this field shows that all words are present in the index and 
>> the query, the order is also correct, but the word "methods" in moved one 
>> position, I guess that's why the result is not found.
>> 
>> Best Regards,
>> Vadim Permakoff
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Shawn Heisey 
>> Sent: Monday, June 29, 2020 6:28 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Query in quotes cannot find results
>> 
>> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote:
>>> The basic query q=expand the methods   <<< finds the document,
>>> the query (in quotes) q="expand the methods"   <<< cannot find the document
>>> 
>>> Am I doing something wrong, or is it known bug (I saw similar issues 
>>> discussed in the past, but not for exact match query) and if yes - what is 
>>> the Jira for it?
>> 
>> The most helpful information will come from running both queries with debug 
>> enabled, so you can see how the query is parsed.  If you add a parameter 
>> "debugQuery=true" to the URL, then the response should include the parsed 
>> query.  Compare those, and see if you can tell what the differences are.
>> 
>> One of the most common problems for queries like this is that you're not 
>> searching the field that you THINK you're searching.  I don't know whether 
>> this is the problem, I just mention it because it is a common error.
>> 
>> Thanks,
>> Shawn
>> 
>> 
>> 
>> This email is intended solely for the recipient. It may contain privileged, 
>> proprietary or confidential information or material. If you are not the 
>> intended recipient, please delete this email and any attachments and notify 
>> the sender of the error.
>

Re: Prevent Re-indexing if Doc Fields are Same

2020-06-26 Thread Walter Underwood

If you don’t want to buy disk space for deleted docs, you should not be 
using Solr. That is an essential part of a reliable Solr installation.

To avoid reindexing unchanged documents, use a bookkeeping RDBMS
table. In that table, put the document ID and the most recent successful
update to Solr. You can check if the fields are the same with a checksum
of the data. MD5 is fine for that. Check that database before sending the
document and update it after new documents are indexed.

You may also want to record deletes in the database.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 26, 2020, at 1:12 AM, Anshuman Singh  wrote:
> 
> I was reading about in-place updates
> https://lucene.apache.org/solr/guide/7_4/updating-parts-of-documents.html,
> In my use case I have to update the field "LASTUPDATETIME", all other
> fields are same. Updates are very frequent and I can't bear the cost of
> deleted docs.
> 
> If I provide all the fields, it deletes the document and re-index it. But
> if I just "set" the "LASTUPDATETIME" field (non-indexed, non-stored,
> docValue field), it does an in-place update without deletion. But the
> problem is I don't know if the document is present or I'm indexing it the
> first time.
> 
> Is there a way to prevent re-indexing if other fields are the same?
> 
> *P.S. I'm looking for a solution that doesn't require looking up if doc is
> present in the Collection or not.*

Re: Retrieve disk usage & release disk space after delete

2020-06-23 Thread Walter Underwood

We get disk usage on volumes using Telegraf.

I’m planning on writing something that gathers size info (docs and bytes) 
by getting core info from the CLUSTERSTATUS request then using the
CoreAdmin API to get the detailed info about cores. It doesn’t look hard,
just complicated. Fire up Python and start walking JSON data.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 23, 2020, at 4:27 AM, Erick Erickson  wrote:
> 
> Q1: If you’re talking about disk space used up by deleted documents,
> then yes, optimize or expungeDeletes will recover it. The former
>will recover it all, the latter will rewrite segments with > 10% deleted
>   documents. HOWEVER: optimize is an expensive operation, and
>can have deleterious side-effects, especially before Solr 7.5, see:
>   
> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
>   and
>   https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> 
>   NOTE: if you just ignore it, the deleted data will be merged away as
>   part of normal indexing so you may have to do nothing.
> 
> Q2: The data if you delete the collections should be removed from 
>   disk, assuming you’re  talking about using the Collections API, 
>   DELETE command. Optimize won’t help because the collection is gone.
>   If you delete the collection and the data dirs are still hanging around,
>   you should look at your logs to see if there’s any information.
> 
> Best,
> Erick
> 
>> On Jun 22, 2020, at 9:04 PM, ChienHuaWang  wrote:
>> 
>> Hi Solr users,
>> 
>> Q1: Wondering if there is any way to retrieve disk usage by host? Could we
>> get thru metrics API or any other methods? I know the data shows in Solr
>> Admin UI, but have other approach for this kind of data.
>> 
>> Q2: 
>> After delete the collections, it seems not physically removed from the disk.
>> Did the research, someone suggest to run an optimize which re-writes the
>> index out to disk without the deleted documents, then deletes the original. 
>> Is there any other way to do clean up without re-writes the index? have to
>> manually clean up now, and look for better approach
>> 
>> Appreciate your feedback.
>> 
>> 
>> Regards,
>> Chien
>> 
>> 
>> 
>> 
>> 
>> --
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Deleting on exact match

2020-06-21 Thread Walter Underwood

I would add a new field with the new behavior. Then any document with
content in the new field would not need to be deleted. Find the deletable
content with:

*:* -new_field:*

I generally add a field that records when the document was indexed or
updated. That can be really handy.



wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 21, 2020, at 12:32 PM, Scott Q.  wrote:
> 
> Also note that I didn't apply the new schema yet because I don't
> think it will let me change it mid-way like this without deleting all
> data and starting anew...
> 
> On Sunday, 21/06/2020 at 15:12 Scott Q. wrote:
> 
> 
> My apologies, it appears the configuration tags were escaped and
> completely removed from my original e-mail.
> 
> I am including them via pastebin.com
> 
> 
> https://pastebin.com/BSUqgEke
> 
> 
> 
> 
> On Sunday, 21/06/2020 at 15:04 Scott Q. wrote:
> 
> 
> Hello,
> 
> I use Solr with Dovecot and I made a mistake when I initially created
> my schema for my instance. I created the username field with partial
> matches enabled.
> Aka, like this:
> 
> 
> 
> 
> 
>   
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   
> 
> 
> 
> 
> 
> I already indexed millions of documents using this schema before I
> fixed it and changed it to
> 
> 
> 
> 
> 
> 
> 
> 
> The task at hand is to remove all documents indexed the old way, but
> how can I do that ? user is of the form u...@domain.com and if I
> search for u...@domain.com it matches all of 'user' or 'domain.com'
> which has obvious unwanted consequences.
> 
> Therefore, how can I remove older documents which were indexed with
> partial match ?

Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-19 Thread Walter Underwood

> On Jun 19, 2020, at 7:48 AM, Phill Campbell  
> wrote:
> 
> Delegator - Handler
> 
> A common pattern we are all aware of. Pretty simple.

The Solr master does not delegate and the slave does not handle.
The master is a server that handles replication requests from the
slave.

Delegator/handler is a common pattern, but it is not the pattern
that describes traditional Solr replication.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Getting rid of Overseer nomenclature in Solr

2020-06-19 Thread Walter Underwood

I just split this off with a different subject line for the “overseer” 
discussion.
That seems independent of the other choices.

I’ve heard these suggestions:

* orchestrator
* director
* coordinator
* cluster manager
* manager

There is a thing called “process orchestration” which is at a higher level than
what the overseer does. It might be something like all the customer interactions
in a billing process. That usage might be confusing for the term “orchestrator”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 18, 2020, at 10:44 PM, Thomas Corthals  wrote:
> 
> Since "overseer" is also problematic, I'd like to propose "orchestrator" as
> an alternative.
> 
> Thomas

Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Walter Underwood

I sometimes describe them as tightly-coupled and loosely-coupled. There is a 
vast
difference in the amount of shared state in the two kinds of clusters. Old 
school
clusters are essentially a REST system. The primary server knows nothing 
about the leeches. The replication only assumes that the trusted resposiitory
is a later generation of the same index.

Solr Cloud has massive amounts of shared state. Just last week, we couldn’t
delete some replicas in two of the shards. I finally found a long queue at the
overseer and started rebooting server processes in those two shards. It fixed 
whatever state was broken, but it was pretty much turning it off and on again.
That kind of thing just cannot happen with master/slave.

Not sure about “manual”. We do a lot more manual management of our Solr
Cloud clusters. Scaling out the master/slave cluster is stupid simple. Bring
up a clone of a current slave, add it to the load balancer, and walk away.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 18, 2020, at 7:40 PM, Trey Grainger  wrote:
> 
>> 
>> Let’s instead find a new good name for the cluster type. Standalone kind
>> of works
>> for me, but I see it can be confused with single-node.
> 
> Yeah, I've typically referred to it as "standalone", but I don't think it's
> descriptive enough. I can see why some people have been calling it
> "master/slave" mode in lieu of a more descriptive alternative. I think a
> new name (other than "standalone" or "legacy") would be superb.
> 
> We have also discussed replacing SolrCloud (which is a terrible name) with
>> something more descriptive.
> 
> Today: SolrCloud vs Master/slave
>> Alt A: SolrCloud vs Standalone
>> Alt B: SolrCloud vs Legacy
>> Alt C: Clustered vs Independent
>> Alt D: Clustered vs Manual mode
> 
> 
> +1 SolrCloud is even less descriptive and IMHO just sounds silly at this
> point.
> 
> re: "Clustered" vs Independent/Manual. The thing I don't like about that is
> that you typically have clusters in both modes. I think the key distinction
> is whether Solr "manages" the cluster automatically for you or whether you
> manage it manually yourself.
> 
> What do you think about:
> Alt E: "Managed Clustering" vs. "Unmanaged Clustering" Mode
> Alt F:  "Managed Clustering" vs. "Manual Clustering" Mode
> ?
> 
> I think I prefer option F.
> 
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
> 
> On Thu, Jun 18, 2020 at 5:59 PM Jan Høydahl  wrote:
> 
>> I support Mike Drob and Trey Grainger. We shuold re-use the leader/replica
>> terminology from Cloud. Even if you hand-configure a master/slave cluster
>> and orchestrate what doc goes to which node/shard, and hand-code your
>> shards
>> parameter, you will still have a cluster where you’d send updates to the
>> leader of
>> each shard and the replicas would replicate the index from the leader.
>> 
>> Let’s instead find a new good name for the cluster type. Standalone kind
>> of works
>> for me, but I see it can be confused with single-node. We have also
>> discussed
>> replacing SolrCloud (which is a terrible name) with something more
>> descriptive.
>> 
>> Today: SolrCloud vs Master/slave
>> Alt A: SolrCloud vs Standalone
>> Alt B: SolrCloud vs Legacy
>> Alt C: Clustered vs Independent
>> Alt D: Clustered vs Manual mode
>> 
>> Jan
>> 
>>> 18. jun. 2020 kl. 15:53 skrev Mike Drob :
>>> 
>>> I personally think that using Solr cloud terminology for this would be
>> fine
>>> with leader/follower. The leader is the one that accepts updates,
>> followers
>>> cascade the updates somehow. The presence of ZK or election doesn’t
>> really
>>> change this detail.
>>> 
>>> However, if folks feel that it’s confusing, then I can’t tell them that
>>> they’re not confused. Especially when they’re working with others who
>> have
>>> less Solr experience than we do and are less familiar with the
>> intricacies.
>>> 
>>> Primary/Replica seems acceptable. Coordinator instead of Overseer seems
>>> acceptable.
>>> 
>>> Would love to see this in 9.0!
>>> 
>>> Mike
>>> 
>>> On Thu, Jun 18, 2020 at 8:25 AM John Gallagher
>>>  wrote:
>>> 
>>>> While on the topic of renaming roles, I'd like to propose finding a
>> better
>>>> term than "overseer" which has historical slavery connotations as well.
>>>> Director, p

Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Walter Underwood

We don’t get to decide whether “master” is a problem. The rest of the world
has already decided that it is a problem.

Our task is to replace the terms “master” and “slave” in Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 18, 2020, at 6:50 PM, Rahul Goswami  wrote:
> 
> I agree with Phill, Noble and Ilan above. The problematic term is "slave"
> (not master) which I am all for changing if it causes less regression than
> removing BOTH master and slave. Since some people have pointed out Github
> changing the "master" terminology, in my personal opinion, it was not a
> measured response to addressing the bigger problem we are all trying to
> tackle. There is no concept of a "slave" branch, and "master" by itself is
> a pretty generic term (Is someone having "mastery" over a skill a bad
> thing?). I fear all it would end up achieving in the end with Github is a
> mess of broken build scripts at best.
> So +1 on "slave" being the problematic term IMO, not "master".
> 
> On Thu, Jun 18, 2020 at 8:19 PM Phill Campbell
>  wrote:
> 
>> Master - Worker
>> Master - Peon
>> Master - Helper
>> Master - Servant
>> 
>> The term that is not wanted is “slave’. The term “master” is not a problem
>> IMO.
>> 
>>> On Jun 18, 2020, at 3:59 PM, Jan Høydahl  wrote:
>>> 
>>> I support Mike Drob and Trey Grainger. We shuold re-use the
>> leader/replica
>>> terminology from Cloud. Even if you hand-configure a master/slave cluster
>>> and orchestrate what doc goes to which node/shard, and hand-code your
>> shards
>>> parameter, you will still have a cluster where you’d send updates to the
>> leader of
>>> each shard and the replicas would replicate the index from the leader.
>>> 
>>> Let’s instead find a new good name for the cluster type. Standalone kind
>> of works
>>> for me, but I see it can be confused with single-node. We have also
>> discussed
>>> replacing SolrCloud (which is a terrible name) with something more
>> descriptive.
>>> 
>>> Today: SolrCloud vs Master/slave
>>> Alt A: SolrCloud vs Standalone
>>> Alt B: SolrCloud vs Legacy
>>> Alt C: Clustered vs Independent
>>> Alt D: Clustered vs Manual mode
>>> 
>>> Jan
>>> 
>>>> 18. jun. 2020 kl. 15:53 skrev Mike Drob :
>>>> 
>>>> I personally think that using Solr cloud terminology for this would be
>> fine
>>>> with leader/follower. The leader is the one that accepts updates,
>> followers
>>>> cascade the updates somehow. The presence of ZK or election doesn’t
>> really
>>>> change this detail.
>>>> 
>>>> However, if folks feel that it’s confusing, then I can’t tell them that
>>>> they’re not confused. Especially when they’re working with others who
>> have
>>>> less Solr experience than we do and are less familiar with the
>> intricacies.
>>>> 
>>>> Primary/Replica seems acceptable. Coordinator instead of Overseer seems
>>>> acceptable.
>>>> 
>>>> Would love to see this in 9.0!
>>>> 
>>>> Mike
>>>> 
>>>> On Thu, Jun 18, 2020 at 8:25 AM John Gallagher
>>>>  wrote:
>>>> 
>>>>> While on the topic of renaming roles, I'd like to propose finding a
>> better
>>>>> term than "overseer" which has historical slavery connotations as well.
>>>>> Director, perhaps?
>>>>> 
>>>>> 
>>>>> John Gallagher
>>>>> 
>>>>> On Thu, Jun 18, 2020 at 8:48 AM Jason Gerlowski >> 
>>>>> wrote:
>>>>> 
>>>>>> +1 to rename master/slave, and +1 to choosing terminology distinct
>>>>>> from what's used for SolrCloud.  I could be happy with several of the
>>>>>> proposed options.  Since a good few have been proposed though, maybe
>>>>>> an eventual vote thread is the most organized way to aggregate the
>>>>>> opinions here.
>>>>>> 
>>>>>> I'm less positive about the prospect of changing the name of our
>>>>>> primary git branch.  Most projects that contributors might come from,
>>>>>> most tutorials out there to learn git, most tools built on top of git
>>>>>> - the majority are going to assume "master" as the main branch.  I
>>>>&

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Walter Underwood

Actually, the term “master” is a problem, so master/follower doesn’t work.

GitLab is renaming the master branch to main.

Rice University renamed College Masters to College Magisters in 2017.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 18, 2020, at 1:03 AM, Atita Arora  wrote:
> 
> +1 Noble and Ilan !!
> 
> 
> 
> On Thu, Jun 18, 2020 at 7:51 AM Noble Paul  wrote:
> 
>> Looking at the code I see a 692 occurrences of the word "slave".
>> Mostly variable names and ref guide docs.
>> 
>> The word "slave" is present in the responses as well. Any change in
>> the request param/response payload is backward incompatible.
>> 
>> I have no objection to changing the names in ref guide and other
>> internal variables. Going ahead with backward incompatible changes is
>> painful. If somebody has the appetite to take it up, it's OK
>> 
>> If we must change, master/follower can be a good enough option.
>> 
>> master (noun): A man in charge of an organization or group.
>> master(adj) : having or showing very great skill or proficiency.
>> master(verb): acquire complete knowledge or skill in (a subject,
>> technique, or art).
>> master (verb): gain control of; overcome.
>> 
>> I hope nobody has a problem with the term "master"
>> 
>> On Thu, Jun 18, 2020 at 3:19 PM Ilan Ginzburg  wrote:
>>> 
>>> Would master/follower work?
>>> 
>>> Half the rename work while still getting rid of the slavery
>> connotation...
>>> 
>>> 
>>> On Thu 18 Jun 2020 at 07:13, Walter Underwood 
>> wrote:
>>> 
>>>>> On Jun 17, 2020, at 4:00 PM, Shawn Heisey 
>> wrote:
>>>>> 
>>>>> It has been interesting watching this discussion play out on multiple
>>>> open source mailing lists.  On other projects, I have seen a VERY high
>>>> level of resistance to these changes, which I find disturbing and
>>>> surprising.
>>>> 
>>>> Yes, it is nice to see everyone just pitch in and do it on this list.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> 
>> 
>> 
>> 
>> --
>> -
>> Noble Paul
>>

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood

> On Jun 17, 2020, at 4:00 PM, Shawn Heisey  wrote:
> 
> It has been interesting watching this discussion play out on multiple open 
> source mailing lists.  On other projects, I have seen a VERY high level of 
> resistance to these changes, which I find disturbing and surprising.

Yes, it is nice to see everyone just pitch in and do it on this list.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood

Master/slave is not going away in our company. That cluster has zero downtime
in five years. I can’t say that about our Solr Cloud clusters.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 9:36 PM, Noble Paul  wrote:
> 
> I really do not see a reason why a master/slave terminology is a problem.
> We do not have slavery anywhere in the world. Should we also remove it from
> the dictionary?
> 
> The old mode is going to go away anyway. Why waste time bikeshedding on
> this?
> 
> On Thu, Jun 18, 2020, 12:04 PM Trey Grainger  wrote:
> 
>> @Shawn,
>> 
>> Ok, yeah, apologies, my semantics were wrong.
>> 
>> I was thinking that a TLog replica is a follower role only and becomes an
>> NRT replica if it gets elected leader. From a pure semantics standpoint,
>> though, I guess technically the TLog replica doesn't "become" an NRT
>> replica, but just "acts the same" as if it was an NRT replica when it gets
>> elected as leader. From the docs regarding TLog replicas: "This type of
>> replica maintains a transaction log but does not index document changes
>> locally... When this type of replica needs to update its index, it does so
>> by replicating the index from the leader... If it does become a leader, it
>> will behave the same as if it was a NRT type of replica."
>> 
>> The Tlog replicas are a bit of a red herring to the point I was making,
>> though, which is that Pull Replicas in SolrCloud mode and Slaves in
>> non-SolrCloud mode both just pull the index from the leader/master and as
>> opposed to updates being pushed the other way. As such, I don't see a
>> meaningful distinction between master/slave and leader/follower behavior in
>> non-SolrCloud mode vs. SolrCloud mode for the specific functionality we're
>> talking about renaming (Solr cores that pull indices from other Solr
>> cores).
>> 
>> At any rate, this is not a hill I care to die on. My belief is that it's
>> better to have consistent terminology for what I see as essentially the
>> same functionality. I respect that others disagree and would rather
>> introduce new terminology to clearly distinguish between modes. Regardless
>> of the naming decided on, I'm in support of removing the master/slave
>> nomenclature.
>> 
>> Trey Grainger
>> Founder, Searchkernel
>> https://searchkernel.com
>> 
>> On Wed, Jun 17, 2020 at 7:00 PM Shawn Heisey  wrote:
>> 
>>> On 6/17/2020 2:36 PM, Trey Grainger wrote:
>>>> 2) TLOG - which can only serve in the role of follower
>>> 
>>> This is inaccurate.  TLOG can become leader.  If that happens, then it
>>> functions exactly like an NRT leader.
>>> 
>>> I'm aware that saying the following is bikeshedding ... but I do think
>>> it would be as mistake to use any existing SolrCloud terminology for
>>> non-cloud deployments, including the word "replica".  The top contenders
>>> I have seen to replace master/slave in Solr are primary/secondary and
>>> publisher/subscriber.
>>> 
>>> It has been interesting watching this discussion play out on multiple
>>> open source mailing lists.  On other projects, I have seen a VERY high
>>> level of resistance to these changes, which I find disturbing and
>>> surprising.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood

Master/slave is not just two roles, but a kind of cluster. I really don’t think
“Standalone” captures the non-Cloud cluster. Nobody in Chegg would 
have any idea that “standalone” meant “no Zookeeper”.

I’ve never thought that master/slave accurately described the traditional
replication model, but I can’t remember what terms I preferred because 
that was ten years ago. A master gives commands. That isn’t how Solr
masters work. It is closer to how an NRT or TLOG leader works, actually.

A Solr master just sits there and waits for other nodes to copy the index.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 3:03 PM, Trey Grainger  wrote:
> 
> Hi Walter,
> 
>> In Solr Cloud, the leader knows about each follower and updates them.
> Respectfully, I think you're mixing the "TYPE" of replica with the role of
> the "leader" and "follower"
> 
> In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the leader
> push updates those followers.
> 
> When the TYPE of a follower is PULL, then it does not.  In Standalone mode,
> the type of a (currently) master would be NRT, and the type of the
> (currently) slaves is always PULL.
> 
> As such, this behavior is consistent across both SolrCloud and Standalone
> mode. It is true that Standalone mode does not currently have support for
> two of the replica TYPES that SolrCloud mode does, but I maintain that
> leader vs. follower behavior is inconsistent here.
> 
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
> 
> 
> 
> On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
> wrote:
> 
>> But they are not the same. In Solr Cloud, the leader knows about each
>> follower and updates them. In standalone, the master has no idea that
>> slaves exist until a replication request arrives.
>> 
>> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
>> config load time.
>> 
>> Looking ahead in my email inbox, publisher/subscriber is an excellent
>> choice.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
>>> 
>>> I guess I don't see it as polysemous, but instead simplifying.
>>> 
>>> In my proposal, the terms "leader" and "follower" would have the exact
>> same
>>> meaning in both SolrCloud and standalone mode. The only difference would
>> be
>>> that SolrCloud automatically manages the leaders and followers, whereas
>> in
>>> standalone mode you have to manage them manually (as is the case with
>> most
>>> things in SolrCloud vs. Standalone).
>>> 
>>> My view is that having an entirely different set of terminology
>> describing
>>> the same thing is way more cognitive overhead than having consistent
>>> terminology.
>>> 
>>> Trey Grainger
>>> Founder, Searchkernel
>>> https://searchkernel.com
>>> 
>>> On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
>>> wrote:
>>> 
>>>> I strongly disagree with using the Solr Cloud leader/follower
>> terminology
>>>> for non-Cloud clusters. People in my company are confused enough without
>>>> using polysemous terminology.
>>>> 
>>>> “This node is the leader, but it means something different than the
>> leader
>>>> in this other cluster.” I’m dreading that conversation.
>>>> 
>>>> I like “principal”. How about “clone” for the slave role? That suggests
>>>> that
>>>> it does not accept updates and that it is loosely-coupled, only
>> depending
>>>> on the state of the no-longer-called-master.
>>>> 
>>>> Chegg has five production Solr Cloud clusters and one production
>>>> master/slave
>>>> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
>> in
>>>> production.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
>>>>> 
>>>>> Proposal:
>>>>> "A Solr COLLECTION is composed of one or more SHARDS, which each have
>> one
>>>>> or more REPLICAS. Each replica can have a ROLE of either:
>>>>> 1) A LEADER, which can process external updates for the shard
>>>

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood

But they are not the same. In Solr Cloud, the leader knows about each
follower and updates them. In standalone, the master has no idea that
slaves exist until a replication request arrives.

In Solr Cloud, the leader is elected. In standalone, that role is fixed at
config load time.

Looking ahead in my email inbox, publisher/subscriber is an excellent choice.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
> 
> I guess I don't see it as polysemous, but instead simplifying.
> 
> In my proposal, the terms "leader" and "follower" would have the exact same
> meaning in both SolrCloud and standalone mode. The only difference would be
> that SolrCloud automatically manages the leaders and followers, whereas in
> standalone mode you have to manage them manually (as is the case with most
> things in SolrCloud vs. Standalone).
> 
> My view is that having an entirely different set of terminology describing
> the same thing is way more cognitive overhead than having consistent
> terminology.
> 
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
> 
> On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
> wrote:
> 
>> I strongly disagree with using the Solr Cloud leader/follower terminology
>> for non-Cloud clusters. People in my company are confused enough without
>> using polysemous terminology.
>> 
>> “This node is the leader, but it means something different than the leader
>> in this other cluster.” I’m dreading that conversation.
>> 
>> I like “principal”. How about “clone” for the slave role? That suggests
>> that
>> it does not accept updates and that it is loosely-coupled, only depending
>> on the state of the no-longer-called-master.
>> 
>> Chegg has five production Solr Cloud clusters and one production
>> master/slave
>> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
>> production.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
>>> 
>>> Proposal:
>>> "A Solr COLLECTION is composed of one or more SHARDS, which each have one
>>> or more REPLICAS. Each replica can have a ROLE of either:
>>> 1) A LEADER, which can process external updates for the shard
>>> 2) A FOLLOWER, which receives updates from another replica"
>>> 
>>> (Note: I prefer "role" but if others think it's too overloaded due to the
>>> overseer role, we could replace it with "mode" or something similar)
>>> ---
>>> 
>>> To be explicit with the above definitions:
>>> 1) In SolrCloud, the roles of leaders and followers can dynamically
>> change
>>> based upon the status of the cluster. In standalone mode, they can be
>>> changed by manual intervention.
>>> 2) A leader does not have to have any followers (i.e. only one active
>>> replica)
>>> 3) Each shard always has one leader.
>>> 4) A follower can also pull updates from another follower instead of a
>>> leader (traditionally known as a REPEATER). A repeater is still a
>> follower,
>>> but would not be considered a leader because it can't process external
>>> updates.
>>> 5) A replica cannot be both a leader and a follower.
>>> 
>>> In addition to the above roles, each replica can have a TYPE of one of:
>>> 1) NRT - which can serve in the role of leader or follower
>>> 2) TLOG - which can only serve in the role of follower
>>> 3) PULL - which can only serve in the role of follower
>>> 
>>> A replica's type may be changed automatically in the event that its role
>>> changes.
>>> 
>>> I think this terminology is consistent with the current Leader/Follower
>>> usage while also being able to easily accomodate a rename of the
>> historical
>>> master/slave terminology without mental gymnastics or the introduction or
>>> more cognitive load through new terminology. I think adopting the
>>> Primary/Replica terminology will be incredibly confusing given the
>> already
>>> specific and well established meaning of "replica" within Solr.
>>> 
>>> All the Best,
>>> 
>>> Trey Grainger
>>> Founder, Searchkernel
>>> https://searchkernel.com
>>> 
>>> 
>>> 
>>> On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
>> w

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood

I strongly disagree with using the Solr Cloud leader/follower terminology
for non-Cloud clusters. People in my company are confused enough without
using polysemous terminology.

“This node is the leader, but it means something different than the leader
in this other cluster.” I’m dreading that conversation.

I like “principal”. How about “clone” for the slave role? That suggests that
it does not accept updates and that it is loosely-coupled, only depending 
on the state of the no-longer-called-master.

Chegg has five production Solr Cloud clusters and one production master/slave
cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in 
production.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> 
> Proposal:
> "A Solr COLLECTION is composed of one or more SHARDS, which each have one
> or more REPLICAS. Each replica can have a ROLE of either:
> 1) A LEADER, which can process external updates for the shard
> 2) A FOLLOWER, which receives updates from another replica"
> 
> (Note: I prefer "role" but if others think it's too overloaded due to the
> overseer role, we could replace it with "mode" or something similar)
> ---
> 
> To be explicit with the above definitions:
> 1) In SolrCloud, the roles of leaders and followers can dynamically change
> based upon the status of the cluster. In standalone mode, they can be
> changed by manual intervention.
> 2) A leader does not have to have any followers (i.e. only one active
> replica)
> 3) Each shard always has one leader.
> 4) A follower can also pull updates from another follower instead of a
> leader (traditionally known as a REPEATER). A repeater is still a follower,
> but would not be considered a leader because it can't process external
> updates.
> 5) A replica cannot be both a leader and a follower.
> 
> In addition to the above roles, each replica can have a TYPE of one of:
> 1) NRT - which can serve in the role of leader or follower
> 2) TLOG - which can only serve in the role of follower
> 3) PULL - which can only serve in the role of follower
> 
> A replica's type may be changed automatically in the event that its role
> changes.
> 
> I think this terminology is consistent with the current Leader/Follower
> usage while also being able to easily accomodate a rename of the historical
> master/slave terminology without mental gymnastics or the introduction or
> more cognitive load through new terminology. I think adopting the
> Primary/Replica terminology will be incredibly confusing given the already
> specific and well established meaning of "replica" within Solr.
> 
> All the Best,
> 
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
> 
> 
> 
> On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta  wrote:
> 
>> Hi everyone,
>> 
>> Moving a conversation that was happening on the PMC list to the public
>> forum. Most of the following is just me recapping the conversation that has
>> happened so far.
>> 
>> Some members of the community have been discussing getting rid of the
>> master/slave nomenclature from Solr.
>> 
>> While this may require a non-trivial effort, a general consensus so far
>> seems to be to start this process and switch over incrementally, if a
>> single change ends up being too big.
>> 
>> There have been a lot of suggestions around what the new nomenclature might
>> look like, a few people don’t want to overlap the naming here with what
>> already exists in SolrCloud i.e. leader/follower.
>> 
>> Primary/Replica was an option that was suggested based on what other
>> vendors are moving towards based on Wikipedia:
>> https://en.wikipedia.org/wiki/Master/slave_(technology)
>> , however there were concerns around the use of “replica” as that denotes a
>> very specific concept in SolrCloud. Current terminology clearly
>> differentiates the use of the traditional replication model from SolrCloud
>> and reusing the names would make it difficult for that to happen.
>> 
>> There were similar concerns around using Leader/follower.
>> 
>> Let’s continue this conversation here while making sure that we converge
>> without much bike-shedding.
>> 
>> -Anshum
>>

Re: Solr 7.6 optimize index size increase

2020-06-17 Thread Walter Underwood

From that short description, you should not be running optimize at all.

Just stop doing it. It doesn’t make that big a difference.

It may take your indexes a few weeks to get back to a normal state after the 
forced merges.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 4:12 AM, Raveendra Yerraguntla 
>  wrote:
> 
> Thank you David, Walt , Eric.
> 1. First time bloated index generated , there is no disk space issue. one 
> copy of index is 1/6 of disk capacity. we ran into disk capacity after more 
> than 2  copies of bloated copies.2. Solr is upgraded from 5.*. in 5.* more 
> than 5 segments is causing performance issue. Performance in 7.* is not 
> measured for increasing segments. I will plan a PT to get optimum number. 
> Application has incremental indexing multiple times in a work week.
> I will keep you updated on the resolution.
> Thanks again 
>On Tuesday, June 16, 2020, 07:34:26 PM EDT, Erick Erickson 
>  wrote:  
> 
> It Depends (tm).
> 
> As of Solr 7.5, optimize is different. See: 
> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> 
> So, assuming you have _not_ specified maxSegments=1, any very large
> segment (near 5G) that has _zero_ deleted documents won’t be merged.
> 
> So there are two scenarios:
> 
> 1> What Walter mentioned. The optimize process runs out of disk space
> and leaves lots of crud around
> 
> 2> your “older segments” are just max-sized segments with zero deletions.
> 
> 
> All that said… do you have demonstrable performance improvements after
> optimizing? The entire name “optimize” is misleading, of course who
> wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
> it made quite a difference. In more recent Solr releases, it’s not as clear
> cut. So before worrying about making optimize work, I’d recommend that
> you do some performance tests on optimized and un-optimized indexes. 
> If there are significant improvements, that’s one thing. Otherwise, it’s
> a waste.
> 
> Best,
> Erick
> 
>> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
>> 
>> For a full forced merge (mistakenly named “optimize”), the worst case disk 
>> space
>> is 3X the size of the index. It is common to need 2X the size of the index.
>> 
>> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
>> behavior.
>> I implemented a disk space check that would refuse to merge if there wasn’t 
>> enough
>> free space. It would log an error and send an email to the admin.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 16, 2020, at 1:58 PM, David Hastings  
>>> wrote:
>>> 
>>> I cant give you a 100% true answer but ive experienced this, and what
>>> "seemed" to happen to me was that the optimize would start, and that will
>>> drive the size up by 3 fold, and if you out of disk space in the process
>>> the optimize will quit since, it cant optimize, and leave the live index
>>> pieces in tact, so now you have the "current" index as well as the
>>> "optimized" fragments
>>> 
>>> i cant say for certain thats what you ran into, but we found that if you
>>> get an expanding disk it will keep growing and prevent this from happening,
>>> then the index will contract and the disk will shrink back to only what it
>>> needs.  saved me a lot of headaches not needing to ever worry about disk
>>> space
>>> 
>>> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>>>  wrote:
>>> 
>>>> 
>>>> when optimize command is issued, the expectation after the completion of
>>>> optimization process is that the index size either decreases or at most
>>>> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
>>>> is issued, some of the shard's transient or older segment files are not
>>>> deleted. This is happening randomly across all shards. When unnoticed these
>>>> transient files makes disk full. Currently it is handled through monitors,
>>>> but question is what is causing the transient/older files remains there.
>>>> Are there any specific race conditions which laves the older files not
>>>> being deleted?
>>>> Any pointers around this will be helpful.
>>>> TIA
>>

Re: Master Slave Terminology

2020-06-17 Thread Walter Underwood

I’ve long thought that master/slave was not the right metaphor for a pull model 
anyway.

We probably should not use “replica” since that already has a use in Solr Cloud.

Where is the discussion?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 16, 2020, at 11:51 PM, Jan Høydahl  wrote:
> 
> Hi Kaya,
> 
> Thanks for bringing it up. The topic is already being discussed by 
> developers, so expect to see some change in this area; Not over-night, but 
> incremental.
> Also, if you want to lend a helping hand, patches are more than welcome as 
> always.
> 
> Jan
> 
>> 17. jun. 2020 kl. 04:22 skrev Kayak28 :
>> 
>> Hello, Community:
>> 
>> As the Github and Python will replace terminologies that relative to
>> slavery,
>> why don't we replace master-slave for Solr as well?
>> 
>> https://developers.srad.jp/story/18/09/14/0935201/
>> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
>> 
>> -- 
>> 
>> Sincerely,
>> Kaya
>> github: https://github.com/28kayak
>

Re: Solr 7.6 optimize index size increase

2020-06-16 Thread Walter Underwood

For a full forced merge (mistakenly named “optimize”), the worst case disk space
is 3X the size of the index. It is common to need 2X the size of the index.

When I worked on Ultraseek Server 20+ years ago, it had the same merge behavior.
I implemented a disk space check that would refuse to merge if there wasn’t 
enough
free space. It would log an error and send an email to the admin.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 16, 2020, at 1:58 PM, David Hastings  
> wrote:
> 
> I cant give you a 100% true answer but ive experienced this, and what
> "seemed" to happen to me was that the optimize would start, and that will
> drive the size up by 3 fold, and if you out of disk space in the process
> the optimize will quit since, it cant optimize, and leave the live index
> pieces in tact, so now you have the "current" index as well as the
> "optimized" fragments
> 
> i cant say for certain thats what you ran into, but we found that if you
> get an expanding disk it will keep growing and prevent this from happening,
> then the index will contract and the disk will shrink back to only what it
> needs.  saved me a lot of headaches not needing to ever worry about disk
> space
> 
> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>  wrote:
> 
>> 
>> when optimize command is issued, the expectation after the completion of
>> optimization process is that the index size either decreases or at most
>> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
>> is issued, some of the shard's transient or older segment files are not
>> deleted. This is happening randomly across all shards. When unnoticed these
>> transient files makes disk full. Currently it is handled through monitors,
>> but question is what is causing the transient/older files remains there.
>> Are there any specific race conditions which laves the older files not
>> being deleted?
>> Any pointers around this will be helpful.
>> TIA

Re: How to determine why solr stops running?

2020-06-11 Thread Walter Underwood

1. You have a tiny heap. 536 Megabytes is not enough.
2. I stopped using the CMS GC years ago.

Here is the GC config we use on every one of our 150+ Solr hosts. We’re still 
on Java 8, but will be upgrading soon.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 11, 2020, at 10:52 AM, Ryan W  wrote:
> 
> On Wed, Jun 10, 2020 at 8:35 PM Hup Chen  wrote:
> 
>> I will check "dmesg" first, to find out any hardware error message.
>> 
> 
> Here is what I see toward the end of the output from dmesg:
> 
> [1521232.781785] [118857]48 118857   108785  677 201
> 901 0 httpd
> [1521232.781787] [118860]48 118860   108785  710 201
> 881 0 httpd
> [1521232.781788] [118862]48 118862   113063 5256 210
> 725 0 httpd
> [1521232.781790] [118864]48 118864   114085 6634 212
> 703 0 httpd
> [1521232.781791] [118871]48 118871   13968732323 262
> 620 0 httpd
> [1521232.781793] [118873]48 118873   108785  821 201
> 792 0 httpd
> [1521232.781795] [118879]48 118879   14026332719 263
> 621 0 httpd
> [1521232.781796] [118903]48 118903   108785  812 201
> 771 0 httpd
> [1521232.781798] [118905]48 118905   113575 5606 211
> 660 0 httpd
> [1521232.781800] [118906]48 118906   113563 5694 211
> 626 0 httpd
> [1521232.781801] Out of memory: Kill process 117529 (httpd) score 9 or
> sacrifice child
> [1521232.782908] Killed process 117529 (httpd), UID 48, total-vm:675824kB,
> anon-rss:181844kB, file-rss:0kB, shmem-rss:0kB
> 
> Is this a relevant "Out of memory" message?  Does this suggest an OOM
> situation is the culprit?
> 
> When I grep in the solr logs for oom, I see some entries like this...
> 
> ./solr_gc.log.4.current:CommandLine flags: -XX:CICompilerCount=4
> -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark
> -XX:ConcGCThreads=4 -XX:GCLogFileSize=20971520
> -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912
> -XX:MaxNewSize=134217728 -XX:MaxTenuringThreshold=8
> -XX:MinHeapDeltaBytes=196608 -XX:NewRatio=3 -XX:NewSize=134217728
> -XX:NumberOfGCLogFiles=9 -XX:OldPLABSize=16 -XX:OldSize=402653184
> -XX:-OmitStackTraceInFastThrow
> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /opt/solr/server/logs
> -XX:ParallelGCThreads=4 -XX:+ParallelRefProcEnabled
> -XX:PretenureSizeThreshold=67108864 -XX:+PrintGC
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution -XX:SurvivorRatio=4
> -XX:TargetSurvivorRatio=90 -XX:ThreadStackSize=256
> -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers
> -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation
> -XX:+UseParNewGC
> 
> Buried in there I see "OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh". But I
> think this is just a setting that indicates what to do in case of an OOM.
> And if I look in that oom_solr.sh file, I see it would write an entry to a
> solr_oom_kill log. And there is no such log in the logs directory.
> 
> Many thanks.
> 
> 
> 
> 
>> Then use some system admin tools to monitor that server,
>> for instance, top, vmstat, lsof, iostat ... or simply install some nice
>> free monitoring tool into this system, like monit, monitorix, nagios.
>> Good luck!
>> 
>> 
>> From: Ryan W 
>> Sent: Thursday, June 11, 2020 2:13 AM
>> To: solr-user@lucene.apache.org 
>> Subject: Re: How to determine why solr stops running?
>> 
>> Hi all,
>> 
>> People keep suggesting I check the logs for errors.  What do those errors
>> look like?  Does anyone have examples of the text of a Solr oom error?  Or
>> the text of any other errors I should be looking for the next time solr
>> fails?  Are there phrases I should grep for in the logs?  Should I be
>> looking in the Solr logs for an OOM error, or in the Apache logs?
>> 
>> There is nothing failing on the server except for solr -- at least not that
>> I can see.  There is no apparent problem with the hardware or anything else
>> on the server.  The OS is Red Hat Enterprise Linux. The serv

Re: Getting rid of zookeeper

2020-06-09 Thread Walter Underwood

Zookeeper was created because fault-tolerant algorithms are extremely hard to 
test and get correct. Maybe the hardest thing in computing. Using a trusted 
implementation frees up lots of developer time.

To get an idea of the difficulty, read through the kinds of things fixed in the 
Zookeeper release notes.

https://zookeeper.apache.org/releases.html

Elasticsearch does not have a good record on fault-tolerance. I haven’t checked 
recently, but it was losing updates during leader elections for several years 
worth of software releases.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 9, 2020, at 12:37 PM, David Hastings  
> wrote:
> 
> Zookeeper is annoying to both set up and manage, but then again the same
> thing can be said about solr cloud.  not certain why you would want to deal
> with either
> 
> On Tue, Jun 9, 2020 at 3:29 PM S G  wrote:
> 
>> Hello,
>> 
>> I recently stumbled across KIP-500: Replace ZooKeeper with a Self-Managed
>> Metadata Quorum
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
>>> 
>> Elastic-search does this too.
>> And so do many other systems.
>> 
>> Is there some work to go in this direction?
>> It would be nice to get rid of another totally disparate system.
>> Hardware savings would be nice to have too.
>> 
>> Best,
>> SG
>>

Re: Script to check if solr is running

2020-06-08 Thread Walter Underwood

I could write a script, too, though I’d do it with straight shell code. But 
then I’d have to test it, check it in somewhere, document it for ops, install 
it, ...

Instead, when we switch from monit, I'll start with one of these systemd 
configs.

https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1 
<https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1>
https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup 
<https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup>
https://issues.apache.org/jira/browse/SOLR-14410 
<https://issues.apache.org/jira/browse/SOLR-14410>

Why have a cold backup and then switch? Every time I see that config, I wonder 
why people don’t have both servers live behind a load balancer. How do you know 
the cold server will work?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 8, 2020, at 9:20 AM, Dave  wrote:
> 
> A simple Perl script would be able to cover this, I have a cron job Perl 
> script that does a search with an expected result, if the result isn’t there 
> it fails over to a backup search server, sends me an email, and I fix what’s 
> wrong. The backup search server is a direct clone of the live server and just 
> as strong, no interruption (aside from the five minute window) 
> 
> If you need a hand with this I’d gladly help, everything I run is Linux based 
> but it’s a simple curl command and server switch on failure. 
> 
>> On Jun 8, 2020, at 12:14 PM, Jörn Franke  wrote:
>> 
>> Use the solution described by Walter. This allows you to automatically 
>> restart in case of failure and is also cleaner than defining a cronjob. 
>> Otherwise This would be another dependency one needs to keep in mind - means 
>> if there is an issue and someone does not know the system the person has to 
>> look at different places which never is good 
>> 
>>> Am 04.06.2020 um 18:36 schrieb Ryan W :
>>> 
>>> Does anyone have a script that checks if solr is running and then starts it
>>> if it isn't running?  Occasionally my solr stops running even if there has
>>> been no Apache restart.  I haven't been able to determine the root cause,
>>> so the next best thing might be to check every 15 minutes or so if it's
>>> running and run it if it has stopped.
>>> 
>>> Thanks.

Re: Script to check if solr is running

2020-06-05 Thread Walter Underwood

Most Linux distros are using systemd to manage server processes.

https://en.wikipedia.org/wiki/Systemd <https://en.wikipedia.org/wiki/Systemd>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 5, 2020, at 8:08 AM, Mark H. Wood  wrote:
> 
> On Thu, Jun 04, 2020 at 12:36:30PM -0400, Ryan W wrote:
>> Does anyone have a script that checks if solr is running and then starts it
>> if it isn't running?  Occasionally my solr stops running even if there has
>> been no Apache restart.  I haven't been able to determine the root cause,
>> so the next best thing might be to check every 15 minutes or so if it's
>> running and run it if it has stopped.
> 
> I've used Monit for things that must be kept running:
> 
>  https://mmonit.com/monit/
> 
> -- 
> Mark H. Wood
> Lead Technology Analyst
> 
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu

Re: Multiple Solr instances using same ZooKeepers

2020-06-03 Thread Walter Underwood

If your clusters are able to use the same Zookeeper, then they are in the same 
data center (or AWS region), so you should not need CDCR. That is for clusters 
in different data centers. Also, CDCR has some known problems.

What are you trying to solve with CDCR? There may be a better way to solve it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 2, 2020, at 6:35 AM, Gell-Holleron, Daniel 
>  wrote:
> 
> Many thanks for this information! 
> 
> 
> -Original Message-
> From: Colvin Cowie  
> Sent: 02 June 2020 09:46
> To: solr-user@lucene.apache.org
> Subject: Re: Multiple Solr instances using same ZooKeepers
> 
> You can specify a different "chroot" directory path in zookeeper for each 
> cloud 
> https://lucene.apache.org/solr/guide/8_5/setting-up-an-external-zookeeper-ensemble.html#using-a-chroot
> 
> On Tue, 2 Jun 2020 at 09:33, Gell-Holleron, Daniel < 
> daniel.gell-holle...@gb.unisys.com> wrote:
> 
>> Hi there,
>> 
>> We are in the process of deploying Solr Cloud with CDCR.
>> 
>> I would like to know if multiple instances of Solr (4 Solr servers for 
>> one instance, 4 for another instance) can use the same ZooKeeper servers?
>> 
>> This would prevent us from needing multiple ZooKeepers servers to 
>> serve each instance of Solr.
>> 
>> Regards,
>> 
>> Daniel
>> 
>>

Re: Not all EML files are indexing during indexing

2020-06-02 Thread Walter Underwood

> On Jun 2, 2020, at 7:40 AM, Charlie Hull  wrote:
> 
> If it was me I'd probably build a standalone indexer script in Python that 
> did the file handling, called out to a separate Tika service for extraction, 
> posted to Solr.

I would do the same thing, and I would base that script on Scrapy 
(https://scrapy.org <https://scrapy.org/>). I worked on a Python-based web 
spider for about ten years.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: SOLR cache tuning

2020-06-01 Thread Walter Underwood

Reading all the documents is going to be slow. If you want to do that, use a 
database.

You do NOT keep all of the index in heap. Solr doesn’t work like that.

Your JVM heap is probably way too big for 2 million documents, but I doubt that 
is the performance issue. We use an 8 GB heap for all of our Solr instances, 
including one with about 5 million docs per shard.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 1, 2020, at 8:28 AM, Tarun Jain  wrote:
> 
> Hi,I have a SOLR installation in master-slave configuration. The slave is 
> used only for reads and master for writes.
> I wanted to know if there is anything I can do to improve the performance of 
> the readonly Slave instance?
> I am running SOLR 8.5 and Java 14. The JVM has 24GB of ram allocated. Server 
> has 256 GB of RAM with about 50gb free (rest being used by other services on 
> the server)The index is 15gb in size with about 2 million documents.
> We do a lot of queries where documents are fetched using filter queries and a 
> few times all 2 million documents are read.My initial idea to speed up SOLR 
> is that given the amount of memory available, SOLR should be able to keep the 
> entire index on the heap (I know OS will also cache the disk blocks) 
> My solrconfig has the following:
>  20  class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0" /> 
>  autowarmCount="0" />  initialSize="8192" autowarmCount="0" />  class="solr.search.LRUCache" size="10" initialSize="0" autowarmCount="10" 
> regenerator="solr.NoOpRegenerator" /> 
> true 
> 20 
> 200 
> false 
> 2 
> I have modified the documentCache size to 8192 from 512 but it has not helped 
> much. 
> I know this question has probably been asked a few times and I have read 
> everything I could find out about SOLR cache tuning. I am looking for some 
> more ideas.
> 
> Any ideas?
> Tarun Jain-=-

Re: JMX metrics for solr cloud cluster state

2020-05-31 Thread Walter Underwood

I gave up on JMX ages ago, so I can’t help there.

I’d open a bug with New Relic.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 31, 2020, at 7:59 PM, Ganesh Sethuraman  
> wrote:
> 
> Can you suggest Solr Cloud JMX metrics for collection and replica status?
> Trying to centralize the alert generation in NewRelic. New Relic only seems
> to support JMX for the same.
> 
> On Sun, May 31, 2020, 7:29 PM Walter Underwood 
> wrote:
> 
>> I wrote a Python demon that gets clusterstatus from the API, parses it,
>> and sends the counts of replicas in each state to InfluxDB. From there, we
>> chart and alert in Grafana. New Relic is good, but we need other kinds of
>> metrics, like the load balancer status from CloudWatch.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 31, 2020, at 4:24 PM, matthew sporleder 
>> wrote:
>>> 
>>> complain to new relic on their lagging solr support!!!  I have and
>>> could use some support!
>>> 
>>> To address your actual question I have found JMX in solr to be crazy
>>> unreliable but the admin/metrics web endpoint is pretty good.
>>> 
>>> I have some (crappy) python for parsing it for datadog:
>>> https://github.com/msporleder/dd-solrcloud  you might be able to ship
>>> something similar to insights if you were so inclined
>>> 
>>> On Sun, May 31, 2020 at 7:15 PM Ganesh Sethuraman
>>>  wrote:
>>>> 
>>>> Hi
>>>> 
>>>> We use New Relic to monitor Sold Cloud Cluster 7.2.1. we would like to
>> get
>>>> alerted on any cluster state change. Like for example degraded shard.
>>>> Replica down. New relic can monitor any JMX metrices.
>>>> 
>>>> Can you suggest JMX metrics that will help monitor degraded cluster,
>>>> replica recovering, shard replica down, etc?
>>>> 
>>>> I couldn't find any metric on Solr documents.
>>>> 
>>>> Regards
>>>> Ganesh
>> 
>>

Re: JMX metrics for solr cloud cluster state

2020-05-31 Thread Walter Underwood

I wrote a Python demon that gets clusterstatus from the API, parses it, and 
sends the counts of replicas in each state to InfluxDB. From there, we chart 
and alert in Grafana. New Relic is good, but we need other kinds of metrics, 
like the load balancer status from CloudWatch.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 31, 2020, at 4:24 PM, matthew sporleder  wrote:
> 
> complain to new relic on their lagging solr support!!!  I have and
> could use some support!
> 
> To address your actual question I have found JMX in solr to be crazy
> unreliable but the admin/metrics web endpoint is pretty good.
> 
> I have some (crappy) python for parsing it for datadog:
> https://github.com/msporleder/dd-solrcloud  you might be able to ship
> something similar to insights if you were so inclined
> 
> On Sun, May 31, 2020 at 7:15 PM Ganesh Sethuraman
>  wrote:
>> 
>> Hi
>> 
>> We use New Relic to monitor Sold Cloud Cluster 7.2.1. we would like to get
>> alerted on any cluster state change. Like for example degraded shard.
>> Replica down. New relic can monitor any JMX metrices.
>> 
>> Can you suggest JMX metrics that will help monitor degraded cluster,
>> replica recovering, shard replica down, etc?
>> 
>> I couldn't find any metric on Solr documents.
>> 
>> Regards
>> Ganesh

Re: Why Did It Match?

2020-05-28 Thread Walter Underwood

Are you sure they will wonder? I’d try it without that and see if the simpler 
UI is easier to use. Simple almost always wins the A/B test.

You can use the highlighter to see if a field matched a term. Only use explain 
if you need all the scores.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 28, 2020, at 3:37 PM, Webster Homer  
> wrote:
> 
> Thank you.
> 
> The problem is that Endeca just provided this information. The website users 
> see how each search result matched the query.
> For example this is displayed for a hit:
> 1 Product Result
> 
> |  Match Criteria: Material, Product Number
> 
> The business users will wonder why we cannot provide this information with 
> the new system.
> 
> -Original Message-
> From: Erick Erickson 
> Sent: Thursday, May 28, 2020 4:38 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Why Did It Match?
> 
> Yes, debug=explain is expensive. Expensive in the sense that I’d never add it 
> to every query. But if your business users are trying to understand why query 
> X came back the way it did by examining individual queries, then I wouldn’t 
> worry.
> 
> You can easily see how expensive it is in your situation by looking at the 
> timings returned. Debug is just a component just like facet etc and the time 
> it takes is listed separately in the timings section of debug output…
> 
> Best,
> Erick
> 
>> On May 28, 2020, at 4:52 PM, Webster Homer 
>>  wrote:
>> 
>> My concern was that I thought that explain is resource heavy, and was only 
>> used for debugging queries.
>> 
>> -Original Message-
>> From: Doug Turnbull 
>> Sent: Thursday, May 21, 2020 4:06 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Why Did It Match?
>> 
>> Is your concern that the Solr explain functionality is slower than Endecas?
>> Or harder to understand/interpret?
>> 
>> If the latter, I might recommend http://splainer.io as one solution
>> 
>> On Thu, May 21, 2020 at 4:52 PM Webster Homer < 
>> webster.ho...@milliporesigma.com> wrote:
>> 
>>> My company is working on a new website. The old/current site is
>>> powered by Endeca. The site under development is powered by Solr
>>> (currently 7.7.2)
>>> 
>>> Out of the box, Endeca provides the capability to show how a query
>>> was matched in the search. The business users like this
>>> functionality, in solr this functionality is an expensive debug
>>> option. Is there another way to get this information from a query?
>>> 
>>> Webster Homer
>>> 
>>> 
>>> 
>>> This message and any attachment are confidential and may be
>>> privileged or otherwise protected from disclosure. If you are not the
>>> intended recipient, you must not copy this message or attachment or
>>> disclose the contents to any other person. If you have received this
>>> transmission in error, please notify the sender immediately and
>>> delete the message and any attachment from your system. Merck KGaA,
>>> Darmstadt, Germany and any of its subsidiaries do not accept
>>> liability for any omissions or errors in this message which may arise
>>> as a result of E-Mail-transmission or for damages resulting from any
>>> unauthorized changes of the content of this message and any
>>> attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not guarantee that this message is free of viruses
>>> and does not accept liability for any damages caused by any virus 
>>> transmitted therewith.
>>> 
>>> 
>>> 
>>> Click http://www.merckgroup.com/disclaimer to access the German,
>>> French, Spanish and Portuguese versions of this disclaimer.
>>> 
>> 
>> 
>> --
>> *Doug Turnbull **| CTO* | OpenSource Connections
>> <http://opensourceconnections.com>, LLC | 240.476.9983
>> Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI 
>> Powered Search <http://aipoweredsearch.com>* This e-mail and all contents, 
>> including attachments, is considered to be Company Confidential unless 
>> explicitly stated otherwise, regardless of whether attachments are marked as 
>> such.
>> 
>> 
>> This message and any attachment are confidential and may be privileged or 
>> otherwise protected from disclosure. If you are not the intended recipient, 
>> you must not copy this message or attachment or disclose the contents to any 
>> other person. If you have received thi

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1642 matches

Mail list logo