Fw: TolerantUpdateProcessorFactory not functioning

2020-06-08 Thread Hup Chen
Any idea?
I still won't be able to get TolerantUpdateProcessorFactory working, solr 
exited at any error without any tolerance, any suggestions will be appreciated.
curl 
"http://localhost:7070/solr/mycore/update?update.chain=tolerant-chain=100;
 -d @data.xml





  
  100
  400
  1


  
org.apache.solr.common.SolrException
com.ctc.wstx.exc.WstxEOFException
  
  Unexpected EOF; was expecting a close tag for element 
field
 at [row,col {unknown-source}]: [1,8191]
  400





From: Hup Chen
Sent: Friday, May 29, 2020 7:29 PM
To: solr-user@lucene.apache.org 
Subject: TolerantUpdateProcessorFactory not functioning

Hi,

My solr indexing did not tolerate bad record but simply exited even I have 
configured TolerantUpdateProcessorFactory  in solrconfig.xml.
Please advise how could I get TolerantUpdateProcessorFactory  to be working?

solrconfig.xml:

 
   
 100
   
   
 

restarted solr before indexing:
service solr stop
service solr start

curl 
"http://localhost:7070/solr/mycore/update?update.chain=tolerant-chain=100;
 -d @test.json

The first record is a bad record in test.json, the rest were not indexed.

{
  "responseHeader":{
"errors":[{
"type":"ADD",
"id":"0007264097",
"message":"ERROR: [doc=0007264097] Error adding field 'usedshipping'='' 
msg=empty String"}],
"maxErrors":100,
"status":400,
"QTime":0},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"Cannot parse provided JSON: Expected key,value separator ':': 
char=\",position=1240 AFTER='isbn\":\"4032171203\", \"sku\":\"\", 
\"title\":\"ãã³ãã¡ã¡ããã³ã \"author\"' BEFORE=':\"Sachiko OÃtomo\", 
ãã, \"ima'",
"code":400}}



Re: How to determine why solr stops running?

2020-06-08 Thread Radu Gheorghe
I assumed it does, based on your description. If you installed it as a service 
(systemd), then systemd can start the service again if it fails. (something 
like Restart=always in your [Service] definition).

But if it doesn’t restart automatically now, I think it’s easier to 
troubleshoot: just check the last logs after it crashed.

Best regards,
Radu

https://sematext.com

> On 8 Jun 2020, at 16:28, Ryan W  wrote:
> 
> "If Solr auto-restarts"
> 
> It doesn't auto-restart.  Is there some auto-restart functionality?  I'm
> not aware of that.
> 
> On Mon, Jun 8, 2020 at 7:10 AM Radu Gheorghe 
> wrote:
> 
>> Hi Ryan,
>> 
>> If Solr auto-restarts, I suppose it's systemd doing that. When it restarts
>> the Solr service, systemd should log this (maybe somethibg like: journalctl
>> --no-pager | grep -i solr).
>> 
>> Then you can go in your Solr logs and check what happened right before that
>> time. Also, check system logs for what happened before Solr was restarted.
>> 
>> Best regards,
>> Radu
>> 
>> https://sematext.com/
>> 
>> joi, 4 iun. 2020, 19:24 Ryan W  a scris:
>> 
>>> Happened again today. Solr stopped running. Apache hasn't stopped in 10
>>> days, so this is not due to a server reboot.
>>> 
>>> Solr is not being run with the oom-killer.  And when I grep for ERROR in
>>> the logs, there is nothing from today.
>>> 
>>> On Mon, May 18, 2020 at 3:15 PM James Greene <
>> ja...@jamesaustingreene.com>
>>> wrote:
>>> 
 I usually do a combination of grepping for ERROR in solr logs and
>>> checking
 journalctl to see if an external program may have killed the process.
 
 
 Cheers,
 
 /
 *   James Austin Greene
 *  www.jamesaustingreene.com
 *  336-lol-nerd
 /
 
 
 On Mon, May 18, 2020 at 1:39 PM Erick Erickson <
>> erickerick...@gmail.com>
 wrote:
 
> ps aux | grep solr
> 
> on a *.nix system will show you all the runtime parameters.
> 
>> On May 18, 2020, at 12:46 PM, Ryan W  wrote:
>> 
>> Is there a config file containing the start params?  I run solr
>>> like...
>> 
>> bin/solr start
>> 
>> I have not seen anything in the logs that seems informative. When I
 grep
> in
>> the logs directory for 'memory', I see nothing besides a couple
>>> entries
>> like...
>> 
>> 2020-05-14 13:05:56.155 INFO  (main) [   ]
> o.a.s.h.a.MetricsHistoryHandler
>> No .system collection, keeping metrics history in memory.
>> 
>> I don't know what that entry means, though the date does roughly
 coincide
>> with the last time solr stopped running.
>> 
>> Thank you.
>> 
>> 
>> On Mon, May 18, 2020 at 12:00 PM Erick Erickson <
 erickerick...@gmail.com
>> 
>> wrote:
>> 
>>> Probably, but check that you are running with the oom-killer,
>> it'll
>>> be
> in
>>> your start params.
>>> 
>>> But absent that, something external will be the culprit, Solr
>>> doesn't
> stop
>>> by itself. Do look at the Solr log once things stop, it should
>> show
>>> if
>>> someone or something stopped it.
>>> 
>>> On Mon, May 18, 2020, 10:43 Ryan W  wrote:
>>> 
 I don't see any log file with "oom" in the file name.  Does that
>>> mean
>>> there
 hasn't been an out-of-memory issue?  Thanks.
 
 On Thu, May 14, 2020 at 10:05 AM James Greene <
>>> ja...@jamesaustingreene.com
> 
 wrote:
 
> Check the log for for an OOM crash.  Fatal exceptions will be in
>>> the
>>> main
> solr log and out of memory errors will be in their own -oom log.
> 
> I've encountered quite a few solr crashes and usually it's when
>>> there's a
> threshold of concurrent users and/or indexing happening.
> 
> 
> 
> On Thu, May 14, 2020, 9:23 AM Ryan W  wrote:
> 
>> Hi all,
>> 
>> I manage a site where solr has stopped running a couple times
>> in
 the
 past
>> week. The server hasn't been rebooted, so that's not the
>> reason.
>>> What
> else
>> causes solr to stop running?  How can I investigate why this is
> happening?
>> 
>> Thank you,
>> Ryan
>> 
> 
 
>>> 
> 
> 
 
>>> 
>> 



Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Srinivas Kashyap
Hi Shawn,

It's a vague question and I haven't tried it out yet.

Can I instead mention query as below:

Basically instead of



q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"BAMBOOROSE"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc



pass



q=PHY_KEY2:" HQ012206"+AND+PHY_KEY1:" BAMBOOROSE 
"=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=1000=MODIFY_TS desc,LOGICAL_SECT_NAME asc,TRACK_ID 
desc,TRACK_INTER_ID asc,PHY_KEY1 asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 
asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 
asc,FIELD_NAME asc


Instead of q=*:* I pass only those fields which I want to retrieve. Will this 
be faster?

Related to earlier question:
We are using 8.4.1 version
All the fields that I'm using on sorting are all string data type(modify ts 
date) with indexed=true stored=true


Thanks,
Srinivas


On 05-Jun-2020 9:50 pm, Shawn Heisey  wrote:
On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>
> This was the original query. Since there were lot of sorting fields, we 
> decided to not do on the solr side, instead fetch the query response and do 
> the sorting outside solr. This eliminated the need of more JVM memory which 
> was allocated. Every time we ran this query, solr would crash exceeding the 
> JVM memory. Now we are only running filter queries.

What Solr version, and what is the definition of each of the fields
you're sorting on? If the definition doesn't include docValues, then a
large on-heap memory structure will be created for sorting (VERY large
with 500 million docs), and I wouldn't be surprised if it's created even
if it is never used. The definition for any field you use for sorting
should definitely include docValues. In recent Solr versions, docValues
defaults to true for most field types. Some field classes, TextField in
particular, cannot have docValues.

There's something else to discuss about sort params -- each sort field
will only be used if ALL of the previous sort fields are identical for
two documents in the full numFound result set. Having more than two or
three sort fields is usually pointless. My guess (which I know could be
wrong) is that most queries with this HUGE sort parameter will never use
anything beyond TRACK_ID.

> And regarding the filter cache, it is in default setup: (we are using default 
> solrconfig.xml, and we have only added the request handler for DIH)
>
>  size="512"
> initialSize="512"
> autowarmCount="0"/>

This is way too big for your index, and a prime candidate for why your
heap requirements are so high. Like I said before, if the filterCache
on your system actually reaches this max size, it will require 30GB of
memory JUST for the filterCache on this core. Can you check the admin
UI to determine what the size is and what hit ratio it's getting? (1.0
is 100% on the hit ratio). I'd probably start with a size of 32 or 64
on this cache. With a size of 64, a little less than 4GB would be the
max heap allocated for the cache. You can experiment... but with 500
million docs, the filterCache size should be pretty small.

You're going to want to carefully digest this part of that wiki page
that I linked earlier. Hopefully email will preserve this link completely:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements

Thanks,
Shawn


DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software 

Re: Faster Vector Highlight

2020-06-08 Thread Kayak28
Hello, Yasufumi-san and Solr Community:

Thank you for your suggestion.
When I added the parameter hl.maxAnalyzedChars=-1, I could highlight for
long text.

Sincerely,
Kaya Ota

2020年6月6日(土) 20:39 Yasufumi Mizoguchi :

> Hi, Kaya.
>
> How about using hl.maxAnalyzedChars parameter ?
>
> Thanks,
> Yasufumi
>
> > 2020/06/06 午後5:56、Kayak28 のメール:
> >
> > Hello, Solr Community:
> >
> > I have a question about FasterVectorHighlight.
> > I know Solr highlight does not return highlighted text if the text in the
> > highlighted field is too long.
> > What is the good way to treat long text highlights?
> >
> >
> > --
> >
> > Sincerely,
> > Kaya
> > github: https://github.com/28kayak
>


-- 

Sincerely,
Kaya
github: https://github.com/28kayak


Re: index join without query criteria

2020-06-08 Thread Mikhail Khludnev
or probably -director_id:[* TO *]

On Mon, Jun 8, 2020 at 10:56 PM Hari Iyer  wrote:

> Hi,
>
> It appears that a query criteria is mandatory for a join. Taking this
> example from the documentation: fq={!join from=id fromIndex=movie_directors
> to=director_id}has_oscar:true. What if I want to find all movies that have
> a director (regardless of whether they have won an Oscar or not)? This
> query: fq={!join from=id fromIndex=movie_directors to=director_id} fails.
> Do I just have to make up a dummy criteria like fq={!join from=id
> fromIndex=movie_directors to=director_id}id:[* TO *]?
>
> Thanks,
> Hari.
>
>

-- 
Sincerely yours
Mikhail Khludnev


index join without query criteria

2020-06-08 Thread Hari Iyer
Hi,

It appears that a query criteria is mandatory for a join. Taking this example 
from the documentation: fq={!join from=id fromIndex=movie_directors 
to=director_id}has_oscar:true. What if I want to find all movies that have a 
director (regardless of whether they have won an Oscar or not)? This query: 
fq={!join from=id fromIndex=movie_directors to=director_id} fails. Do I just 
have to make up a dummy criteria like fq={!join from=id 
fromIndex=movie_directors to=director_id}id:[* TO *]?

Thanks,
Hari.



RE: Script to check if solr is running

2020-06-08 Thread Dunigan, Craig A.
I agree with the systemd guys if you’re unfamiliar with scripting this sort of 
thing.  I’d wind up with piping through awk and grep and the like, which is as 
clear as mud if you don’t already know it.  Might as well learn and use the 
modern tools if you can.  We have an old-school hard division between sysadmins 
and app admins, so we don’t like to play with their toys.  /etc/rc.appstart and 
/etc/rc.appstop owned by the app account is the standard way here, and we build 
our own service monitoring.  And now that I write that all out, it really does 
look kinda clunky, doesn’t it?  Better to do it The Right Way©.

From: Walter Underwood 
Sent: Monday, June 8, 2020 11:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Script to check if solr is running

WARNING: This email originated outside of Lands’ End. Please be on the lookout 
for phishing scams and do not open attachments or click links from people you 
do not know..

I could write a script, too, though I’d do it with straight shell code. But 
then I’d have to test it, check it in somewhere, document it for ops, install 
it, ...

Instead, when we switch from monit, I'll start with one of these systemd 
configs.

https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1
 
>
https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup
 
>
https://issues.apache.org/jira/browse/SOLR-14410
 
>

Why have a cold backup and then switch? Every time I see that config, I wonder 
why people don’t have both servers live behind a load balancer. How do you know 
the cold server will work?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Jun 8, 2020, at 9:20 AM, Dave 
> mailto:hastings.recurs...@gmail.com>> wrote:
>
> A simple Perl script would be able to cover this, I have a cron job Perl 
> script that does a search with an expected result, if the result isn’t there 
> it fails over to a backup search server, sends me an email, and I fix what’s 
> wrong. The backup search server is a direct clone of the live server and just 
> as strong, no interruption (aside from the five minute window)
>
> If you need a hand with this I’d gladly help, everything I run is Linux based 
> but it’s a simple curl command and server switch on failure.
>
>> On Jun 8, 2020, at 12:14 PM, Jörn Franke 
>> mailto:jornfra...@gmail.com>> wrote:
>>
>> Use the solution described by Walter. This allows you to automatically 
>> restart in case of failure and is also cleaner than defining a cronjob. 
>> Otherwise This would be another dependency one needs to keep in mind - means 
>> if there is an issue and someone does not know the system the person has to 
>> look at different places which never is good
>>
>>> Am 04.06.2020 um 18:36 schrieb Ryan W 
>>> mailto:rya...@gmail.com>>:
>>>
>>> Does anyone have a script that checks if solr is running and then starts it
>>> if it isn't running? Occasionally my solr stops running even if there has
>>> been no Apache restart. I haven't been able to determine the root cause,
>>> so the next best thing might be to check every 15 minutes or so if it's
>>> running and run it if it has stopped.
>>>
>>> Thanks.


Re: Script to check if solr is running

2020-06-08 Thread David Hastings
>
> Why have a cold backup and then switch?
>

my current set up is:
1. master indexer
2. master slave on a release/commit basis
3. 3 live slave searching nodes in two data different centers


the three live nodes are in front of nginx load balancing and they are
mostly hot but not all of them, i found that having all load into one made
the performance significantly better, but if one of them goes down theres a
likelihood that the other two went with it, they are also part of a
mysql gallera cluster and it has a possibility of going down (innodb can be
annoying), so the script will go through all three of the live slaves until
it has to fall back to the master slave, i know the cold master will work,
mostly out of faith, but if i lose four servers all at the same time, i
have larger problems to worry about than searching.

just adaptation over time, I cant say its the best set up but i can say it
operates pretty well, very well speed wise keeping one searcher super hot
with two clones ready to jump in if needed



On Mon, Jun 8, 2020 at 12:30 PM Walter Underwood 
wrote:

> I could write a script, too, though I’d do it with straight shell code.
> But then I’d have to test it, check it in somewhere, document it for ops,
> install it, ...
>
> Instead, when we switch from monit, I'll start with one of these systemd
> configs.
>
> https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1 <
> https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1>
>
> https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup
> <
> https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup
> >
> https://issues.apache.org/jira/browse/SOLR-14410 <
> https://issues.apache.org/jira/browse/SOLR-14410>
>
> Why have a cold backup and then switch? Every time I see that config, I
> wonder why people don’t have both servers live behind a load balancer. How
> do you know the cold server will work?
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 8, 2020, at 9:20 AM, Dave  wrote:
> >
> > A simple Perl script would be able to cover this, I have a cron job Perl
> script that does a search with an expected result, if the result isn’t
> there it fails over to a backup search server, sends me an email, and I fix
> what’s wrong. The backup search server is a direct clone of the live server
> and just as strong, no interruption (aside from the five minute window)
> >
> > If you need a hand with this I’d gladly help, everything I run is Linux
> based but it’s a simple curl command and server switch on failure.
> >
> >> On Jun 8, 2020, at 12:14 PM, Jörn Franke  wrote:
> >>
> >> Use the solution described by Walter. This allows you to automatically
> restart in case of failure and is also cleaner than defining a cronjob.
> Otherwise This would be another dependency one needs to keep in mind -
> means if there is an issue and someone does not know the system the person
> has to look at different places which never is good
> >>
> >>> Am 04.06.2020 um 18:36 schrieb Ryan W :
> >>>
> >>> Does anyone have a script that checks if solr is running and then
> starts it
> >>> if it isn't running?  Occasionally my solr stops running even if there
> has
> >>> been no Apache restart.  I haven't been able to determine the root
> cause,
> >>> so the next best thing might be to check every 15 minutes or so if it's
> >>> running and run it if it has stopped.
> >>>
> >>> Thanks.
>
>


Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Colvin Cowie
Great, thanks Erick

On Mon, 8 Jun 2020 at 13:22, Erick Erickson  wrote:

> It’s _bounded_ buy MaxDoc/8 + (some overhead). The overhead is
> both the map overhead and the representation of the query.
>
> This is an upper bound, the full bitset is not stored if there
> are few entries that match the filter, in that case the
> doc IDs are stored. Consider if maxDoc is 1M and only 2 docs
> match the query, it’s much more efficient to store two ints
> rather than 1M/8.
>
> You can also limit the RAM used by specifying maxRamMB.
>
> Best,
> Erick
>
> > On Jun 8, 2020, at 4:59 AM, Colvin Cowie 
> wrote:
> >
> > Sorry to hijack this a little bit. Shawn, what's the calculation for the
> > size of the filter cache?
> > Is that 1 bit per document in the core / shard?
> > Thanks
> >
> > On Fri, 5 Jun 2020 at 17:20, Shawn Heisey  wrote:
> >
> >> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
> >>> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO
> >> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS
> >> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1
> >> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6
> >> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> >>>
> >>> This was the original query. Since there were lot of sorting fields, we
> >> decided to not do on the solr side, instead fetch the query response
> and do
> >> the sorting outside solr. This eliminated the need of more JVM memory
> which
> >> was allocated. Every time we ran this query, solr would crash exceeding
> the
> >> JVM memory. Now we are only running filter queries.
> >>
> >> What Solr version, and what is the definition of each of the fields
> >> you're sorting on?  If the definition doesn't include docValues, then a
> >> large on-heap memory structure will be created for sorting (VERY large
> >> with 500 million docs), and I wouldn't be surprised if it's created even
> >> if it is never used.  The definition for any field you use for sorting
> >> should definitely include docValues.  In recent Solr versions, docValues
> >> defaults to true for most field types.  Some field classes, TextField in
> >> particular, cannot have docValues.
> >>
> >> There's something else to discuss about sort params -- each sort field
> >> will only be used if ALL of the previous sort fields are identical for
> >> two documents in the full numFound result set.  Having more than two or
> >> three sort fields is usually pointless.  My guess (which I know could be
> >> wrong) is that most queries with this HUGE sort parameter will never use
> >> anything beyond TRACK_ID.
> >>
> >>> And regarding the filter cache, it is in default setup: (we are using
> >> default solrconfig.xml, and we have only added the request handler for
> DIH)
> >>>
> >>>  >>>  size="512"
> >>>  initialSize="512"
> >>>  autowarmCount="0"/>
> >>
> >> This is way too big for your index, and a prime candidate for why your
> >> heap requirements are so high.  Like I said before, if the filterCache
> >> on your system actually reaches this max size, it will require 30GB of
> >> memory JUST for the filterCache on this core.  Can you check the admin
> >> UI to determine what the size is and what hit ratio it's getting? (1.0
> >> is 100% on the hit ratio).  I'd probably start with a size of 32 or 64
> >> on this cache.  With a size of 64, a little less than 4GB would be the
> >> max heap allocated for the cache.  You can experiment... but with 500
> >> million docs, the filterCache size should be pretty small.
> >>
> >> You're going to want to carefully digest this part of that wiki page
> >> that I linked earlier.  Hopefully email will preserve this link
> completely:
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements
> >>
> >> Thanks,
> >> Shawn
> >>
>
>


Re: Script to check if solr is running

2020-06-08 Thread Walter Underwood
I could write a script, too, though I’d do it with straight shell code. But 
then I’d have to test it, check it in somewhere, document it for ops, install 
it, ...

Instead, when we switch from monit, I'll start with one of these systemd 
configs.

https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1 

https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup 

https://issues.apache.org/jira/browse/SOLR-14410 


Why have a cold backup and then switch? Every time I see that config, I wonder 
why people don’t have both servers live behind a load balancer. How do you know 
the cold server will work?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 8, 2020, at 9:20 AM, Dave  wrote:
> 
> A simple Perl script would be able to cover this, I have a cron job Perl 
> script that does a search with an expected result, if the result isn’t there 
> it fails over to a backup search server, sends me an email, and I fix what’s 
> wrong. The backup search server is a direct clone of the live server and just 
> as strong, no interruption (aside from the five minute window) 
> 
> If you need a hand with this I’d gladly help, everything I run is Linux based 
> but it’s a simple curl command and server switch on failure. 
> 
>> On Jun 8, 2020, at 12:14 PM, Jörn Franke  wrote:
>> 
>> Use the solution described by Walter. This allows you to automatically 
>> restart in case of failure and is also cleaner than defining a cronjob. 
>> Otherwise This would be another dependency one needs to keep in mind - means 
>> if there is an issue and someone does not know the system the person has to 
>> look at different places which never is good 
>> 
>>> Am 04.06.2020 um 18:36 schrieb Ryan W :
>>> 
>>> Does anyone have a script that checks if solr is running and then starts it
>>> if it isn't running?  Occasionally my solr stops running even if there has
>>> been no Apache restart.  I haven't been able to determine the root cause,
>>> so the next best thing might be to check every 15 minutes or so if it's
>>> running and run it if it has stopped.
>>> 
>>> Thanks.



Re: Script to check if solr is running

2020-06-08 Thread Dave
A simple Perl script would be able to cover this, I have a cron job Perl script 
that does a search with an expected result, if the result isn’t there it fails 
over to a backup search server, sends me an email, and I fix what’s wrong. The 
backup search server is a direct clone of the live server and just as strong, 
no interruption (aside from the five minute window) 

If you need a hand with this I’d gladly help, everything I run is Linux based 
but it’s a simple curl command and server switch on failure. 

> On Jun 8, 2020, at 12:14 PM, Jörn Franke  wrote:
> 
> Use the solution described by Walter. This allows you to automatically 
> restart in case of failure and is also cleaner than defining a cronjob. 
> Otherwise This would be another dependency one needs to keep in mind - means 
> if there is an issue and someone does not know the system the person has to 
> look at different places which never is good 
> 
>> Am 04.06.2020 um 18:36 schrieb Ryan W :
>> 
>> Does anyone have a script that checks if solr is running and then starts it
>> if it isn't running?  Occasionally my solr stops running even if there has
>> been no Apache restart.  I haven't been able to determine the root cause,
>> so the next best thing might be to check every 15 minutes or so if it's
>> running and run it if it has stopped.
>> 
>> Thanks.


Re: Script to check if solr is running

2020-06-08 Thread Jörn Franke
Use the solution described by Walter. This allows you to automatically restart 
in case of failure and is also cleaner than defining a cronjob. Otherwise This 
would be another dependency one needs to keep in mind - means if there is an 
issue and someone does not know the system the person has to look at different 
places which never is good 

> Am 04.06.2020 um 18:36 schrieb Ryan W :
> 
> Does anyone have a script that checks if solr is running and then starts it
> if it isn't running?  Occasionally my solr stops running even if there has
> been no Apache restart.  I haven't been able to determine the root cause,
> so the next best thing might be to check every 15 minutes or so if it's
> running and run it if it has stopped.
> 
> Thanks.


Atomic updates with add-distinct in Solr 7 cloud

2020-06-08 Thread Thomas Corthals
Hi

I'm trying to do atomic updates with an 'add-distinct' modifier in a Solr 7
cloud. It seems to behave like an 'add' and I end up with double values in
my multiValued field. This only happens with multiple values for the field
in an update (cat:{"add-distinct":["a","b","d"]} exhibits this
problem, cat:{"add-distinct":"a"} doesn't). When running the same update
request with a single core, or a Solr 8 cloud, I get the expected result.

This is a minimal test case with Solr 7.7.3 in cloud mode, 2 nodes, a
collection with shard count 1 and replicationFactor 2, using the
techproducts configset.

$ curl -X POST -H 'Content-Type: text/json' '
http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
'[{"id":123,cat:["a","b","c"]}]'
{
  "responseHeader":{
"rf":2,
"status":0,
"QTime":75}}

$ curl -X POST -H 'Content-Type: text/json' '
http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
'[{"id":123,cat:{"add-distinct":["a","b","d"]}}]'
{
  "responseHeader":{
"rf":2,
"status":0,
"QTime":81}}

$ curl '
http://localhost:8983/solr/techproducts/select?q=id%3A123=true'
{
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"123",
"cat":["a",
  "b",
  "c",
  "a",
  "b",
  "d"],
"_version_":1668919799351083008}]
  }}

Is this a known issue or am I missing something here?

Kind regards

Thomas Corthals


Re: Script to check if solr is running

2020-06-08 Thread Ryan W
"A simple cronjob with /bin/solr status and /bin/solr start should do the trick."

I don't know what that would look like.  Wouldn't the job have to check the
status and only give the start command if solr isn't running?  I don't
think it's possible to put logic in a cron job. I think it would have to be
in a script, with a cron job to run the script.  I've never had cause to
write such a script, though, so I don't know how it's done.


On Fri, Jun 5, 2020 at 11:19 AM Dunigan, Craig A. <
craig.duni...@landsend.com> wrote:

> A simple cronjob with /bin/solr status and  directory>/bin/solr start should do the trick.  There must be a Windows
> equivalent if that’s what you’re using.
>
> From: Ryan W 
> Sent: Thursday, June 4, 2020 11:39 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Script to check if solr is running
>
> WARNING: This email originated outside of Lands’ End. Please be on the
> lookout for phishing scams and do not open attachments or click links from
> people you do not know..
>
> Or is it not much overhead to give the command to start solr if it is
> already running? Maybe it's not necessary to check if it's running? Is
> there any downside to giving the start command every 15 minutes or so
> whether it is running or not?
>
> Thanks.
>
> On Thu, Jun 4, 2020 at 12:36 PM Ryan W  rya...@gmail.com>> wrote:
>
> > Does anyone have a script that checks if solr is running and then starts
> > it if it isn't running? Occasionally my solr stops running even if there
> > has been no Apache restart. I haven't been able to determine the root
> > cause, so the next best thing might be to check every 15 minutes or so if
> > it's running and run it if it has stopped.
> >
> > Thanks.
> >
>


Re: How to determine why solr stops running?

2020-06-08 Thread Ryan W
"If Solr auto-restarts"

It doesn't auto-restart.  Is there some auto-restart functionality?  I'm
not aware of that.

On Mon, Jun 8, 2020 at 7:10 AM Radu Gheorghe 
wrote:

> Hi Ryan,
>
> If Solr auto-restarts, I suppose it's systemd doing that. When it restarts
> the Solr service, systemd should log this (maybe somethibg like: journalctl
> --no-pager | grep -i solr).
>
> Then you can go in your Solr logs and check what happened right before that
> time. Also, check system logs for what happened before Solr was restarted.
>
> Best regards,
> Radu
>
> https://sematext.com/
>
> joi, 4 iun. 2020, 19:24 Ryan W  a scris:
>
> > Happened again today. Solr stopped running. Apache hasn't stopped in 10
> > days, so this is not due to a server reboot.
> >
> > Solr is not being run with the oom-killer.  And when I grep for ERROR in
> > the logs, there is nothing from today.
> >
> > On Mon, May 18, 2020 at 3:15 PM James Greene <
> ja...@jamesaustingreene.com>
> > wrote:
> >
> > > I usually do a combination of grepping for ERROR in solr logs and
> > checking
> > > journalctl to see if an external program may have killed the process.
> > >
> > >
> > > Cheers,
> > >
> > > /
> > > *   James Austin Greene
> > > *  www.jamesaustingreene.com
> > > *  336-lol-nerd
> > > /
> > >
> > >
> > > On Mon, May 18, 2020 at 1:39 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > > > ps aux | grep solr
> > > >
> > > > on a *.nix system will show you all the runtime parameters.
> > > >
> > > > > On May 18, 2020, at 12:46 PM, Ryan W  wrote:
> > > > >
> > > > > Is there a config file containing the start params?  I run solr
> > like...
> > > > >
> > > > > bin/solr start
> > > > >
> > > > > I have not seen anything in the logs that seems informative. When I
> > > grep
> > > > in
> > > > > the logs directory for 'memory', I see nothing besides a couple
> > entries
> > > > > like...
> > > > >
> > > > > 2020-05-14 13:05:56.155 INFO  (main) [   ]
> > > > o.a.s.h.a.MetricsHistoryHandler
> > > > > No .system collection, keeping metrics history in memory.
> > > > >
> > > > > I don't know what that entry means, though the date does roughly
> > > coincide
> > > > > with the last time solr stopped running.
> > > > >
> > > > > Thank you.
> > > > >
> > > > >
> > > > > On Mon, May 18, 2020 at 12:00 PM Erick Erickson <
> > > erickerick...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Probably, but check that you are running with the oom-killer,
> it'll
> > be
> > > > in
> > > > >> your start params.
> > > > >>
> > > > >> But absent that, something external will be the culprit, Solr
> > doesn't
> > > > stop
> > > > >> by itself. Do look at the Solr log once things stop, it should
> show
> > if
> > > > >> someone or something stopped it.
> > > > >>
> > > > >> On Mon, May 18, 2020, 10:43 Ryan W  wrote:
> > > > >>
> > > > >>> I don't see any log file with "oom" in the file name.  Does that
> > mean
> > > > >> there
> > > > >>> hasn't been an out-of-memory issue?  Thanks.
> > > > >>>
> > > > >>> On Thu, May 14, 2020 at 10:05 AM James Greene <
> > > > >> ja...@jamesaustingreene.com
> > > > 
> > > > >>> wrote:
> > > > >>>
> > > >  Check the log for for an OOM crash.  Fatal exceptions will be in
> > the
> > > > >> main
> > > >  solr log and out of memory errors will be in their own -oom log.
> > > > 
> > > >  I've encountered quite a few solr crashes and usually it's when
> > > > >> there's a
> > > >  threshold of concurrent users and/or indexing happening.
> > > > 
> > > > 
> > > > 
> > > >  On Thu, May 14, 2020, 9:23 AM Ryan W  wrote:
> > > > 
> > > > > Hi all,
> > > > >
> > > > > I manage a site where solr has stopped running a couple times
> in
> > > the
> > > > >>> past
> > > > > week. The server hasn't been rebooted, so that's not the
> reason.
> > > > >> What
> > > >  else
> > > > > causes solr to stop running?  How can I investigate why this is
> > > >  happening?
> > > > >
> > > > > Thank you,
> > > > > Ryan
> > > > >
> > > > 
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>


Re: Highlighting values of non stored fields

2020-06-08 Thread Erick Erickson
When highlighting, the stored data for the field is re-analyzed against the 
query based on the field you’re highlighting. My bet is that if you query just 
“q=doc_text:mosh” you will not get a hit. Check your text_ws fieldType, it’s 
probably case sensitive. So if you changed the doc_text type to text_general 
(the same as your dynamic field), I think you’d be fine. re-index your data of 
course….

I’ll add by-the-by that text_ws is a fairly restricted, and is rarely useful 
for searching on anything humans have to key in. It’ll include punctuation for 
instance, i.e. input like “dog dog.” will produce two tokens, one with a period 
in the token and one without. It’s most useful for heavily-preprocessed data 
where the app normalizes the input or machine-generated input.

There’s no reason, BTW, to index your doc_text for highlighting purposes since 
the stored data is what counts. Unless, of course, you want to search on that 
field specifically.

Best,
Erick

> On Jun 7, 2020, at 11:32 PM, mosh bla  wrote:
> 
> 
> Thanks Erick for the reply. Your answer is eaxctly what I was expecting from 
> the highlight component but it seems like I am getting different behaviour.
> I'll try to give a simple example and I hope you can explain where is my 
> mistake.
> Say I have the following fields configuration:
> 
> 
> 
> 
> And I indexed the following document:
> {
>"doc_text": "MOSH"
> }
> 
> When executing the following query 
> "http://.../select?q=doc_text_lw:mosh=true=doc_text; - the document 
> is matched and returned in response, but the highlighed fragment is empty.
> I also tried to change 'hl.method' param to 'unified' and 'fastVector' but no 
> luck either. My conclusion was that 'hl.fl' param should be set to 
> 'doc_text_lw' and it must be also stored...
>  
>  
>  
> 
> Sent: Tuesday, June 02, 2020 at 3:15 PM
> From: "Erick Erickson" 
> To: solr-user@lucene.apache.org
> Subject: Re: Highlighting values of non stored fields
> Why do you think even variants need to be stored/highlighted? Usually
> when you store variants for ranking purposes those extra copies are
> invisible to the user. So most often people store exactly one copy
> of a particular field and highlight _that_ field in the return.
> 
> So say my field is f1 and I have indexed f1_1, f1_2, f1_3. I just store
> f1_1 and return the highlighted text from that one.
> 
> You could even just stored the data only once in a field that’s never
> indexed and return/highlight that if you wanted.
> 
> Best,
> Erick
> 
>> On Jun 2, 2020, at 3:24 AM, mosheB  wrote:
>> 
>> Our use case is as follow:
>> We are indexing free text documents. Each document contains metadata fields
>> (such as author, creation date...) which are kinda small, and one "big"
>> field that holds the document's text itself.
>> 
>> For ranking purpose each field is indexed in more then one "variation" and
>> query is executed with edismax query parser. Things are working alright, but
>> now a new feature is requested by the customer - highlighting.
>> To enable highlighting every field must be stored, including all variations
>> of the big text field. This pushes our storage to the limit (and probably
>> the document cache...) and feels a bit redundant, as the stored value is
>> duplicated n times... Is there any way to “reference” stored value from one
>> field to another?
>> For example:
>> Say we have the following config:
>> > />
>> > />
>> 
>> 
>> 
>> 
>> 
>> And we execute the following query:
>> http://.../select?defType=edismax=desired_terms=doc_text^2
>> doc_text_bigrams^3
>> doc_text_phrases^4=on=doc_text,doc_text_bigrams,doc_text_phrases
>> 
>> Highlight fragments in response will be blank if match occurred on the
>> non-stored fields (doc_text_bigrams or doc_text_phrases). Is it possible to
>> pass extra parameter to the highlight component, to point it to the stored
>> data of the “original” doc_text field? a kind of “stored value reference
>> field”?
>> 
>> Thanks in advance.
>> 
>> 
>> 
>> --
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>  



Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Erick Erickson
It’s _bounded_ buy MaxDoc/8 + (some overhead). The overhead is
both the map overhead and the representation of the query.

This is an upper bound, the full bitset is not stored if there
are few entries that match the filter, in that case the
doc IDs are stored. Consider if maxDoc is 1M and only 2 docs
match the query, it’s much more efficient to store two ints
rather than 1M/8.

You can also limit the RAM used by specifying maxRamMB.

Best,
Erick

> On Jun 8, 2020, at 4:59 AM, Colvin Cowie  wrote:
> 
> Sorry to hijack this a little bit. Shawn, what's the calculation for the
> size of the filter cache?
> Is that 1 bit per document in the core / shard?
> Thanks
> 
> On Fri, 5 Jun 2020 at 17:20, Shawn Heisey  wrote:
> 
>> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
>>> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO
>> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS
>> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1
>> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6
>> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>>> 
>>> This was the original query. Since there were lot of sorting fields, we
>> decided to not do on the solr side, instead fetch the query response and do
>> the sorting outside solr. This eliminated the need of more JVM memory which
>> was allocated. Every time we ran this query, solr would crash exceeding the
>> JVM memory. Now we are only running filter queries.
>> 
>> What Solr version, and what is the definition of each of the fields
>> you're sorting on?  If the definition doesn't include docValues, then a
>> large on-heap memory structure will be created for sorting (VERY large
>> with 500 million docs), and I wouldn't be surprised if it's created even
>> if it is never used.  The definition for any field you use for sorting
>> should definitely include docValues.  In recent Solr versions, docValues
>> defaults to true for most field types.  Some field classes, TextField in
>> particular, cannot have docValues.
>> 
>> There's something else to discuss about sort params -- each sort field
>> will only be used if ALL of the previous sort fields are identical for
>> two documents in the full numFound result set.  Having more than two or
>> three sort fields is usually pointless.  My guess (which I know could be
>> wrong) is that most queries with this HUGE sort parameter will never use
>> anything beyond TRACK_ID.
>> 
>>> And regarding the filter cache, it is in default setup: (we are using
>> default solrconfig.xml, and we have only added the request handler for DIH)
>>> 
>>> >>  size="512"
>>>  initialSize="512"
>>>  autowarmCount="0"/>
>> 
>> This is way too big for your index, and a prime candidate for why your
>> heap requirements are so high.  Like I said before, if the filterCache
>> on your system actually reaches this max size, it will require 30GB of
>> memory JUST for the filterCache on this core.  Can you check the admin
>> UI to determine what the size is and what hit ratio it's getting? (1.0
>> is 100% on the hit ratio).  I'd probably start with a size of 32 or 64
>> on this cache.  With a size of 64, a little less than 4GB would be the
>> max heap allocated for the cache.  You can experiment... but with 500
>> million docs, the filterCache size should be pretty small.
>> 
>> You're going to want to carefully digest this part of that wiki page
>> that I linked earlier.  Hopefully email will preserve this link completely:
>> 
>> 
>> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements
>> 
>> Thanks,
>> Shawn
>> 



Re: Getting to grips with auto-scaling

2020-06-08 Thread Radu Gheorghe
Hi Tom,

To your last two questions, I'd like to vent an alternative design: have
dedicated "hot" and "warm" nodes. That is, 2020+lists will go to the hot
tier, and 2019, 2018,2017+lists go to the warm tier.

Then you can scale the hot tier based on your query load. For the warm
tier, I assume there will be less need for scaling, and if it is, I guess
it's less important for shards of each index to be perfectly balanced (so a
simple "make sure cores are evenly distributed" should be enough).

Granted, this design isn't as flexible as the one you suggested, but it's
simpler. So simple that I've seen it done without autoscaling (just a few
scripts from when you add nodes in each tier).

Best regards,
Radu

https://sematext.com

vin., 5 iun. 2020, 21:59 Tom Evans  a
scris:

> Hi
>
> I'm trying to get a handle on the newer auto-scaling features in Solr.
> We're in the process of upgrading an older SolrCloud cluster from 5.5
> to 8.5, and re-architecture it slightly to improve performance and
> automate operations.
>
> If I boil it down slightly, currently we have two collections, "items"
> and "lists". Both collections have just one shard. We publish new data
> to "items" once each day, and our users search and do analysis on
> them, whilst "lists" contains NRT user-specified collections of ids
> from items, which we join to from "items" in order to allow them to
> restrict their searches/analysis to just docs in their curated lists.
>
> Most of our searches have specific date ranges in them, usually only
> from the last 3 years or so, but sometimes we need to do searches
> across all the data. With the new setup, we want to:
>
> * shard by date (year) to make the hottest data available in smaller shards
> * have more nodes with these shards than we do of the older data.
> * be able to add/remove nodes predictably based upon our clients
> (predictable) query load
> * use TLOG for "items" and NRT for "lists", to avoid unnecessary
> indexing load for "items" and have NRT for "lists".
> * spread cores across two AZ
>
> With that in mind, I came up with a bunch of simplified rules for
> testing, with just 4 shards for "items":
>
> * "lists" collection has one NRT replica on each node
> * "items" collection shard 2020 has one TLOG replica on each node
> * "items" collection shard 2019 has one TLOG replica on 75% of nodes
> * "items" collection shards 2018 and 2017 each have one TLOG replica
> on 50% of nodes
> * all shards have at least 2 replicas if number of nodes > 1
> * no node should have 2 replicas of the same shard
> * number of cores should be balanced across nodes
>
> Eg, with 1 node, I want to see this topology:
> A: items: 2020, 2019, 2018, 2017 + lists
>
> with 2 nodes:
> A: items: 2020, 2019, 2018, 2017 + lists
> B: items: 2020, 2019, 2018, 2017 + lists
>
> and if I add two more nodes:
> A: items: 2020, 2019, 2018 + lists
> B: items: 2020, 2019, 2017 + lists
> C: items: 2020, 2019, 2017 + lists
> D: items: 2020, 2018 + lists
>
> To the questions:
>
> * The type of replica created when nodeAdded is triggered can't be set
> per collection. Either everything gets NRT or everything gets TLOG.
> Even if I specify nrtReplicas=0 when creating a collection, nodeAdded
> will add NRT replicas if configured that way.
> * I'm having difficulty expressing these rules in terms of a policy -
> I can't seem to figure out a way to specify the number of replicas for
> a shard based upon the total number of nodes.
> * Is this beyond the current scope of autoscaling triggers/policies?
> Should I instead use the trigger with a custom plugin action (or to
> trigger a web hook) to be a bit more intelligent?
> * Am I wasting my time trying to ensure there are more replicas of the
> hotter shards than the colder shards? It seems to add a lot of
> complexity - should I just instead think that they aren't getting
> queried much, so won't be using up cache space that the hot shards
> will be using. Disk space is pretty cheap after all (total size for
> "items" + "lists" is under 60GB).
>
> Cheers
>
> Tom
>


Re: How to determine why solr stops running?

2020-06-08 Thread Radu Gheorghe
Hi Ryan,

If Solr auto-restarts, I suppose it's systemd doing that. When it restarts
the Solr service, systemd should log this (maybe somethibg like: journalctl
--no-pager | grep -i solr).

Then you can go in your Solr logs and check what happened right before that
time. Also, check system logs for what happened before Solr was restarted.

Best regards,
Radu

https://sematext.com/

joi, 4 iun. 2020, 19:24 Ryan W  a scris:

> Happened again today. Solr stopped running. Apache hasn't stopped in 10
> days, so this is not due to a server reboot.
>
> Solr is not being run with the oom-killer.  And when I grep for ERROR in
> the logs, there is nothing from today.
>
> On Mon, May 18, 2020 at 3:15 PM James Greene 
> wrote:
>
> > I usually do a combination of grepping for ERROR in solr logs and
> checking
> > journalctl to see if an external program may have killed the process.
> >
> >
> > Cheers,
> >
> > /
> > *   James Austin Greene
> > *  www.jamesaustingreene.com
> > *  336-lol-nerd
> > /
> >
> >
> > On Mon, May 18, 2020 at 1:39 PM Erick Erickson 
> > wrote:
> >
> > > ps aux | grep solr
> > >
> > > on a *.nix system will show you all the runtime parameters.
> > >
> > > > On May 18, 2020, at 12:46 PM, Ryan W  wrote:
> > > >
> > > > Is there a config file containing the start params?  I run solr
> like...
> > > >
> > > > bin/solr start
> > > >
> > > > I have not seen anything in the logs that seems informative. When I
> > grep
> > > in
> > > > the logs directory for 'memory', I see nothing besides a couple
> entries
> > > > like...
> > > >
> > > > 2020-05-14 13:05:56.155 INFO  (main) [   ]
> > > o.a.s.h.a.MetricsHistoryHandler
> > > > No .system collection, keeping metrics history in memory.
> > > >
> > > > I don't know what that entry means, though the date does roughly
> > coincide
> > > > with the last time solr stopped running.
> > > >
> > > > Thank you.
> > > >
> > > >
> > > > On Mon, May 18, 2020 at 12:00 PM Erick Erickson <
> > erickerick...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Probably, but check that you are running with the oom-killer, it'll
> be
> > > in
> > > >> your start params.
> > > >>
> > > >> But absent that, something external will be the culprit, Solr
> doesn't
> > > stop
> > > >> by itself. Do look at the Solr log once things stop, it should show
> if
> > > >> someone or something stopped it.
> > > >>
> > > >> On Mon, May 18, 2020, 10:43 Ryan W  wrote:
> > > >>
> > > >>> I don't see any log file with "oom" in the file name.  Does that
> mean
> > > >> there
> > > >>> hasn't been an out-of-memory issue?  Thanks.
> > > >>>
> > > >>> On Thu, May 14, 2020 at 10:05 AM James Greene <
> > > >> ja...@jamesaustingreene.com
> > > 
> > > >>> wrote:
> > > >>>
> > >  Check the log for for an OOM crash.  Fatal exceptions will be in
> the
> > > >> main
> > >  solr log and out of memory errors will be in their own -oom log.
> > > 
> > >  I've encountered quite a few solr crashes and usually it's when
> > > >> there's a
> > >  threshold of concurrent users and/or indexing happening.
> > > 
> > > 
> > > 
> > >  On Thu, May 14, 2020, 9:23 AM Ryan W  wrote:
> > > 
> > > > Hi all,
> > > >
> > > > I manage a site where solr has stopped running a couple times in
> > the
> > > >>> past
> > > > week. The server hasn't been rebooted, so that's not the reason.
> > > >> What
> > >  else
> > > > causes solr to stop running?  How can I investigate why this is
> > >  happening?
> > > >
> > > > Thank you,
> > > > Ryan
> > > >
> > > 
> > > >>>
> > > >>
> > >
> > >
> >
>


Re: unified highlighter performance in solr 8.5.1

2020-06-08 Thread Michal Hlavac
Hi David,

sorry for my late answer. I created simple test scenarios on github 
https://github.com/hlavki/solr-unified-highlighter-test[1] 
There are 2 documents, both bigger sized.
Test method: 
https://github.com/hlavki/solr-unified-highlighter-test/blob/master/src/test/java/com/example/HighlightTest.java#L60[2]
 

Result is, that with hl.fragsizeIsMinimum=true=0 response 
times are similar to solr 8.4.1
I didn't expect that default configuration values should change response time 
that drastically.

m.

On streda 27. mája 2020 9:14:37 CEST David Smiley wrote:


try setting hl.fragsizeIsMinimum=true
I did some benchmarking and found that this helps quite a bit




BTW I used the highlights.alg benchmark file, with some changes to make it more 
reflective of your scenario -- offsets in postings, and used "enwiki" (english 
wikipedia) docs which are larger than the Reuters ones (so it appears, any 
way).  I had to do a bit of hacking to use the "LengthGoalBreakIterator, which 
wasn't previously used by this framework.


~ David



On Tue, May 26, 2020 at 4:42 PM Michal Hlavac  wrote:


fine, I'l try to write simple test, thanks
 
On utorok 26. mája 2020 17:44:52 CEST David Smiley wrote:
> Please create an issue.  I haven't reproduced it yet but it seems unlikely
> to be user-error.
> 
> ~ David
> 
> 
> On Mon, May 25, 2020 at 9:28 AM Michal Hlavac <_miso@hlavki.eu_> wrote:
> 
> > Hi,
> >
> > I have field:
> >  > stored="true" indexed="false" storeOffsetsWithPositions="true"/>
> >
> > and configuration:
> > true
> > unified
> > true
> > content_txt_sk_highlight
> > 2
> > true
> >
> > Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> > is really slow.
> > Same query with hl.bs.type=WORD takes from 8 - 45 ms
> >
> > is this normal behaviour or should I create issue?
> >
> > thanks, m.
> >
> 





[1] https://github.com/hlavki/solr-unified-highlighter-test
[2] 
https://github.com/hlavki/solr-unified-highlighter-test/blob/master/src/test/java/com/example/HighlightTest.java#L60
[3] mailto:m...@hlavki.eu


Re: Solr takes time to warm up core with huge data

2020-06-08 Thread Colvin Cowie
Sorry to hijack this a little bit. Shawn, what's the calculation for the
size of the filter cache?
Is that 1 bit per document in the core / shard?
Thanks

On Fri, 5 Jun 2020 at 17:20, Shawn Heisey  wrote:

> On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:
> > q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6
> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> >
> > This was the original query. Since there were lot of sorting fields, we
> decided to not do on the solr side, instead fetch the query response and do
> the sorting outside solr. This eliminated the need of more JVM memory which
> was allocated. Every time we ran this query, solr would crash exceeding the
> JVM memory. Now we are only running filter queries.
>
> What Solr version, and what is the definition of each of the fields
> you're sorting on?  If the definition doesn't include docValues, then a
> large on-heap memory structure will be created for sorting (VERY large
> with 500 million docs), and I wouldn't be surprised if it's created even
> if it is never used.  The definition for any field you use for sorting
> should definitely include docValues.  In recent Solr versions, docValues
> defaults to true for most field types.  Some field classes, TextField in
> particular, cannot have docValues.
>
> There's something else to discuss about sort params -- each sort field
> will only be used if ALL of the previous sort fields are identical for
> two documents in the full numFound result set.  Having more than two or
> three sort fields is usually pointless.  My guess (which I know could be
> wrong) is that most queries with this HUGE sort parameter will never use
> anything beyond TRACK_ID.
>
> > And regarding the filter cache, it is in default setup: (we are using
> default solrconfig.xml, and we have only added the request handler for DIH)
> >
> >  >   size="512"
> >   initialSize="512"
> >   autowarmCount="0"/>
>
> This is way too big for your index, and a prime candidate for why your
> heap requirements are so high.  Like I said before, if the filterCache
> on your system actually reaches this max size, it will require 30GB of
> memory JUST for the filterCache on this core.  Can you check the admin
> UI to determine what the size is and what hit ratio it's getting? (1.0
> is 100% on the hit ratio).  I'd probably start with a size of 32 or 64
> on this cache.  With a size of 64, a little less than 4GB would be the
> max heap allocated for the cache.  You can experiment... but with 500
> million docs, the filterCache size should be pretty small.
>
> You're going to want to carefully digest this part of that wiki page
> that I linked earlier.  Hopefully email will preserve this link completely:
>
>
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements
>
> Thanks,
> Shawn
>


RE: CDCR behaviour

2020-06-08 Thread Gell-Holleron, Daniel
HI Jason, 

Thanks for this. Without screenshots this is what I get:
Site A
Last Modified:less than a minute ago
Num Docs:5455
Max Doc:5524
Heap Memory Usage:-1
Deleted Docs:69
Version:699
Segment Count:3
Current: Y

Site B
Last Modified:3 days ago
Num Docs:5454
Max Doc:5523
Heap Memory Usage:-1
Deleted Docs:69
Version:640
Segment Count:3
Current: N

I noticed that if I run the command 
http://hostname:8983/solr/SiteB-Collection/update/?commit=true the index would 
then be current. 

I've messed around with auto commit settings in the solrconfig.xml file but had 
no success.

Any help would be greatly appreciated. 

Thanks, 

Daniel 

-Original Message-
From: Jason Gerlowski  
Sent: 05 June 2020 12:18
To: solr-user@lucene.apache.org
Subject: Re: CDCR behaviour

Hi Daniel,

Just a heads up that attachments and images are stripped pretty aggressively by 
the mailing list - none of your images made it through.
You might more success linking to the images in Dropbox or some other online 
storage medium.

Best,

Jason

On Thu, Jun 4, 2020 at 10:55 AM Gell-Holleron, Daniel < 
daniel.gell-holle...@gb.unisys.com> wrote:

> Hi,
>
>
>
> Looks for some advice, sent a few questions on CDCR the last couple of 
> days.
>
>
>
> I just want to see if this is expected behavior from Solr or not?
>
>
>
> When a document is added to Site A, it is then supposed to replicate 
> across, however in the statistics page I see the following:
>
>
>
> Site A
>
>
>
>
> Site B
>
>
>
>
>
> When I perform a search on Site B through the Solr admin page, I do 
> get results (which I find strange). The only way for the numb docs 
> parameter to be matching is restart Solr, I then get the below:
>
>
>
>
>
> I just want to know whether this behavior is expected or is a bug? My 
> expectation is that the data will always be current between the two sites.
>
>
>
> Thanks,
>
> Daniel
>
>
>