Re: Highlighting large text fields

2021-01-12 Thread Shaun Campbell
Hi David

Just reindexed everything and it appears to be performing well and giving
me highlights for the matched text.

Thanks for your help.
Shaun

On Tue, 12 Jan 2021, 21:00 David Smiley,  wrote:

> The last update to highlighting that I think is pertinent to
> whether highlights match or not is v7.6 which added that hl.weightMatches
> option.  So I recommend upgrading to at least that if you want to
> experiment further.  But... uh.weightMatches highlights more accurately and
> as such is more likely to not highlight as much as you are highlighting
> now, and highlighting more is your goal right now it appears.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Jan 12, 2021 at 2:45 PM Shaun Campbell 
> wrote:
>
> > That's great David.  So hl.maxAnalyzedChars isn't that critical. I'll
> whack
> > it right up and see what happens.
> >
> > I'm running 7.4 from a few years ago. Should I upgrade?
> >
> > For your info this is what I'm doing with Solr
> > https://dev.fundingawards.nihr.ac.uk/search.
> >
> > Thanks
> > Shaun
> >
> > On Tue, 12 Jan 2021 at 19:33, David Smiley  wrote:
> >
> > > On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell <
> campbell.sh...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi David
> > > >
> > > > Getting closer now.
> > > >
> > > > First of all, a bit of a mistake on my part. I have two cores set up
> > and
> > > I
> > > > was changing the solrconfig.xml on the wrong core doh!!  That's why
> > > > highlighting wasn't being turned off.
> > > >
> > > > I think I've got the unified highlighter working.
> > > > storeOffsetsWithPositions was already configured on my field type
> > > > definition, not the field definition, so that was ok.
> > > >
> > > > What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> > > > highlighting on some records and not others, making it confusing as
> to
> > > > where the match is with my dismax parser.  I increased
> > > > my hl.maxAnalyzedChars to 130 and now it's highlighting more
> > records.
> > > > Two questions:
> > > >
> > > > 1. Have you any guidelines as to what could be a
> > > > maximum hl.maxAnalyzedChars without impacting performance or memory?
> > > >
> > >
> > > With storeOffsetsWithPositions, highlighting is super-fast, and so this
> > > hl.maxAnalyzedChars threshold is of marginal utility, like only to cap
> > the
> > > amount of memory used if you have some truly humongous docs and it's
> okay
> > > only highlight the first X megabytes of them.  Maybe set to a 100MB
> worth
> > > of text, or something like that.
> > >
> > >
> > > > 2. Do you know a way to query the maximum length of text in a field
> so
> > > that
> > > > I can set hl.maxAnalyzedChars accordingly?  Just thinking I can
> > probably
> > > > modify my java indexer to log the maximum content length.  Actually,
> I
> > > > probably don't want the maximum but some value that highlights 90-95%
> > > > records
> > > >
> > >
> > > Eh... not really.  Maybe some approximation hacks involving function
> > > queries on norms but I'd not bother in favor of just using a high
> > threshold
> > > such that this won't be an issue.
> > >
> > > All this said, this threshold is *not* the only reason why you might
> not
> > be
> > > getting highlights that you expect.  If you are using a recent Solr
> > > version, you might try toggling the hl.weightMatches boolean, which
> could
> > > make a difference for certain query arrangements.  There's a JIRA issue
> > > pertaining to this one, and I haven't investigated it yet.
> > >
> > > ~ David
> > >
> > >
> > > >
> > > > Thanks
> > > > Shaun
> > > >
> > > > On Tue, 12 Jan 2021 at 16:30, David Smiley 
> wrote:
> > > >
> > > > > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <
> > > campbell.sh...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi David
> > > > > >
> > > > > > First of all I wanted to say I'm working off your book!!  Third
> > > > edition,
> > > > > > and I think it's a bit out of date now. I was just going to try
> > > > following
> > > > > > the section on the Postings highlighter, but I see that's been
> > > absorbed
> > > > > > into the Unified highlighter. I find your book easier to follow
> > than
> > > > the
> > > > > > official documentation though.
> > > > > >
> > > > >
> > > > > Thanks :-D.  I do maintain the Solr Reference Guide for the parts
> of
> > > > code I
> > > > > touch, including highlighting, so I hope what's there makes sense
> > too.
> > > > >
> > > > >
> > > > > > I am going to try to configure the unified highlighter, and I
> will
> > > add
> > > > > that
> > > > > > storeOffsetsWithPositions to the schema (which I saw in your
> book)
> > > and
> > > > I
> > > > > > will try indexing again from scratch.  Was getting some funny
> > things
> > > > > going
> > > > > > on where I thought I'd turned highlighting off and it was still
> > > giving
> > > > me
> > > > > > highlights.
> > > > > >
> > > > >
> > > > > 

Re: Highlighting large text fields

2021-01-12 Thread David Smiley
The last update to highlighting that I think is pertinent to
whether highlights match or not is v7.6 which added that hl.weightMatches
option.  So I recommend upgrading to at least that if you want to
experiment further.  But... uh.weightMatches highlights more accurately and
as such is more likely to not highlight as much as you are highlighting
now, and highlighting more is your goal right now it appears.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Jan 12, 2021 at 2:45 PM Shaun Campbell 
wrote:

> That's great David.  So hl.maxAnalyzedChars isn't that critical. I'll whack
> it right up and see what happens.
>
> I'm running 7.4 from a few years ago. Should I upgrade?
>
> For your info this is what I'm doing with Solr
> https://dev.fundingawards.nihr.ac.uk/search.
>
> Thanks
> Shaun
>
> On Tue, 12 Jan 2021 at 19:33, David Smiley  wrote:
>
> > On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell  >
> > wrote:
> >
> > > Hi David
> > >
> > > Getting closer now.
> > >
> > > First of all, a bit of a mistake on my part. I have two cores set up
> and
> > I
> > > was changing the solrconfig.xml on the wrong core doh!!  That's why
> > > highlighting wasn't being turned off.
> > >
> > > I think I've got the unified highlighter working.
> > > storeOffsetsWithPositions was already configured on my field type
> > > definition, not the field definition, so that was ok.
> > >
> > > What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> > > highlighting on some records and not others, making it confusing as to
> > > where the match is with my dismax parser.  I increased
> > > my hl.maxAnalyzedChars to 130 and now it's highlighting more
> records.
> > > Two questions:
> > >
> > > 1. Have you any guidelines as to what could be a
> > > maximum hl.maxAnalyzedChars without impacting performance or memory?
> > >
> >
> > With storeOffsetsWithPositions, highlighting is super-fast, and so this
> > hl.maxAnalyzedChars threshold is of marginal utility, like only to cap
> the
> > amount of memory used if you have some truly humongous docs and it's okay
> > only highlight the first X megabytes of them.  Maybe set to a 100MB worth
> > of text, or something like that.
> >
> >
> > > 2. Do you know a way to query the maximum length of text in a field so
> > that
> > > I can set hl.maxAnalyzedChars accordingly?  Just thinking I can
> probably
> > > modify my java indexer to log the maximum content length.  Actually, I
> > > probably don't want the maximum but some value that highlights 90-95%
> > > records
> > >
> >
> > Eh... not really.  Maybe some approximation hacks involving function
> > queries on norms but I'd not bother in favor of just using a high
> threshold
> > such that this won't be an issue.
> >
> > All this said, this threshold is *not* the only reason why you might not
> be
> > getting highlights that you expect.  If you are using a recent Solr
> > version, you might try toggling the hl.weightMatches boolean, which could
> > make a difference for certain query arrangements.  There's a JIRA issue
> > pertaining to this one, and I haven't investigated it yet.
> >
> > ~ David
> >
> >
> > >
> > > Thanks
> > > Shaun
> > >
> > > On Tue, 12 Jan 2021 at 16:30, David Smiley  wrote:
> > >
> > > > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <
> > campbell.sh...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi David
> > > > >
> > > > > First of all I wanted to say I'm working off your book!!  Third
> > > edition,
> > > > > and I think it's a bit out of date now. I was just going to try
> > > following
> > > > > the section on the Postings highlighter, but I see that's been
> > absorbed
> > > > > into the Unified highlighter. I find your book easier to follow
> than
> > > the
> > > > > official documentation though.
> > > > >
> > > >
> > > > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> > > code I
> > > > touch, including highlighting, so I hope what's there makes sense
> too.
> > > >
> > > >
> > > > > I am going to try to configure the unified highlighter, and I will
> > add
> > > > that
> > > > > storeOffsetsWithPositions to the schema (which I saw in your book)
> > and
> > > I
> > > > > will try indexing again from scratch.  Was getting some funny
> things
> > > > going
> > > > > on where I thought I'd turned highlighting off and it was still
> > giving
> > > me
> > > > > highlights.
> > > > >
> > > >
> > > > hl=true/false
> > > >
> > > >
> > > > > Actually just re-reading your email again, are you saying that you
> > > can't
> > > > > configure highlighting in solrconfig.xml? That's where I always
> > > configure
> > > > > original highlighting in my dismax search handler. Am I supposed to
> > add
> > > > > highlighting to each request?
> > > > >
> > > >
> > > > You can set highlighting and other *parameters* in solrconfig.xml for
> > > > request handlers.  But the dedicated  plugin info is
> only
> > > for
> > > > the original 

Re: Highlighting large text fields

2021-01-12 Thread Shaun Campbell
That's great David.  So hl.maxAnalyzedChars isn't that critical. I'll whack
it right up and see what happens.

I'm running 7.4 from a few years ago. Should I upgrade?

For your info this is what I'm doing with Solr
https://dev.fundingawards.nihr.ac.uk/search.

Thanks
Shaun

On Tue, 12 Jan 2021 at 19:33, David Smiley  wrote:

> On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell 
> wrote:
>
> > Hi David
> >
> > Getting closer now.
> >
> > First of all, a bit of a mistake on my part. I have two cores set up and
> I
> > was changing the solrconfig.xml on the wrong core doh!!  That's why
> > highlighting wasn't being turned off.
> >
> > I think I've got the unified highlighter working.
> > storeOffsetsWithPositions was already configured on my field type
> > definition, not the field definition, so that was ok.
> >
> > What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> > highlighting on some records and not others, making it confusing as to
> > where the match is with my dismax parser.  I increased
> > my hl.maxAnalyzedChars to 130 and now it's highlighting more records.
> > Two questions:
> >
> > 1. Have you any guidelines as to what could be a
> > maximum hl.maxAnalyzedChars without impacting performance or memory?
> >
>
> With storeOffsetsWithPositions, highlighting is super-fast, and so this
> hl.maxAnalyzedChars threshold is of marginal utility, like only to cap the
> amount of memory used if you have some truly humongous docs and it's okay
> only highlight the first X megabytes of them.  Maybe set to a 100MB worth
> of text, or something like that.
>
>
> > 2. Do you know a way to query the maximum length of text in a field so
> that
> > I can set hl.maxAnalyzedChars accordingly?  Just thinking I can probably
> > modify my java indexer to log the maximum content length.  Actually, I
> > probably don't want the maximum but some value that highlights 90-95%
> > records
> >
>
> Eh... not really.  Maybe some approximation hacks involving function
> queries on norms but I'd not bother in favor of just using a high threshold
> such that this won't be an issue.
>
> All this said, this threshold is *not* the only reason why you might not be
> getting highlights that you expect.  If you are using a recent Solr
> version, you might try toggling the hl.weightMatches boolean, which could
> make a difference for certain query arrangements.  There's a JIRA issue
> pertaining to this one, and I haven't investigated it yet.
>
> ~ David
>
>
> >
> > Thanks
> > Shaun
> >
> > On Tue, 12 Jan 2021 at 16:30, David Smiley  wrote:
> >
> > > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <
> campbell.sh...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi David
> > > >
> > > > First of all I wanted to say I'm working off your book!!  Third
> > edition,
> > > > and I think it's a bit out of date now. I was just going to try
> > following
> > > > the section on the Postings highlighter, but I see that's been
> absorbed
> > > > into the Unified highlighter. I find your book easier to follow than
> > the
> > > > official documentation though.
> > > >
> > >
> > > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> > code I
> > > touch, including highlighting, so I hope what's there makes sense too.
> > >
> > >
> > > > I am going to try to configure the unified highlighter, and I will
> add
> > > that
> > > > storeOffsetsWithPositions to the schema (which I saw in your book)
> and
> > I
> > > > will try indexing again from scratch.  Was getting some funny things
> > > going
> > > > on where I thought I'd turned highlighting off and it was still
> giving
> > me
> > > > highlights.
> > > >
> > >
> > > hl=true/false
> > >
> > >
> > > > Actually just re-reading your email again, are you saying that you
> > can't
> > > > configure highlighting in solrconfig.xml? That's where I always
> > configure
> > > > original highlighting in my dismax search handler. Am I supposed to
> add
> > > > highlighting to each request?
> > > >
> > >
> > > You can set highlighting and other *parameters* in solrconfig.xml for
> > > request handlers.  But the dedicated  plugin info is only
> > for
> > > the original and Fast Vector Highlighters.
> > >
> > > ~ David
> > >
> > >
> > > >
> > > > Thanks
> > > > Shaun
> > > >
> > > > On Mon, 11 Jan 2021 at 20:57, David Smiley 
> wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I worked on the UnifiedHighlighter a lot and want to help you!
> > > > >
> > > > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> > > campbell.sh...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I've been using highlighting for a while, using the original
> > > > highlighter,
> > > > > > and just come across a problem with fields that contain a large
> > > amount
> > > > of
> > > > > > text, approx 250k characters. I only have about 2,000 records but
> > > each
> > > > > one
> > > > > > contains a journal publication to search through.
> > > > > >
> > > > > > What I noticed is that some 

Re: Highlighting large text fields

2021-01-12 Thread David Smiley
On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell 
wrote:

> Hi David
>
> Getting closer now.
>
> First of all, a bit of a mistake on my part. I have two cores set up and I
> was changing the solrconfig.xml on the wrong core doh!!  That's why
> highlighting wasn't being turned off.
>
> I think I've got the unified highlighter working.
> storeOffsetsWithPositions was already configured on my field type
> definition, not the field definition, so that was ok.
>
> What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> highlighting on some records and not others, making it confusing as to
> where the match is with my dismax parser.  I increased
> my hl.maxAnalyzedChars to 130 and now it's highlighting more records.
> Two questions:
>
> 1. Have you any guidelines as to what could be a
> maximum hl.maxAnalyzedChars without impacting performance or memory?
>

With storeOffsetsWithPositions, highlighting is super-fast, and so this
hl.maxAnalyzedChars threshold is of marginal utility, like only to cap the
amount of memory used if you have some truly humongous docs and it's okay
only highlight the first X megabytes of them.  Maybe set to a 100MB worth
of text, or something like that.


> 2. Do you know a way to query the maximum length of text in a field so that
> I can set hl.maxAnalyzedChars accordingly?  Just thinking I can probably
> modify my java indexer to log the maximum content length.  Actually, I
> probably don't want the maximum but some value that highlights 90-95%
> records
>

Eh... not really.  Maybe some approximation hacks involving function
queries on norms but I'd not bother in favor of just using a high threshold
such that this won't be an issue.

All this said, this threshold is *not* the only reason why you might not be
getting highlights that you expect.  If you are using a recent Solr
version, you might try toggling the hl.weightMatches boolean, which could
make a difference for certain query arrangements.  There's a JIRA issue
pertaining to this one, and I haven't investigated it yet.

~ David


>
> Thanks
> Shaun
>
> On Tue, 12 Jan 2021 at 16:30, David Smiley  wrote:
>
> > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell  >
> > wrote:
> >
> > > Hi David
> > >
> > > First of all I wanted to say I'm working off your book!!  Third
> edition,
> > > and I think it's a bit out of date now. I was just going to try
> following
> > > the section on the Postings highlighter, but I see that's been absorbed
> > > into the Unified highlighter. I find your book easier to follow than
> the
> > > official documentation though.
> > >
> >
> > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> code I
> > touch, including highlighting, so I hope what's there makes sense too.
> >
> >
> > > I am going to try to configure the unified highlighter, and I will add
> > that
> > > storeOffsetsWithPositions to the schema (which I saw in your book) and
> I
> > > will try indexing again from scratch.  Was getting some funny things
> > going
> > > on where I thought I'd turned highlighting off and it was still giving
> me
> > > highlights.
> > >
> >
> > hl=true/false
> >
> >
> > > Actually just re-reading your email again, are you saying that you
> can't
> > > configure highlighting in solrconfig.xml? That's where I always
> configure
> > > original highlighting in my dismax search handler. Am I supposed to add
> > > highlighting to each request?
> > >
> >
> > You can set highlighting and other *parameters* in solrconfig.xml for
> > request handlers.  But the dedicated  plugin info is only
> for
> > the original and Fast Vector Highlighters.
> >
> > ~ David
> >
> >
> > >
> > > Thanks
> > > Shaun
> > >
> > > On Mon, 11 Jan 2021 at 20:57, David Smiley  wrote:
> > >
> > > > Hello!
> > > >
> > > > I worked on the UnifiedHighlighter a lot and want to help you!
> > > >
> > > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> > campbell.sh...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > I've been using highlighting for a while, using the original
> > > highlighter,
> > > > > and just come across a problem with fields that contain a large
> > amount
> > > of
> > > > > text, approx 250k characters. I only have about 2,000 records but
> > each
> > > > one
> > > > > contains a journal publication to search through.
> > > > >
> > > > > What I noticed is that some records didn't return a highlight even
> > > though
> > > > > they matched on the content. I noticed the hl.maxAnalyzedChars
> > > parameter
> > > > > and increased that, but  it allowed some records to be highlighted,
> > but
> > > > not
> > > > > all, and then it caused memory problems on the server.  Performance
> > is
> > > > also
> > > > > very poor.
> > > > >
> > > >
> > > > I've been thinking hl.maxAnalyzedChars should maybe default to no
> limit
> > > --
> > > > it's a performance threshold but perhaps better to opt-in to such a
> > limit
> > > > then scratch your head for a long time wondering why a search result
> > > 

RE: disallowing delete through security.json

2021-01-12 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
Does anyone yet have any examples or suggestion for using the "method" section 
in lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html ?

Also, if anyone has any other suggestions of how to provide high availability 
while completely dropping and recreating and reloading a large collection (as 
required in order to complete the upgrade to a new release), let me know.

-Original Message-
From: Oakley, Craig (NIH/NLM/NCBI) [C]  
Sent: Tuesday, November 24, 2020 1:56 PM
To: solr-user@lucene.apache.org
Subject: RE: disallowing delete through security.json

Thank you for the response

The use case I have in mind is trying to approximate incremental updates (as 
are available in Sybase or MSSQL, to which I am more accustomed).

We are wanting to upgrade a large collection from Solr7.4 to Solr8.5. It turns 
out that Solr8.5 cannot run against the current data, because the collection 
was created under Solr6.6. We want to migrate in such a way that, in a year or 
so, we will be able to migrate to Solr9 without worrying about Solr7.4 let 
alone Solr6.6. We want to create a new collection (of the same name) in a brand 
new Solr8.5 SolrCloud, and then to select everything from the current Solr7.4 
collection in json format and load it into the new Solr8.5 collection. All of 
the fields have stored="true", with the exception of fields populated by 
copyField. The select will be done by ranges of id values, so as to avoid 
OutOfMemory errors. That process will take several days; and in the meanwhile, 
users will be continuing to add data. When all the data will have been copied 
(including that which is described below), we can switch port numbers so that 
the new Solr8.5 SolrCloud takes the place of the old Solr7.4 SolrCloud.

The plan is to find a value of _version_ (call it V1) which was in the Solr7.4 
collection when we started the first select, but which is greater than almost 
all values of _version_ in the collection (we are fine with having an overlap 
of _version_ values, but we want to avoid losing anything by having a gap in 
_version_ values). After the initial selects are complete, we can run other 
selects by ranges of id with the additional criteria that the _version_ will be 
no lower than the V1 value. As we have seen in test runs, this will involve 
less data and will run faster. We will also keep note of a new value of 
_version_ (call it V2) which was in the Solr7.4 collection when we start the V1 
select, but which is greater than almost all values of _version_ in the V1 
select. Following this procedure through various iterations (V3, V4, however 
many it takes), we can load the V1 set of selects when we will have completed 
the loading of the initial set of selects. We can then load the V2 set of 
selects when we will have completed the loading of the V1 set of selects. The 
plan is that the selecting and loading of the last Vn set of selects will 
involve a maintenance window measured in minutes rather than in days.

The users claim that they never do deletes: which is good, because a delete 
would be something which would be missed by this plan. If (as you describe) the 
users were to update a record so that only the id field (and the _version_ 
field) are left, that update would get picked up by one of these incremental 
selects and would be applied to the new collection. A delete, however, would 
not be noticed: and the new Solr8.5 collection would still have the record 
which had been deleted from the old Solr7.4 collection. The users claim that 
they never do deletes: but it would seem safer to actually disallow deletes 
during the maintenance.

Let me know if you have any suggestions.

Thank you again for your reply.


-Original Message-
From: Jason Gerlowski  
Sent: Tuesday, November 24, 2020 12:35 PM
To: solr-user@lucene.apache.org
Subject: Re: disallowing delete through security.json

Hey Craig,

I think this will be tricky to do with the current Rule-Based
Authorization support.  As you pointed out in your initial post -
there are lots of ways to delete documents.  The Rule-Based Auth code
doesn't inspect request bodies (AFAIK), so it's going to have trouble
differentiating between traditional "/update" requests with
method=POST that are request-body driven.

But to zoom out a bit, does it really make sense to lock down deletes,
but not updates more broadly?  After all, "updates" can remove and add
fields.  Users might submit an update that strips everything but "id"
from your documents.  In many/most usecases that'd be equally
concerning.  Just wondering what your usecase is - if it's generally
applicable this is probably worth a JIRA ticket.

Best,

Jason

On Thu, Nov 19, 2020 at 10:34 AM Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:
>
> Having not heard back, I thought I would ask again whether anyone else has 
> been able to use security.json to disallow deletes, and/or if anyone has 
> examples of using the "method" section in 
> 

Re: Highlighting large text fields

2021-01-12 Thread Shaun Campbell
Hi David

Getting closer now.

First of all, a bit of a mistake on my part. I have two cores set up and I
was changing the solrconfig.xml on the wrong core doh!!  That's why
highlighting wasn't being turned off.

I think I've got the unified highlighter working.
storeOffsetsWithPositions was already configured on my field type
definition, not the field definition, so that was ok.

What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
highlighting on some records and not others, making it confusing as to
where the match is with my dismax parser.  I increased
my hl.maxAnalyzedChars to 130 and now it's highlighting more records.
Two questions:

1. Have you any guidelines as to what could be a
maximum hl.maxAnalyzedChars without impacting performance or memory?

2. Do you know a way to query the maximum length of text in a field so that
I can set hl.maxAnalyzedChars accordingly?  Just thinking I can probably
modify my java indexer to log the maximum content length.  Actually, I
probably don't want the maximum but some value that highlights 90-95%
records

Thanks
Shaun

On Tue, 12 Jan 2021 at 16:30, David Smiley  wrote:

> On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell 
> wrote:
>
> > Hi David
> >
> > First of all I wanted to say I'm working off your book!!  Third edition,
> > and I think it's a bit out of date now. I was just going to try following
> > the section on the Postings highlighter, but I see that's been absorbed
> > into the Unified highlighter. I find your book easier to follow than the
> > official documentation though.
> >
>
> Thanks :-D.  I do maintain the Solr Reference Guide for the parts of code I
> touch, including highlighting, so I hope what's there makes sense too.
>
>
> > I am going to try to configure the unified highlighter, and I will add
> that
> > storeOffsetsWithPositions to the schema (which I saw in your book) and I
> > will try indexing again from scratch.  Was getting some funny things
> going
> > on where I thought I'd turned highlighting off and it was still giving me
> > highlights.
> >
>
> hl=true/false
>
>
> > Actually just re-reading your email again, are you saying that you can't
> > configure highlighting in solrconfig.xml? That's where I always configure
> > original highlighting in my dismax search handler. Am I supposed to add
> > highlighting to each request?
> >
>
> You can set highlighting and other *parameters* in solrconfig.xml for
> request handlers.  But the dedicated  plugin info is only for
> the original and Fast Vector Highlighters.
>
> ~ David
>
>
> >
> > Thanks
> > Shaun
> >
> > On Mon, 11 Jan 2021 at 20:57, David Smiley  wrote:
> >
> > > Hello!
> > >
> > > I worked on the UnifiedHighlighter a lot and want to help you!
> > >
> > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> campbell.sh...@gmail.com
> > >
> > > wrote:
> > >
> > > > I've been using highlighting for a while, using the original
> > highlighter,
> > > > and just come across a problem with fields that contain a large
> amount
> > of
> > > > text, approx 250k characters. I only have about 2,000 records but
> each
> > > one
> > > > contains a journal publication to search through.
> > > >
> > > > What I noticed is that some records didn't return a highlight even
> > though
> > > > they matched on the content. I noticed the hl.maxAnalyzedChars
> > parameter
> > > > and increased that, but  it allowed some records to be highlighted,
> but
> > > not
> > > > all, and then it caused memory problems on the server.  Performance
> is
> > > also
> > > > very poor.
> > > >
> > >
> > > I've been thinking hl.maxAnalyzedChars should maybe default to no limit
> > --
> > > it's a performance threshold but perhaps better to opt-in to such a
> limit
> > > then scratch your head for a long time wondering why a search result
> > isn't
> > > showing highlights.
> > >
> > >
> > > > To try to fix this I've tried  to configure the unified highlighter
> in
> > my
> > > > solrconfig.xml instead.   It seems to be working but again I'm
> missing
> > > some
> > > > highlighted records.
> > > >
> > >
> > > There is no configuration of that highlighter in solrconfig.xml; it's
> > > entirely parameter driven (runtime).
> > >
> > >
> > > > The other thing is I've tried to adjust my unified highlighting
> > settings
> > > in
> > > > solrconfig.xml and they don't  seem to be having any effect even
> after
> > > > restarting Solr.  I was just wondering whether there is any
> > highlighting
> > > > information stored at index time. It's taking over 4hours to index my
> > > > records so it's not easy to keep reindexing my content.
> > > >
> > > > Any ideas on how to handle highlighting of large content  would be
> > > > appreciated.
> > > >
> > > > Shaun
> > > >
> > >
> > > Please read the documentation here thoroughly:
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > > (or earlier version as applicable)
> > > Since you have large bodies of text to 

Re: Solr using all available CPU and becoming unresponsive

2021-01-12 Thread Charlie Hull

Hi Jeremy,

You might find our recent blog on Debugging Solr Performance Issues 
useful 
https://opensourceconnections.com/blog/2021/01/05/a-solr-performance-debugging-toolkit/ 
- also check out Savan Das' blog which is linked within.


Best

Charlie

On 12/01/2021 14:53, Michael Gibney wrote:

Ahh ok. If those are your only fieldType definitions, and most of your
config is copied from the default, then SOLR-13336 is unlikely to be the
culprit. Looking at more general options, off the top of my head:
1. make sure you haven't allocated all physical memory to heap (leave a
decent amount for OS page cache)
2. disable swap, if you can (this is esp. important if using network
storage as swap). There are potential downsides to this (so proceed with
caution); but if part of your heap gets swapped out (and it almost
certainly will, with a sufficiently large heap) full GCs lead to a swap
storm that compounds the problem. (fwiw, this is probably the first thing
I'd recommend looking into and trying, because it's so easy, and can in
some cases yield a dramatic improvement. N.b., I'm talking about `swapoff
-a`, not `sysctl -w vm.swappiness=0` -- I find that the latter does *not*
eliminate swapping in the way that's needed to achieve the desired goal in
this case. Again, exercise caution in doing this, discuss, research, etc.).
Related documentation was added in 8.5, but absolutely applies to 7.3.1 as
well:
https://lucene.apache.org/solr/guide/8_7/taking-solr-to-production.html#avoid-swapping-nix-operating-systems
-- the note there about "lowering swappiness" being an acceptable
alternative contradicts my experience, but I suppose ymmv?
3. if you're faceting on fields -- especially high-cardinality fields (many
values) -- make sure that you have `docValues=true, uninvertible=false`
configured (to ensure that you're not building large on-heap data
structures when there's an alternative that doesn't require it.

These are all recommendations that are explained in more detail by others
elsewhere; I think they should all apply to 7.3.1; fwiw, I would recommend
upgrading if you have the (human) bandwidth to do so. Good luck!

Michael

On Tue, Jan 12, 2021 at 8:39 AM Jeremy Smith  wrote:


Thanks Michael,
  SOLR-13336 seems intriguing.  I'm not a solr expert, but I believe
these are the relevant sections from our schema definition:

 
   
 
 
   
   
 
 
   
 
 
   
 
 
 
   
   
 
 
 
 
   
 

Our other fieldTypes don't have any analyzers attached to them.


If SOLR-13336 is the cause of the issue is the best remedy to upgrade to
solr 8?  It doesn't look like the fix was back patched to 7.x.

Our schema has some issues arising from not fully understanding Solr and
just copying existing structures from the defaults.  In this case,
stopwords.txt is completely empty and synonyms.txt is just the default
synonyms.txt, which seems not useful at all for us.  Could I just take out
the StopFilterFactory and SynonymGraphFilterFactory from the query section
(and maybe the StopFilterFactory from the index section as well)?

Thanks again,
Jeremy


From: Michael Gibney 
Sent: Monday, January 11, 2021 8:30 PM
To: solr-user@lucene.apache.org 
Subject: Re: Solr using all available CPU and becoming unresponsive

Hi Jeremy,
Can you share your analysis chain configs? (SOLR-13336 can manifest in a
similar way, and would affect 7.3.1 with a susceptible config, given the
right (wrong?) input ...)
Michael

On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith  wrote:


Hello all,
  We have been struggling with an issue where solr will intermittently
use all available CPU and become unresponsive.  It will remain in this
state until we restart.  Solr will remain stable for some time, usually a
few hours to a few days, before this happens again.  We've tried

adjusting

the caches and adding memory to both the VM and JVM, but we haven't been
able to solve the issue yet.

Here is some info about our server:
Solr:
   Solr 7.3.1, running on Java 1.8
   Running in cloud mode, but there's only one core

Host:
   CentOS7
   8 CPU, 56GB RAM
   The only other processes running on this VM are two zookeepers, one for
this Solr instance, one for another Solr instance

Solr Config:
  - One Core
  - 36 Million documents (Max Doc), 28 million (Num Docs)
  - ~15GB
  - 10-20 Requests/second
  - The schema is fairly large (~100 fields) and we allow faceting and
searching on many, but not all, of the fields
  - Data are imported once per minute through the DataImportHandler, with

a

hard commit at the end.  We usually index ~100-500 documents per minute,
with many of these being updates to existing documents.

Cache settings:
 

 

 

For the filterCache, we have tried sizes as low as 128, which caused our
CPU usage to go up and didn't solve our issue.  autowarmCount used to be
much 

Re: Highlighting large text fields

2021-01-12 Thread David Smiley
On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell 
wrote:

> Hi David
>
> First of all I wanted to say I'm working off your book!!  Third edition,
> and I think it's a bit out of date now. I was just going to try following
> the section on the Postings highlighter, but I see that's been absorbed
> into the Unified highlighter. I find your book easier to follow than the
> official documentation though.
>

Thanks :-D.  I do maintain the Solr Reference Guide for the parts of code I
touch, including highlighting, so I hope what's there makes sense too.


> I am going to try to configure the unified highlighter, and I will add that
> storeOffsetsWithPositions to the schema (which I saw in your book) and I
> will try indexing again from scratch.  Was getting some funny things going
> on where I thought I'd turned highlighting off and it was still giving me
> highlights.
>

hl=true/false


> Actually just re-reading your email again, are you saying that you can't
> configure highlighting in solrconfig.xml? That's where I always configure
> original highlighting in my dismax search handler. Am I supposed to add
> highlighting to each request?
>

You can set highlighting and other *parameters* in solrconfig.xml for
request handlers.  But the dedicated  plugin info is only for
the original and Fast Vector Highlighters.

~ David


>
> Thanks
> Shaun
>
> On Mon, 11 Jan 2021 at 20:57, David Smiley  wrote:
>
> > Hello!
> >
> > I worked on the UnifiedHighlighter a lot and want to help you!
> >
> > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell  >
> > wrote:
> >
> > > I've been using highlighting for a while, using the original
> highlighter,
> > > and just come across a problem with fields that contain a large amount
> of
> > > text, approx 250k characters. I only have about 2,000 records but each
> > one
> > > contains a journal publication to search through.
> > >
> > > What I noticed is that some records didn't return a highlight even
> though
> > > they matched on the content. I noticed the hl.maxAnalyzedChars
> parameter
> > > and increased that, but  it allowed some records to be highlighted, but
> > not
> > > all, and then it caused memory problems on the server.  Performance is
> > also
> > > very poor.
> > >
> >
> > I've been thinking hl.maxAnalyzedChars should maybe default to no limit
> --
> > it's a performance threshold but perhaps better to opt-in to such a limit
> > then scratch your head for a long time wondering why a search result
> isn't
> > showing highlights.
> >
> >
> > > To try to fix this I've tried  to configure the unified highlighter in
> my
> > > solrconfig.xml instead.   It seems to be working but again I'm missing
> > some
> > > highlighted records.
> > >
> >
> > There is no configuration of that highlighter in solrconfig.xml; it's
> > entirely parameter driven (runtime).
> >
> >
> > > The other thing is I've tried to adjust my unified highlighting
> settings
> > in
> > > solrconfig.xml and they don't  seem to be having any effect even after
> > > restarting Solr.  I was just wondering whether there is any
> highlighting
> > > information stored at index time. It's taking over 4hours to index my
> > > records so it's not easy to keep reindexing my content.
> > >
> > > Any ideas on how to handle highlighting of large content  would be
> > > appreciated.
> > >
> > > Shaun
> > >
> >
> > Please read the documentation here thoroughly:
> >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > (or earlier version as applicable)
> > Since you have large bodies of text to highlight, you would strongly
> > benefit from putting offsets into the search index (and re-index) --
> > storeOffsetsWithPositions.  That's an option on the field/fieldType in
> your
> > schema; it may not be obvious reading the docs.  You have to opt-in to
> > that; Solr doesn't normally store any info in the index for highlighting.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
>


Re: leader election stuck after hosts restarts

2021-01-12 Thread Pierre Salagnac
Sorry I missed this detail.
We are running Solr 8.2.
Thanks

Le mar. 12 janv. 2021 à 16:46, Phill Campbell 
a écrit :

> Which version of Apache Solr?
>
> > On Jan 12, 2021, at 8:36 AM, Pierre Salagnac 
> wrote:
> >
> > Hello,
> > We had a stuck leader election for a shard.
> >
> > We have collections with 2 shards, each shard has 5 replicas. We have
> many
> > collections but the issue happened for a single shard. Once all host
> > restarts completed, this shard was stuck with one replica is "recovery"
> > state and all other is "down" state.
> >
> > Here is the state of the shard returned by CLUSTERSTATUS command.
> >  "replicas":{
> >"core_node3":{
> >  "core":"_shard1_replica_n1",
> >  "base_url":"https://host1:8983/solr;,
> >  "node_name":"host1:8983_solr",
> >  "state":"recovering",
> >  "type":"NRT",
> >  "force_set_state":"false"},
> >"core_node9":{
> >  "core":"_shard1_replica_n6",
> >  "base_url":"https://host2:8983/solr;,
> >  "node_name":"host2:8983_solr",
> >  "state":"down",
> >  "type":"NRT",
> >  "force_set_state":"false"},
> >"core_node26":{
> >  "core":"_shard1_replica_n25",
> >  "base_url":"https://host3:8983/solr;,
> >  "node_name":"host3:8983_solr",
> >  "state":"down",
> >  "type":"NRT",
> >  "force_set_state":"false"},
> >"core_node28":{
> >  "core":"_shard1_replica_n27",
> >  "base_url":"https://host4:8983/solr;,
> >  "node_name":"host4:8983_solr",
> >  "state":"down",
> >  "type":"NRT",
> >  "force_set_state":"false"},
> >"core_node34":{
> >  "core":"_shard1_replica_n33",
> >  "base_url":"https://host5:8983/solr;,
> >  "node_name":"host5:8983_solr",
> >  "state":"down",
> >  "type":"NRT",
> >  "force_set_state":"false"}}}
> >
> > The workarounds to shutdown server host1 with the replica stuck in
> recovery
> > state. This unblocked leader election, the 4 other replicas went active.
> >
> > Here is the first error I found in logs related to this shard. It
> happened
> > while shutting a server host3 that was the leader at that time/
> > (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> > r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> > x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient
> Error
> > consuming and closing http response stream. =>
> > java.nio.channels.AsynchronousCloseException
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> > java.nio.channels.AsynchronousCloseException: null
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> > at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> > at
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> > at
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> > at
> >
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> > at
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> > ~[?:?]
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> > ~[?:?]
> > at java.lang.Thread.run(Thread.java:834) [?:?]
> >
> > My understanding is following this error, each server restart ended in
> the
> > replica on this server being in "down" state, but I'm not sure how to
> > confirm that.
> > We then entered in a loop where term is increased because of failed
> > replication.
> >
> > Is this a know issue? I found no similar ticket in Jira.
> > Could you please having a better understanding of the issue?
> > Thanks
>
>


Re: leader election stuck after hosts restarts

2021-01-12 Thread Phill Campbell
Which version of Apache Solr?

> On Jan 12, 2021, at 8:36 AM, Pierre Salagnac  
> wrote:
> 
> Hello,
> We had a stuck leader election for a shard.
> 
> We have collections with 2 shards, each shard has 5 replicas. We have many
> collections but the issue happened for a single shard. Once all host
> restarts completed, this shard was stuck with one replica is "recovery"
> state and all other is "down" state.
> 
> Here is the state of the shard returned by CLUSTERSTATUS command.
>  "replicas":{
>"core_node3":{
>  "core":"_shard1_replica_n1",
>  "base_url":"https://host1:8983/solr;,
>  "node_name":"host1:8983_solr",
>  "state":"recovering",
>  "type":"NRT",
>  "force_set_state":"false"},
>"core_node9":{
>  "core":"_shard1_replica_n6",
>  "base_url":"https://host2:8983/solr;,
>  "node_name":"host2:8983_solr",
>  "state":"down",
>  "type":"NRT",
>  "force_set_state":"false"},
>"core_node26":{
>  "core":"_shard1_replica_n25",
>  "base_url":"https://host3:8983/solr;,
>  "node_name":"host3:8983_solr",
>  "state":"down",
>  "type":"NRT",
>  "force_set_state":"false"},
>"core_node28":{
>  "core":"_shard1_replica_n27",
>  "base_url":"https://host4:8983/solr;,
>  "node_name":"host4:8983_solr",
>  "state":"down",
>  "type":"NRT",
>  "force_set_state":"false"},
>"core_node34":{
>  "core":"_shard1_replica_n33",
>  "base_url":"https://host5:8983/solr;,
>  "node_name":"host5:8983_solr",
>  "state":"down",
>  "type":"NRT",
>  "force_set_state":"false"}}}
> 
> The workarounds to shutdown server host1 with the replica stuck in recovery
> state. This unblocked leader election, the 4 other replicas went active.
> 
> Here is the first error I found in logs related to this shard. It happened
> while shutting a server host3 that was the leader at that time/
> (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
> consuming and closing http response stream. =>
> java.nio.channels.AsynchronousCloseException
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> java.nio.channels.AsynchronousCloseException: null
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
> 
> My understanding is following this error, each server restart ended in the
> replica on this server being in "down" state, but I'm not sure how to
> confirm that.
> We then entered in a loop where term is increased because of failed
> replication.
> 
> Is this a know issue? I found no similar ticket in Jira.
> Could you please having a better understanding of the issue?
> Thanks



Re: leader election stuck after hosts restarts

2021-01-12 Thread matthew sporleder
When this has happened to me before I have had pretty good luck by
restarting the overseer leader, which can be found in zookeeper under
/overseer_elect/leader

If that doesn't work I've had to do more intrusive and manual recovery
methods, which suck.

On Tue, Jan 12, 2021 at 10:36 AM Pierre Salagnac
 wrote:
>
> Hello,
> We had a stuck leader election for a shard.
>
> We have collections with 2 shards, each shard has 5 replicas. We have many
> collections but the issue happened for a single shard. Once all host
> restarts completed, this shard was stuck with one replica is "recovery"
> state and all other is "down" state.
>
> Here is the state of the shard returned by CLUSTERSTATUS command.
>   "replicas":{
> "core_node3":{
>   "core":"_shard1_replica_n1",
>   "base_url":"https://host1:8983/solr;,
>   "node_name":"host1:8983_solr",
>   "state":"recovering",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node9":{
>   "core":"_shard1_replica_n6",
>   "base_url":"https://host2:8983/solr;,
>   "node_name":"host2:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node26":{
>   "core":"_shard1_replica_n25",
>   "base_url":"https://host3:8983/solr;,
>   "node_name":"host3:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node28":{
>   "core":"_shard1_replica_n27",
>   "base_url":"https://host4:8983/solr;,
>   "node_name":"host4:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node34":{
>   "core":"_shard1_replica_n33",
>   "base_url":"https://host5:8983/solr;,
>   "node_name":"host5:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"}}}
>
> The workarounds to shutdown server host1 with the replica stuck in recovery
> state. This unblocked leader election, the 4 other replicas went active.
>
> Here is the first error I found in logs related to this shard. It happened
> while shutting a server host3 that was the leader at that time/
>  (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
> consuming and closing http response stream. =>
> java.nio.channels.AsynchronousCloseException
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> java.nio.channels.AsynchronousCloseException: null
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
>
> My understanding is following this error, each server restart ended in the
> replica on this server being in "down" state, but I'm not sure how to
> confirm that.
> We then entered in a loop where term is increased because of failed
> replication.
>
> Is this a know issue? I found no similar ticket in Jira.
> Could you please having a better understanding of the issue?
> Thanks


leader election stuck after hosts restarts

2021-01-12 Thread Pierre Salagnac
Hello,
We had a stuck leader election for a shard.

We have collections with 2 shards, each shard has 5 replicas. We have many
collections but the issue happened for a single shard. Once all host
restarts completed, this shard was stuck with one replica is "recovery"
state and all other is "down" state.

Here is the state of the shard returned by CLUSTERSTATUS command.
  "replicas":{
"core_node3":{
  "core":"_shard1_replica_n1",
  "base_url":"https://host1:8983/solr;,
  "node_name":"host1:8983_solr",
  "state":"recovering",
  "type":"NRT",
  "force_set_state":"false"},
"core_node9":{
  "core":"_shard1_replica_n6",
  "base_url":"https://host2:8983/solr;,
  "node_name":"host2:8983_solr",
  "state":"down",
  "type":"NRT",
  "force_set_state":"false"},
"core_node26":{
  "core":"_shard1_replica_n25",
  "base_url":"https://host3:8983/solr;,
  "node_name":"host3:8983_solr",
  "state":"down",
  "type":"NRT",
  "force_set_state":"false"},
"core_node28":{
  "core":"_shard1_replica_n27",
  "base_url":"https://host4:8983/solr;,
  "node_name":"host4:8983_solr",
  "state":"down",
  "type":"NRT",
  "force_set_state":"false"},
"core_node34":{
  "core":"_shard1_replica_n33",
  "base_url":"https://host5:8983/solr;,
  "node_name":"host5:8983_solr",
  "state":"down",
  "type":"NRT",
  "force_set_state":"false"}}}

The workarounds to shutdown server host1 with the replica stuck in recovery
state. This unblocked leader election, the 4 other replicas went active.

Here is the first error I found in logs related to this shard. It happened
while shutting a server host3 that was the leader at that time/
 (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
consuming and closing http response stream. =>
java.nio.channels.AsynchronousCloseException
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
java.nio.channels.AsynchronousCloseException: null
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
at java.io.InputStream.read(InputStream.java:205) ~[?:?]
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]

My understanding is following this error, each server restart ended in the
replica on this server being in "down" state, but I'm not sure how to
confirm that.
We then entered in a loop where term is increased because of failed
replication.

Is this a know issue? I found no similar ticket in Jira.
Could you please having a better understanding of the issue?
Thanks


Re: Solr using all available CPU and becoming unresponsive

2021-01-12 Thread Michael Gibney
Ahh ok. If those are your only fieldType definitions, and most of your
config is copied from the default, then SOLR-13336 is unlikely to be the
culprit. Looking at more general options, off the top of my head:
1. make sure you haven't allocated all physical memory to heap (leave a
decent amount for OS page cache)
2. disable swap, if you can (this is esp. important if using network
storage as swap). There are potential downsides to this (so proceed with
caution); but if part of your heap gets swapped out (and it almost
certainly will, with a sufficiently large heap) full GCs lead to a swap
storm that compounds the problem. (fwiw, this is probably the first thing
I'd recommend looking into and trying, because it's so easy, and can in
some cases yield a dramatic improvement. N.b., I'm talking about `swapoff
-a`, not `sysctl -w vm.swappiness=0` -- I find that the latter does *not*
eliminate swapping in the way that's needed to achieve the desired goal in
this case. Again, exercise caution in doing this, discuss, research, etc.).
Related documentation was added in 8.5, but absolutely applies to 7.3.1 as
well:
https://lucene.apache.org/solr/guide/8_7/taking-solr-to-production.html#avoid-swapping-nix-operating-systems
-- the note there about "lowering swappiness" being an acceptable
alternative contradicts my experience, but I suppose ymmv?
3. if you're faceting on fields -- especially high-cardinality fields (many
values) -- make sure that you have `docValues=true, uninvertible=false`
configured (to ensure that you're not building large on-heap data
structures when there's an alternative that doesn't require it.

These are all recommendations that are explained in more detail by others
elsewhere; I think they should all apply to 7.3.1; fwiw, I would recommend
upgrading if you have the (human) bandwidth to do so. Good luck!

Michael

On Tue, Jan 12, 2021 at 8:39 AM Jeremy Smith  wrote:

> Thanks Michael,
>  SOLR-13336 seems intriguing.  I'm not a solr expert, but I believe
> these are the relevant sections from our schema definition:
>
>  positionIncrementGap="100">
>   
> 
> 
>   
>   
> 
> 
>   
> 
>  positionIncrementGap="100" multiValued="false">
>   
> 
>  words="stopwords.txt" />
> 
>   
>   
> 
>  words="stopwords.txt" />
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>   
> 
>
> Our other fieldTypes don't have any analyzers attached to them.
>
>
> If SOLR-13336 is the cause of the issue is the best remedy to upgrade to
> solr 8?  It doesn't look like the fix was back patched to 7.x.
>
> Our schema has some issues arising from not fully understanding Solr and
> just copying existing structures from the defaults.  In this case,
> stopwords.txt is completely empty and synonyms.txt is just the default
> synonyms.txt, which seems not useful at all for us.  Could I just take out
> the StopFilterFactory and SynonymGraphFilterFactory from the query section
> (and maybe the StopFilterFactory from the index section as well)?
>
> Thanks again,
> Jeremy
>
> 
> From: Michael Gibney 
> Sent: Monday, January 11, 2021 8:30 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Solr using all available CPU and becoming unresponsive
>
> Hi Jeremy,
> Can you share your analysis chain configs? (SOLR-13336 can manifest in a
> similar way, and would affect 7.3.1 with a susceptible config, given the
> right (wrong?) input ...)
> Michael
>
> On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith  wrote:
>
> > Hello all,
> >  We have been struggling with an issue where solr will intermittently
> > use all available CPU and become unresponsive.  It will remain in this
> > state until we restart.  Solr will remain stable for some time, usually a
> > few hours to a few days, before this happens again.  We've tried
> adjusting
> > the caches and adding memory to both the VM and JVM, but we haven't been
> > able to solve the issue yet.
> >
> > Here is some info about our server:
> > Solr:
> >   Solr 7.3.1, running on Java 1.8
> >   Running in cloud mode, but there's only one core
> >
> > Host:
> >   CentOS7
> >   8 CPU, 56GB RAM
> >   The only other processes running on this VM are two zookeepers, one for
> > this Solr instance, one for another Solr instance
> >
> > Solr Config:
> >  - One Core
> >  - 36 Million documents (Max Doc), 28 million (Num Docs)
> >  - ~15GB
> >  - 10-20 Requests/second
> >  - The schema is fairly large (~100 fields) and we allow faceting and
> > searching on many, but not all, of the fields
> >  - Data are imported once per minute through the DataImportHandler, with
> a
> > hard commit at the end.  We usually index ~100-500 documents per minute,
> > with many of these being updates to existing documents.
> >
> > Cache settings:
> >  >  size="256"
> >  initialSize="256"
> >  

Re: Highlighting large text fields

2021-01-12 Thread Shaun Campbell
Hi David

First of all I wanted to say I'm working off your book!!  Third edition,
and I think it's a bit out of date now. I was just going to try following
the section on the Postings highlighter, but I see that's been absorbed
into the Unified highlighter. I find your book easier to follow than the
official documentation though.

I am going to try to configure the unified highlighter, and I will add that
storeOffsetsWithPositions to the schema (which I saw in your book) and I
will try indexing again from scratch.  Was getting some funny things going
on where I thought I'd turned highlighting off and it was still giving me
highlights.

Actually just re-reading your email again, are you saying that you can't
configure highlighting in solrconfig.xml? That's where I always configure
original highlighting in my dismax search handler. Am I supposed to add
highlighting to each request?

Thanks
Shaun

On Mon, 11 Jan 2021 at 20:57, David Smiley  wrote:

> Hello!
>
> I worked on the UnifiedHighlighter a lot and want to help you!
>
> On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell 
> wrote:
>
> > I've been using highlighting for a while, using the original highlighter,
> > and just come across a problem with fields that contain a large amount of
> > text, approx 250k characters. I only have about 2,000 records but each
> one
> > contains a journal publication to search through.
> >
> > What I noticed is that some records didn't return a highlight even though
> > they matched on the content. I noticed the hl.maxAnalyzedChars parameter
> > and increased that, but  it allowed some records to be highlighted, but
> not
> > all, and then it caused memory problems on the server.  Performance is
> also
> > very poor.
> >
>
> I've been thinking hl.maxAnalyzedChars should maybe default to no limit --
> it's a performance threshold but perhaps better to opt-in to such a limit
> then scratch your head for a long time wondering why a search result isn't
> showing highlights.
>
>
> > To try to fix this I've tried  to configure the unified highlighter in my
> > solrconfig.xml instead.   It seems to be working but again I'm missing
> some
> > highlighted records.
> >
>
> There is no configuration of that highlighter in solrconfig.xml; it's
> entirely parameter driven (runtime).
>
>
> > The other thing is I've tried to adjust my unified highlighting settings
> in
> > solrconfig.xml and they don't  seem to be having any effect even after
> > restarting Solr.  I was just wondering whether there is any highlighting
> > information stored at index time. It's taking over 4hours to index my
> > records so it's not easy to keep reindexing my content.
> >
> > Any ideas on how to handle highlighting of large content  would be
> > appreciated.
> >
> > Shaun
> >
>
> Please read the documentation here thoroughly:
>
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> (or earlier version as applicable)
> Since you have large bodies of text to highlight, you would strongly
> benefit from putting offsets into the search index (and re-index) --
> storeOffsetsWithPositions.  That's an option on the field/fieldType in your
> schema; it may not be obvious reading the docs.  You have to opt-in to
> that; Solr doesn't normally store any info in the index for highlighting.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>


Re: Solr using all available CPU and becoming unresponsive

2021-01-12 Thread Jeremy Smith
Thanks Michael,
 SOLR-13336 seems intriguing.  I'm not a solr expert, but I believe these 
are the relevant sections from our schema definition:


  


  
  


  


  



  
  




  


Our other fieldTypes don't have any analyzers attached to them.


If SOLR-13336 is the cause of the issue is the best remedy to upgrade to solr 
8?  It doesn't look like the fix was back patched to 7.x.

Our schema has some issues arising from not fully understanding Solr and just 
copying existing structures from the defaults.  In this case, stopwords.txt is 
completely empty and synonyms.txt is just the default synonyms.txt, which seems 
not useful at all for us.  Could I just take out the StopFilterFactory and 
SynonymGraphFilterFactory from the query section (and maybe the 
StopFilterFactory from the index section as well)?

Thanks again,
Jeremy


From: Michael Gibney 
Sent: Monday, January 11, 2021 8:30 PM
To: solr-user@lucene.apache.org 
Subject: Re: Solr using all available CPU and becoming unresponsive

Hi Jeremy,
Can you share your analysis chain configs? (SOLR-13336 can manifest in a
similar way, and would affect 7.3.1 with a susceptible config, given the
right (wrong?) input ...)
Michael

On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith  wrote:

> Hello all,
>  We have been struggling with an issue where solr will intermittently
> use all available CPU and become unresponsive.  It will remain in this
> state until we restart.  Solr will remain stable for some time, usually a
> few hours to a few days, before this happens again.  We've tried adjusting
> the caches and adding memory to both the VM and JVM, but we haven't been
> able to solve the issue yet.
>
> Here is some info about our server:
> Solr:
>   Solr 7.3.1, running on Java 1.8
>   Running in cloud mode, but there's only one core
>
> Host:
>   CentOS7
>   8 CPU, 56GB RAM
>   The only other processes running on this VM are two zookeepers, one for
> this Solr instance, one for another Solr instance
>
> Solr Config:
>  - One Core
>  - 36 Million documents (Max Doc), 28 million (Num Docs)
>  - ~15GB
>  - 10-20 Requests/second
>  - The schema is fairly large (~100 fields) and we allow faceting and
> searching on many, but not all, of the fields
>  - Data are imported once per minute through the DataImportHandler, with a
> hard commit at the end.  We usually index ~100-500 documents per minute,
> with many of these being updates to existing documents.
>
> Cache settings:
>   size="256"
>  initialSize="256"
>  autowarmCount="8"
>  showItems="64"/>
>
>size="256"
>   initialSize="256"
>   autowarmCount="0"/>
>
> size="1024"
>initialSize="1024"
>autowarmCount="0"/>
>
> For the filterCache, we have tried sizes as low as 128, which caused our
> CPU usage to go up and didn't solve our issue.  autowarmCount used to be
> much higher, but we have reduced it to try to address this issue.
>
>
> The behavior we see:
>
> Solr is normally using ~3-6GB of heap and we usually have ~20GB of free
> memory.  Occasionally, though, solr is not able to free up memory and the
> heap usage climbs.  Analyzing the GC logs shows a sharp incline of usage
> with the GC (the default CMS) working hard to free memory, but not
> accomplishing much.  Eventually, it fills up the heap, maxes out the CPUs,
> and never recovers.  We have tried to analyze the logs to see if there are
> particular queries causing issues or if there are network issues to
> zookeeper, but we haven't been able to find any patterns.  After the issues
> start, we often see session timeouts to zookeeper, but it doesn't appear​
> that they are the cause.
>
>
>
> Does anyone have any recommendations on things to try or metrics to look
> into or configuration issues I may be overlooking?
>
> Thanks,
> Jeremy
>
>


Solr 8.7.0 in Cloud mode with Zookeeper 3.4.5 cdh 5.16

2021-01-12 Thread Subhajit Das

Hi,

We are planning to implement Solr Cloud 8.7.0, running in Kubernetes cluster, 
with external Zookeeper  3.4.5 cdh 5.16.
Solr 8.7.0 seems to be matched with Zookeeper 3.6.2. Is there any issue using 
Zookeeper  3.4.5 cdh 5.16?

Thanks in advance.

Regards,
Subhajit



SOLR 7.5 : java.io.IOException

2021-01-12 Thread Akreeti Agarwal
Classification: Internal
Hi,

I am using SOLR 7.5 master slave architecture. I am having two slaves connected 
to master, when load is getting increased then one my slave server CPU spikes 
100% and gets terminated. In logs and monitoring I could find 
"java.io.IOException" coming.

Please help me how to handle this problem.

Regards,
Akreeti Agarwal

::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only. E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted, lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents (with or without referred errors) shall therefore 
not attach any liability on the originator or HCL or its affiliates. Views or 
opinions, if any, presented in this email are solely those of the author and 
may not necessarily reflect the views or opinions of HCL or its affiliates. Any 
form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of authorized representative of HCL is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. Before opening any email and/or attachments, please check them for 
viruses and other defects.



Question: JavaBinCodec cannot handle BytesRef object

2021-01-12 Thread Boqi Gao
Dear all:

We are facing a problem recently when we are utilizing BinaryDocValueField of 
solr 7.3.1.
We have created a binary docValue field.
The constructor of BinaryDocValuesField(String name, BytesRef value) needs a 
BytesRef object to be set as its fieldData.

However, the JavaBinCodec cannot handle a BytesRef object. It writes the 
BytesRef as a string of class name and value.
(please see: 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L247)

And the response is like:
"response":{"docs":[
  { fieldName: "org.apache.lucene.util.BytesRef:[3c c1 a1 28 3d ……]”}
]
}

However, if the value of the field could be handled as a BytesRef object by 
JavabinCodec, the TextResponseWriter will write the response as a Base64 string.
(please see:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java#L190-L192
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L247
)

And the response, which we hope to get, is like:
"response":{"docs":[
  { fieldName: "vApp0zDtHj69e9mq……”}
]
}

We would like to ask that do you have any idea or suggestion to fix this 
problem? We hope to get a response of Base64 string.
Many thanks!

Best wishes,
Gao


RE: Query over migrating a solr database from 7.7.1 to 8.7.0

2021-01-12 Thread Flowerday, Matthew J
Hi Jim

 

Thanks for getting back to me.

 

I checked the schema.xml that we are using and it has the line you
mentioned:

 



 

And this is the only reference (apart from within a comment) for _root_ In
the schema.xml. Does your schema.xml have further references to _root_ that
I could need? I also checked out solrconfig.xml file for any references to
_root_ and there are none.

 

Many Thanks

 

Matthew

 

Matthew Flowerday | Consultant | ULEAF

Unisys | 01908 774830|  
matthew.flower...@unisys.com 

Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | MK17
8LX

 

  

 

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
MATERIAL and is for use only by the intended recipient. If you received this
in error, please contact the sender and delete the e-mail and its
attachments from all devices.

 

  
 

 

From: Dyer, Jim  
Sent: 11 January 2021 22:58
To: solr-user@lucene.apache.org
Subject: RE: Query over migrating a solr database from 7.7.1 to 8.7.0

 

EXTERNAL EMAIL - Be cautious of all links and attachments.

When we upgraded from 7.x to 8.x, I ran into an issue similar to yours:
when updating an existing document in the index, the document would be
duplicated instead of replaced as expected.  The solution was to add a
"_root_" field to schema.xml like this:

 



 

It appeared that when a feature was added for nested documents, this field
somehow became mandatory in order for updates to work properly, at least in
some cases.

 

From: Flowerday, Matthew J mailto:matthew.flower...@gb.unisys.com> > 
Sent: Saturday, January 9, 2021 4:44 AM
To: solr-user@lucene.apache.org  
Subject: RE: Query over migrating a solr database from 7.7.1 to 8.7.0

 

Hi There

 

As a test I stopped Solr and ran the IndexUpgrader tool on the database to
see if this might fix the issue. It completed OK but unfortunately the issue
still occurs - a new version of the record on solr is created rather than
updating the original record.

 

It looks to me as if the record created under 7.7.1 is somehow not being
'marked as deleted' in the way that records created under 8.7.0 are. Is
there a way for these records to be marked as deleted when they are updated.

 

Many Thanks

 

Matthew

 

 

Matthew Flowerday | Consultant | ULEAF

Unisys | 01908 774830|  
matthew.flower...@unisys.com 

Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | MK17
8LX

 

  

 

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
MATERIAL and is for use only by the intended recipient. If you received this
in error, please contact the sender and delete the e-mail and its
attachments from all devices.

 

  
 

 

From: Flowerday, Matthew J mailto:matthew.flower...@gb.unisys.com> > 
Sent: 07 January 2021 12:25
To: solr-user@lucene.apache.org  
Subject: Query over migrating a solr database from 7.7.1 to 8.7.0

 

Hi There

 

I have recently upgraded a solr database from 7.7.1 to 8.7.0 and not wiped
the database and re-indexed (as this would take too long to run on site).

 

On my local windows machine I have a single solr server 7.7.1 installation

 

I upgraded in the following manner

 

*   Installed windows solr 8.7.0 on my machine in a different folder
*   Copied the core related folder (holding conf, data, lib,
core.properties) from 7.7.1 to the new 8.7.0 folder
*   Brought up the solr
*   Checked that queries work through the Solr Admin Tool and our
application

 

This all worked fine until I tried to update a record which had been created
under 7.7.1. Instead of marking the old record as deleted it effectively
created a new copy of the record with the change in and left the old image
as still visible. When I updated the record again it then correctly updated
the new 8.7.0 version without leaving the old image behind. If I created a
new record and then updated it the solr record would be updated correctly.
The issue only seemed to affect the old 7.7.1 created records.

 

An example of the duplication as follows (the first record is 7.7.1 created
version and the second record is the 8.7.0 version after carrying out an
update):

 

{

  "responseHeader":{

"status":0,

"QTime":4,

"params":{

  "q":"id:9901020319M01-N26",

  "_":"1610016003669"}},

  "response":{"numFound":2,"start":0,"numFoundExact":true,"docs":[

  {