Re: full name free text search problem

2018-01-31 Thread Alexandre Rafalovitch
You need to tokenize the full name in several different ways and then
search both (all) tokenization versions with different boosts.

This way you can tokenize as full string (perhaps lowercased) and then
also on white space and then maybe even with phonetic mapping to catch
spellings.

You can see something similar in:
https://gist.github.com/arafalov/5e04884e5aefaf46678c

Regards,
   Alex.

On 31 January 2018 at 05:48, Deepak Udapudi  wrote:
> Hi all,
>
> I have the below scenario in full name search that we are trying to implement.
>
> Solr configuration :-
>
> fieldType name="keywords_text" class="solr.TextField">
> 
>   
>   
> 
> 
>   
>   
> 
>   
>
>
>  multiValued="true" />
>   
>   
>   
> 
>
> Scenario :-
>
> Solr configuration has office name, facility name and the full name as 
> displayed above.
> We are searching based on the input name with the records sorts by distance.
>
> Problem :-
>
> I am getting the records matching the full name sorted by distance.
> If the input string(for ex Dae Kim) is provided, I am getting the records 
> other than Dae Kim(for ex Rodney Kim) too at the top of the search results 
> including Dae Kim
> just before the next Dae Kim because Kim is matching with all the fields like 
> full name, facility name and the office name. So, the hit frequency is high 
> and it's
> distance is less compared to the next Dae Kim in the search results with 
> higher distance.
>
> Expected results :-
>
> I want to see all the records for Dae Kim to be seen at the top of the search 
> results sorted by distance without any irrelevant results.
>
> Queries :-
>
> What is the fix for the above problem if anyone has faced it?
> How do I handle the problem?
>
> Any inputs would be highly appreciated.
>
> Thanks in advance.
>
> Regards,
> Deepak
>
>
>
>
> The information contained in this email message and any attachments is 
> confidential and intended only for the addressee(s). If you are not an 
> addressee, you may not copy or disclose the information, or act upon it, and 
> you should delete it entirely from your email system. Please notify the 
> sender that you received this email in error.


Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Walter: Perhaps you are right on not to consider stemming. Instead fuzzy
search will cover these along with the misspellings.

In case of symbols, we want the titles matching the symbols ranked higher
than the others. Perhaps we can use this field only for boosting.

Certain movies have around 4-6 different aliases based on what our source
gives and we do not really know what is the max. Is there no other way from
lucene/solr to use a multivalued field?


On Thu, Feb 1, 2018 at 11:06 AM, Walter Underwood 
wrote:

> I was the first search engineer at Netflix and moved their search from a
> home-grown engine to Solr. It worked very well with a single title field
> and aliases.
>
> I think your schema is too complicated for movie search.
>
> Stemming is not useful. It doesn’t help search and it can hurt. You don’t
> want the movie “Saw” to match the query “see”.
>
> When is it useful to search with symbols? Remove the punctuation.
>
> The only movie titles with symbols that caused any challenge were:
>
> * Frost/Nixon
> * .hack//Sign
> * +/-
>
> For the first two, removing punctuation worked fine. For the last one, I
> hardcoded a translation to “plus/minus” before indexing or querying.
>
> Query completion made a huge difference, taking our clickthrough rate from
> 0.45 to 0.55.
>
> Later, we added fuzzy search to handle misspellings.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jan 31, 2018, at 8:54 PM, Sravan Kumar  wrote:
> >
> > @Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents.
> This
> > is done through the fieldnorm component in the class. The issue is when
> the
> > field is multivalued. Consider the field has two string each of 4 tokens.
> > The fieldNorm from the lucene TFIDFSimilarity class considers the total
> sum
> > of these two values i.e 8 for normalizing instead of 4. Hence, the
> ranking
> > is distorted.
> > Regarding the search evaluation, we do have a curated set.
> >
> >
> > On Thu, Feb 1, 2018 at 9:18 AM, Tim Casey  wrote:
> >
> >> For smaller length documents TFIDFSimilarity will weight towards shorter
> >> documents.  Another way to say this, if your documents are 5-10 terms,
> the
> >> 5 terms are going to win.
> >> You might think about having per token, or token pair, weight.  I would
> be
> >> surprised if there was not something similar out there.  This is a
> common
> >> issue with any short text.
> >> I guess I would think of this as TFICF, where the CF is the corpus
> >> frequency. You also might want to weight inversely proportional to the
> age
> >> of the title, older are less important.  This is assuming people are
> doing
> >> searches within some time cluster, newer is more likely.
> >>
> >> For some obvious advice, things you probably already know.  This kind of
> >> search needs some hard measurement to begin to know how to tune it.  You
> >> need to find a reasonable annotated representation.  So, if you took the
> >> previous months searches where there is a chain of successive
> searches.  If
> >> you weighted things differently would you shorten the length of the
> chain.
> >> Can you get the click throughs to happen sooner.
> >>
> >> Anyway, just my 2 cents
> >>
> >>
> >> On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar  wrote:
> >>
> >>>
> >>> @Walter: We have 6 fields declared in schema.xml for title each with
> >>> different type of analyzer. One without processing symbols, other
> stemmed
> >>> and other removing  symbols, etc. So, if we have separate fields for
> each
> >>> alias it will be that many times the number of final fields declared in
> >>> schema.xml. And we exactly do not know what is the maximum number of
> >>> aliases a movie can have.
> >>> @Walter: I will try this but isn’t there any other way  where I can
> >> tweak ?
> >>>
> >>> @eric: will try this. But it will work only for exact matches.
> >>>
> >>>
>  On Jan 31, 2018, at 10:39 PM, Erick Erickson  >
> >>> wrote:
> 
>  Or use a boost for the phrase, something like
>  "beauty and the beast"^5
> 
> > On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <
> >>> wun...@wunderwood.org> wrote:
> > You can use a separate field for title aliases. That is what I did
> for
> >>> Netflix search.
> >
> > Why disable idf? Disabling tf for titles can be a good idea, for
> >>> example the movie “New York, New York” is not twice as much about New
> >> York
> >>> as some other film that just lists it once.
> >
> > Also, consider using a popularity score as a boost.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
> >>
> >> Hi,
> >> We are using solr for our movie title search.
> >>
> >>
> >> As 

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Walter Underwood
I was the first search engineer at Netflix and moved their search from a 
home-grown engine to Solr. It worked very well with a single title field and 
aliases.

I think your schema is too complicated for movie search.

Stemming is not useful. It doesn’t help search and it can hurt. You don’t want 
the movie “Saw” to match the query “see”.

When is it useful to search with symbols? Remove the punctuation.

The only movie titles with symbols that caused any challenge were:

* Frost/Nixon
* .hack//Sign
* +/-

For the first two, removing punctuation worked fine. For the last one, I 
hardcoded a translation to “plus/minus” before indexing or querying.

Query completion made a huge difference, taking our clickthrough rate from 0.45 
to 0.55.

Later, we added fuzzy search to handle misspellings.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 31, 2018, at 8:54 PM, Sravan Kumar  wrote:
> 
> @Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents. This
> is done through the fieldnorm component in the class. The issue is when the
> field is multivalued. Consider the field has two string each of 4 tokens.
> The fieldNorm from the lucene TFIDFSimilarity class considers the total sum
> of these two values i.e 8 for normalizing instead of 4. Hence, the ranking
> is distorted.
> Regarding the search evaluation, we do have a curated set.
> 
> 
> On Thu, Feb 1, 2018 at 9:18 AM, Tim Casey  wrote:
> 
>> For smaller length documents TFIDFSimilarity will weight towards shorter
>> documents.  Another way to say this, if your documents are 5-10 terms, the
>> 5 terms are going to win.
>> You might think about having per token, or token pair, weight.  I would be
>> surprised if there was not something similar out there.  This is a common
>> issue with any short text.
>> I guess I would think of this as TFICF, where the CF is the corpus
>> frequency. You also might want to weight inversely proportional to the age
>> of the title, older are less important.  This is assuming people are doing
>> searches within some time cluster, newer is more likely.
>> 
>> For some obvious advice, things you probably already know.  This kind of
>> search needs some hard measurement to begin to know how to tune it.  You
>> need to find a reasonable annotated representation.  So, if you took the
>> previous months searches where there is a chain of successive searches.  If
>> you weighted things differently would you shorten the length of the chain.
>> Can you get the click throughs to happen sooner.
>> 
>> Anyway, just my 2 cents
>> 
>> 
>> On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar  wrote:
>> 
>>> 
>>> @Walter: We have 6 fields declared in schema.xml for title each with
>>> different type of analyzer. One without processing symbols, other stemmed
>>> and other removing  symbols, etc. So, if we have separate fields for each
>>> alias it will be that many times the number of final fields declared in
>>> schema.xml. And we exactly do not know what is the maximum number of
>>> aliases a movie can have.
>>> @Walter: I will try this but isn’t there any other way  where I can
>> tweak ?
>>> 
>>> @eric: will try this. But it will work only for exact matches.
>>> 
>>> 
 On Jan 31, 2018, at 10:39 PM, Erick Erickson 
>>> wrote:
 
 Or use a boost for the phrase, something like
 "beauty and the beast"^5
 
> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <
>>> wun...@wunderwood.org> wrote:
> You can use a separate field for title aliases. That is what I did for
>>> Netflix search.
> 
> Why disable idf? Disabling tf for titles can be a good idea, for
>>> example the movie “New York, New York” is not twice as much about New
>> York
>>> as some other film that just lists it once.
> 
> Also, consider using a popularity score as a boost.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
>> 
>> Hi,
>> We are using solr for our movie title search.
>> 
>> 
>> As it is "title search", this should be treated different than the
>>> normal
>> document search.
>> Hence, we use a modified version of TFIDFSimilarity with the
>> following
>> changes.
>> -  disabled TF & IDF and will only have 1 as value.
>> -  disabled norms by specifying omitNorms as true for all the fields.
>> 
>> There are 6 fields with different analyzers and we make use of
>>> different
>> weights in edismax's qf & pf parameters to match tokens & boost
>>> phrases.
>> 
>> But, movies could have aliases and have multiple titles. So, we made
>>> the
>> fields multivalued.
>> 
>> Now, consider the following four documents
>> 1>  "Beauty and the Beast"
>> 2>  "The Real Beauty and the 

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents. This
is done through the fieldnorm component in the class. The issue is when the
field is multivalued. Consider the field has two string each of 4 tokens.
The fieldNorm from the lucene TFIDFSimilarity class considers the total sum
of these two values i.e 8 for normalizing instead of 4. Hence, the ranking
is distorted.
Regarding the search evaluation, we do have a curated set.


On Thu, Feb 1, 2018 at 9:18 AM, Tim Casey  wrote:

> For smaller length documents TFIDFSimilarity will weight towards shorter
> documents.  Another way to say this, if your documents are 5-10 terms, the
> 5 terms are going to win.
> You might think about having per token, or token pair, weight.  I would be
> surprised if there was not something similar out there.  This is a common
> issue with any short text.
> I guess I would think of this as TFICF, where the CF is the corpus
> frequency. You also might want to weight inversely proportional to the age
> of the title, older are less important.  This is assuming people are doing
> searches within some time cluster, newer is more likely.
>
> For some obvious advice, things you probably already know.  This kind of
> search needs some hard measurement to begin to know how to tune it.  You
> need to find a reasonable annotated representation.  So, if you took the
> previous months searches where there is a chain of successive searches.  If
> you weighted things differently would you shorten the length of the chain.
> Can you get the click throughs to happen sooner.
>
> Anyway, just my 2 cents
>
>
> On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar  wrote:
>
> >
> > @Walter: We have 6 fields declared in schema.xml for title each with
> > different type of analyzer. One without processing symbols, other stemmed
> > and other removing  symbols, etc. So, if we have separate fields for each
> > alias it will be that many times the number of final fields declared in
> > schema.xml. And we exactly do not know what is the maximum number of
> > aliases a movie can have.
> > @Walter: I will try this but isn’t there any other way  where I can
> tweak ?
> >
> > @eric: will try this. But it will work only for exact matches.
> >
> >
> > > On Jan 31, 2018, at 10:39 PM, Erick Erickson 
> > wrote:
> > >
> > > Or use a boost for the phrase, something like
> > > "beauty and the beast"^5
> > >
> > >> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <
> > wun...@wunderwood.org> wrote:
> > >> You can use a separate field for title aliases. That is what I did for
> > Netflix search.
> > >>
> > >> Why disable idf? Disabling tf for titles can be a good idea, for
> > example the movie “New York, New York” is not twice as much about New
> York
> > as some other film that just lists it once.
> > >>
> > >> Also, consider using a popularity score as a boost.
> > >>
> > >> wunder
> > >> Walter Underwood
> > >> wun...@wunderwood.org
> > >> http://observer.wunderwood.org/  (my blog)
> > >>
> > >>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
> > >>>
> > >>> Hi,
> > >>> We are using solr for our movie title search.
> > >>>
> > >>>
> > >>> As it is "title search", this should be treated different than the
> > normal
> > >>> document search.
> > >>> Hence, we use a modified version of TFIDFSimilarity with the
> following
> > >>> changes.
> > >>> -  disabled TF & IDF and will only have 1 as value.
> > >>> -  disabled norms by specifying omitNorms as true for all the fields.
> > >>>
> > >>> There are 6 fields with different analyzers and we make use of
> > different
> > >>> weights in edismax's qf & pf parameters to match tokens & boost
> > phrases.
> > >>>
> > >>> But, movies could have aliases and have multiple titles. So, we made
> > the
> > >>> fields multivalued.
> > >>>
> > >>> Now, consider the following four documents
> > >>> 1>  "Beauty and the Beast"
> > >>> 2>  "The Real Beauty and the Beast"
> > >>> 3>  "Beauty and the Beast", "La bella y la bestia"
> > >>> 4>  "Beauty and the Beast"
> > >>>
> > >>> Note: Document 3 has two titles in it.
> > >>>
> > >>> So, for a query "Beauty and the Beast" and with the above
> > configuration all
> > >>> the documents receive same score. But 1,3,4 should have got same
> score
> > and
> > >>> document 2 lesser than others.
> > >>>
> > >>> To solve this, we followed what is suggested in the following thread:
> > >>> http://lucene.472066.n3.nabble.com/Influencing-scores-
> > on-values-in-multiValue-fields-td1791651.html
> > >>>
> > >>> Now, the fields which are used to boost are made to use Norms. And
> for
> > >>> matching norms are disabled. This is to make sure that exact & near
> > exact
> > >>> matches are rewarded.
> > >>>
> > >>> But, for the same query, we get the following results.
> > >>> query: "Beauty & the Beast"
> > >>> Search Results:
> > >>> 1>  "Beauty and the Beast"
> > >>> 4>  "Beauty and the Beast"
> > >>> 

Re: Query fields with data of certain length

2018-01-31 Thread Zheng Lin Edwin Yeo
Hi,

Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ?

Regards,
Edwin


On 4 January 2018 at 18:04, Zheng Lin Edwin Yeo 
wrote:

> Hi Emir,
>
> An example of the string in Chinese is 预支款管理及账务处理办法
>
> The number of characters is 12, but the expected length should be 36.
>
> Regards,
> Edwin
>
>
> On 4 January 2018 at 16:21, Emir Arnautović 
> wrote:
>
>> Hi Edwin,
>> I don’t have enough knowledge in eastern languages to know what is
>> expected number when you as for sting length. Maybe you can try some of
>> regex unicode settings and see if you’ll get what you need: try setting
>> unicode flag with (?U) or try using regex groups and ranges. If you provide
>> example string and expected length, maybe we could provide you regex.
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 4 Jan 2018, at 04:37, Zheng Lin Edwin Yeo 
>> wrote:
>> >
>> > Hi Emir,
>> >
>> > So this would likely be different from what the operating system
>> counts, as
>> > the operating system may consider each Chinese characters as 3 to 4
>> bytes.
>> > Which is probably why I could not find any record with
>> subject:/.{255,}.*/
>> >
>> > Is there other tools that we can use to query the length for data that
>> are
>> > already indexed which are not in the standard English language? (Eg:
>> > Chinese, Japanese, etc)
>> >
>> > Regards,
>> > Edwin
>> >
>> > On 3 January 2018 at 23:51, Emir Arnautović <
>> emir.arnauto...@sematext.com>
>> > wrote:
>> >
>> >> Hi Edwin,
>> >> I do not know, but my guess would be that each character is counted as
>> 1
>> >> in regex regardless how many bytes it takes in used encoding.
>> >>
>> >> Regards,
>> >> Emir
>> >> --
>> >> Monitoring - Log Management - Alerting - Anomaly Detection
>> >> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>> >>
>> >>
>> >>
>> >>> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo 
>> >> wrote:
>> >>>
>> >>> Thanks for the reply.
>> >>>
>> >>> I am doing the search on existing data that has already been indexed,
>> and
>> >>> it is likely to be a one time thing.
>> >>>
>> >>> This  subject:/.{255,}.*/  works for English characters. However,
>> there
>> >> are
>> >>> Chinese characters in some of the records. The length seems to be more
>> >> than
>> >>> 255, but it does not shows up in the results.
>> >>>
>> >>> Do you know how the length for Chinese characters and other languages
>> are
>> >>> being determined?
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>>
>> >>> On 3 January 2018 at 23:01, Alexandre Rafalovitch > >
>> >>> wrote:
>> >>>
>>  Do that during indexing as Emir suggested. Specifically, use an
>>  UpdateRequestProcessor chain, probably with the Clone and FieldLength
>>  processors: http://www.solr-start.com/javadoc/solr-lucene/org/
>>  apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
>> 
>>  Regards,
>>   Alex.
>> 
>>  On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com
>> >>>
>>  wrote:
>> > Hi,
>> >
>> > Would like to check, if it is possible to query a field which has
>> data
>> >> of
>> > more than a certain length?
>> >
>> > Like for example, I want to query the field subject that has more
>> than
>>  255
>> > bytes. Is it possible?
>> >
>> > I am currently using Solr 6.5.1.
>> >
>> > Regards,
>> > Edwin
>> 
>> >>
>> >>
>>
>>
>


Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Tim Casey
For smaller length documents TFIDFSimilarity will weight towards shorter
documents.  Another way to say this, if your documents are 5-10 terms, the
5 terms are going to win.
You might think about having per token, or token pair, weight.  I would be
surprised if there was not something similar out there.  This is a common
issue with any short text.
I guess I would think of this as TFICF, where the CF is the corpus
frequency. You also might want to weight inversely proportional to the age
of the title, older are less important.  This is assuming people are doing
searches within some time cluster, newer is more likely.

For some obvious advice, things you probably already know.  This kind of
search needs some hard measurement to begin to know how to tune it.  You
need to find a reasonable annotated representation.  So, if you took the
previous months searches where there is a chain of successive searches.  If
you weighted things differently would you shorten the length of the chain.
Can you get the click throughs to happen sooner.

Anyway, just my 2 cents


On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar  wrote:

>
> @Walter: We have 6 fields declared in schema.xml for title each with
> different type of analyzer. One without processing symbols, other stemmed
> and other removing  symbols, etc. So, if we have separate fields for each
> alias it will be that many times the number of final fields declared in
> schema.xml. And we exactly do not know what is the maximum number of
> aliases a movie can have.
> @Walter: I will try this but isn’t there any other way  where I can tweak ?
>
> @eric: will try this. But it will work only for exact matches.
>
>
> > On Jan 31, 2018, at 10:39 PM, Erick Erickson 
> wrote:
> >
> > Or use a boost for the phrase, something like
> > "beauty and the beast"^5
> >
> >> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <
> wun...@wunderwood.org> wrote:
> >> You can use a separate field for title aliases. That is what I did for
> Netflix search.
> >>
> >> Why disable idf? Disabling tf for titles can be a good idea, for
> example the movie “New York, New York” is not twice as much about New York
> as some other film that just lists it once.
> >>
> >> Also, consider using a popularity score as a boost.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
> >>>
> >>> Hi,
> >>> We are using solr for our movie title search.
> >>>
> >>>
> >>> As it is "title search", this should be treated different than the
> normal
> >>> document search.
> >>> Hence, we use a modified version of TFIDFSimilarity with the following
> >>> changes.
> >>> -  disabled TF & IDF and will only have 1 as value.
> >>> -  disabled norms by specifying omitNorms as true for all the fields.
> >>>
> >>> There are 6 fields with different analyzers and we make use of
> different
> >>> weights in edismax's qf & pf parameters to match tokens & boost
> phrases.
> >>>
> >>> But, movies could have aliases and have multiple titles. So, we made
> the
> >>> fields multivalued.
> >>>
> >>> Now, consider the following four documents
> >>> 1>  "Beauty and the Beast"
> >>> 2>  "The Real Beauty and the Beast"
> >>> 3>  "Beauty and the Beast", "La bella y la bestia"
> >>> 4>  "Beauty and the Beast"
> >>>
> >>> Note: Document 3 has two titles in it.
> >>>
> >>> So, for a query "Beauty and the Beast" and with the above
> configuration all
> >>> the documents receive same score. But 1,3,4 should have got same score
> and
> >>> document 2 lesser than others.
> >>>
> >>> To solve this, we followed what is suggested in the following thread:
> >>> http://lucene.472066.n3.nabble.com/Influencing-scores-
> on-values-in-multiValue-fields-td1791651.html
> >>>
> >>> Now, the fields which are used to boost are made to use Norms. And for
> >>> matching norms are disabled. This is to make sure that exact & near
> exact
> >>> matches are rewarded.
> >>>
> >>> But, for the same query, we get the following results.
> >>> query: "Beauty & the Beast"
> >>> Search Results:
> >>> 1>  "Beauty and the Beast"
> >>> 4>  "Beauty and the Beast"
> >>> 2>  "The Real Beauty and the Beast"
> >>> 3>  "Beauty and the Beast", "La bella y la bestia"
> >>>
> >>> Clearly, the changes have solved only a part of the problem. The
> document 3
> >>> should be ranked/scored higher than document 2.
> >>>
> >>> This is because lucene considers the total field length across all the
> >>> values in a multivalued field for normalization.
> >>>
> >>> How do we handle this scenario and make sure that in multivalued
> fields the
> >>> normalization is taken care of?
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Sravan
> >>
>


Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar

@Walter: We have 6 fields declared in schema.xml for title each with different 
type of analyzer. One without processing symbols, other stemmed and other 
removing  symbols, etc. So, if we have separate fields for each alias it will 
be that many times the number of final fields declared in schema.xml. And we 
exactly do not know what is the maximum number of aliases a movie can have. 
@Walter: I will try this but isn’t there any other way  where I can tweak ?

@eric: will try this. But it will work only for exact matches. 


> On Jan 31, 2018, at 10:39 PM, Erick Erickson  wrote:
> 
> Or use a boost for the phrase, something like
> "beauty and the beast"^5
> 
>> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood  
>> wrote:
>> You can use a separate field for title aliases. That is what I did for 
>> Netflix search.
>> 
>> Why disable idf? Disabling tf for titles can be a good idea, for example the 
>> movie “New York, New York” is not twice as much about New York as some other 
>> film that just lists it once.
>> 
>> Also, consider using a popularity score as a boost.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
>>> 
>>> Hi,
>>> We are using solr for our movie title search.
>>> 
>>> 
>>> As it is "title search", this should be treated different than the normal
>>> document search.
>>> Hence, we use a modified version of TFIDFSimilarity with the following
>>> changes.
>>> -  disabled TF & IDF and will only have 1 as value.
>>> -  disabled norms by specifying omitNorms as true for all the fields.
>>> 
>>> There are 6 fields with different analyzers and we make use of different
>>> weights in edismax's qf & pf parameters to match tokens & boost phrases.
>>> 
>>> But, movies could have aliases and have multiple titles. So, we made the
>>> fields multivalued.
>>> 
>>> Now, consider the following four documents
>>> 1>  "Beauty and the Beast"
>>> 2>  "The Real Beauty and the Beast"
>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>> 4>  "Beauty and the Beast"
>>> 
>>> Note: Document 3 has two titles in it.
>>> 
>>> So, for a query "Beauty and the Beast" and with the above configuration all
>>> the documents receive same score. But 1,3,4 should have got same score and
>>> document 2 lesser than others.
>>> 
>>> To solve this, we followed what is suggested in the following thread:
>>> http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html
>>> 
>>> Now, the fields which are used to boost are made to use Norms. And for
>>> matching norms are disabled. This is to make sure that exact & near exact
>>> matches are rewarded.
>>> 
>>> But, for the same query, we get the following results.
>>> query: "Beauty & the Beast"
>>> Search Results:
>>> 1>  "Beauty and the Beast"
>>> 4>  "Beauty and the Beast"
>>> 2>  "The Real Beauty and the Beast"
>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>> 
>>> Clearly, the changes have solved only a part of the problem. The document 3
>>> should be ranked/scored higher than document 2.
>>> 
>>> This is because lucene considers the total field length across all the
>>> values in a multivalued field for normalization.
>>> 
>>> How do we handle this scenario and make sure that in multivalued fields the
>>> normalization is taken care of?
>>> 
>>> 
>>> --
>>> Regards,
>>> Sravan
>> 


Re: Distributed search cross cluster

2018-01-31 Thread Jan Høydahl
Erick:

> ...one for each cluster and just merged the docs when it got them back


This would be the logical way. I'm afraid that "just merged the docs" is the 
crux here, that would
make this an expensive task. You'd have to merge docs, facets, highlights etc, 
handle the
different search phases (ID fetch, doc fetch, potentially global idf fetch?) 
etc.
It may be that the code necessary to do the merge already exists in the 
project, haven't looked...

Charlie:

Yes it should "just" work. Until someone upgrades the schema in one cloud and 
not the others
of course :) and we still need to handle failure cases such as high latency or 
one cluster down…

Besides, we'll have SSL certs with client auth and probably some sort of 
auth in place in
all clouds, and we'd of course need to make sure that the user exists in all 
clusters and that
cross cluster traffic is allowed in everywhere. PKI auth is not really intended 
for accepting
requests from a foreign node that is not in its ZK etc.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 31. jan. 2018 kl. 10:06 skrev Charlie Hull :
> 
> On 30/01/2018 16:09, Jan Høydahl wrote:
>> Hi,
>> A customer has 10 separate SolrCloud clusters, with same schema across all, 
>> but different content.
>> Now they want users in each location to be able to federate a search across 
>> all locations.
>> Each location is 100% independent, with separate ZK etc. Bandwidth and 
>> latency between the
>> clusters is not an issue, they are actually in the same physical datacenter.
>> Now my first thought was using a custom  parameter, and let the 
>> receiving node fan
>> out to all shards of all clusters. We’d need to contact the ZK for each 
>> environment and find
>> all shards and replicas participating in the collection and then construct 
>> the shards=A1|A2,B1|B2…
>> sting which would be quite big, but if we get it right, it should “just 
>> work".
>> Now, my question is whether there are other smarter ways that would leave it 
>> up to existing Solr
>> logic to select shards and load balance, that would also take into account 
>> any shard.keys/_route_
>> info etc. I thought of these
>>   * =collA,collB  — but it only supports collections local to one 
>> cloud
>>   * Create a collection ALIAS to point to all 10 — but same here, only local 
>> to one cluster
>>   * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want 
>> it with pure search API
>>   * Write a custom ShardHandler plugin that knows about all clusters — but 
>> this is complex stuff :)
>>   * Write a custom SearchComponent plugin that knows about all clusters and 
>> adds the = param
>> Another approach would be for the originating cluster to fan out just ONE 
>> request to each of the other
>> clusters and then write some SearchComponent to merge those responses. That 
>> would let us query
>> the other clusters using one LB IP address instead of requiring full 
>> visibility to all solr nodes
>> of all clusters, but if we don’t need that isolation, that extra merge code 
>> seems fairly complex.
>> So far I opt for the custom SearchComponent and = param approach. Any 
>> useful input from
>> someone who tried a similar approach would be priceless!
> 
> Hi Jan,
> 
> We actually looked at this for the BioSolr project - a SolrCloud of 
> SolrClouds. Unfortunately the funding didn't appear for the project so we 
> didn't take it any further than some rough ideas - as you say, if you get it 
> right it should 'just work'. We had some extra complications in terms of 
> shared partial schemas...
> 
> Cheers
> 
> Charlie
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com 
> 
> 
> -- 
> Charlie Hull
> Flax - Open Source Enterprise Search
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk 


Re: facet.method=uif not working in solr cloud?

2018-01-31 Thread Wei
Thanks Alessandro. Totally agree that from the logic I can't see why the
requested facet.method=uif is not accepted. I don't see anything in
solr.log also.  However I find that the uif method somehow works with json
facet api in cloud mode,  e.g:

curl http://mysolrcloud:8983/solr/mycollection/select -d
'q=*:*=json=0={color: {type: terms, field : color,
method : uif, limit:1000, mincount:1}}=true'

Then in the debug response I see:

"facet-trace":{

   - "processor":"FacetQueryProcessor",
   - "elapse":453,
   - "query":null,
   - "domainSize":70215,
   - "sub-facet":[
  1. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":1,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":7166
  },
  2. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":1,
 - "field":"color",
 - "limit":1000
 - "numBuckets":19,
 - "domainSize":7004
  },
  3. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":2,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":7030
  },
  4. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":80,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6969
  },
  5. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":85,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6953
  },
  6. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":85,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6901
  },
  7. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":93,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6951
  },
  8. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":104,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":19,
 - "domainSize":7127
  }
   ]

A few things puzzled me here.  Looks like when using the json facet api,
SimpleFacets is not used, replaced by FacetFieldPorcessorByArrayUIF
processor. Is that the expected behavior? Also with uif method applied,
facet latency is greatly increased.  Some shards have much bigger elapse
time reported ( 104 vs 1),  I wonder what could cause the discrepancy as my
index in different shards are evenly distributed.

Thanks,
Wei


On Wed, Jan 31, 2018 at 2:24 AM, Alessandro Benedetti 
wrote:

> I worked personally on the SimpleFacets class which does the facet method
> selection :
>
> FacetMethod appliedFacetMethod = selectFacetMethod(field,
> sf, requestedMethod, mincount,
> exists);
>
> RTimer timer = null;
> if (fdebug != null) {
>fdebug.putInfoItem("requestedMethod", requestedMethod==null?"not
> specified":requestedMethod.name());
>fdebug.putInfoItem("appliedMethod", appliedFacetMethod.name());
>fdebug.putInfoItem("inputDocSetSize", docs.size());
>fdebug.putInfoItem("field", field);
>timer = new RTimer();
> }
>
> Within the select facet method , the only code block related UIF is (
> another block can apply when facet method arrives null to the Solr Node,
> but
> that should not apply as we see the facet method in the debug):
>
> /* UIF without DocValues can't deal with mincount=0, the reason is because
>  we create the buckets based on the values present in the result
> set.
>  So we are not going to see facet values which are not in the
> result
> set */
>  if (method == FacetMethod.UIF
>  && !field.hasDocValues() && mincount == 0) {
>method = field.multiValued() ? FacetMethod.FC : FacetMethod.FCS;
>  }
>
> So is there anything in the logs?
> Because that seems to me the only point where you can change from UIF to FC
> and you clearly have mincount=1.
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Distributed search cross cluster

2018-01-31 Thread Jan Høydahl
Hi,

I am an ex FAST employee and actually used Unity a lot myself, even hacking the 
code
writing custom mixers etc :)

That is all cool, if you want to write a generic federation layer. In our case 
we only ever
need to talk to Solr instances with exactly the same schema and doument types,
compatible scores etc. So that’s why I figure it is out of scope to write 
custom merge
code. It would also be less efficient since you’d get, say 10 hits from 10 
clusters = 100 hits
while if you just let the original node talk to all the shards then you only 
fetch the top docs
across all clusters.

I see many many open OLD JIRAs for federated features, which never got anywhere,
so I take that also as a hint that this is either not needed or very complex :)

Takling about FAST ESP, the "fsearch" process responsible for merging results 
from 
underlying indices was actually used at multiple levels, so to federate two 
FAST clusters
all you had to do was put a top level fsearch process above all of them and 
point it to
the right host:port list, then a QRServer on top of that fsearch again. Those 
were the days.

If there was some class that would delegate an incoming search request to sub 
shards
in a generic way, without writing all the merge and two-phase stuff over again, 
then
that would be ideal.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 31. jan. 2018 kl. 10:41 skrev Bernd Fehling :
> 
> Many years ago, in a different universe, when Federated Search was a buzzword 
> we
> used Unity from FAST FDS (which is now MS ESP). It worked pretty well across
> many systems like FAST FDS, Google, Gigablast, ...
> Very flexible with different mixers, parsers, query transformers.
> Was written in Python and used pylib.medusa.
> Search for "unity federated search", there is a book at Google about this, 
> just
> to get an idea.
> 
> Regards, Bernd
> 
> 
> Am 30.01.2018 um 17:09 schrieb Jan Høydahl:
>> Hi,
>> 
>> A customer has 10 separate SolrCloud clusters, with same schema across all, 
>> but different content.
>> Now they want users in each location to be able to federate a search across 
>> all locations.
>> Each location is 100% independent, with separate ZK etc. Bandwidth and 
>> latency between the
>> clusters is not an issue, they are actually in the same physical datacenter.
>> 
>> Now my first thought was using a custom  parameter, and let the 
>> receiving node fan
>> out to all shards of all clusters. We’d need to contact the ZK for each 
>> environment and find
>> all shards and replicas participating in the collection and then construct 
>> the shards=A1|A2,B1|B2…
>> sting which would be quite big, but if we get it right, it should “just 
>> work".
>> 
>> Now, my question is whether there are other smarter ways that would leave it 
>> up to existing Solr
>> logic to select shards and load balance, that would also take into account 
>> any shard.keys/_route_
>> info etc. I thought of these
>>  * =collA,collB  — but it only supports collections local to one 
>> cloud
>>  * Create a collection ALIAS to point to all 10 — but same here, only local 
>> to one cluster
>>  * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want 
>> it with pure search API
>>  * Write a custom ShardHandler plugin that knows about all clusters — but 
>> this is complex stuff :)
>>  * Write a custom SearchComponent plugin that knows about all clusters and 
>> adds the = param
>> 
>> Another approach would be for the originating cluster to fan out just ONE 
>> request to each of the other
>> clusters and then write some SearchComponent to merge those responses. That 
>> would let us query
>> the other clusters using one LB IP address instead of requiring full 
>> visibility to all solr nodes
>> of all clusters, but if we don’t need that isolation, that extra merge code 
>> seems fairly complex.
>> 
>> So far I opt for the custom SearchComponent and = param approach. Any 
>> useful input from
>> someone who tried a similar approach would be priceless!
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 



Re: Mixing simple and nested docs in same update?

2018-01-31 Thread Jan Høydahl
Thanks for the reply.

I see that the child doctransformer 
(https://lucene.apache.org/solr/guide/6_6/transforming-result-documents.html#TransformingResultDocuments-_child_-ChildDocTransformerFactory)
 has a childFilter= option which, when used, solves the issue/bug.
But such a childFilter does not exist for the BlockJoin QParsers.

Still not sure whether it is a bug or not...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 31. jan. 2018 kl. 00:30 skrev Tomas Fernandez Lobbe :
> 
> I believe the problem is that:
> * BlockJoin queries do not know about your “types”, in the BlockJoin query 
> world, everything that’s not a parent (matches the parentFilter) is a child.
> * All docs indexed before a parent are considered childs of that doc.
> That’s why in your first case it considers “friend” (not a parent, then a 
> child) to be a child of the first parent it can find in the segment (mother). 
> In the second case, the “friend” doc would have no parent. No parent document 
> matches the filter after it, so it’s not considered a match. 
> Maybe if you try your query with parentFilter=-type:child, this particular 
> example works (I haven’t tried it)?
> 
> Note that when you send docs with childs to Solr, Solr will make sure the 
> childs are indexed before the parent. Also note that there are some other 
> open bugs related to child docs, and in particular, with mixing child docs 
> with non-child docs, depending on which features you need this may be a 
> problem.
> 
> Tomás
> 
>> On Jan 30, 2018, at 5:48 AM, Jan Høydahl  wrote:
>> 
>> Pasting the GIST link :-) 
>> https://gist.github.com/45640fe3bad696d53ef8a0930a35d163 
>> 
>> Anyone knows if this is expected behavior?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 15. jan. 2018 kl. 14:08 skrev Jan Høydahl :
>>> 
>>> Radio silence…
>>> 
>>> Here is a GIST for easy reproduction. Is this by design?
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
 11. jan. 2018 kl. 00:42 skrev Jan Høydahl :
 
 Hi,
 
 We index several large nested documents. We found that querying the data 
 behaves differently depending on how the documents are indexed.
 
 To reproduce:
 
 solr start
 solr create -c nested
 # Index one plain document, “friend" and a nested one, “mother” and 
 “daughter”, in same request:
 curl localhost:8983/solr/nested/update -d ‘
 
 
  friend
  other
 
 
  mother
  parent
  
daughter
child
  
 
 '
 
 # Query for mother’s children using either child transformer or child 
 query parser
 curl 
 "localhost:8983/solr/a/query?q=id:mother=%2A%2C%5Bchild%20parentFilter%3Dtype%3Aparent%5D”
 {
 "responseHeader":{
 "zkConnected":true,
 "status":0,
 "QTime":4,
 "params":{
   "q":"id:mother",
   "fl":"*,[child parentFilter=type:parent]"}},
 "response":{"numFound":1,"start":0,"docs":[
   {
 "id":"mother",
 "type":["parent"],
 "_version_":1589249812802306048,
 "type_str":["parent"],
 "_childDocuments_":[
 {
   "id":"friend",
   "type":["other"],
   "_version_":1589249812729954304,
   "type_str":["other"]},
 {
   "id":"daughter",
   "type":["child"],
   "_version_":1589249812802306048,
   "type_str":["child"]}]}]
 }}
 
 As you can see, the “friend” got included as a child of “mother”.
 If you index the exact same request, putting “friend” after “mother” in 
 the xml,
 the query works as expected.
 
 Inspecting the index, everything looks correct, and only “daughter” and 
 “mother” have _root_=mother.
 Is there a rule that you should start a new update request for each type 
 of parent/child relationship
 that you need to index, and not mix them in the same request?
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
>>> 
>> 
> 



Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Emir Arnautović
Hi Wendy,
I was thinking of query q=method:“x-ray*” “Solution NMR”
This should be equivalent to one with OR between them. If you want to put AND 
between those two, query would be q=+method:”x-ray*” +”Solution NMR”

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 31 Jan 2018, at 19:39, Wendy2  wrote:
> 
> Hi Emir,
> 
> Listed below are the debugQuery outputs from query without "OR" operator. I
> really appreciate your help!  --Wendy
> 
> ===DebugQuery Outputs for case 1f-a, 1f-b  without "OR"
> operator=
> *1f-a (/search?q=+method:"x-ray*" +method:"Solution NMR") result counts = 0:
> *
> 
>  "debug":{
>"rawquerystring":" method:\"x-ray*\"  method:\"Solution NMR\"",
>"querystring":" method:\"x-ray*\"  method:\"Solution NMR\"",
>"parsedquery":"(+(PhraseQuery(method:\"x rai\")
> PhraseQuery(method:\"solut nmr\"))~2)/no_coord",
>"parsedquery_toString":"+((method:\"x rai\" method:\"solut nmr\")~2)",
> 
> 
> *1f-b (/search?q=method:"x-ray*" method:"Solution NMR") result counts = 0: *
> 
> "debug":{
>"rawquerystring":"method:\"x-ray*\" method:\"Solution NMR\"",
>"querystring":"method:\"x-ray*\" method:\"Solution NMR\"",
>"parsedquery":"(+(PhraseQuery(method:\"x rai\")
> PhraseQuery(method:\"solut nmr\"))~2)/no_coord",
>"parsedquery_toString":"+((method:\"x rai\" method:\"solut nmr\")~2)",
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Minimum memory requirement

2018-01-31 Thread Shawn Heisey

On 1/31/2018 1:54 PM, TK Solr wrote:
On my AWS t2.micro instance, which only has 1 GB memory, I installed 
Solr (4.7.1 - please don't ask) and tried to run it in sample 
directory as java -jar start.jar. It exited shortly due to lack of 
memory.


How much memory does Solr require to run, with empty core?


If you use the literal commandline "java -jar start.jar" then Java is 
going to decide how much memory it wants to allocate.  It could end up 
deciding to use a value that your specific OS installation can't 
actually support.  Solr can't do anything to change this. You can reduce 
the amount of heap that Java tries to allocate with the -Xmx option -- 
perhaps "java -Xmx512m -jar start.jar" would be a good starting point.


Solr versions since 4.10 have a startup script that sets many things 
that you don't get when running java directly yourself. The startup 
script is greatly improved in 5.0, and has steadily gotten better since 
then.


On a Windows 7 system, I have a download of Solr 7.0.0, with one core 
created using the default configset.  The core is empty and has an index 
size of 72 bytes.


With this commandline (telling Java to use a 16MB max heap), everything 
I did works:


bin\solr start -m 16m

I did not try indexing.  It is likely that indexing would not work with 
a 16MB heap, because I think the example configset would try to allocate 
a 100MB indexing buffer, and would probably need additional memory 
beyond that.  Indexing tends to increase heap requirements, especially 
heavy indexing.


With this commandline, I got out of memory errors just by navigating 
around the admin UI:


bin\solr start -m 12m

Even though I could get Solr working with a 16MB heap, I think I would 
not try running it "for real" with a heap less than the 512MB default 
that the script chooses by default.  On a machine with 1GB of memory, if 
Solr is the only software it has beyond the OS, a 512MB heap would 
probably work, as long as the OS was something like Linux, which is 
fairly lightweight when there is no GUI.  If the OS has a GUI, 1GB is 
probably not enough memory for a 512MB heap.


Once the index begins to achieve any size, most users end up needing to 
increase the heap beyond 512MB, and the machine will need more than 1GB.


The script that I used to start version 7 is not available in version 
4.7.1.  The older version probably has lower memory requirements than 
the newer one, but it would not be *significantly* lower.


Thanks,
Shawn



RE: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread Markus Jelsma
Hello S.G.

We do not complain about speed improvements at all, it is clear 7.x is faster 
than its predecessor. The problem is stability and not recovering from weird 
circumstances. In general, it is our high load cluster containing user 
interaction logs that suffers the most. Our main text search cluster - 
receiving much fewer queries - seems mostly unaffected, except last Sunday. 
After very short but high burst of queries it entered the same catatonic state 
the logs cluster usually dies from. 

The query burst immediately caused ZK timeouts and high heap consumption (not 
sure which came first of the latter two). The query burst lasted for 30 
minutes, the excessive heap consumption continued for more than 8 hours, before 
Solr finally realized it could relax. Most remarkable was that Solr recovered 
on its own, ZK timeouts stopped, heap went back to normal.

There seems to be a causality between high load and this state.

We really want to get this fixed for ourselves and everyone else that may 
encounter this problem, but i don't know how, so i need much more feedback and 
hints from those who have deep understanding of inner working of Solrcloud and 
changes since 6.x.

To be clear, we don't have the problem of 15 second ZK timeout, we use 30. Is 
30 too low still? Is it even remotely related to this problem? What does load 
have to do with it?

We are not able to reproduce it in lab environments. It can take minutes after 
cluster startup for it to occur, but also days. 

I've been slightly annoyed by problems that can occur in a board time span, it 
is always bad luck for reproduction.

Any help getting further is much appreciated.

Many thanks,
Markus
 
-Original message-
> From:S G 
> Sent: Wednesday 31st January 2018 21:48
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> And that came out all right.
> We saw a performance increase of about 30% in read latencies between 6.6.0
> and 7.1.0
> And then we saw a performance degradation of about 10% between 7.1.0 and
> 7.2.1 in many metrics.
> But overall, it still seems better than 6.6.0.
> 
> I will check for the errors too in the logs but the nodes were responsive
> for all the 23+ hours we did the load test.
> 
> Disclaimer: We do not test facets and pivots or block-joins. And will add
> those features to our load-testing tool sometime this year.
> 
> Thanks
> SG
> 
> 
> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma 
> wrote:
> 
> > Ah thanks, i just submitted a patch fixing it.
> >
> > Anyway, in the end it appears this is not the problem we are seeing as our
> > timeouts were already at 30 seconds.
> >
> > All i know is that at some point nodes start to lose ZK connections due to
> > timeouts (logs say so, but all within 30 seconds), the logs are flooded
> > with those messages:
> > o.a.z.ClientCnxn Client session timed out, have not heard from server in
> > 10359ms for sessionid 0x160f9e723c12122
> > o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> > 0x60f9e7234f05bb has expired
> >
> > Then there is a doubling in heap usage and nodes become unresponsive, die
> > etc.
> >
> > We also see those messages in other collections, but not so frequently and
> > they don't cause failure in those less loaded clusters.
> >
> > Ideas?
> >
> > Thanks,
> > Markus
> >
> > -Original message-
> > > From:Michael Braun 
> > > Sent: Monday 29th January 2018 21:09
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > Believe this is reported in https://issues.apache.org/
> > jira/browse/SOLR-10471
> > >
> > >
> > > On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Hello SG,
> > > >
> > > > The default in solr.in.sh is commented so it defaults to the value
> > set in
> > > > bin/solr, which is fifteen seconds. Just uncomment the setting in
> > > > solr.in.sh and your timeout will be thirty seconds.
> > > >
> > > > For Solr itself to really default to thirty seconds, Solr's bin/solr
> > needs
> > > > to be patched to use the correct value.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:S G 
> > > > > Sent: Monday 29th January 2018 20:15
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > > >
> > > > > Hi Markus,
> > > > >
> > > > > We are in the process of upgrading our clusters to 7.2.1 and I am not
> > > > sure
> > > > > I quite follow the conversation here.
> > > > > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> > > > value
> > > > > in the config (and it's just a default value being wrong/overridden
> > > > > somewhere)?
> > > > > Or is it more 

Minimum memory requirement

2018-01-31 Thread TK Solr
On my AWS t2.micro instance, which only has 1 GB memory, I installed Solr (4.7.1 
- please don't ask) and tried to run it in sample directory as java -jar 
start.jar. It exited shortly due to lack of memory.


How much memory does Solr require to run, with empty core?

TK




Re: Long GC Pauses

2018-01-31 Thread S G
Hey Maulin,

I hope you are using some tools to look at your gc.log file (There are
couple available online) or grepping for pauses.
Do you mind sharing your G1GC settings and some screenshots from your
gc.log analyzer's output ?

-SG


On Wed, Jan 31, 2018 at 9:16 AM, Erick Erickson 
wrote:

> Just to double check, when you san you're seeing 60-200 sec  GC pauses
> are you looking at the GC logs (or using some kind of monitor) or is
> that the time it takes the query to respond to the client? Because a
> single GC pause that long on 40G is unusual no matter what. Another
> take on Jason's question is
> For all the JVMs you're running, how much _total_ heap is allocated?
> And how much physical memory is on the box? I generally start with _at
> least_ half the memory left to the OS
>
> These are fairly horrible, what generates such queries?
> AND * AND *
>
> Best,
> Erick
>
>
>
> On Wed, Jan 31, 2018 at 7:28 AM, Jason Gerlowski 
> wrote:
> > Hi Maulin,
> >
> > To clarify, when you said "...allocated 40 GB RAM to each shard." above,
> > I'm going to assume you meant "to each node" instead.  If you actually
> did
> > mean "to each shard" above, please correct me and anyone who chimes in
> > afterward.
> >
> > Firstly, it's really hard to even take guesses about potential causes or
> > remediations without more details about your load characteristics
> > (average/peak QPS, index size, average document size, etc.).  If no one
> > gives any satisfactory advice, please consider uploading additional
> details
> > to help us help you.
> >
> > Secondly, I don't know anything about the load characteristics you're
> > putting on your Solr cluster, but I'm curious whether you've experimented
> > with lower RAM settings.  Generally speaking, the more RAM you have, the
> > longer your GC pauses are likely to be (even with the tuning that various
> > GC settings provide).  If you can get away with giving the Solr process
> > less RAM, you should see your GC pauses shrink.  Was 40GB chosen after
> some
> > trial-and-error experimentation, or is it something you could
> investigate?
> >
> > For a bit more overview on this, see this slightly outdated (but still
> > useful) wiki page: https://wiki.apache.org/solr/
> SolrPerformanceProblems#RAM
> >
> > Hope that helps, even if just to disqualify some potential
> causes/solutions
> > to close in on a real fix.
> >
> > Best,
> >
> > Jason
> >
> > On Wed, Jan 31, 2018 at 8:17 AM, Maulin Rathod 
> wrote:
> >
> >> Hi,
> >>
> >> We are using solr cloud 6.1. We have around 20 collection on 4 nodes (We
> >> have 2 shards and each shard have 2 replicas). We have allocated 40 GB
> RAM
> >> to each shard.
> >>
> >> Intermittently we found long GC pauses (60 sec to 200 sec) due to which
> >> solr stops responding and hence collections goes in recovering mode. It
> >> takes minimum 5-10 minutes (sometime it takes more and we have to
> restart
> >> the solr node) for recovering all collections. We are using default GC
> >> setting (CMS) as per solr.cmd.
> >>
> >> We tried different G1 GC to see if it help, but still we see long GC
> >> pauses(60 sec to 200 sec) and also found that memory usage is more in in
> >> case G1 GC.
> >>
> >> What could be reason for long GC pauses and how can fix it? Insufficient
> >> memory or problem with GC setting or something else? Any suggestion
> would
> >> be greatly appreciated.
> >>
> >> In our analysis, we also found some inefficient queries (which uses *
> many
> >> times in query) in solr logs. Could it be reason for high memory usage?
> >>
> >> Slow Query
> >> --
> >>
> >> INFO  (qtp1239731077-498778) [c:documents s:shard1 r:core_node1
> >> x:documents] o.a.s.c.S.Request [documents]  webapp=/solr path=/select
> >> params={df=summary=false=id=4&
> >> start=0=true=description+asc,id+desc==
> >> s1.asite.com:8983/solr/documents|s1r1.asite.com:
> >> 8983/solr/documents=250=2=((id:(
> >> REV78364_24705418+REV78364_24471492+REV78364_24471429+
> >> REV78364_24470771+REV78364_24470271+))+OR+summary:((HPC*+
> >> AND+*+AND+*+AND+OH1150*+AND+*+AND+*+AND+U0*+AND+*+AND+*+AND+
> >> HGS*+AND+*+AND+*+AND+MDL*+AND+*+AND+*+AND+100067*+AND+*+AND+
> >> -*+AND+Reinforcement*+AND+*+AND+Mode*)+))++AND++(title:((*
> >> HPC\+\-\+OH1150\+\-\+U0\+\-\+HGS\+\-\+MDL\+\-\+100067\+-\+
> >> Reinforcement\+Mode*)+))+AND+project_id:(-2+78243+78365+
> >> 78364)+AND+is_active:true+AND+((isLatest:(true)+AND+
> >> isFolderActive:true+AND+isXref:false+AND+-document_
> >> type_id:(3+7)+AND+((is_public:true+OR+distribution_list:
> >> 4858120+OR+folderadmin_list:4858120+OR+author_user_id:
> >> 4858120)+AND+((defaultAccess:(true)+OR+allowedUsers:(
> >> 4858120)+OR+allowedRoles:(6342201+172408+6336860)+OR+
> >> combinationUsers:(4858120))+AND+-blockedUsers:(4858120
> >> +OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(
> >> 4858120)+AND+-folderadmin_list:(4858120)))=true=
> >> 

Re: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread S G
We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
And that came out all right.
We saw a performance increase of about 30% in read latencies between 6.6.0
and 7.1.0
And then we saw a performance degradation of about 10% between 7.1.0 and
7.2.1 in many metrics.
But overall, it still seems better than 6.6.0.

I will check for the errors too in the logs but the nodes were responsive
for all the 23+ hours we did the load test.

Disclaimer: We do not test facets and pivots or block-joins. And will add
those features to our load-testing tool sometime this year.

Thanks
SG


On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma 
wrote:

> Ah thanks, i just submitted a patch fixing it.
>
> Anyway, in the end it appears this is not the problem we are seeing as our
> timeouts were already at 30 seconds.
>
> All i know is that at some point nodes start to lose ZK connections due to
> timeouts (logs say so, but all within 30 seconds), the logs are flooded
> with those messages:
> o.a.z.ClientCnxn Client session timed out, have not heard from server in
> 10359ms for sessionid 0x160f9e723c12122
> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> 0x60f9e7234f05bb has expired
>
> Then there is a doubling in heap usage and nodes become unresponsive, die
> etc.
>
> We also see those messages in other collections, but not so frequently and
> they don't cause failure in those less loaded clusters.
>
> Ideas?
>
> Thanks,
> Markus
>
> -Original message-
> > From:Michael Braun 
> > Sent: Monday 29th January 2018 21:09
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.2.1 cluster dies within minutes after restart
> >
> > Believe this is reported in https://issues.apache.org/
> jira/browse/SOLR-10471
> >
> >
> > On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Hello SG,
> > >
> > > The default in solr.in.sh is commented so it defaults to the value
> set in
> > > bin/solr, which is fifteen seconds. Just uncomment the setting in
> > > solr.in.sh and your timeout will be thirty seconds.
> > >
> > > For Solr itself to really default to thirty seconds, Solr's bin/solr
> needs
> > > to be patched to use the correct value.
> > >
> > > Regards,
> > > Markus
> > >
> > > -Original message-
> > > > From:S G 
> > > > Sent: Monday 29th January 2018 20:15
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > >
> > > > Hi Markus,
> > > >
> > > > We are in the process of upgrading our clusters to 7.2.1 and I am not
> > > sure
> > > > I quite follow the conversation here.
> > > > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> > > value
> > > > in the config (and it's just a default value being wrong/overridden
> > > > somewhere)?
> > > > Or is it more severe in the sense that any config set for
> > > ZK_CLIENT_TIMEOUT
> > > > by the user is just ignored completely by Solr in 7.2.1 ?
> > > >
> > > > Thanks
> > > > SG
> > > >
> > > >
> > > > On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> > > markus.jel...@openindex.io>
> > > > wrote:
> > > >
> > > > > Ok, i applied the patch and it is clear the timeout is 15000.
> Solr.xml
> > > > > says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default
> unset
> > > in
> > > > > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's
> default is
> > > > > still 15000, not 3.
> > > > >
> > > > > But, back to my topic. I see we explicitly set it in solr.in.sh to
> > > 3.
> > > > > To be sure, i applied your patch to a production machine, all our
> > > > > collections run with 3. So how would that explain this log
> line?
> > > > >
> > > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> server
> > > in
> > > > > 22130ms
> > > > >
> > > > > We also see these with smaller values, seven seconds. And, is this
> > > > > actually an indicator of the problems we have?
> > > > >
> > > > > Any ideas?
> > > > >
> > > > > Many thanks,
> > > > > Markus
> > > > >
> > > > >
> > > > > -Original message-
> > > > > > From:Markus Jelsma 
> > > > > > Sent: Saturday 27th January 2018 10:03
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Subject: RE: 7.2.1 cluster dies within minutes after restart
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I grepped for it yesterday and found nothing but 3 in the
> > > settings,
> > > > > but judging from the weird time out value, you may be right. Let me
> > > apply
> > > > > your patch early next week and check for spurious warnings.
> > > > > >
> > > > > > Another note worthy observation for those working on cloud
> stability
> > > and
> > > > > recovery, whenever this happens, some nodes are also absolutely
> sure
> > > to run
> > > > > OOM. The leaders usually live longest, the replica's don't, their
> heap
> > > > > usage peaks every time, 

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Wendy2
Hi Emir,

Listed below are the debugQuery outputs from query without "OR" operator. I
really appreciate your help!  --Wendy

===DebugQuery Outputs for case 1f-a, 1f-b  without "OR"
operator=
*1f-a (/search?q=+method:"x-ray*" +method:"Solution NMR") result counts = 0:
*

  "debug":{
"rawquerystring":" method:\"x-ray*\"  method:\"Solution NMR\"",
"querystring":" method:\"x-ray*\"  method:\"Solution NMR\"",
"parsedquery":"(+(PhraseQuery(method:\"x rai\")
PhraseQuery(method:\"solut nmr\"))~2)/no_coord",
"parsedquery_toString":"+((method:\"x rai\" method:\"solut nmr\")~2)",


*1f-b (/search?q=method:"x-ray*" method:"Solution NMR") result counts = 0: *

"debug":{
"rawquerystring":"method:\"x-ray*\" method:\"Solution NMR\"",
"querystring":"method:\"x-ray*\" method:\"Solution NMR\"",
"parsedquery":"(+(PhraseQuery(method:\"x rai\")
PhraseQuery(method:\"solut nmr\"))~2)/no_coord",
"parsedquery_toString":"+((method:\"x rai\" method:\"solut nmr\")~2)",



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Sorting results for spatial search

2018-01-31 Thread Leila Deljkovic
Hiya,

So I have some nested documents in my index with this kind of structure:
{   
"id": “parent",
"gridcell_rpt": "POLYGON((30 10, 40 40, 20 40, 10 20, 30 10))",
"density": “30"

"_childDocuments_" : [
{
"id":"child1",
"gridcell_rpt":"MULTIPOLYGON(((30 20, 45 40, 10 40, 30 20)))",
"density":"25"
},
{
"id":"child2",
"gridcell_rpt":"MULTIPOLYGON(((15 5, 40 10, 10 20, 5 10, 15 5)))",
"density":"5"
}
]
}

The parent document is a WKT shape, and its children are “grid cells”, which 
are just divisions of the main shape (ie; cutting up the parent shape to get 
children shapes). The “density" is the feature count in each shape. When I 
query (through the Solr UI) I use “Intersects” to return parents which touch 
the search area (note that if a child is touching, the parent must also be 
touching).

eg; fq={!field f=gridcell_rpt}Intersects(POLYGON((-20 70, -50 80, -20 
20, 30 60, -10 40, -20 70)))

and I want to sort the results by the sum of the densities of all the children 
touching the search area (so which parent has children that touch the search 
area, and how big the sum of these children’s densities is)
something like {!parent which=is_parent:true score=total 
v='+is_parent:false +{!func}density'} desc

The problem is that this includes children that DON’T touch the search area in 
the sum. How can I only include the shapes from the first query above in my 
sort?

Cheers :)

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Luigi Caiazza
Hi,

first of all, thank you for your answers.

@ Rick: the reason is that the set of pages that are stored into the disk
represents just a static view of the Web, in order to let my experiments be
fully replicable. My need is to run simulations of different crawlers on
top of it, each working on those pages as if they are coming from the real
Web. During a simulation, the crawler receives a set of unpredictable user
queries from an external module. Then, it changes the visit priorities to
the discovered but uncrawled pages according with the current top-k results
for those queries, given the contents of the "crawled" pages so far.
Moreover, distinct runs explore different parts of the Web graph and
receive different user queries. That's why I need to build a separate index
of crawled contents for each run. The observation is that, since I am
working with a snapshot of the Web, my indexing process could be engineered
such that all the Web pages are already stored in the indexer and a flag
enables the retrievability of the page if it has been crawled in the
current experiment. In this way, I save some time that I could use to
augment the scale of the crawling simulation, and/or to run other
experiments.

@ Alessandro: your approach of using a static and a dynamic index and then
to merge the results by means of query joins was what I had in mind at a
first glance. It could still do the job, but you already highlighted a
performance limitation on the static index. Moreover, even if I store just
the IDs and the crawling cycles, also the dynamic index will still be
populated by some million of entries as the experiment proceeds. The atomic
updates were another opportunity that I investigated before asking your
help, but since eventually they rewrite the entire document I was hoping to
find a more efficient solution.

@ Diego: your idea of using the NumericDocValues sounds interesting.
Probably this is the solution, but, if I get the point, a NumericDocValues
has some features in common with the IntPoint that I am currently using in
my index [1]. Among them: the storage of primitive data types instead of
strings only, and the storage on a data structure different than the
inverted index. Now I am asking: is there a chance to use the IntPoint in
the same way?

Cheers.

[1]
https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/document/IntPoint.html

2018-01-31 13:45 GMT+01:00 Rick Leir :

> Luigi
> Is there a reason for not indexing all of your on-disk pages? That seems
> to be the first step. But I do not understand what your goal is.
> Cheers -- Rick
>
> On January 30, 2018 1:33:27 PM EST, Luigi Caiazza 
> wrote:
> >Hello,
> >
> >I am working on a project that simulates a selective, large-scale
> >crawling.
> >The system adapts its behaviour according with some external user
> >queries
> >received at crawling time. Briefly, it analyzes the already crawled
> >pages
> >in the top-k results for each query, and prioritizes the visit of the
> >discovered links accordingly. In a generic experiment, I measure the
> >time
> >units as the number of crawling cycles completed so far, i.e., with an
> >integer value. Finally, I evaluate the experiment by analyzing the
> >documents fetched over the crawling cycles. In this work I am using
> >Lucene
> >7.2.1, but this should not be an issue since I need just some
> >conceptual
> >help.
> >
> >In my current implementation, an experiment starts with an empty index.
> >When a Web page is fetched during the crawling cycle *x*, the system
> >builds
> >a document with the URL as StringField, the title and the body as
> >TextFields, and *x* as an IntPoint. When I get an external user query,
> >I
> >submit it  to get the top-k relevant documents crawled so far. When I
> >need
> >to retrieve the documents indexed from cycle *i* to cycle *j*, I
> >execute a
> >range query over this last IntPoint field. This strategy does the job,
> >but
> >of course the write operations take some hours overall for a single
> >experiment, even if I crawl just half a million of Web pages.
> >
> >Since I am not crawling real-time data, but I am working over a static
> >set
> >of many billions of Web pages (whose contents are already stored on
> >disk),
> >I am investigating some opportunities to reduce the number of writes
> >during
> >an experiment. For instance, I could avoid to index everything from
> >scratch
> >for each run. I would be happy to index all the static contents of my
> >dataset (i.e., URL, title and body of a Web page) once and for all.
> >Then,
> >for a single experiment, I would mark a document as crawled at cycle
> >*x* without
> >storing this information permanently, in order both to filter out the
> >documents that in the current simulation have not been crawled when
> >processing the external queries, and to still perform the range queries
> >at
> >evaluation time. Do you have any idea on how to do that?
> >
> >Thank you in advance 

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Emir Arnautović
Hi Wendy,
With OR with spaces OR is interpreted as another search term. Can you try
without or - just a space between two parts. If you need and, use + before
each part.

HTH,
Emir

On Jan 31, 2018 6:24 PM, "Wendy2"  wrote:

Hi Emir,

Thank you so much for following up with your ticket.
Listed below are the parts of debugQuery outputs via /search request
handler. The reason I used * in the query term is that there are a couple of
methods starting with "x-ray". When I used space surrounding the "OR"
boolean search operator (refer to 1f) below, I got zero results. If I remove
the space, the result count = 19.

Thank you very much for investigating this issue. I am a happy Solr user. We
implemented Solr to our web site text search (www.rcsb.org) last year and
have improved the search results :-). Now we want to expand our text search
to support Boolean search and I am facing this issue.  Thank you again for
all your help and support! --Wendy

===DebugQuery Outputs for case 1d, 1e, 1f =
*1d(/search?q=method:"x-ray*"):* result counts = 884

"debug":{
"rawquerystring":"method:\"x-ray*\"",
"querystring":"method:\"x-ray*\"",
"parsedquery":"(+PhraseQuery(method:\"x rai\"))/no_coord",
"parsedquery_toString":"+method:\"x rai\"",

*1e (/search?q=method:"Solution NMR"):* result counts = 153

 "debug":{
"rawquerystring":"method:\"Solution NMR\"",
"querystring":"method:\"Solution NMR\"",
"parsedquery":"(+PhraseQuery(method:\"solut nmr\"))/no_coord",
"parsedquery_toString":"+method:\"solut nmr\"",

*1f (/search?q=method:"x-ray*" OR "Solution NMR"):* result counts = 0

 "debug":{
"rawquerystring":"method:\"x-ray*\" OR \"Solution NMR\"",
"querystring":"method:\"x-ray*\" OR \"Solution NMR\"",
"parsedquery":"(+(PhraseQuery(method:\"x rai\")
DisjunctionMaxQuery(((pdb_id:OR)^5.0)) DisjunctionMaxQuery(((pdb_id:Solution
NMR)^5.0 | (entity_name_com.name:\"solut nmr\")^20.0 |
(citation_author.name:\"solut nmr\")^5.0 | (audit_author.name:\"solut
nmr\")^5.0 | rest_fields_stem:\"solut nmr\" | (title_fields_stem:\"solut
nmr\")^3.0 | (classification:\"solut nmr\")^15.0 |
(struct_keywords.text:\"solut nmr\")^12.0 | (entity.pdbx_description:\"solut
nmr\")^10.0 | (pdbx_descriptor_stem:\"solut nmr\")^10.0 |
(citation.title:\"solut nmr\")^25.0 | (struct_keywords.pdbx_keywords:\"solut
nmr\")^15.0 | (entity_src_gen_concat_stem:\"solut nmr\")^15.0 |
(struct.title:\"solut nmr\")^35.0 | (group_id_stem:\"solut
nmr\")^10.0)))~3)/no_coord",
"parsedquery_toString":"+((method:\"x rai\" ((pdb_id:OR)^5.0)
((pdb_id:Solution NMR)^5.0 | (entity_name_com.name:\"solut nmr\")^20.0 |
(citation_author.name:\"solut nmr\")^5.0 | (audit_author.name:\"solut
nmr\")^5.0 | rest_fields_stem:\"solut nmr\" | (title_fields_stem:\"solut
nmr\")^3.0 | (classification:\"solut nmr\")^15.0 |
(struct_keywords.text:\"solut nmr\")^12.0 | (entity.pdbx_description:\"solut
nmr\")^10.0 | (pdbx_descriptor_stem:\"solut nmr\")^10.0 |
(citation.title:\"solut nmr\")^25.0 | (struct_keywords.pdbx_keywords:\"solut
nmr\")^15.0 | (entity_src_gen_concat_stem:\"solut nmr\")^15.0 |
(struct.title:\"solut nmr\")^35.0 | (group_id_stem:\"solut
nmr\")^10.0))~3)",
"explain":{},





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Wendy2
Hi Emir,

Thank you so much for following up with your ticket.
Listed below are the parts of debugQuery outputs via /search request
handler. The reason I used * in the query term is that there are a couple of
methods starting with "x-ray". When I used space surrounding the "OR"
boolean search operator (refer to 1f) below, I got zero results. If I remove
the space, the result count = 19. 

Thank you very much for investigating this issue. I am a happy Solr user. We
implemented Solr to our web site text search (www.rcsb.org) last year and
have improved the search results :-). Now we want to expand our text search
to support Boolean search and I am facing this issue.  Thank you again for
all your help and support! --Wendy  

===DebugQuery Outputs for case 1d, 1e, 1f =
*1d(/search?q=method:"x-ray*"):* result counts = 884

"debug":{
"rawquerystring":"method:\"x-ray*\"",
"querystring":"method:\"x-ray*\"",
"parsedquery":"(+PhraseQuery(method:\"x rai\"))/no_coord",
"parsedquery_toString":"+method:\"x rai\"",

*1e (/search?q=method:"Solution NMR"):* result counts = 153

 "debug":{
"rawquerystring":"method:\"Solution NMR\"",
"querystring":"method:\"Solution NMR\"",
"parsedquery":"(+PhraseQuery(method:\"solut nmr\"))/no_coord",
"parsedquery_toString":"+method:\"solut nmr\"",

*1f (/search?q=method:"x-ray*" OR "Solution NMR"):* result counts = 0

 "debug":{
"rawquerystring":"method:\"x-ray*\" OR \"Solution NMR\"",
"querystring":"method:\"x-ray*\" OR \"Solution NMR\"",
"parsedquery":"(+(PhraseQuery(method:\"x rai\")
DisjunctionMaxQuery(((pdb_id:OR)^5.0)) DisjunctionMaxQuery(((pdb_id:Solution
NMR)^5.0 | (entity_name_com.name:\"solut nmr\")^20.0 |
(citation_author.name:\"solut nmr\")^5.0 | (audit_author.name:\"solut
nmr\")^5.0 | rest_fields_stem:\"solut nmr\" | (title_fields_stem:\"solut
nmr\")^3.0 | (classification:\"solut nmr\")^15.0 |
(struct_keywords.text:\"solut nmr\")^12.0 | (entity.pdbx_description:\"solut
nmr\")^10.0 | (pdbx_descriptor_stem:\"solut nmr\")^10.0 |
(citation.title:\"solut nmr\")^25.0 | (struct_keywords.pdbx_keywords:\"solut
nmr\")^15.0 | (entity_src_gen_concat_stem:\"solut nmr\")^15.0 |
(struct.title:\"solut nmr\")^35.0 | (group_id_stem:\"solut
nmr\")^10.0)))~3)/no_coord",
"parsedquery_toString":"+((method:\"x rai\" ((pdb_id:OR)^5.0)
((pdb_id:Solution NMR)^5.0 | (entity_name_com.name:\"solut nmr\")^20.0 |
(citation_author.name:\"solut nmr\")^5.0 | (audit_author.name:\"solut
nmr\")^5.0 | rest_fields_stem:\"solut nmr\" | (title_fields_stem:\"solut
nmr\")^3.0 | (classification:\"solut nmr\")^15.0 |
(struct_keywords.text:\"solut nmr\")^12.0 | (entity.pdbx_description:\"solut
nmr\")^10.0 | (pdbx_descriptor_stem:\"solut nmr\")^10.0 |
(citation.title:\"solut nmr\")^25.0 | (struct_keywords.pdbx_keywords:\"solut
nmr\")^15.0 | (entity_src_gen_concat_stem:\"solut nmr\")^15.0 |
(struct.title:\"solut nmr\")^35.0 | (group_id_stem:\"solut
nmr\")^10.0))~3)",
"explain":{},





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Long GC Pauses

2018-01-31 Thread Erick Erickson
Just to double check, when you san you're seeing 60-200 sec  GC pauses
are you looking at the GC logs (or using some kind of monitor) or is
that the time it takes the query to respond to the client? Because a
single GC pause that long on 40G is unusual no matter what. Another
take on Jason's question is
For all the JVMs you're running, how much _total_ heap is allocated?
And how much physical memory is on the box? I generally start with _at
least_ half the memory left to the OS

These are fairly horrible, what generates such queries?
AND * AND *

Best,
Erick



On Wed, Jan 31, 2018 at 7:28 AM, Jason Gerlowski  wrote:
> Hi Maulin,
>
> To clarify, when you said "...allocated 40 GB RAM to each shard." above,
> I'm going to assume you meant "to each node" instead.  If you actually did
> mean "to each shard" above, please correct me and anyone who chimes in
> afterward.
>
> Firstly, it's really hard to even take guesses about potential causes or
> remediations without more details about your load characteristics
> (average/peak QPS, index size, average document size, etc.).  If no one
> gives any satisfactory advice, please consider uploading additional details
> to help us help you.
>
> Secondly, I don't know anything about the load characteristics you're
> putting on your Solr cluster, but I'm curious whether you've experimented
> with lower RAM settings.  Generally speaking, the more RAM you have, the
> longer your GC pauses are likely to be (even with the tuning that various
> GC settings provide).  If you can get away with giving the Solr process
> less RAM, you should see your GC pauses shrink.  Was 40GB chosen after some
> trial-and-error experimentation, or is it something you could investigate?
>
> For a bit more overview on this, see this slightly outdated (but still
> useful) wiki page: https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
>
> Hope that helps, even if just to disqualify some potential causes/solutions
> to close in on a real fix.
>
> Best,
>
> Jason
>
> On Wed, Jan 31, 2018 at 8:17 AM, Maulin Rathod  wrote:
>
>> Hi,
>>
>> We are using solr cloud 6.1. We have around 20 collection on 4 nodes (We
>> have 2 shards and each shard have 2 replicas). We have allocated 40 GB RAM
>> to each shard.
>>
>> Intermittently we found long GC pauses (60 sec to 200 sec) due to which
>> solr stops responding and hence collections goes in recovering mode. It
>> takes minimum 5-10 minutes (sometime it takes more and we have to restart
>> the solr node) for recovering all collections. We are using default GC
>> setting (CMS) as per solr.cmd.
>>
>> We tried different G1 GC to see if it help, but still we see long GC
>> pauses(60 sec to 200 sec) and also found that memory usage is more in in
>> case G1 GC.
>>
>> What could be reason for long GC pauses and how can fix it? Insufficient
>> memory or problem with GC setting or something else? Any suggestion would
>> be greatly appreciated.
>>
>> In our analysis, we also found some inefficient queries (which uses * many
>> times in query) in solr logs. Could it be reason for high memory usage?
>>
>> Slow Query
>> --
>>
>> INFO  (qtp1239731077-498778) [c:documents s:shard1 r:core_node1
>> x:documents] o.a.s.c.S.Request [documents]  webapp=/solr path=/select
>> params={df=summary=false=id=4&
>> start=0=true=description+asc,id+desc==
>> s1.asite.com:8983/solr/documents|s1r1.asite.com:
>> 8983/solr/documents=250=2=((id:(
>> REV78364_24705418+REV78364_24471492+REV78364_24471429+
>> REV78364_24470771+REV78364_24470271+))+OR+summary:((HPC*+
>> AND+*+AND+*+AND+OH1150*+AND+*+AND+*+AND+U0*+AND+*+AND+*+AND+
>> HGS*+AND+*+AND+*+AND+MDL*+AND+*+AND+*+AND+100067*+AND+*+AND+
>> -*+AND+Reinforcement*+AND+*+AND+Mode*)+))++AND++(title:((*
>> HPC\+\-\+OH1150\+\-\+U0\+\-\+HGS\+\-\+MDL\+\-\+100067\+-\+
>> Reinforcement\+Mode*)+))+AND+project_id:(-2+78243+78365+
>> 78364)+AND+is_active:true+AND+((isLatest:(true)+AND+
>> isFolderActive:true+AND+isXref:false+AND+-document_
>> type_id:(3+7)+AND+((is_public:true+OR+distribution_list:
>> 4858120+OR+folderadmin_list:4858120+OR+author_user_id:
>> 4858120)+AND+((defaultAccess:(true)+OR+allowedUsers:(
>> 4858120)+OR+allowedRoles:(6342201+172408+6336860)+OR+
>> combinationUsers:(4858120))+AND+-blockedUsers:(4858120
>> +OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(
>> 4858120)+AND+-folderadmin_list:(4858120)))=true=
>> 1516786982952=true=javabin} hits=0 status=0 QTime=83309
>>
>>
>>
>>
>> Regards,
>>
>> Maulin
>>
>> [CC Award Winners!]
>>
>>


Re: How to avoid warning message

2018-01-31 Thread Shawn Heisey

On 1/31/2018 9:07 AM, Tamás Barta wrote:

I'm using Solr 6.6.2 and I use Zookeeper too handle Solr cloud. In Java
client I use SolrJ this way:

*client = new CloudSolrClient.Builder().withZkHost(zkHostString).build();*


In the log I see the followings:

*WARN  [org.apache.zookeeper.SaslClientCallbackHandler] Could not login:
the Client is being asked for a password, but the ZooKeeper Client code


The ZK servers have authentication configured, but you haven't 
configured any credentials for Solr.



After that everything works. What should I do to avoid this message? I
don't want any authentication between the client and Zookeepers as they are
not available from outside.


You're probably going to need to enlist the help of the ZooKeeper user 
mailing list on how to disable their authentication, or at least disable 
it for the Solr servers.


If you do end up using ZK authentication, here's Solr's documentation on it:

https://lucene.apache.org/solr/guide/7_2/zookeeper-access-control.html

Thanks,
Shawn



Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Erick Erickson
Or use a boost for the phrase, something like
"beauty and the beast"^5

On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood  wrote:
> You can use a separate field for title aliases. That is what I did for 
> Netflix search.
>
> Why disable idf? Disabling tf for titles can be a good idea, for example the 
> movie “New York, New York” is not twice as much about New York as some other 
> film that just lists it once.
>
> Also, consider using a popularity score as a boost.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
>>
>> Hi,
>> We are using solr for our movie title search.
>>
>>
>> As it is "title search", this should be treated different than the normal
>> document search.
>> Hence, we use a modified version of TFIDFSimilarity with the following
>> changes.
>> -  disabled TF & IDF and will only have 1 as value.
>> -  disabled norms by specifying omitNorms as true for all the fields.
>>
>> There are 6 fields with different analyzers and we make use of different
>> weights in edismax's qf & pf parameters to match tokens & boost phrases.
>>
>> But, movies could have aliases and have multiple titles. So, we made the
>> fields multivalued.
>>
>> Now, consider the following four documents
>> 1>  "Beauty and the Beast"
>> 2>  "The Real Beauty and the Beast"
>> 3>  "Beauty and the Beast", "La bella y la bestia"
>> 4>  "Beauty and the Beast"
>>
>> Note: Document 3 has two titles in it.
>>
>> So, for a query "Beauty and the Beast" and with the above configuration all
>> the documents receive same score. But 1,3,4 should have got same score and
>> document 2 lesser than others.
>>
>> To solve this, we followed what is suggested in the following thread:
>> http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html
>>
>> Now, the fields which are used to boost are made to use Norms. And for
>> matching norms are disabled. This is to make sure that exact & near exact
>> matches are rewarded.
>>
>> But, for the same query, we get the following results.
>> query: "Beauty & the Beast"
>> Search Results:
>> 1>  "Beauty and the Beast"
>> 4>  "Beauty and the Beast"
>> 2>  "The Real Beauty and the Beast"
>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>
>> Clearly, the changes have solved only a part of the problem. The document 3
>> should be ranked/scored higher than document 2.
>>
>> This is because lucene considers the total field length across all the
>> values in a multivalued field for normalization.
>>
>> How do we handle this scenario and make sure that in multivalued fields the
>> normalization is taken care of?
>>
>>
>> --
>> Regards,
>> Sravan
>


Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Walter Underwood
You can use a separate field for title aliases. That is what I did for Netflix 
search.

Why disable idf? Disabling tf for titles can be a good idea, for example the 
movie “New York, New York” is not twice as much about New York as some other 
film that just lists it once.

Also, consider using a popularity score as a boost.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
> 
> Hi,
> We are using solr for our movie title search.
> 
> 
> As it is "title search", this should be treated different than the normal
> document search.
> Hence, we use a modified version of TFIDFSimilarity with the following
> changes.
> -  disabled TF & IDF and will only have 1 as value.
> -  disabled norms by specifying omitNorms as true for all the fields.
> 
> There are 6 fields with different analyzers and we make use of different
> weights in edismax's qf & pf parameters to match tokens & boost phrases.
> 
> But, movies could have aliases and have multiple titles. So, we made the
> fields multivalued.
> 
> Now, consider the following four documents
> 1>  "Beauty and the Beast"
> 2>  "The Real Beauty and the Beast"
> 3>  "Beauty and the Beast", "La bella y la bestia"
> 4>  "Beauty and the Beast"
> 
> Note: Document 3 has two titles in it.
> 
> So, for a query "Beauty and the Beast" and with the above configuration all
> the documents receive same score. But 1,3,4 should have got same score and
> document 2 lesser than others.
> 
> To solve this, we followed what is suggested in the following thread:
> http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html
> 
> Now, the fields which are used to boost are made to use Norms. And for
> matching norms are disabled. This is to make sure that exact & near exact
> matches are rewarded.
> 
> But, for the same query, we get the following results.
> query: "Beauty & the Beast"
> Search Results:
> 1>  "Beauty and the Beast"
> 4>  "Beauty and the Beast"
> 2>  "The Real Beauty and the Beast"
> 3>  "Beauty and the Beast", "La bella y la bestia"
> 
> Clearly, the changes have solved only a part of the problem. The document 3
> should be ranked/scored higher than document 2.
> 
> This is because lucene considers the total field length across all the
> values in a multivalued field for normalization.
> 
> How do we handle this scenario and make sure that in multivalued fields the
> normalization is taken care of?
> 
> 
> -- 
> Regards,
> Sravan



How to avoid warning message

2018-01-31 Thread Tamás Barta
Hi,

I'm using Solr 6.6.2 and I use Zookeeper too handle Solr cloud. In Java
client I use SolrJ this way:

*client = new CloudSolrClient.Builder().withZkHost(zkHostString).build();*


In the log I see the followings:

*WARN  [org.apache.zookeeper.SaslClientCallbackHandler] Could not login:
the Client is being asked for a password, but the ZooKeeper Client code
does not currently support obtaining a password from the user. Make sure
that the Client is configured to use a ticket cache (using the JAAS
configuration setting 'useTicketCache=true)' and restart the Client. If you
still get this message after that, the TGT in the ticket cache has expired
and must be manually refreshed. To do so, first determine if you are using
a password or a keytab. If the former, run kinit in a Unix shell in the
environment of the user who is running this Zookeeper Client using the
command 'kinit ' (where  is the name of the Client's Kerberos
principal). If the latter, do 'kinit -k -t  ' (where 
is the name of the Kerberos principal, and  is the location of the
keytab file). After manually refreshing your cache, restart this Client. If
you continue to see this message after manually refreshing your cache,
ensure that your KDC host's clock is in sync with this host's clock.*

*WARN [org.apache.zookeeper.ClientCnxn] SASL configuration failed:
javax.security.auth.login.FailedLoginException: PBOX70: Password
invalid/Password required Will continue connection to Zookeeper server
without SASL authentication, if Zookeeper server allows it.*

*WARN  [org.apache.solr.common.cloud.ConnectionManager] Watcher
org.apache.solr.common.cloud.ConnectionManager@61c7a0c name:
ZooKeeperConnection Watcher:zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
got event WatchedEvent state:AuthFailed type:None path:null path: null
type: None*

*WARN  [org.apache.solr.common.cloud.ConnectionManager] zkClient received
AuthFailed*


After that everything works. What should I do to avoid this message? I
don't want any authentication between the client and Zookeepers as they are
not available from outside.

Thanks, Tamás


Re: Query parser problem, using fuzzy search

2018-01-31 Thread David Frese

Am 29.01.18 um 18:05 schrieb Erick Erickson:

Try searching with lowercase the word and. Somehow you have to allow
the parser to distinguish the two.


Oh yeah, the biggest unsolved problem in the ~80 years history of 
programming languages... NOT ;-)



You _might_ be able to try "AND~2" (with quotes) to see if you can get
that through the parser. Kind of a hack, but


Well, the parser swallows that, but it's not a fuzzy search then anymore.


There's also a parameter (depending on the parser) about lowercasing
operators, so if and~2 doesn't work check thatl


And if both appear?

Well, thanks for your ideas - of course you are not the one to blame.



On Mon, Jan 29, 2018 at 8:32 AM, David Frese
 wrote:

Hello everybody,

how can I formulate a fuzzy query that works for an arbitrary string, resp.
is there a formal syntax definition somewhere?

I already found by by hand, that

field:"val"~2

Is read by the parser, but the fuzzyness seems to get lost. So I write

field:val~2

Now if val contain spaces and other special characters, I can escape them:

field:my\ val~2

But now I'm stuck with the term AND:

field:AND~2

Note that I do not want a boolean expression here, but I want to match the
string AND! But the parser complains:

"org.apache.solr.search.SyntaxError: Cannot parse 'field:AND~2': Encountered
\"  \"AND \"\" at line 1, column 4.\nWas expecting one of:\n
 ...\n\"(\" ...\n\"*\" ...\n ...\n
...\n ...\n ...\n  ...\n\"[\"
...\n\"{\" ...\n ...\n \"filter(\" ...\n ...\n
",




--
David Frese
+49 7071 70896 75

Active Group GmbH
Hechinger Str. 12/1, 72072 Tübingen
Registergericht: Amtsgericht Stuttgart, HRB 224404
Geschäftsführer: Dr. Michael Sperber


Re: Long GC Pauses

2018-01-31 Thread Jason Gerlowski
Hi Maulin,

To clarify, when you said "...allocated 40 GB RAM to each shard." above,
I'm going to assume you meant "to each node" instead.  If you actually did
mean "to each shard" above, please correct me and anyone who chimes in
afterward.

Firstly, it's really hard to even take guesses about potential causes or
remediations without more details about your load characteristics
(average/peak QPS, index size, average document size, etc.).  If no one
gives any satisfactory advice, please consider uploading additional details
to help us help you.

Secondly, I don't know anything about the load characteristics you're
putting on your Solr cluster, but I'm curious whether you've experimented
with lower RAM settings.  Generally speaking, the more RAM you have, the
longer your GC pauses are likely to be (even with the tuning that various
GC settings provide).  If you can get away with giving the Solr process
less RAM, you should see your GC pauses shrink.  Was 40GB chosen after some
trial-and-error experimentation, or is it something you could investigate?

For a bit more overview on this, see this slightly outdated (but still
useful) wiki page: https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Hope that helps, even if just to disqualify some potential causes/solutions
to close in on a real fix.

Best,

Jason

On Wed, Jan 31, 2018 at 8:17 AM, Maulin Rathod  wrote:

> Hi,
>
> We are using solr cloud 6.1. We have around 20 collection on 4 nodes (We
> have 2 shards and each shard have 2 replicas). We have allocated 40 GB RAM
> to each shard.
>
> Intermittently we found long GC pauses (60 sec to 200 sec) due to which
> solr stops responding and hence collections goes in recovering mode. It
> takes minimum 5-10 minutes (sometime it takes more and we have to restart
> the solr node) for recovering all collections. We are using default GC
> setting (CMS) as per solr.cmd.
>
> We tried different G1 GC to see if it help, but still we see long GC
> pauses(60 sec to 200 sec) and also found that memory usage is more in in
> case G1 GC.
>
> What could be reason for long GC pauses and how can fix it? Insufficient
> memory or problem with GC setting or something else? Any suggestion would
> be greatly appreciated.
>
> In our analysis, we also found some inefficient queries (which uses * many
> times in query) in solr logs. Could it be reason for high memory usage?
>
> Slow Query
> --
>
> INFO  (qtp1239731077-498778) [c:documents s:shard1 r:core_node1
> x:documents] o.a.s.c.S.Request [documents]  webapp=/solr path=/select
> params={df=summary=false=id=4&
> start=0=true=description+asc,id+desc==
> s1.asite.com:8983/solr/documents|s1r1.asite.com:
> 8983/solr/documents=250=2=((id:(
> REV78364_24705418+REV78364_24471492+REV78364_24471429+
> REV78364_24470771+REV78364_24470271+))+OR+summary:((HPC*+
> AND+*+AND+*+AND+OH1150*+AND+*+AND+*+AND+U0*+AND+*+AND+*+AND+
> HGS*+AND+*+AND+*+AND+MDL*+AND+*+AND+*+AND+100067*+AND+*+AND+
> -*+AND+Reinforcement*+AND+*+AND+Mode*)+))++AND++(title:((*
> HPC\+\-\+OH1150\+\-\+U0\+\-\+HGS\+\-\+MDL\+\-\+100067\+-\+
> Reinforcement\+Mode*)+))+AND+project_id:(-2+78243+78365+
> 78364)+AND+is_active:true+AND+((isLatest:(true)+AND+
> isFolderActive:true+AND+isXref:false+AND+-document_
> type_id:(3+7)+AND+((is_public:true+OR+distribution_list:
> 4858120+OR+folderadmin_list:4858120+OR+author_user_id:
> 4858120)+AND+((defaultAccess:(true)+OR+allowedUsers:(
> 4858120)+OR+allowedRoles:(6342201+172408+6336860)+OR+
> combinationUsers:(4858120))+AND+-blockedUsers:(4858120
> +OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(
> 4858120)+AND+-folderadmin_list:(4858120)))=true=
> 1516786982952=true=javabin} hits=0 status=0 QTime=83309
>
>
>
>
> Regards,
>
> Maulin
>
> [CC Award Winners!]
>
>


Solrj + spring data: Indexing file body + own fields

2018-01-31 Thread Joris De Smedt
Hi

I'm using Solrj 6.6.1 found in spring-data-solr 3.0.3.RELEASE, solr is 7.2.1
.

I'm currently able to upload solrDocument via spring-data but would like to
add the equivalent to
  tika new AutoDetectParser().parse(stream, new BodyContentHandler(-1),
new MetaData())
as a content field. Without handeling the parsing at client side.
How can I do this?
-- 
Joris De Smedt


Save the document size in to a new field

2018-01-31 Thread Blackknight
Hello guys,

I want to add an option to search document by size. For example, find the
top categories with the biggest documents. I thought about creating a new
update processor wich will counting the bytes of all fields in the document,
but I think it wont work good, because some fields are stored, some are
indexed, some od them has both of these flags, there are copyfields too wich
need to count...
 
So I think the size counter of fields in update processor, will lie about
the doc size. I don't take into account the compression of index on the
disk, but I want to get real numbers (I can admit for 10% observational
error)  
 
Someone knows what should I do?

I read some posts about saving size(in bytes) of document, all the posts
were relatively old, and has no solution. May be solr has new techniques for
document size counting? :)

Thank you, guys! 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Long GC Pauses

2018-01-31 Thread Maulin Rathod
Hi,

We are using solr cloud 6.1. We have around 20 collection on 4 nodes (We have 2 
shards and each shard have 2 replicas). We have allocated 40 GB RAM to each 
shard.

Intermittently we found long GC pauses (60 sec to 200 sec) due to which solr 
stops responding and hence collections goes in recovering mode. It takes 
minimum 5-10 minutes (sometime it takes more and we have to restart the solr 
node) for recovering all collections. We are using default GC setting (CMS) as 
per solr.cmd.

We tried different G1 GC to see if it help, but still we see long GC pauses(60 
sec to 200 sec) and also found that memory usage is more in in case G1 GC.

What could be reason for long GC pauses and how can fix it? Insufficient memory 
or problem with GC setting or something else? Any suggestion would be greatly 
appreciated.

In our analysis, we also found some inefficient queries (which uses * many 
times in query) in solr logs. Could it be reason for high memory usage?

Slow Query
--

INFO  (qtp1239731077-498778) [c:documents s:shard1 r:core_node1 x:documents] 
o.a.s.c.S.Request [documents]  webapp=/solr path=/select 
params={df=summary=false=id=4=0=true=description+asc,id+desc==s1.asite.com:8983/solr/documents|s1r1.asite.com:8983/solr/documents=250=2=((id:(
 
REV78364_24705418+REV78364_24471492+REV78364_24471429+REV78364_24470771+REV78364_24470271+))+OR+summary:((HPC*+AND+*+AND+*+AND+OH1150*+AND+*+AND+*+AND+U0*+AND+*+AND+*+AND+HGS*+AND+*+AND+*+AND+MDL*+AND+*+AND+*+AND+100067*+AND+*+AND+-*+AND+Reinforcement*+AND+*+AND+Mode*)+))++AND++(title:((*HPC\+\-\+OH1150\+\-\+U0\+\-\+HGS\+\-\+MDL\+\-\+100067\+-\+Reinforcement\+Mode*)+))+AND+project_id:(-2+78243+78365+78364)+AND+is_active:true+AND+((isLatest:(true)+AND+isFolderActive:true+AND+isXref:false+AND+-document_type_id:(3+7)+AND+((is_public:true+OR+distribution_list:4858120+OR+folderadmin_list:4858120+OR+author_user_id:4858120)+AND+((defaultAccess:(true)+OR+allowedUsers:(4858120)+OR+allowedRoles:(6342201+172408+6336860)+OR+combinationUsers:(4858120))+AND+-blockedUsers:(4858120+OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(4858120)+AND+-folderadmin_list:(4858120)))=true=1516786982952=true=javabin}
 hits=0 status=0 QTime=83309




Regards,

Maulin

[CC Award Winners!]



Clusterstatus Action

2018-01-31 Thread Chris Ulicny
Hi all,

According to the documentation, the 'shard' parameter for the CLUSTERSTATUS
action should allow a comma delimited list of shards. However, passing
'shard1,shard2' as the value results in a shard-not-found error where it
was looking for 'shard1,shard2'. Not a search for 'shard1' and 'shard2'.

Is this a known issue? This problem happens on both 7.2.0 and 7.2.1 for us.

Thanks,
Chris


Re: Using SolrJ for digest authentication

2018-01-31 Thread Rick Leir
Eddy
Maybe your request is getting through twice. Check your logs to see.
Cheers -- Rick

On January 31, 2018 5:59:53 AM EST, ddramireddy  wrote:
>We are currently deploying Solr in war mode(Yes, recommendation is not
>war.
>But this is something I can't change now. Planned for future). I am
>setting
>authentication for solr. As Solr provided basic authentication is not
>working in Solr 6.4.2, I am setting up digest authentication in tomcat
>for
>Solr. I am able to login into Solr admin application using credentials.
>
>Now from my Java application, when I try to run a query, which will
>delete
>documents in a core, it's throwing following error.
>
>org.apache.http.client.NonRepeatableRequestException: Cannot retry
>request
>with a non-repeatable request entity
>
>I can see in HttpSolrClient, we are setting only basic authentication.
>But,
>I am using Digest auth. Did anyone faced this error before??
>
>This is my code:
>
>public static void main(String[] args) throws ClassNotFoundException,
>SQLException, InterruptedException, IOException, SolrServerException {
>HttpSolrClient solrClient = getSolrHttpClient("solr",
>"testpassword");
>
>try {
>solrClient.deleteByQuery("account", "*:*");
>solrClient.commit("account");
>} catch (final SolrServerException | IOException exn) {
>throw new IllegalStateException(exn);
>}
>}
>
>private static HttpSolrClient getSolrHttpClient(final String userName,
>final
>String password) {
>
>final HttpSolrClient solrClient = new HttpSolrClient.Builder()
>  .withBaseSolrUrl("http://localhost:9000/solr/index.html;)
>.withHttpClient(getHttpClientWithSolrAuth(userName,
>password))
>.build();
>
>return solrClient;
>}
>
>private static HttpClient getHttpClientWithSolrAuth(final String
>userName, final String password) {
>   final CredentialsProvider provider = new BasicCredentialsProvider();
>final UsernamePasswordCredentials credentials
>= new UsernamePasswordCredentials(userName, password);
>provider.setCredentials(AuthScope.ANY, credentials);
>
>
>return HttpClientBuilder.create()
>.addInterceptorFirst(new PreemptiveAuthInterceptor())
>.setDefaultCredentialsProvider(provider)
>.build();
>
>}
>
>
>static class PreemptiveAuthInterceptor implements
>HttpRequestInterceptor
>{
>
>DigestScheme digestAuth = new DigestScheme();
>
>PreemptiveAuthInterceptor() {
>
>}
>
>@Override
>   public void process(final HttpRequest request, final HttpContext
>context)
>throws HttpException, IOException {
>final AuthState authState = (AuthState)
>context.getAttribute(HttpClientContext.TARGET_AUTH_STATE);
>
>  if (authState != null && authState.getAuthScheme() == null) {
>final CredentialsProvider credsProvider =
>(CredentialsProvider)
>context.getAttribute(HttpClientContext.CREDS_PROVIDER);
>final HttpHost targetHost = (HttpHost)
>context.getAttribute(HttpCoreContext.HTTP_TARGET_HOST);
> final Credentials creds = credsProvider.getCredentials(new
>AuthScope(targetHost.getHostName(), targetHost.getPort(), "Solr",
>"DIGEST"));
>if (creds == null) {
>System.out.println("No credentials for preemptive
>authentication");
>}
>digestAuth.overrideParamter("realm", "Solr");
>digestAuth.overrideParamter("nonce", Long.toString(new
>Random().nextLong(), 36));
>AuthCache authCache = new BasicAuthCache();
>authCache.put(targetHost, digestAuth);
>
>// Add AuthCache to the execution context
>   HttpClientContext localContext = HttpClientContext.create();
>localContext.setAuthCache(authCache);
>
>  request.addHeader(digestAuth.authenticate(creds, request,
>localContext));
>} else {
>System.out.println("authState is null. No preemptive
>authentication.");
>}
>}
>}
>
>
>
>--
>Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Rick Leir
Luigi
Is there a reason for not indexing all of your on-disk pages? That seems to be 
the first step. But I do not understand what your goal is.
Cheers -- Rick

On January 30, 2018 1:33:27 PM EST, Luigi Caiazza  wrote:
>Hello,
>
>I am working on a project that simulates a selective, large-scale
>crawling.
>The system adapts its behaviour according with some external user
>queries
>received at crawling time. Briefly, it analyzes the already crawled
>pages
>in the top-k results for each query, and prioritizes the visit of the
>discovered links accordingly. In a generic experiment, I measure the
>time
>units as the number of crawling cycles completed so far, i.e., with an
>integer value. Finally, I evaluate the experiment by analyzing the
>documents fetched over the crawling cycles. In this work I am using
>Lucene
>7.2.1, but this should not be an issue since I need just some
>conceptual
>help.
>
>In my current implementation, an experiment starts with an empty index.
>When a Web page is fetched during the crawling cycle *x*, the system
>builds
>a document with the URL as StringField, the title and the body as
>TextFields, and *x* as an IntPoint. When I get an external user query,
>I
>submit it  to get the top-k relevant documents crawled so far. When I
>need
>to retrieve the documents indexed from cycle *i* to cycle *j*, I
>execute a
>range query over this last IntPoint field. This strategy does the job,
>but
>of course the write operations take some hours overall for a single
>experiment, even if I crawl just half a million of Web pages.
>
>Since I am not crawling real-time data, but I am working over a static
>set
>of many billions of Web pages (whose contents are already stored on
>disk),
>I am investigating some opportunities to reduce the number of writes
>during
>an experiment. For instance, I could avoid to index everything from
>scratch
>for each run. I would be happy to index all the static contents of my
>dataset (i.e., URL, title and body of a Web page) once and for all.
>Then,
>for a single experiment, I would mark a document as crawled at cycle
>*x* without
>storing this information permanently, in order both to filter out the
>documents that in the current simulation have not been crawled when
>processing the external queries, and to still perform the range queries
>at
>evaluation time. Do you have any idea on how to do that?
>
>Thank you in advance for your support.

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re:Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi Luigi,

What about using an updatable DocValue [1] for the field x ? you could 
initially set it to -1, 
and then update it for the docs in the step j. Range queries should still work 
and the update should be fast. 

Cheers

[1] http://shaierera.blogspot.com/2014/04/updatable-docvalues-under-hood.html

From: solr-user@lucene.apache.org At: 01/30/18 18:42:01To:  
solr-user@lucene.apache.org
Subject: Searching for an efficient and scalable way to filter query results 
using non-indexed and dynamic range values

Hello,

I am working on a project that simulates a selective, large-scale crawling.
The system adapts its behaviour according with some external user queries
received at crawling time. Briefly, it analyzes the already crawled pages
in the top-k results for each query, and prioritizes the visit of the
discovered links accordingly. In a generic experiment, I measure the time
units as the number of crawling cycles completed so far, i.e., with an
integer value. Finally, I evaluate the experiment by analyzing the
documents fetched over the crawling cycles. In this work I am using Lucene
7.2.1, but this should not be an issue since I need just some conceptual
help.

In my current implementation, an experiment starts with an empty index.
When a Web page is fetched during the crawling cycle *x*, the system builds
a document with the URL as StringField, the title and the body as
TextFields, and *x* as an IntPoint. When I get an external user query, I
submit it  to get the top-k relevant documents crawled so far. When I need
to retrieve the documents indexed from cycle *i* to cycle *j*, I execute a
range query over this last IntPoint field. This strategy does the job, but
of course the write operations take some hours overall for a single
experiment, even if I crawl just half a million of Web pages.

Since I am not crawling real-time data, but I am working over a static set
of many billions of Web pages (whose contents are already stored on disk),
I am investigating some opportunities to reduce the number of writes during
an experiment. For instance, I could avoid to index everything from scratch
for each run. I would be happy to index all the static contents of my
dataset (i.e., URL, title and body of a Web page) once and for all. Then,
for a single experiment, I would mark a document as crawled at cycle
*x* without
storing this information permanently, in order both to filter out the
documents that in the current simulation have not been crawled when
processing the external queries, and to still perform the range queries at
evaluation time. Do you have any idea on how to do that?

Thank you in advance for your support.




Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
Hi,
We are using solr for our movie title search.


As it is "title search", this should be treated different than the normal
document search.
Hence, we use a modified version of TFIDFSimilarity with the following
changes.
-  disabled TF & IDF and will only have 1 as value.
-  disabled norms by specifying omitNorms as true for all the fields.

There are 6 fields with different analyzers and we make use of different
weights in edismax's qf & pf parameters to match tokens & boost phrases.

But, movies could have aliases and have multiple titles. So, we made the
fields multivalued.

Now, consider the following four documents
1>  "Beauty and the Beast"
2>  "The Real Beauty and the Beast"
3>  "Beauty and the Beast", "La bella y la bestia"
4>  "Beauty and the Beast"

Note: Document 3 has two titles in it.

So, for a query "Beauty and the Beast" and with the above configuration all
the documents receive same score. But 1,3,4 should have got same score and
document 2 lesser than others.

To solve this, we followed what is suggested in the following thread:
http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html

Now, the fields which are used to boost are made to use Norms. And for
matching norms are disabled. This is to make sure that exact & near exact
matches are rewarded.

But, for the same query, we get the following results.
query: "Beauty & the Beast"
Search Results:
1>  "Beauty and the Beast"
4>  "Beauty and the Beast"
2>  "The Real Beauty and the Beast"
3>  "Beauty and the Beast", "La bella y la bestia"

Clearly, the changes have solved only a part of the problem. The document 3
should be ranked/scored higher than document 2.

This is because lucene considers the total field length across all the
values in a multivalued field for normalization.

How do we handle this scenario and make sure that in multivalued fields the
normalization is taken care of?


-- 
Regards,
Sravan


Re: Save the document size in to a new field

2018-01-31 Thread Emir Arnautović
With any generic solution there will be always the question of what is the 
document size: should you count the same field twice if indexed in two 
different ways? Does size of index count or size of response?

If simplified version works for you - approximate doc size to the size of the 
largest field, e.g. ‘content’, you can use 
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
 

 to obtain that size.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 31 Jan 2018, at 10:36, Blackknight  wrote:
> 
> Hello guys,
> 
> I want to add an option to search document by size. For example, find the
> top categories with the biggest documents. I thought about creating a new
> update processor wich will counting the bytes of all fields in the document,
> but I think it wont work good, because some fields are stored, some are
> indexed, some od them has both of these flags, there are copyfields too wich
> need to count...
> 
> So I think the size counter of fields in update processor, will lie about
> the doc size. I don't take into account the compression of index on the
> disk, but I want to get real numbers (I can admit for 10% observational
> error)  
> 
> Someone knows what should I do?
> 
> I read some posts about saving size(in bytes) of document, all the posts
> were relatively old, and has no solution. May be solr has new techniques for
> document size counting? :)
> 
> Thank you, guys! 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



OnImportEnd EventListener

2018-01-31 Thread Srinivas Kashyap
Hello,

I'm trying to get the documents which got indexed on calling DIH and I want to 
differentiate such documents with the ones which are added using SolrJ atomic 
update.

Is it possible to get the document primary keys which got indexed thru 
"onImportEnd" Eventlistener?

Any alternative way I can find them?

Thanks and Regards,
Srinivas Kashyap

DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.


Save the document size in to a new field

2018-01-31 Thread Blackknight
Hello guys,

I want to add an option to search document by size. For example, find the
top categories with the biggest documents. I thought about creating a new
update processor wich will counting the bytes of all fields in the document,
but I think it wont work good, because some fields are stored, some are
indexed, some od them has both of these flags, there are copyfields too wich
need to count...
 
So I think the size counter of fields in update processor, will lie about
the doc size. I don't take into account the compression of index on the
disk, but I want to get real numbers (I can admit for 10% observational
error)  
 
Someone knows what should I do?

I read some posts about saving size(in bytes) of document, all the posts
were relatively old, and has no solution. May be solr has new techniques for
document size counting? :)

Thank you, guys! 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread Markus Jelsma
Ah thanks, i just submitted a patch fixing it.

Anyway, in the end it appears this is not the problem we are seeing as our 
timeouts were already at 30 seconds.

All i know is that at some point nodes start to lose ZK connections due to 
timeouts (logs say so, but all within 30 seconds), the logs are flooded with 
those messages:
o.a.z.ClientCnxn Client session timed out, have not heard from server in 
10359ms for sessionid 0x160f9e723c12122
o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session 
0x60f9e7234f05bb has expired

Then there is a doubling in heap usage and nodes become unresponsive, die etc. 

We also see those messages in other collections, but not so frequently and they 
don't cause failure in those less loaded clusters.

Ideas?

Thanks,
Markus

-Original message-
> From:Michael Braun 
> Sent: Monday 29th January 2018 21:09
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Believe this is reported in https://issues.apache.org/jira/browse/SOLR-10471
> 
> 
> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma 
> wrote:
> 
> > Hello SG,
> >
> > The default in solr.in.sh is commented so it defaults to the value set in
> > bin/solr, which is fifteen seconds. Just uncomment the setting in
> > solr.in.sh and your timeout will be thirty seconds.
> >
> > For Solr itself to really default to thirty seconds, Solr's bin/solr needs
> > to be patched to use the correct value.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> > > From:S G 
> > > Sent: Monday 29th January 2018 20:15
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > Hi Markus,
> > >
> > > We are in the process of upgrading our clusters to 7.2.1 and I am not
> > sure
> > > I quite follow the conversation here.
> > > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> > value
> > > in the config (and it's just a default value being wrong/overridden
> > > somewhere)?
> > > Or is it more severe in the sense that any config set for
> > ZK_CLIENT_TIMEOUT
> > > by the user is just ignored completely by Solr in 7.2.1 ?
> > >
> > > Thanks
> > > SG
> > >
> > >
> > > On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml
> > > > says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default unset
> > in
> > > > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is
> > > > still 15000, not 3.
> > > >
> > > > But, back to my topic. I see we explicitly set it in solr.in.sh to
> > 3.
> > > > To be sure, i applied your patch to a production machine, all our
> > > > collections run with 3. So how would that explain this log line?
> > > >
> > > > o.a.z.ClientCnxn Client session timed out, have not heard from server
> > in
> > > > 22130ms
> > > >
> > > > We also see these with smaller values, seven seconds. And, is this
> > > > actually an indicator of the problems we have?
> > > >
> > > > Any ideas?
> > > >
> > > > Many thanks,
> > > > Markus
> > > >
> > > >
> > > > -Original message-
> > > > > From:Markus Jelsma 
> > > > > Sent: Saturday 27th January 2018 10:03
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: RE: 7.2.1 cluster dies within minutes after restart
> > > > >
> > > > > Hello,
> > > > >
> > > > > I grepped for it yesterday and found nothing but 3 in the
> > settings,
> > > > but judging from the weird time out value, you may be right. Let me
> > apply
> > > > your patch early next week and check for spurious warnings.
> > > > >
> > > > > Another note worthy observation for those working on cloud stability
> > and
> > > > recovery, whenever this happens, some nodes are also absolutely sure
> > to run
> > > > OOM. The leaders usually live longest, the replica's don't, their heap
> > > > usage peaks every time, consistently.
> > > > >
> > > > > Thanks,
> > > > > Markus
> > > > >
> > > > > -Original message-
> > > > > > From:Shawn Heisey 
> > > > > > Sent: Saturday 27th January 2018 0:49
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > > > >
> > > > > > On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > > > > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> > > > server in 22130ms (although zkClientTimeOut is 3).
> > > > > >
> > > > > > Are you absolutely certain that there is a setting for
> > zkClientTimeout
> > > > > > that is actually getting applied?  The default value in Solr's
> > example
> > > > > > configs is 30 seconds, but the internal default in the code (when
> > no
> > > > > > configuration is found) is still 15.  I have confirmed this in the
> > > > code.
> > > > > >
> > > > > 

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Alessandro Benedetti
I am not sure I fully understood your use case, but let me suggest few
different possible solutions :

1) Query Time join approach : you keep 2 collections, one static with all
the pages, one that just store lighweight documents containing the crawling
interaction :
1) Id, content -> Pages
2)pageId, ExperimentId, CrawlingCycleId ->CrawlingInteractions

Then your query will be something like this ( to retrieve pageId):
http://localhost:8983/solr/select?q={!join+from=id+to=pageId}text:query=CrawlingCycleId:[N
To K]

To retrieve the entire page can be more problematic as you have to reverse
the Join and you will join on millions of items. Not sure if it's going to
work

2) You use atomic updates[1], and for each experiment and iteration you just
add the fields you want ( experimentId and CrawlingCycleId). Be careful here
as Atomic Updates doesn't mean you are not going to write the entire
document again ( this is valid only under certain condition which doesn't
apply to your use case i think), but at least it will give you a bit of
advantage as your post requests pushing the document will be much more
lightweight.



[1]
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: OnImportEnd EventListener

2018-01-31 Thread Emir Arnautović
So all fields are DIH imported? And you just want to know which are from the 
last run? Can you add date field and track when DIH started and ended and 
filter based on that?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 31 Jan 2018, at 11:56, Srinivas Kashyap  
> wrote:
> 
> Hi Emir,
> 
> Thanks for the reply,
> 
> As I'm doing atomic update on the existing documents(already indexed from 
> DIH) as well, with the suggested approach, I might end up doing atomic update 
> on DIH imported document and commit the same.
> 
> So, I wanted to get the document values which were indexed when import was 
> completed("onImportEnd" eventlistener).
> 
> Thanks and Regards,
> Srinivas Kashyap
> 
> 
> 
> -Original Message-
> From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
> Sent: 31 January 2018 04:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: OnImportEnd EventListener
> 
> Hi Srinivas,
> I guess you can add some field that will be set in your DIH config - 
> something like:
> 
> 
> And you can use ‘dih’ field to filter out doc that are imported using DIH.
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 31 Jan 2018, at 11:19, Srinivas Kashyap  
>> wrote:
>> 
>> Hello,
>> 
>> I'm trying to get the documents which got indexed on calling DIH and I want 
>> to differentiate such documents with the ones which are added using SolrJ 
>> atomic update.
>> 
>> Is it possible to get the document primary keys which got indexed thru 
>> "onImportEnd" Eventlistener?
>> 
>> Any alternative way I can find them?
>> 
>> Thanks and Regards,
>> Srinivas Kashyap
>> 
>> 
>> DISCLAIMER: 
>> E-mails and attachments from TradeStone Software, Inc. are confidential.
>> If you are not the intended recipient, please notify the sender 
>> immediately by replying to the e-mail, and then delete it without 
>> making copies or using it in any way. No representation is made that 
>> this email or any attachments are free of viruses. Virus scanning is 
>> recommended and is the responsibility of the recipient.
> 
> 
> DISCLAIMER: 
> E-mails and attachments from TradeStone Software, Inc. are confidential.
> If you are not the intended recipient, please notify the sender immediately by
> replying to the e-mail, and then delete it without making copies or using it
> in any way. No representation is made that this email or any attachments are
> free of viruses. Virus scanning is recommended and is the responsibility of
> the recipient.



Using SolrJ for digest authentication

2018-01-31 Thread ddramireddy
We are currently deploying Solr in war mode(Yes, recommendation is not war.
But this is something I can't change now. Planned for future). I am setting
authentication for solr. As Solr provided basic authentication is not
working in Solr 6.4.2, I am setting up digest authentication in tomcat for
Solr. I am able to login into Solr admin application using credentials.

Now from my Java application, when I try to run a query, which will delete
documents in a core, it's throwing following error.

org.apache.http.client.NonRepeatableRequestException: Cannot retry request
with a non-repeatable request entity

I can see in HttpSolrClient, we are setting only basic authentication. But,
I am using Digest auth. Did anyone faced this error before??

This is my code:

public static void main(String[] args) throws ClassNotFoundException,
SQLException, InterruptedException, IOException, SolrServerException {
HttpSolrClient solrClient = getSolrHttpClient("solr",
"testpassword");

try {
solrClient.deleteByQuery("account", "*:*");
solrClient.commit("account");
} catch (final SolrServerException | IOException exn) {
throw new IllegalStateException(exn);
}
}

private static HttpSolrClient getSolrHttpClient(final String userName, final
String password) {

final HttpSolrClient solrClient = new HttpSolrClient.Builder()
.withBaseSolrUrl("http://localhost:9000/solr/index.html;)
.withHttpClient(getHttpClientWithSolrAuth(userName,
password))
.build();

return solrClient;
}

private static HttpClient getHttpClientWithSolrAuth(final String
userName, final String password) {
final CredentialsProvider provider = new BasicCredentialsProvider();
final UsernamePasswordCredentials credentials
= new UsernamePasswordCredentials(userName, password);
provider.setCredentials(AuthScope.ANY, credentials);


return HttpClientBuilder.create()
.addInterceptorFirst(new PreemptiveAuthInterceptor())
.setDefaultCredentialsProvider(provider)
.build();

}


static class PreemptiveAuthInterceptor implements HttpRequestInterceptor
{

DigestScheme digestAuth = new DigestScheme();

PreemptiveAuthInterceptor() {

}

@Override
public void process(final HttpRequest request, final HttpContext
context)
throws HttpException, IOException {
final AuthState authState = (AuthState)
context.getAttribute(HttpClientContext.TARGET_AUTH_STATE);

if (authState != null && authState.getAuthScheme() == null) {
final CredentialsProvider credsProvider =
(CredentialsProvider)
context.getAttribute(HttpClientContext.CREDS_PROVIDER);
final HttpHost targetHost = (HttpHost)
context.getAttribute(HttpCoreContext.HTTP_TARGET_HOST);
final Credentials creds = credsProvider.getCredentials(new
AuthScope(targetHost.getHostName(), targetHost.getPort(), "Solr",
"DIGEST"));
if (creds == null) {
System.out.println("No credentials for preemptive
authentication");
}
digestAuth.overrideParamter("realm", "Solr");
digestAuth.overrideParamter("nonce", Long.toString(new
Random().nextLong(), 36));
AuthCache authCache = new BasicAuthCache();
authCache.put(targetHost, digestAuth);

// Add AuthCache to the execution context
HttpClientContext localContext = HttpClientContext.create();
localContext.setAuthCache(authCache);

request.addHeader(digestAuth.authenticate(creds, request,
localContext));
} else {
System.out.println("authState is null. No preemptive
authentication.");
}
}
}



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: OnImportEnd EventListener

2018-01-31 Thread Srinivas Kashyap
Hi Emir,

Thanks for the reply,

As I'm doing atomic update on the existing documents(already indexed from DIH) 
as well, with the suggested approach, I might end up doing atomic update on DIH 
imported document and commit the same.

So, I wanted to get the document values which were indexed when import was 
completed("onImportEnd" eventlistener).

Thanks and Regards,
Srinivas Kashyap



-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
Sent: 31 January 2018 04:14 PM
To: solr-user@lucene.apache.org
Subject: Re: OnImportEnd EventListener

Hi Srinivas,
I guess you can add some field that will be set in your DIH config - something 
like:


And you can use ‘dih’ field to filter out doc that are imported using DIH.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/



> On 31 Jan 2018, at 11:19, Srinivas Kashyap  
> wrote:
> 
> Hello,
> 
> I'm trying to get the documents which got indexed on calling DIH and I want 
> to differentiate such documents with the ones which are added using SolrJ 
> atomic update.
> 
> Is it possible to get the document primary keys which got indexed thru 
> "onImportEnd" Eventlistener?
> 
> Any alternative way I can find them?
> 
> Thanks and Regards,
> Srinivas Kashyap
> 
> 
> DISCLAIMER: 
> E-mails and attachments from TradeStone Software, Inc. are confidential.
> If you are not the intended recipient, please notify the sender 
> immediately by replying to the e-mail, and then delete it without 
> making copies or using it in any way. No representation is made that 
> this email or any attachments are free of viruses. Virus scanning is 
> recommended and is the responsibility of the recipient.


DISCLAIMER: 
E-mails and attachments from TradeStone Software, Inc. are confidential.
If you are not the intended recipient, please notify the sender immediately by
replying to the e-mail, and then delete it without making copies or using it
in any way. No representation is made that this email or any attachments are
free of viruses. Virus scanning is recommended and is the responsibility of
the recipient.


Re: Computing record score depending on its association with other records

2018-01-31 Thread Gintautas Sulskus
Yes, that is correct. Collection 'features' stores mapping between features
and their scores.
For simplicity, I tried to keep the level of detail about these collections
to a minimum.

Both collections contain thousands of records and are updated by (lily)
hbase-indexer. Therefore storing scores/weights in the model resource is
not feasible.

Ideally, I would like to keep these data collection separate and perform
cross-collection queries. If such an approach is not feasible, then I could
possibly merge the two collections into one.
This would make matters simpler but not ideal.

Gintas



On Tue, Jan 30, 2018 at 5:49 PM, Alessandro Benedetti 
wrote:

> Accordingly to what I understood the feature weight is present in your
> second
> collection.
> You should express the feature weight in the model resource ( not even in
> the original collection)
> Is actually necessary for the feature weight to be in a separate Solr
> collection ?
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: OnImportEnd EventListener

2018-01-31 Thread Emir Arnautović
Hi Srinivas,
I guess you can add some field that will be set in your DIH config - something 
like:


And you can use ‘dih’ field to filter out doc that are imported using DIH.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 31 Jan 2018, at 11:19, Srinivas Kashyap  
> wrote:
> 
> Hello,
> 
> I'm trying to get the documents which got indexed on calling DIH and I want 
> to differentiate such documents with the ones which are added using SolrJ 
> atomic update.
> 
> Is it possible to get the document primary keys which got indexed thru 
> "onImportEnd" Eventlistener?
> 
> Any alternative way I can find them?
> 
> Thanks and Regards,
> Srinivas Kashyap
> 
> 
> DISCLAIMER: 
> E-mails and attachments from TradeStone Software, Inc. are confidential.
> If you are not the intended recipient, please notify the sender immediately by
> replying to the e-mail, and then delete it without making copies or using it
> in any way. No representation is made that this email or any attachments are
> free of viruses. Virus scanning is recommended and is the responsibility of
> the recipient.



Re: facet.method=uif not working in solr cloud?

2018-01-31 Thread Alessandro Benedetti
I worked personally on the SimpleFacets class which does the facet method
selection :

FacetMethod appliedFacetMethod = selectFacetMethod(field,
sf, requestedMethod, mincount,
exists);

RTimer timer = null;
if (fdebug != null) {
   fdebug.putInfoItem("requestedMethod", requestedMethod==null?"not
specified":requestedMethod.name());
   fdebug.putInfoItem("appliedMethod", appliedFacetMethod.name());
   fdebug.putInfoItem("inputDocSetSize", docs.size());
   fdebug.putInfoItem("field", field);
   timer = new RTimer();
}

Within the select facet method , the only code block related UIF is (
another block can apply when facet method arrives null to the Solr Node, but
that should not apply as we see the facet method in the debug):

/* UIF without DocValues can't deal with mincount=0, the reason is because
 we create the buckets based on the values present in the result
set.
 So we are not going to see facet values which are not in the result
set */
 if (method == FacetMethod.UIF
 && !field.hasDocValues() && mincount == 0) {
   method = field.multiValued() ? FacetMethod.FC : FacetMethod.FCS;
 }

So is there anything in the logs?
Because that seems to me the only point where you can change from UIF to FC
and you clearly have mincount=1.





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


OnImportEnd EventListener

2018-01-31 Thread Srinivas Kashyap
Hello,

I'm trying to get the documents which got indexed on calling DIH and I want to 
differentiate such documents with the ones which are added using SolrJ atomic 
update.

Is it possible to get the document primary keys which got indexed thru 
"onImportEnd" Eventlistener?

Any alternative way I can find them?

Thanks and Regards,
Srinivas Kashyap


DISCLAIMER: 
E-mails and attachments from TradeStone Software, Inc. are confidential.
If you are not the intended recipient, please notify the sender immediately by
replying to the e-mail, and then delete it without making copies or using it
in any way. No representation is made that this email or any attachments are
free of viruses. Virus scanning is recommended and is the responsibility of
the recipient.

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Emir Arnautović
Hi Wendy,
I see several issues, but not sure if any of them is the reason why you are not 
getting what you expect:
* there are no spaces around OR and that results in query being parsed 
sometimes with OR, e.g. (pdb_id:OR\”Solution)^5
* wildcard in quotes - it is not handled as you expected - the whole phrase is 
analysed and wildcard is eliminated and resulting query is method:”x rai”
* In 1d you search for “x-reays*” - that will search all fields, not just 
method field - maybe that is why you get 844 results.

Can you provide debug query for all three 1d, 1e and 1f (with spaces around 
OR). Please paste results as text and not as picture, and do not update 
original post since some of us are using mails and we are not getting updates.

Thanks,
Emir

--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 30 Jan 2018, at 21:33, Wendy2  wrote:
> 
> Hi Emlr,
> 
> Thank you for reading my post and for your reply. I updated my post with
> debug info and a better view of the definition of  /search request handler. 
> 
> Any suggestion on what I should try? 
> 
> Thanks,
> 
> Wendy
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: full name free text search problem

2018-01-31 Thread Alessandro Benedetti
"I am getting the records matching the full name sorted by distance. 
If the input string(for ex Dae Kim) is provided, I am getting the records
other than Dae Kim(for ex Rodney Kim) too at the top of the search results
including Dae Kim 
just before the next Dae Kim because Kim is matching with all the fields
like full name, facility name and the office name. So, the hit frequency is
high and it's 
distance is less compared to the next Dae Kim in the search results with
higher distance. "

All is quite confused.
First of all, sorted by distance, do you mean sorted by string distance ?
By a space distance ?
You are analysing the fields without tokenization and then you put
everything in the same multivalued field.
This means you are going to have just exact matches.
And you lose the semantic of the field source ( which could have given a
different score boost depending on the field) .

If you want to sort or score by a string distance, you need to use function
query sorting or boosting[1]
In particular you are interested in strdist ( you find the details in the
page linked).
If it is geographical distance, take a look to the spatial module [2].

Regards

[1] https://lucene.apache.org/solr/guide/6_6/function-queries.html
[2] https://lucene.apache.org/solr/guide/6_6/spatial-search.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Distributed search cross cluster

2018-01-31 Thread Bernd Fehling
Many years ago, in a different universe, when Federated Search was a buzzword we
used Unity from FAST FDS (which is now MS ESP). It worked pretty well across
many systems like FAST FDS, Google, Gigablast, ...
Very flexible with different mixers, parsers, query transformers.
Was written in Python and used pylib.medusa.
Search for "unity federated search", there is a book at Google about this, just
to get an idea.

Regards, Bernd


Am 30.01.2018 um 17:09 schrieb Jan Høydahl:
> Hi,
> 
> A customer has 10 separate SolrCloud clusters, with same schema across all, 
> but different content.
> Now they want users in each location to be able to federate a search across 
> all locations.
> Each location is 100% independent, with separate ZK etc. Bandwidth and 
> latency between the
> clusters is not an issue, they are actually in the same physical datacenter.
> 
> Now my first thought was using a custom  parameter, and let the 
> receiving node fan
> out to all shards of all clusters. We’d need to contact the ZK for each 
> environment and find
> all shards and replicas participating in the collection and then construct 
> the shards=A1|A2,B1|B2…
> sting which would be quite big, but if we get it right, it should “just work".
> 
> Now, my question is whether there are other smarter ways that would leave it 
> up to existing Solr
> logic to select shards and load balance, that would also take into account 
> any shard.keys/_route_
> info etc. I thought of these
>   * =collA,collB  — but it only supports collections local to one 
> cloud
>   * Create a collection ALIAS to point to all 10 — but same here, only local 
> to one cluster
>   * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want 
> it with pure search API
>   * Write a custom ShardHandler plugin that knows about all clusters — but 
> this is complex stuff :)
>   * Write a custom SearchComponent plugin that knows about all clusters and 
> adds the = param
> 
> Another approach would be for the originating cluster to fan out just ONE 
> request to each of the other
> clusters and then write some SearchComponent to merge those responses. That 
> would let us query
> the other clusters using one LB IP address instead of requiring full 
> visibility to all solr nodes
> of all clusters, but if we don’t need that isolation, that extra merge code 
> seems fairly complex.
> 
> So far I opt for the custom SearchComponent and = param approach. Any 
> useful input from
> someone who tried a similar approach would be priceless!
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 


Re: Distributed search cross cluster

2018-01-31 Thread Charlie Hull

On 30/01/2018 16:09, Jan Høydahl wrote:

Hi,

A customer has 10 separate SolrCloud clusters, with same schema across all, but 
different content.
Now they want users in each location to be able to federate a search across all 
locations.
Each location is 100% independent, with separate ZK etc. Bandwidth and latency 
between the
clusters is not an issue, they are actually in the same physical datacenter.

Now my first thought was using a custom  parameter, and let the 
receiving node fan
out to all shards of all clusters. We’d need to contact the ZK for each 
environment and find
all shards and replicas participating in the collection and then construct the 
shards=A1|A2,B1|B2…
sting which would be quite big, but if we get it right, it should “just work".

Now, my question is whether there are other smarter ways that would leave it up 
to existing Solr
logic to select shards and load balance, that would also take into account any 
shard.keys/_route_
info etc. I thought of these
   * =collA,collB  — but it only supports collections local to one 
cloud
   * Create a collection ALIAS to point to all 10 — but same here, only local 
to one cluster
   * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want it 
with pure search API
   * Write a custom ShardHandler plugin that knows about all clusters — but 
this is complex stuff :)
   * Write a custom SearchComponent plugin that knows about all clusters and adds 
the = param

Another approach would be for the originating cluster to fan out just ONE 
request to each of the other
clusters and then write some SearchComponent to merge those responses. That 
would let us query
the other clusters using one LB IP address instead of requiring full visibility 
to all solr nodes
of all clusters, but if we don’t need that isolation, that extra merge code 
seems fairly complex.

So far I opt for the custom SearchComponent and = param approach. Any 
useful input from
someone who tried a similar approach would be priceless!


Hi Jan,

We actually looked at this for the BioSolr project - a SolrCloud of 
SolrClouds. Unfortunately the funding didn't appear for the project so 
we didn't take it any further than some rough ideas - as you say, if you 
get it right it should 'just work'. We had some extra complications in 
terms of shared partial schemas...


Cheers

Charlie


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk