Re: Bi Gram token generation with fuzzy searches

2018-02-07 Thread Sravan Kumar
@Emir :   The  'sow' parameter in edismax along with the nested query
'_query_' works. Tuning has to be done for desired relevancy.

@Walter:  It would be nice to have SOLR-629 integrated into the project. As
Emir suggested, _query_ caters to my need by by applying fuzzy parameter to
the query. Anyways, I will apply the patch and give it a try.


On Wed, Feb 7, 2018 at 8:42 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

> I think you need the feature in SOLR-629 that adds fuzzy to edismax.
>
> https://issues.apache.org/jira/browse/SOLR-629
>
> The patch on that issue is for Solr 4.x, but I believe someone is working
> on a new patch.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 7, 2018, at 2:10 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> >
> > Hi Sravan,
> > Edismax has ’sow’ parameter that results in edismax to pass query to
> field analysis, but not sure how it will work with fuzzy search. What you
> might do is use _query synthax to separate shingle and non shingle queries,
> e.g.
> > q=_query({!edismax sow=false qf=title_bigrams}$v) OR _query({!edismax
> qf=title}$v)&$v=some movie title
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 7 Feb 2018, at 10:55, Sravan Kumar <sra...@caavo.com> wrote:
> >>
> >> We have the following two fields for our movie title search
> >> - title without symbols
> >> a custom analyser with WordDelimiterFilterFactory, SynonymFilterFactory
> and
> >> other filters to retain only alpha numeric characters.
> >> - title with word bi grams
> >> a custom analyser with solr.ShingleFilterFactory to generate "bi gram"
> word
> >> tokens with '_' as separator.
> >>
> >> A custom similarity class is used to make tf & idf values as 1.
> >>
> >> Edismax query parser is used to perform all searches. Phrase boosting
> (pf)
> >> is also used.
> >>
> >> There are couple of issues while searching:
> >> 1>  BiGram field doesn't generate bi grams if the white spaces in the
> query
> >> are not escaped.
> >> - For example, if the query is "pursuit of happyness", then bi grams are
> >> not generated.  This is due to the fact that the edismax query parser
> >> tokenizes based on whitespaces before passing the string to
> >> analyser(correct me if I am wrong).
> >> But in case of "pursuit\ of\ happyness", they are as the string which is
> >> passed to the analyser is with the whitespace.
> >>
> >> 2>  Fuzzy search doesn't work in  whitespace escaped queries.
> >> Ex: "pursuit~2\ of\ happiness~1"
> >>
> >> 3> Edismax's Phrase boosting doesn't work the way it should in
> >> non-whitespace escaped fuzzy queries.
> >>
> >> If the query is "pursuit~2 of happiness~1" (without escaping
> whitespaces)
> >>
> >> fuzzy queries are generated
> >> (title_name:pursuit~2), (title_name:happiness~1) in the parsed query.
> >> But,edismax pf (phrase boost) generates query like
> >> title_name:"pursuit (2 pursuit2) of happiness (1 happiness1)"
> >> This means the analyser got the original query consisting the fuzzy
> >> operator for phrase boosting.
> >>
> >>
> >> 1> How whitespaces should be handled in case of filters like
> >> solr.ShingleFilterFactory to generate bi grams?
> >> 2> If generating bi grams requires whitespaces escaped and fuzzy
> searches
> >> not, how do we accomodate both these in a single solr request and scored
> >> together.
> >>
> >>
> >>
> >> -
> >> --
> >> Regards,
> >> Sravan
> >
>
>


-- 
Regards,
Sravan


Bi Gram token generation with fuzzy searches

2018-02-07 Thread Sravan Kumar
We have the following two fields for our movie title search
- title without symbols
a custom analyser with WordDelimiterFilterFactory, SynonymFilterFactory and
other filters to retain only alpha numeric characters.
- title with word bi grams
a custom analyser with solr.ShingleFilterFactory to generate "bi gram" word
tokens with '_' as separator.

A custom similarity class is used to make tf & idf values as 1.

Edismax query parser is used to perform all searches. Phrase boosting (pf)
is also used.

There are couple of issues while searching:
1>  BiGram field doesn't generate bi grams if the white spaces in the query
are not escaped.
- For example, if the query is "pursuit of happyness", then bi grams are
not generated.  This is due to the fact that the edismax query parser
tokenizes based on whitespaces before passing the string to
analyser(correct me if I am wrong).
But in case of "pursuit\ of\ happyness", they are as the string which is
passed to the analyser is with the whitespace.

2>  Fuzzy search doesn't work in  whitespace escaped queries.
Ex: "pursuit~2\ of\ happiness~1"

3> Edismax's Phrase boosting doesn't work the way it should in
non-whitespace escaped fuzzy queries.

If the query is "pursuit~2 of happiness~1" (without escaping whitespaces)

fuzzy queries are generated
(title_name:pursuit~2), (title_name:happiness~1) in the parsed query.
But,edismax pf (phrase boost) generates query like
title_name:"pursuit (2 pursuit2) of happiness (1 happiness1)"
This means the analyser got the original query consisting the fuzzy
operator for phrase boosting.


1> How whitespaces should be handled in case of filters like
solr.ShingleFilterFactory to generate bi grams?
2> If generating bi grams requires whitespaces escaped and fuzzy searches
not, how do we accomodate both these in a single solr request and scored
together.



-
-- 
Regards,
Sravan


Re: Title Search scoring issues with multivalued field & norm

2018-02-04 Thread Sravan Kumar
Using edismax with different fields for each title will affect the final
scores if the tie paramter is non-zero.

Can we create separate document for each title? The uniqueness won't be for
movie_id but for each title. In this manner, even while using edismax, the
other titles won't affect the score.

Any other way to handle norms in multivalued field?

On Thu, Feb 1, 2018 at 12:24 PM, Sravan Kumar <sra...@caavo.com> wrote:

> @Walter: Perhaps you are right on not to consider stemming. Instead fuzzy
> search will cover these along with the misspellings.
>
> In case of symbols, we want the titles matching the symbols ranked higher
> than the others. Perhaps we can use this field only for boosting.
>
> Certain movies have around 4-6 different aliases based on what our source
> gives and we do not really know what is the max. Is there no other way from
> lucene/solr to use a multivalued field?
>
>
> On Thu, Feb 1, 2018 at 11:06 AM, Walter Underwood <wun...@wunderwood.org>
> wrote:
>
>> I was the first search engineer at Netflix and moved their search from a
>> home-grown engine to Solr. It worked very well with a single title field
>> and aliases.
>>
>> I think your schema is too complicated for movie search.
>>
>> Stemming is not useful. It doesn’t help search and it can hurt. You don’t
>> want the movie “Saw” to match the query “see”.
>>
>> When is it useful to search with symbols? Remove the punctuation.
>>
>> The only movie titles with symbols that caused any challenge were:
>>
>> * Frost/Nixon
>> * .hack//Sign
>> * +/-
>>
>> For the first two, removing punctuation worked fine. For the last one, I
>> hardcoded a translation to “plus/minus” before indexing or querying.
>>
>> Query completion made a huge difference, taking our clickthrough rate
>> from 0.45 to 0.55.
>>
>> Later, we added fuzzy search to handle misspellings.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> > On Jan 31, 2018, at 8:54 PM, Sravan Kumar <sra...@caavo.com> wrote:
>> >
>> > @Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents.
>> This
>> > is done through the fieldnorm component in the class. The issue is when
>> the
>> > field is multivalued. Consider the field has two string each of 4
>> tokens.
>> > The fieldNorm from the lucene TFIDFSimilarity class considers the total
>> sum
>> > of these two values i.e 8 for normalizing instead of 4. Hence, the
>> ranking
>> > is distorted.
>> > Regarding the search evaluation, we do have a curated set.
>> >
>> >
>> > On Thu, Feb 1, 2018 at 9:18 AM, Tim Casey <tca...@gmail.com> wrote:
>> >
>> >> For smaller length documents TFIDFSimilarity will weight towards
>> shorter
>> >> documents.  Another way to say this, if your documents are 5-10 terms,
>> the
>> >> 5 terms are going to win.
>> >> You might think about having per token, or token pair, weight.  I
>> would be
>> >> surprised if there was not something similar out there.  This is a
>> common
>> >> issue with any short text.
>> >> I guess I would think of this as TFICF, where the CF is the corpus
>> >> frequency. You also might want to weight inversely proportional to the
>> age
>> >> of the title, older are less important.  This is assuming people are
>> doing
>> >> searches within some time cluster, newer is more likely.
>> >>
>> >> For some obvious advice, things you probably already know.  This kind
>> of
>> >> search needs some hard measurement to begin to know how to tune it.
>> You
>> >> need to find a reasonable annotated representation.  So, if you took
>> the
>> >> previous months searches where there is a chain of successive
>> searches.  If
>> >> you weighted things differently would you shorten the length of the
>> chain.
>> >> Can you get the click throughs to happen sooner.
>> >>
>> >> Anyway, just my 2 cents
>> >>
>> >>
>> >> On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar <sra...@caavo.com>
>> wrote:
>> >>
>> >>>
>> >>> @Walter: We have 6 fields declared in schema.xml for title each with
>> >>> different type of analyzer. One without processing symbols, other
>> stemmed
>> >>> and other removing  symbols, etc. So, if we have separate fields for

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Walter: Perhaps you are right on not to consider stemming. Instead fuzzy
search will cover these along with the misspellings.

In case of symbols, we want the titles matching the symbols ranked higher
than the others. Perhaps we can use this field only for boosting.

Certain movies have around 4-6 different aliases based on what our source
gives and we do not really know what is the max. Is there no other way from
lucene/solr to use a multivalued field?


On Thu, Feb 1, 2018 at 11:06 AM, Walter Underwood <wun...@wunderwood.org>
wrote:

> I was the first search engineer at Netflix and moved their search from a
> home-grown engine to Solr. It worked very well with a single title field
> and aliases.
>
> I think your schema is too complicated for movie search.
>
> Stemming is not useful. It doesn’t help search and it can hurt. You don’t
> want the movie “Saw” to match the query “see”.
>
> When is it useful to search with symbols? Remove the punctuation.
>
> The only movie titles with symbols that caused any challenge were:
>
> * Frost/Nixon
> * .hack//Sign
> * +/-
>
> For the first two, removing punctuation worked fine. For the last one, I
> hardcoded a translation to “plus/minus” before indexing or querying.
>
> Query completion made a huge difference, taking our clickthrough rate from
> 0.45 to 0.55.
>
> Later, we added fuzzy search to handle misspellings.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jan 31, 2018, at 8:54 PM, Sravan Kumar <sra...@caavo.com> wrote:
> >
> > @Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents.
> This
> > is done through the fieldnorm component in the class. The issue is when
> the
> > field is multivalued. Consider the field has two string each of 4 tokens.
> > The fieldNorm from the lucene TFIDFSimilarity class considers the total
> sum
> > of these two values i.e 8 for normalizing instead of 4. Hence, the
> ranking
> > is distorted.
> > Regarding the search evaluation, we do have a curated set.
> >
> >
> > On Thu, Feb 1, 2018 at 9:18 AM, Tim Casey <tca...@gmail.com> wrote:
> >
> >> For smaller length documents TFIDFSimilarity will weight towards shorter
> >> documents.  Another way to say this, if your documents are 5-10 terms,
> the
> >> 5 terms are going to win.
> >> You might think about having per token, or token pair, weight.  I would
> be
> >> surprised if there was not something similar out there.  This is a
> common
> >> issue with any short text.
> >> I guess I would think of this as TFICF, where the CF is the corpus
> >> frequency. You also might want to weight inversely proportional to the
> age
> >> of the title, older are less important.  This is assuming people are
> doing
> >> searches within some time cluster, newer is more likely.
> >>
> >> For some obvious advice, things you probably already know.  This kind of
> >> search needs some hard measurement to begin to know how to tune it.  You
> >> need to find a reasonable annotated representation.  So, if you took the
> >> previous months searches where there is a chain of successive
> searches.  If
> >> you weighted things differently would you shorten the length of the
> chain.
> >> Can you get the click throughs to happen sooner.
> >>
> >> Anyway, just my 2 cents
> >>
> >>
> >> On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar <sra...@caavo.com> wrote:
> >>
> >>>
> >>> @Walter: We have 6 fields declared in schema.xml for title each with
> >>> different type of analyzer. One without processing symbols, other
> stemmed
> >>> and other removing  symbols, etc. So, if we have separate fields for
> each
> >>> alias it will be that many times the number of final fields declared in
> >>> schema.xml. And we exactly do not know what is the maximum number of
> >>> aliases a movie can have.
> >>> @Walter: I will try this but isn’t there any other way  where I can
> >> tweak ?
> >>>
> >>> @eric: will try this. But it will work only for exact matches.
> >>>
> >>>
> >>>> On Jan 31, 2018, at 10:39 PM, Erick Erickson <erickerick...@gmail.com
> >
> >>> wrote:
> >>>>
> >>>> Or use a boost for the phrase, something like
> >>>> "beauty and the beast"^5
> >>>>
> >>>>> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood 

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents. This
is done through the fieldnorm component in the class. The issue is when the
field is multivalued. Consider the field has two string each of 4 tokens.
The fieldNorm from the lucene TFIDFSimilarity class considers the total sum
of these two values i.e 8 for normalizing instead of 4. Hence, the ranking
is distorted.
Regarding the search evaluation, we do have a curated set.


On Thu, Feb 1, 2018 at 9:18 AM, Tim Casey <tca...@gmail.com> wrote:

> For smaller length documents TFIDFSimilarity will weight towards shorter
> documents.  Another way to say this, if your documents are 5-10 terms, the
> 5 terms are going to win.
> You might think about having per token, or token pair, weight.  I would be
> surprised if there was not something similar out there.  This is a common
> issue with any short text.
> I guess I would think of this as TFICF, where the CF is the corpus
> frequency. You also might want to weight inversely proportional to the age
> of the title, older are less important.  This is assuming people are doing
> searches within some time cluster, newer is more likely.
>
> For some obvious advice, things you probably already know.  This kind of
> search needs some hard measurement to begin to know how to tune it.  You
> need to find a reasonable annotated representation.  So, if you took the
> previous months searches where there is a chain of successive searches.  If
> you weighted things differently would you shorten the length of the chain.
> Can you get the click throughs to happen sooner.
>
> Anyway, just my 2 cents
>
>
> On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar <sra...@caavo.com> wrote:
>
> >
> > @Walter: We have 6 fields declared in schema.xml for title each with
> > different type of analyzer. One without processing symbols, other stemmed
> > and other removing  symbols, etc. So, if we have separate fields for each
> > alias it will be that many times the number of final fields declared in
> > schema.xml. And we exactly do not know what is the maximum number of
> > aliases a movie can have.
> > @Walter: I will try this but isn’t there any other way  where I can
> tweak ?
> >
> > @eric: will try this. But it will work only for exact matches.
> >
> >
> > > On Jan 31, 2018, at 10:39 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> > >
> > > Or use a boost for the phrase, something like
> > > "beauty and the beast"^5
> > >
> > >> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <
> > wun...@wunderwood.org> wrote:
> > >> You can use a separate field for title aliases. That is what I did for
> > Netflix search.
> > >>
> > >> Why disable idf? Disabling tf for titles can be a good idea, for
> > example the movie “New York, New York” is not twice as much about New
> York
> > as some other film that just lists it once.
> > >>
> > >> Also, consider using a popularity score as a boost.
> > >>
> > >> wunder
> > >> Walter Underwood
> > >> wun...@wunderwood.org
> > >> http://observer.wunderwood.org/  (my blog)
> > >>
> > >>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar <sra...@caavo.com> wrote:
> > >>>
> > >>> Hi,
> > >>> We are using solr for our movie title search.
> > >>>
> > >>>
> > >>> As it is "title search", this should be treated different than the
> > normal
> > >>> document search.
> > >>> Hence, we use a modified version of TFIDFSimilarity with the
> following
> > >>> changes.
> > >>> -  disabled TF & IDF and will only have 1 as value.
> > >>> -  disabled norms by specifying omitNorms as true for all the fields.
> > >>>
> > >>> There are 6 fields with different analyzers and we make use of
> > different
> > >>> weights in edismax's qf & pf parameters to match tokens & boost
> > phrases.
> > >>>
> > >>> But, movies could have aliases and have multiple titles. So, we made
> > the
> > >>> fields multivalued.
> > >>>
> > >>> Now, consider the following four documents
> > >>> 1>  "Beauty and the Beast"
> > >>> 2>  "The Real Beauty and the Beast"
> > >>> 3>  "Beauty and the Beast", "La bella y la bestia"
> > >>> 4>  "Beauty and the Beast"
> > >&g

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar

@Walter: We have 6 fields declared in schema.xml for title each with different 
type of analyzer. One without processing symbols, other stemmed and other 
removing  symbols, etc. So, if we have separate fields for each alias it will 
be that many times the number of final fields declared in schema.xml. And we 
exactly do not know what is the maximum number of aliases a movie can have. 
@Walter: I will try this but isn’t there any other way  where I can tweak ?

@eric: will try this. But it will work only for exact matches. 


> On Jan 31, 2018, at 10:39 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Or use a boost for the phrase, something like
> "beauty and the beast"^5
> 
>> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <wun...@wunderwood.org> 
>> wrote:
>> You can use a separate field for title aliases. That is what I did for 
>> Netflix search.
>> 
>> Why disable idf? Disabling tf for titles can be a good idea, for example the 
>> movie “New York, New York” is not twice as much about New York as some other 
>> film that just lists it once.
>> 
>> Also, consider using a popularity score as a boost.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar <sra...@caavo.com> wrote:
>>> 
>>> Hi,
>>> We are using solr for our movie title search.
>>> 
>>> 
>>> As it is "title search", this should be treated different than the normal
>>> document search.
>>> Hence, we use a modified version of TFIDFSimilarity with the following
>>> changes.
>>> -  disabled TF & IDF and will only have 1 as value.
>>> -  disabled norms by specifying omitNorms as true for all the fields.
>>> 
>>> There are 6 fields with different analyzers and we make use of different
>>> weights in edismax's qf & pf parameters to match tokens & boost phrases.
>>> 
>>> But, movies could have aliases and have multiple titles. So, we made the
>>> fields multivalued.
>>> 
>>> Now, consider the following four documents
>>> 1>  "Beauty and the Beast"
>>> 2>  "The Real Beauty and the Beast"
>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>> 4>  "Beauty and the Beast"
>>> 
>>> Note: Document 3 has two titles in it.
>>> 
>>> So, for a query "Beauty and the Beast" and with the above configuration all
>>> the documents receive same score. But 1,3,4 should have got same score and
>>> document 2 lesser than others.
>>> 
>>> To solve this, we followed what is suggested in the following thread:
>>> http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html
>>> 
>>> Now, the fields which are used to boost are made to use Norms. And for
>>> matching norms are disabled. This is to make sure that exact & near exact
>>> matches are rewarded.
>>> 
>>> But, for the same query, we get the following results.
>>> query: "Beauty & the Beast"
>>> Search Results:
>>> 1>  "Beauty and the Beast"
>>> 4>  "Beauty and the Beast"
>>> 2>  "The Real Beauty and the Beast"
>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>> 
>>> Clearly, the changes have solved only a part of the problem. The document 3
>>> should be ranked/scored higher than document 2.
>>> 
>>> This is because lucene considers the total field length across all the
>>> values in a multivalued field for normalization.
>>> 
>>> How do we handle this scenario and make sure that in multivalued fields the
>>> normalization is taken care of?
>>> 
>>> 
>>> --
>>> Regards,
>>> Sravan
>> 


Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
Hi,
We are using solr for our movie title search.


As it is "title search", this should be treated different than the normal
document search.
Hence, we use a modified version of TFIDFSimilarity with the following
changes.
-  disabled TF & IDF and will only have 1 as value.
-  disabled norms by specifying omitNorms as true for all the fields.

There are 6 fields with different analyzers and we make use of different
weights in edismax's qf & pf parameters to match tokens & boost phrases.

But, movies could have aliases and have multiple titles. So, we made the
fields multivalued.

Now, consider the following four documents
1>  "Beauty and the Beast"
2>  "The Real Beauty and the Beast"
3>  "Beauty and the Beast", "La bella y la bestia"
4>  "Beauty and the Beast"

Note: Document 3 has two titles in it.

So, for a query "Beauty and the Beast" and with the above configuration all
the documents receive same score. But 1,3,4 should have got same score and
document 2 lesser than others.

To solve this, we followed what is suggested in the following thread:
http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html

Now, the fields which are used to boost are made to use Norms. And for
matching norms are disabled. This is to make sure that exact & near exact
matches are rewarded.

But, for the same query, we get the following results.
query: "Beauty & the Beast"
Search Results:
1>  "Beauty and the Beast"
4>  "Beauty and the Beast"
2>  "The Real Beauty and the Beast"
3>  "Beauty and the Beast", "La bella y la bestia"

Clearly, the changes have solved only a part of the problem. The document 3
should be ranked/scored higher than document 2.

This is because lucene considers the total field length across all the
values in a multivalued field for normalization.

How do we handle this scenario and make sure that in multivalued fields the
normalization is taken care of?


-- 
Regards,
Sravan


Re: SolrCloud Nodes going to recovery state during indexing

2018-01-04 Thread Sravan Kumar
Emir,
   'delete_by_query' is the cause for the replicas going to recover state.
   I replaced it with delete_by_id as you suggested. Everything works fine
after that. The cluster held for nearly 3 hours without any failures.
  Thanks Emir.


On Wed, Jan 3, 2018 at 8:41 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Sravan,
> DBQ does not play well with indexing - it causes indexing to be completely
> blocked on replicas while it is running. It is highly likely that it is the
> root cause of your issues. If you can change indexing logic to avoid it,
> you can quickly test it. What you can do as a workaround is to query for
> IDs that needs to be deleted and execute bulk delete by ID - that will not
> cause issues as DBQ.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 16:04, Sravan Kumar <sra...@caavo.com> wrote:
> >
> > Emir,
> >Yes there is a delete_by_query on every bulk insert.
> >This delete_by_query deletes all the documents which are updated
> lesser
> > than a day before the current time.
> >Is bulk delete_by_query the reason?
> >
> > On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Do you have deletes by query while indexing or it is append only index?
> >>
> >> Regards,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 3 Jan 2018, at 12:16, sravan <sra...@caavo.com> wrote:
> >>>
> >>> SolrCloud Nodes going to recovery state during indexing
> >>>
> >>>
> >>> We have solr cloud setup with the settings shared below. We have a
> >> collection with 3 shards and a replica for each of them.
> >>>
> >>> Normal State(As soon as the whole cluster is restarted):
> >>>- Status of all the shards is UP.
> >>>- a bulk update request of 50 documents each takes < 100ms.
> >>>- 6-10 simultaneous bulk updates.
> >>>
> >>> Nodes going to recover state after updates for 15-30 mins.
> >>>- Some shards starts giving the following ERRORs:
> >>>- o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
> >> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
> >> exception during distributed update: Read timed out
> >>>- o.a.s.u.StreamingSolrClients error java.net.
> SocketTimeoutException:
> >> Read timed out
> >>>- the following error is seen on the shard which goes to recovery
> >> state.
> >>>- too many updates received since start - startingUpdates no
> >> longer overlaps with our currentUpdates.
> >>>- Sometimes, the same shard even goes to DOWN state and needs a node
> >> restart to come back.
> >>>- a bulk update request of 50 documents takes more than 5 seconds.
> >> Sometimes even >120 secs. This is seen for all the requests if at least
> one
> >> node is in recovery state in the whole cluster.
> >>>
> >>> We have a standalone setup with the same collection schema which is
> able
> >> to take update & query load without any errors.
> >>>
> >>>
> >>> We have the following solrcloud setup.
> >>>- setup in AWS.
> >>>
> >>>- Zookeeper Setup:
> >>>- number of nodes: 3
> >>>- aws instance type: t2.small
> >>>- instance memory: 2gb
> >>>
> >>>- Solr Setup:
> >>>- Solr version: 6.6.0
> >>>- number of nodes: 3
> >>>- aws instance type: m5.xlarge
> >>>- instance memory: 16gb
> >>>- number of cores: 4
> >>>- JAVA HEAP: 8gb
> >>>- JAVA VERSION: oracle java version "1.8.0_151"
> >>>- GC settings: default CMS.
> >>>
> >>>collection settings:
> >>>- number of shards: 3
> >>>- replication factor: 2
> >>>- total 6 replicas.
> >>>- total number of documents in the collection: 12 million
> >>>- total number of do

Re: SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread Sravan Kumar
Emir,
Yes there is a delete_by_query on every bulk insert.
This delete_by_query deletes all the documents which are updated lesser
than a day before the current time.
Is bulk delete_by_query the reason?

On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Do you have deletes by query while indexing or it is append only index?
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 12:16, sravan <sra...@caavo.com> wrote:
> >
> > SolrCloud Nodes going to recovery state during indexing
> >
> >
> > We have solr cloud setup with the settings shared below. We have a
> collection with 3 shards and a replica for each of them.
> >
> > Normal State(As soon as the whole cluster is restarted):
> > - Status of all the shards is UP.
> > - a bulk update request of 50 documents each takes < 100ms.
> > - 6-10 simultaneous bulk updates.
> >
> > Nodes going to recover state after updates for 15-30 mins.
> > - Some shards starts giving the following ERRORs:
> > - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
> exception during distributed update: Read timed out
> > - o.a.s.u.StreamingSolrClients error 
> > java.net.SocketTimeoutException:
> Read timed out
> > - the following error is seen on the shard which goes to recovery
> state.
> > - too many updates received since start - startingUpdates no
> longer overlaps with our currentUpdates.
> > - Sometimes, the same shard even goes to DOWN state and needs a node
> restart to come back.
> > - a bulk update request of 50 documents takes more than 5 seconds.
> Sometimes even >120 secs. This is seen for all the requests if at least one
> node is in recovery state in the whole cluster.
> >
> > We have a standalone setup with the same collection schema which is able
> to take update & query load without any errors.
> >
> >
> > We have the following solrcloud setup.
> > - setup in AWS.
> >
> > - Zookeeper Setup:
> > - number of nodes: 3
> > - aws instance type: t2.small
> > - instance memory: 2gb
> >
> > - Solr Setup:
> > - Solr version: 6.6.0
> > - number of nodes: 3
> > - aws instance type: m5.xlarge
> > - instance memory: 16gb
> > - number of cores: 4
> > - JAVA HEAP: 8gb
> > - JAVA VERSION: oracle java version "1.8.0_151"
> > - GC settings: default CMS.
> >
> > collection settings:
> > - number of shards: 3
> > - replication factor: 2
> > - total 6 replicas.
> > - total number of documents in the collection: 12 million
> > - total number of documents in each shard: 4 million
> > - Each document has around 25 fields with 12 of them
> containing textual analysers & filters.
> > - Commit Strategy:
> > - No explicit commits from application code.
> > - Hard commit of 15 secs with OpenSearcher as false.
> > - Soft commit of 10 mins.
> > - Cache Strategy:
> > - filter queries
> > - number: 512
> > - autowarmCount: 100
> > - all other caches
> > - number: 512
> > - autowarmCount: 0
> > - maxWarmingSearchers: 2
> >
> >
> > - We tried the following
> > - commit strategy
> > - hard commit - 150 secs
> > - soft commit - 5 mins
> > - with GCG1 garbage collector based on https://wiki.apache.org/solr/
> ShawnHeisey#Java_8_recommendation_for_Solr:
> > - the nodes go to recover state in less than a minute.
> >
> > The issue is seen even when the leaders are balanced across the three
> nodes.
> >
> > Can you help us find the soluttion to this problem?
>
>


-- 
Regards,
Sravan


SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread sravan

SolrCloud Nodes going to recovery state during indexing


We have solr cloud setup with the settings shared below. We have a 
collection with 3 shards and a replica for each of them.


Normal State(As soon as the whole cluster is restarted):
    - Status of all the shards is UP.
    - a bulk update request of 50 documents each takes < 100ms.
    - 6-10 simultaneous bulk updates.

Nodes going to recover state after updates for 15-30 mins.
    - Some shards starts giving the following ERRORs:
        - o.a.s.h.RequestHandlerBase 
org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 
Async exception during distributed update: Read timed out
        - o.a.s.u.StreamingSolrClients error 
java.net.SocketTimeoutException: Read timed out
    - the following error is seen on the shard which goes to recovery 
state.
        - too many updates received since start - startingUpdates no 
longer overlaps with our currentUpdates.
    - Sometimes, the same shard even goes to DOWN state and needs a 
node restart to come back.
    - a bulk update request of 50 documents takes more than 5 seconds. 
Sometimes even >120 secs. This is seen for all the requests if at least 
one node is in recovery state in the whole cluster.


We have a standalone setup with the same collection schema which is able 
to take update & query load without any errors.



We have the following solrcloud setup.
    - setup in AWS.

    - Zookeeper Setup:
        - number of nodes: 3
        - aws instance type: t2.small
        - instance memory: 2gb

    - Solr Setup:
        - Solr version: 6.6.0
        - number of nodes: 3
        - aws instance type: m5.xlarge
        - instance memory: 16gb
        - number of cores: 4
        - JAVA HEAP: 8gb
        - JAVA VERSION: oracle java version "1.8.0_151"
        - GC settings: default CMS.

        collection settings:
            - number of shards: 3
            - replication factor: 2
            - total 6 replicas.
            - total number of documents in the collection: 12 million
            - total number of documents in each shard: 4 million
            - Each document has around 25 fields with 12 of them 
containing textual analysers & filters.

            - Commit Strategy:
                - No explicit commits from application code.
                - Hard commit of 15 secs with OpenSearcher as false.
                - Soft commit of 10 mins.
            - Cache Strategy:
                - filter queries
                    - number: 512
                    - autowarmCount: 100
                - all other caches
                    - number: 512
                    - autowarmCount: 0
            - maxWarmingSearchers: 2


- We tried the following
    - commit strategy
        - hard commit - 150 secs
        - soft commit - 5 mins
    - with GCG1 garbage collector based on 
https://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:

        - the nodes go to recover state in less than a minute.

The issue is seen even when the leaders are balanced across the three 
nodes.


Can you help us find the soluttion to this problem?