Sort on docValue field is slow.

2019-05-20 Thread Ashwin Ramesh
Hello everybody,

Hoping to get advice on a specific issue - We have a collection of 50M
documents. We recently added a featuredAt field defined as such -



This field is sparely populated such that only a small subset (3-5 thousand
currently) have been tagged with that field.

We have a business case where we want to order this content by most
recently featured -> least recently featured -> the rest of the content in
any order. However adding the `sort=featuredAt desc` param results in qTime
> 5000 (our hard timeout is 5000).

The request handler processing this request is defined as follows:

  
*
  
  
id
edismax
10
id
  
  
elevator
  


We hydrate content with a seperate store.

Any advice as to how to improve the performance of this request handler +
sorting.

System/Architecture Specs:
Solr 7.4
8 Shards
TLOG / PULLs

Thank you & Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Sort on docValue field is slow.

2019-05-20 Thread Ashwin Ramesh
Hi Shawn,

Thanks for the prompt response.

1. date type def - 

2. The field is brand new. I added it to schema.xml, uploaded to ZK &
reloaded the collection. After that we started indexing the few thousand.
Did we still need to do a full reindex to a fresh collection?

3. It is the only difference. I am testing the raw URL call timing
difference with and without the extra sort.

Hope this helps,

Regards,

Ash



On Mon, May 20, 2019 at 11:17 PM Shawn Heisey  wrote:

> On 5/20/2019 6:25 AM, Ashwin Ramesh wrote:
> > Hoping to get advice on a specific issue - We have a collection of 50M
> > documents. We recently added a featuredAt field defined as such -
> >
> >  > required="false"
> > multiValued="false" docValues="true"/>
>
> What is the fieldType definition for "date"?  We cannot assume that you
> have left this the same as Solr's sample configs.
>
> > This field is sparely populated such that only a small subset (3-5
> thousand
> > currently) have been tagged with that field.
>
> Did you completely reindex, or just index those few thousand records?
> When changing fields related to docValues, you must completely delete
> the old index and reindex.  That's just how docValues works.
>
> > We have a business case where we want to order this content by most
> > recently featured -> least recently featured -> the rest of the content
> in
> > any order. However adding the `sort=featuredAt desc` param results in
> qTime
> >> 5000 (our hard timeout is 5000).
>
> Is the definition of the sort parameter the ONLY difference?  Are you
> querying on the new field?  Can you share the entire query URL, or the
> code that produced it if you're using a Solr client?  What is the before
> QTime?
>
> Thanks,
> Shawn
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Are docValues useful for FilterQueries?

2019-07-08 Thread Ashwin Ramesh
Hi everybody,

I can't find concrete evidence whether docValues are indeed useful for
filter queries. One example of a field:



This field will have a value between 0-1 The only usecase for this
field is to filter on a range / subset of values. There will be no scoring
/ querying on this field. Is this a good usecase for docValues? Regards, Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Is it possible to skip scoring completely?

2019-09-12 Thread Ashwin Ramesh
Thanks Shawn & Emir,

I just tried a * query with filters with fl=id,score. I noticed that all
scores were 1.0. Which I assume means no scoring was done. When I added a
sort after that test, scores were still 1.0.

I guess all I have to do is set q=* & set a sort.

Appreciate your help,

Ash

On Thu, Sep 12, 2019 at 4:40 PM Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Ash,
> I did not check the code, so not sure if your question is based on
> something that you find in the codebase or you are just assuming that
> scoring is called? I would assume differently: if you use only fq, then
> Solr does not have anything to score. Also, if you order by something other
> than score and do not request score to be returned, I would also assume
> that Solr will not calculate score. Again, didn’t have time to check the
> code, so these are just assumptions.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 12 Sep 2019, at 01:27, Ashwin Ramesh  wrote:
> >
> > Hi everybody,
> >
> > I was wondering if there is a way we can tell solr (7.3+) to run none of
> > it's scoring logic. We would like to simply add a set of filter queries
> and
> > order on a specific docValue field.
> >
> > e.g. "Give me all fq=color:red documents ORDER on popularityScore DESC"
> >
> > Thanks in advance,
> >
> > Ash
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > <https://product.canva.com/>. ***
> > ** <https://www.canva.com/>Empowering the
> > world to design
> > Also, we're hiring. Apply here!
> > <https://about.canva.com/careers/>
> > <https://twitter.com/canva>
> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> > <https://twitter.com/canva>  <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Is it possible to skip scoring completely?

2019-09-11 Thread Ashwin Ramesh
Hi everybody,

I was wondering if there is a way we can tell solr (7.3+) to run none of
it's scoring logic. We would like to simply add a set of filter queries and
order on a specific docValue field.

e.g. "Give me all fq=color:red documents ORDER on popularityScore DESC"

Thanks in advance,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Is it possible to skip scoring completely?

2019-09-12 Thread Ashwin Ramesh
Ah! Thanks so much!

On Thu., 12 Sep. 2019, 11:56 pm Shawn Heisey,  wrote:

> On 9/12/2019 12:43 AM, Ashwin Ramesh wrote:
> > I just tried a * query with filters with fl=id,score. I noticed that all
> > scores were 1.0. Which I assume means no scoring was done. When I added a
> > sort after that test, scores were still 1.0.
> >
> > I guess all I have to do is set q=* & set a sort.
>
> Don't use q=* for your query. This is a wildcard query.  What that means
> is that if the field you're querying contains 10 million different
> values, then your actual query will be built with all 10 million of
> those values.  It will be huge, and VERY slow.
>
> Use q=*:* if you mean all documents.  This is special syntax that Lucene
> and Solr understand and translate into a very fast "all documents
> query".  That query will probably also generate 1.0 for scores, though I
> haven't checked.
>
> Thanks,
> Shawn
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Best field type for boosting all documents

2019-09-16 Thread Ashwin Ramesh
Hi everybody,

We have a usecase where we want to push a popularity boost for each
document in our collection. When a user searches for any term, we would
like to arbitrarily add an additional boost by this value (which is
different for each document).

E.g. q=foo=def(popularityBoostField,1)

Should we define the field 'popularityBoostField' as a docValue or regular
field?

If the field is sparsely filled, will that cause any issues?

Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Dealing with multi-word keywords and SOW=true

2019-09-30 Thread Ashwin Ramesh
Thanks Erick, that seems to work!

Should I leave it in qf also? For example the query "blue dog" may be
represented as separate tokens in the keyword index.



On Mon, Sep 30, 2019 at 9:32 PM Erick Erickson 
wrote:

> Have you tried taking your keyword field out of the “qf” param and adding
> it explicitly? As keyword:”ice cream”
>
> Best,
> Erick
>
> > On Sep 30, 2019, at 5:27 AM, Ashwin Ramesh  wrote:
> >
> > Hi everybody,
> >
> > I am using the edismax parser and have noticed a very specific behaviour
> > with how sow=true (default) handles multiword keywords.
> >
> > We have a field called 'keywords', which uses the general
> > KeywordTokenizerFactory. There are also other text fields like title and
> > description. etc.
> >
> > When we index a document with a keyword "ice cream", for example, we know
> > it gets indexed into that field as "ice cream".
> >
> > However, at query time, I noticed that if we run an Edismax query:
> > q=ice cream
> > qf=keywords
> >
> > I do not get that document back as a match. This is due to sow=true
> > splitting the user's query and the final tokens not being present in the
> > keywords field.
> >
> > I was wondering what the best practise around this was? Some thoughts I
> > have had:
> >
> > 1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
> > cream" -> "ice-cream"
> > 2. Additionally index the separate words as keywords also. E.g. "ice
> cream"
> > -> "ice cream", "ice", "cream". However this method will result in the
> loss
> > of intent (q=ice would return this document).
> > 3. Add a boost query which is an edismax query where we explicitly set
> > sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
> > sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*
> >
> > Is there an industry practise solution to handle this type of problem?
> Keep
> > in mind that the other text fields may also include these terms. E.g.
> > title="This is ice cream", which would match the query. This specific
> > problem affects the keywords field for the obvious reason that the
> indexing
> > pipeline does not tokenize keywords.
> >
> > Thank you for all your amazing help,
> >
> > Regards,
> >
> > Ash
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > <https://product.canva.com/>. ***
> > ** <https://www.canva.com/>Empowering the
> > world to design
> > Also, we're hiring. Apply here!
> > <https://about.canva.com/careers/>
> > <https://twitter.com/canva>
> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> > <https://twitter.com/canva>  <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Dealing with multi-word keywords and SOW=true

2019-09-30 Thread Ashwin Ramesh
Hi everybody,

I am using the edismax parser and have noticed a very specific behaviour
with how sow=true (default) handles multiword keywords.

We have a field called 'keywords', which uses the general
KeywordTokenizerFactory. There are also other text fields like title and
description. etc.

When we index a document with a keyword "ice cream", for example, we know
it gets indexed into that field as "ice cream".

However, at query time, I noticed that if we run an Edismax query:
q=ice cream
qf=keywords

I do not get that document back as a match. This is due to sow=true
splitting the user's query and the final tokens not being present in the
keywords field.

I was wondering what the best practise around this was? Some thoughts I
have had:

1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
cream" -> "ice-cream"
2. Additionally index the separate words as keywords also. E.g. "ice cream"
-> "ice cream", "ice", "cream". However this method will result in the loss
of intent (q=ice would return this document).
3. Add a boost query which is an edismax query where we explicitly set
sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*

Is there an industry practise solution to handle this type of problem? Keep
in mind that the other text fields may also include these terms. E.g.
title="This is ice cream", which would match the query. This specific
problem affects the keywords field for the obvious reason that the indexing
pipeline does not tokenize keywords.

Thank you for all your amazing help,

Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Best Practises around relevance tuning per query

2020-02-26 Thread Ashwin Ramesh
Hi everybody,

Thank you for all the amazing feedback. I apologize for the formatting of
my question.

I guess if I was to generalize my question, 'What is the most common
approaches to storing query level features in Solr documents?'

For example, a normalized_click_score is a document level feature, but how
would you scalably also do the same for specific queries? E.g. How do you
define, *For the query 'ipod' this specific document is very relevant*.

Thanks again!

Regards,

Ash

On Wed, Feb 19, 2020 at 6:14 PM Jörn Franke  wrote:

> You are too much focus on the solution. If you would describe the business
> case in more detail without including the solution itself more people could
> help.
>
> Eg it ie not clear why you have a scoring model and why this can address
> business needs.
>
> > Am 18.02.2020 um 01:50 schrieb Ashwin Ramesh :
> >
> > Hi,
> >
> > We are in the process of applying a scoring model to our search results.
> In
> > particular, we would like to add scores for documents per query and user
> > context.
> >
> > For example, we want to have a score from 500 to 1 for the top 500
> > documents for the query “dog” for users who speak US English.
> >
> > We believe it becomes infeasible to store these scores in Solr because we
> > want to update the scores regularly, and the number of scores increases
> > rapidly with increased user attributes.
> >
> > One solution we explored was to store these scores in a secondary data
> > store, and use this at Solr query time with a boost function such as:
> >
> > `bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
> > mul(termfreq(id,'ID-500'),1)`
> >
> > We have over a hundred thousand documents in one Solr collection, and
> about
> > fifty million in another Solr collection. We have some queries for which
> > roughly 80% of the results match, although this is an edge case. We
> wanted
> > to know the worst case performance, so we tested with such a query. For
> > both of these collections we found the a message similar to the following
> > in the Solr cloud logs (tested on a laptop):
> >
> > Elapsed time: 5020. Exceeded allowed search time: 5000 ms.
> >
> > We then tried using the following boost, which seemed simpler:
> >
> > `boost=if(query($qq), 10, 1)=id:(ID-1 OR ID-2 OR … OR ID-500)`
> >
> > We then saw the following in the Solr cloud logs:
> >
> > `The request took too long to iterate over terms.`
> >
> > All responses above took over 5000 milliseconds to return.
> >
> > We are considering Solr’s re-ranker, but I don’t know how we would use
> this
> > without pushing all the query-context-document scores to Solr.
> >
> >
> > The alternative solution that we are currently considering involves
> > invoking multiple solr queries.
> >
> > This means we would make a request to solr to fetch the top N results
> (id,
> > score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar,
> limit=N.
> >
> > Another request would be made using a filter query with a set of doc ids
> > that we know are high value for the user’s query. E.g. q=*:*,
> > fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.
> >
> > We would then do a reranking phase in our service layer.
> >
> > Do you have any suggestions for known patterns of how we can store and
> > retrieve scores per user context and query?
> >
> > Regards,
> > Ash & Spirit.
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Also, we're
> > hiring. Apply here! <https://about.canva.com/careers/>
> >
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Best Practises around relevance tuning per query

2020-02-17 Thread Ashwin Ramesh
Hi,

We are in the process of applying a scoring model to our search results. In
particular, we would like to add scores for documents per query and user
context.

For example, we want to have a score from 500 to 1 for the top 500
documents for the query “dog” for users who speak US English.

We believe it becomes infeasible to store these scores in Solr because we
want to update the scores regularly, and the number of scores increases
rapidly with increased user attributes.

One solution we explored was to store these scores in a secondary data
store, and use this at Solr query time with a boost function such as:

`bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
mul(termfreq(id,'ID-500'),1)`

We have over a hundred thousand documents in one Solr collection, and about
fifty million in another Solr collection. We have some queries for which
roughly 80% of the results match, although this is an edge case. We wanted
to know the worst case performance, so we tested with such a query. For
both of these collections we found the a message similar to the following
in the Solr cloud logs (tested on a laptop):

Elapsed time: 5020. Exceeded allowed search time: 5000 ms.

We then tried using the following boost, which seemed simpler:

`boost=if(query($qq), 10, 1)=id:(ID-1 OR ID-2 OR … OR ID-500)`

We then saw the following in the Solr cloud logs:

`The request took too long to iterate over terms.`

All responses above took over 5000 milliseconds to return.

We are considering Solr’s re-ranker, but I don’t know how we would use this
without pushing all the query-context-document scores to Solr.


The alternative solution that we are currently considering involves
invoking multiple solr queries.

This means we would make a request to solr to fetch the top N results (id,
score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.

Another request would be made using a filter query with a set of doc ids
that we know are high value for the user’s query. E.g. q=*:*,
fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.

We would then do a reranking phase in our service layer.

Do you have any suggestions for known patterns of how we can store and
retrieve scores per user context and query?

Regards,
Ash & Spirit.

-- 
**
** Empowering the world to design
Also, we're 
hiring. Apply here! 
 
  
   
    













Re: Best Practises around relevance tuning per query

2020-02-18 Thread Ashwin Ramesh
ping on this :)

On Tue, Feb 18, 2020 at 11:50 AM Ashwin Ramesh  wrote:

> Hi,
>
> We are in the process of applying a scoring model to our search results.
> In particular, we would like to add scores for documents per query and user
> context.
>
> For example, we want to have a score from 500 to 1 for the top 500
> documents for the query “dog” for users who speak US English.
>
> We believe it becomes infeasible to store these scores in Solr because we
> want to update the scores regularly, and the number of scores increases
> rapidly with increased user attributes.
>
> One solution we explored was to store these scores in a secondary data
> store, and use this at Solr query time with a boost function such as:
>
> `bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
> mul(termfreq(id,'ID-500'),1)`
>
> We have over a hundred thousand documents in one Solr collection, and
> about fifty million in another Solr collection. We have some queries for
> which roughly 80% of the results match, although this is an edge case. We
> wanted to know the worst case performance, so we tested with such a query.
> For both of these collections we found the a message similar to the
> following in the Solr cloud logs (tested on a laptop):
>
> Elapsed time: 5020. Exceeded allowed search time: 5000 ms.
>
> We then tried using the following boost, which seemed simpler:
>
> `boost=if(query($qq), 10, 1)=id:(ID-1 OR ID-2 OR … OR ID-500)`
>
> We then saw the following in the Solr cloud logs:
>
> `The request took too long to iterate over terms.`
>
> All responses above took over 5000 milliseconds to return.
>
> We are considering Solr’s re-ranker, but I don’t know how we would use
> this without pushing all the query-context-document scores to Solr.
>
>
> The alternative solution that we are currently considering involves
> invoking multiple solr queries.
>
> This means we would make a request to solr to fetch the top N results (id,
> score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.
>
> Another request would be made using a filter query with a set of doc ids
> that we know are high value for the user’s query. E.g. q=*:*,
> fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.
>
> We would then do a reranking phase in our service layer.
>
> Do you have any suggestions for known patterns of how we can store and
> retrieve scores per user context and query?
>
> Regards,
> Ash & Spirit.
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Overseer & Backups - Questions

2020-03-10 Thread Ashwin Ramesh
Hi everybody,

Quick Specs:
- Solr 7.4 Solr Cloud
- 30gb index on 8 shards Tlog/Pull

We run daily backups on our 30gb index and noticed that the overseer does
not process other jobs on it's task list while the backup is being taken.
They remain on the pending list (in ZK). Is this expected?

Also I was wondering if there was a safe way to cancel a currently running
task or deleing pending tasks?

Regards,

Ash

-- 
**
** Empowering the world to design
Also, we're 
hiring. Apply here! 
 
  
   
    













LTR - FieldValueFeature Question

2020-04-24 Thread Ashwin Ramesh
Hi everybody,

Do we need to have 'indexed=true' to be able to retrieve the value of a
field via FieldValueFeature or is having docValue=true enough?

Currently, we have some dynamic fields as [dynamicField=true, stored=false,
indexed=false, docValue=true]. However when we noticing that the value
extracted is '0.0'.

This is the code I read around FieldFeatureValue:
https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

Thanks,

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Solr 7.4 - LTR reranker not adhering by Elevate Plugin

2020-05-14 Thread Ashwin Ramesh
Hi everybody,

We are running a query with both elevateIds=1,2,3 & a reranker phase using
LTR plugin.

We noticed that the results do not return in the expected order - per the
elevateIds param.
Example LTR rq param {!ltr.model=foo reRankDocs=250 efi.query=$q}

When I used the standard reranker ({!rerank reRankQuery=$titleQuery
reRankDocs=1000 reRankWeight=3}) , it did adhere.

I assumed it's because the elevate plugin runs before the reranker (LTR).
However I'm finding it hard to confirm. The model is a linear model.

Is this expected behaviour?

Regards,

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Re: Overseer & Backups - Questions

2020-03-10 Thread Ashwin Ramesh
We use the collection API to invoke backups. The tasks we noticed that
stalled are ADDREPLICA. As expected when the backup completed a few hours
ago, the task then got completed. Is there some concurrency setting with
these tasks? Or is a backup a blocking task? We noticed that the index was
still being flushed to segments though.

Regards,

Ash

On Wed, Mar 11, 2020 at 3:18 AM Aroop Ganguly
 wrote:

> May we know how you are invoking backups ?
>
> > On Mar 9, 2020, at 11:53 PM, Ashwin Ramesh 
> wrote:
> >
> > Hi everybody,
> >
> > Quick Specs:
> > - Solr 7.4 Solr Cloud
> > - 30gb index on 8 shards Tlog/Pull
> >
> > We run daily backups on our 30gb index and noticed that the overseer does
> > not process other jobs on it's task list while the backup is being taken.
> > They remain on the pending list (in ZK). Is this expected?
> >
> > Also I was wondering if there was a safe way to cancel a currently
> running
> > task or deleing pending tasks?
> >
> > Regards,
> >
> > Ash
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Also, we're
> > hiring. Apply here! <https://about.canva.com/careers/>
> >
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Overseer & Backups - Questions

2020-03-10 Thread Ashwin Ramesh
Hey Aroop,

Yes we sent ASYNC=

Backups are taken to an EFS drive (AWS's managed NFS)

I also thought it was async and Solr can process multiple tasks at once.
But the ZK state definitely showed that only the backup task was in
progress while all the other tasks were queued up.

Regards,

Ash

On Wed, Mar 11, 2020 at 9:21 AM Aroop Ganguly
 wrote:

> Backups on hdfs ?
> These should not be blocking if invoked asynchronously, are u doing them
> async by passing the async flag?
>
> > On Mar 10, 2020, at 3:19 PM, Ashwin Ramesh 
> wrote:
> >
> > We use the collection API to invoke backups. The tasks we noticed that
> > stalled are ADDREPLICA. As expected when the backup completed a few hours
> > ago, the task then got completed. Is there some concurrency setting with
> > these tasks? Or is a backup a blocking task? We noticed that the index
> was
> > still being flushed to segments though.
> >
> > Regards,
> >
> > Ash
> >
> > On Wed, Mar 11, 2020 at 3:18 AM Aroop Ganguly
> >  wrote:
> >
> >> May we know how you are invoking backups ?
> >>
> >>> On Mar 9, 2020, at 11:53 PM, Ashwin Ramesh 
> >> wrote:
> >>>
> >>> Hi everybody,
> >>>
> >>> Quick Specs:
> >>> - Solr 7.4 Solr Cloud
> >>> - 30gb index on 8 shards Tlog/Pull
> >>>
> >>> We run daily backups on our 30gb index and noticed that the overseer
> does
> >>> not process other jobs on it's task list while the backup is being
> taken.
> >>> They remain on the pending list (in ZK). Is this expected?
> >>>
> >>> Also I was wondering if there was a safe way to cancel a currently
> >> running
> >>> task or deleing pending tasks?
> >>>
> >>> Regards,
> >>>
> >>> Ash
> >>>
> >>> --
> >>> **
> >>> ** <https://www.canva.com/>Empowering the world to design
> >>> Also, we're
> >>> hiring. Apply here! <https://about.canva.com/careers/>
> >>>
> >>> <https://twitter.com/canva> <https://facebook.com/canva>
> >>> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> >>> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> >>> <https://instagram.com/canva>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Also, we're
> > hiring. Apply here! <https://about.canva.com/careers/>
> >
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Cannot add replica during backup

2020-08-11 Thread Ashwin Ramesh
Hey Matthew,

Unfortunately, our shard leaders are across multiple nodes thus a single
EBS couldn't work. Did you manage to get around this issue yourself?

Regards,

Ash

On Tue, Aug 11, 2020 at 9:00 PM matthew sporleder 
wrote:

> I can already tell you it is EFS that is slow. I had to switch to an ebs
> disk for backups on a different project because efs couldn't keep up.
>
> > On Aug 10, 2020, at 9:43 PM, Ashwin Ramesh 
> wrote:
> >
> > Hey Aroop, the general process for our backup is:
> > - Connect all machines to an EFS drive (AWS's NFS service)
> > - Call the collections API to backup into EFS
> > - ZIP the directory once the backup is completed
> > - Copy the ZIP into an s3 bucket
> >
> > I'll probably have to see which part of the process is the slowest.
> >
> > On another note, can you simply remove the task from the ZK path to
> > continue the execution of tasks?
> >
> > Regards,
> >
> > Ash
> >
> >> On Tue, Aug 11, 2020 at 11:40 AM Aroop Ganguly
> >>  wrote:
> >>
> >> 12 hours is extreme, we take backups of 10TB worth of indexes in 15 mins
> >> using the collection backup api.
> >> How are you taking the backup?
> >>
> >> Do you actually see any backup progress or u are just seeing the task in
> >> the overseer queue linger ?
> >> I have seen restore tasks hanging in the queue forever despite process
> >> completing in Solr 77 so wouldn’t be surprised this happens with backup
> as
> >> well. And also observed that unless that unless that task is removed
> from
> >> the overseer-collection-queue the next ones do not proceed.
> >>
> >> Also adding replicas while backup seems like overkill, why don’t you
> just
> >> have the appropriate replication factor in the first place and have
> >> autoAddReplicas=true for indemnity?
> >>
> >>> On Aug 10, 2020, at 6:32 PM, Ashwin Ramesh 
> >> wrote:
> >>>
> >>> Hi everybody,
> >>>
> >>> We are using solr 7.6 (SolrCloud). We notices that when the backup is
> >>> running, we cannot add any replicas to the collection. By the looks of
> >> it,
> >>> the job to add the replica is put into the Overseer queue, but it is
> not
> >>> being processed. Is this expected? And are there any workarounds?
> >>>
> >>> Our backups take about 12 hours. Maybe we should try optimize that too.
> >>>
> >>> Regards,
> >>>
> >>> Ash
> >>>
> >>> --
> >>> **
> >>> ** <https://www.canva.com/>Empowering the world to design
> >>> Share accurate
> >>> information on COVID-19 and spread messages of support to your
> community.
> >>>
> >>> Here are some resources
> >>> <
> >>
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates
> >
> >>
> >>> that can help.
> >>> <https://twitter.com/canva> <https://facebook.com/canva>
> >>> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> >>> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> >>> <https://instagram.com/canva>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Share accurate
> > information on COVID-19 and spread messages of support to your community.
> >
> > Here are some resources
> > <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates>
>
> > that can help.
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Backups in SolrCloud using snapshots of individual cores?

2020-08-10 Thread Ashwin Ramesh
I would love an answer to this too!

On Fri, Aug 7, 2020 at 12:18 AM Bram Van Dam  wrote:

> Hey folks,
>
> Been reading up about the various ways of creating backups. The whole
> "shared filesystem for Solrcloud backups"-thing is kind of a no-go in
> our environment, so I've been looking for ways around that, and here's
> what I've come up with so far:
>
> 1. Stop applications from writing to solr
>
> 2. Commit everything
>
> 3. Identify a single core for each shard in each collection
>
> 4. Snapshot that core using CREATESNAPSHOT in the Collections API
>
> 5. Once complete, re-enable application write access to Solr
>
> 6. Create a backup from these snapshots using the replication handler's
> backup function (replication?command=backup=mySnapshot)
>
> 7. Put the backups somewhere safe
>
> 8. Clean up snapshots
>
>
> This seems ... too good to be true? I've seen so many threads about how
> hard it is to create backups in SolrCloud on this mailing list over the
> years, but this seems pretty straightforward? Am I missing some
> glaringly obvious reason why this will fail catastrophically?
>
> Using Solr 7.7 in this case.
>
> Feedback much appreciated!
>
> Thanks,
>
>  - Bram
>

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Re: Cannot add replica during backup

2020-08-10 Thread Ashwin Ramesh
Hi Aroop,

We have 16 shards each approx 30GB - total is ~480GB. I'm also pretty sure
it's a network issue. Very interesting that you can index 20x the data in
15 min!

>> It would also help to ensure your overseer is on a node with a role that
exempts it from any Solr index responsibilities.
How would I ensure this? First I'm hearing about this!

Thanks for all the help!!

On Tue, Aug 11, 2020 at 11:48 AM Aroop Ganguly
 wrote:

> Hi Ashwin
>
> Thanks for sharing this detail.
> Do you mind sharing how big are each of these indices ?
> I am almost sure this is network capacity and constraints related per your
> aws setup.
>
> Yes if you can confirm that the backup is complete, or you just want the
> system to move on discarding the backup process, your removal of the backup
> flag from zookeeper will help Solr in moving on to the next task in the
> queue.
>
> It would also help to ensure your overseer is on a node with a role that
> exempts it from any Solr index responsibilities.
>
>
> > On Aug 10, 2020, at 6:43 PM, Ashwin Ramesh 
> wrote:
> >
> > Hey Aroop, the general process for our backup is:
> > - Connect all machines to an EFS drive (AWS's NFS service)
> > - Call the collections API to backup into EFS
> > - ZIP the directory once the backup is completed
> > - Copy the ZIP into an s3 bucket
> >
> > I'll probably have to see which part of the process is the slowest.
> >
> > On another note, can you simply remove the task from the ZK path to
> > continue the execution of tasks?
> >
> > Regards,
> >
> > Ash
> >
> > On Tue, Aug 11, 2020 at 11:40 AM Aroop Ganguly
> >  wrote:
> >
> >> 12 hours is extreme, we take backups of 10TB worth of indexes in 15 mins
> >> using the collection backup api.
> >> How are you taking the backup?
> >>
> >> Do you actually see any backup progress or u are just seeing the task in
> >> the overseer queue linger ?
> >> I have seen restore tasks hanging in the queue forever despite process
> >> completing in Solr 77 so wouldn’t be surprised this happens with backup
> as
> >> well. And also observed that unless that unless that task is removed
> from
> >> the overseer-collection-queue the next ones do not proceed.
> >>
> >> Also adding replicas while backup seems like overkill, why don’t you
> just
> >> have the appropriate replication factor in the first place and have
> >> autoAddReplicas=true for indemnity?
> >>
> >>> On Aug 10, 2020, at 6:32 PM, Ashwin Ramesh 
> >> wrote:
> >>>
> >>> Hi everybody,
> >>>
> >>> We are using solr 7.6 (SolrCloud). We notices that when the backup is
> >>> running, we cannot add any replicas to the collection. By the looks of
> >> it,
> >>> the job to add the replica is put into the Overseer queue, but it is
> not
> >>> being processed. Is this expected? And are there any workarounds?
> >>>
> >>> Our backups take about 12 hours. Maybe we should try optimize that too.
> >>>
> >>> Regards,
> >>>
> >>> Ash
> >>>
> >>> --
> >>> **
> >>> ** <https://www.canva.com/>Empowering the world to design
> >>> Share accurate
> >>> information on COVID-19 and spread messages of support to your
> community.
> >>>
> >>> Here are some resources
> >>> <
> >>
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates
> >
> >>
> >>> that can help.
> >>> <https://twitter.com/canva> <https://facebook.com/canva>
> >>> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> >>> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> >>> <https://instagram.com/canva>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Share accurate
> > information on COVID-19 and spread messages of support to your community.
> >
> > Here are some resources
> > <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates>
>
> > that can help.
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Cannot add replica during backup

2020-08-10 Thread Ashwin Ramesh
Hey Aroop, the general process for our backup is:
- Connect all machines to an EFS drive (AWS's NFS service)
- Call the collections API to backup into EFS
- ZIP the directory once the backup is completed
- Copy the ZIP into an s3 bucket

I'll probably have to see which part of the process is the slowest.

On another note, can you simply remove the task from the ZK path to
continue the execution of tasks?

Regards,

Ash

On Tue, Aug 11, 2020 at 11:40 AM Aroop Ganguly
 wrote:

> 12 hours is extreme, we take backups of 10TB worth of indexes in 15 mins
> using the collection backup api.
> How are you taking the backup?
>
> Do you actually see any backup progress or u are just seeing the task in
> the overseer queue linger ?
> I have seen restore tasks hanging in the queue forever despite process
> completing in Solr 77 so wouldn’t be surprised this happens with backup as
> well. And also observed that unless that unless that task is removed from
> the overseer-collection-queue the next ones do not proceed.
>
> Also adding replicas while backup seems like overkill, why don’t you just
> have the appropriate replication factor in the first place and have
> autoAddReplicas=true for indemnity?
>
> > On Aug 10, 2020, at 6:32 PM, Ashwin Ramesh 
> wrote:
> >
> > Hi everybody,
> >
> > We are using solr 7.6 (SolrCloud). We notices that when the backup is
> > running, we cannot add any replicas to the collection. By the looks of
> it,
> > the job to add the replica is put into the Overseer queue, but it is not
> > being processed. Is this expected? And are there any workarounds?
> >
> > Our backups take about 12 hours. Maybe we should try optimize that too.
> >
> > Regards,
> >
> > Ash
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Share accurate
> > information on COVID-19 and spread messages of support to your community.
> >
> > Here are some resources
> > <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates>
>
> > that can help.
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Cannot add replica during backup

2020-08-10 Thread Ashwin Ramesh
Hi everybody,

We are using solr 7.6 (SolrCloud). We notices that when the backup is
running, we cannot add any replicas to the collection. By the looks of it,
the job to add the replica is put into the Overseer queue, but it is not
being processed. Is this expected? And are there any workarounds?

Our backups take about 12 hours. Maybe we should try optimize that too.

Regards,

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Solr 7.6.0 - OOM Caused Down Replica. Cannot recover. Please advice

2021-02-24 Thread Ashwin Ramesh
Hi everyone,

We had an OOM event earlier this morning. This has caused one of our shards
to lose all it's replicas and it's leader is still in a down state. We have
restarted the Java process (solr) and it's still in a down state. Logs
below:

```
Feb 25, 2021 @ 11:46:43.000 2021-02-25 00:46:43.268 WARN
 (updateExecutor-3-thread-1-processing-n:10.0.10.43:8983_solr
x:search-collection-2018-10-30_shard2_5_replica_n1480
c:search-collection-2018-10-30 s:shard2_5 r:core_node1481)
[c:search-collection-2018-10-30 s:shard2_5 r:core_node1481
x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:40.000 2021-02-25 00:46:40.759 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:35.000 2021-02-25 00:46:35.761 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:33.000 2021-02-25 00:46:33.270 WARN
 (updateExecutor-3-thread-2-processing-n:10.0.10.43:8983_solr
x:search-collection-2018-10-30_shard2_5_replica_n1480
c:search-collection-2018-10-30 s:shard2_5 r:core_node1481)
[c:search-collection-2018-10-30 s:shard2_5 r:core_node1481
x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:30.000 2021-02-25 00:46:30.759 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:25.000 2021-02-25 00:46:25.761 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:23.000 2021-02-25 00:46:23.279 WARN
 (updateExecutor-3-thread-1-processing-n:10.0.10.43:8983_solr
x:search-collection-2018-10-30_shard2_5_replica_n1480
c:search-collection-2018-10-30 s:shard2_5 r:core_node1481)
[c:search-collection-2018-10-30 s:shard2_5 r:core_node1481
x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
```

Questions:
1. Is there anything we can do to force this core to go live?
2. If the core is unrecoverable, is there a way to clear the core up such
that we can reindex only that shard?

Any other advice would be great too :)

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 

 
that can help.