Re: optimize boosting parameters

2020-12-08 Thread Derek Poh
We monitor the response time (pingdom) of the page that uses these 
boosting parameters. Since the addition of these boosting parameters and 
an additional field to search on (which I will create a thread on it in 
the mailing list), the page average response time has increased by 1-2 
seconds.

Management has feedback on this.


If it does turn out to be the boosting (and IIRC the
map function can be expensive), can you pre-compute some
number of the boosts? Your requirements look
like they can be computed at index time, then boost
by just the value of the pre-computed field.
I have gone through the list of functions and map function is the only 
one that can meet the requirements.

Or is there a less expensive function that I missed out?

By pre-compute some number, do you mean before the indexing at 
preparation stage, check the value of P_SupplierResponseRate. If the 
value = 3, specify 'boost="0.4"' for the field of the document?



BTW, boosts < 1.0
_reduce_ the score. I mention that just in case that’s a surprise ;)
Oh it is to reduce the score?! Not increase (multiply or add) the score 
by less than 1?



  You use termfreq, which changes of course, but
1> if your corpus is updated often enough, the termfreqs will be relatively 
stable.
   in that case you can pre-compute them too.
We do incremental indexing every half an hour on this collection. 
Average of 50K-100K documents during each indexing. Collection has 7+ 
milliion documents.

So the entire corpus does not get updated in every indexing.


2> your problem statement has nothing to do with termfreq so why are you
  using it in the first place?
I read up on termfreq function again. It returns the number of times the 
term appears in the field for that document. It does not really fit the 
requirements. Thank you for pointing it out.

I should use map instead?

Derek

On 8/12/2020 9:48 pm, Erick Erickson wrote:

Before worrying about it too much, exactly _how_ much has
the performance changed?

I’ve just been in too many situations where there’s
no objective measure of performance before and after, just
someone saying “it seems slower” and had those performance
changes disappear when a rigorous test is done. Then spent
a lot of time figuring out that the person reporting the
problem hadn’t had coffee yet. Or the network was slow.
Or….

If it does turn out to be the boosting (and IIRC the
map function can be expensive), can you pre-compute some
number of the boosts? Your requirements look
like they can be computed at index time, then boost
by just the value of the pre-computed field. BTW, boosts < 1.0
_reduce_ the score. I mention that just in case that’s a surprise ;)
Of course that means that to change the boosting you need
to re-index.

  You use termfreq, which changes of course, but
1> if your corpus is updated often enough, the termfreqs will be relatively 
stable.
   in that case you can pre-compute them too.


2> your problem statement has nothing to do with termfreq so why are you
  using it in the first place?

Best,
Erick


On Dec 8, 2020, at 12:46 AM, Radu Gheorghe  wrote:

Hi Derek,

Ah, then my reply was completely off :)

I don’t really see a better way. Maybe other than changing termfreq to field, 
if the numeric field has docValues? That may be faster, but I don’t know for 
sure.

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support


On 8 Dec 2020, at 06:17, Derek Poh  wrote:

Hi Radu

Apologies for not making myself clear.

I would like to know if there is a more simple or efficient way to craft the 
boosting parameters based on the requirements.

For example, I am using 'if', 'map' and 'termfreq' functions in the bf 
parameters.

Is there a more efficient or simple function that can be use instead? Or craft 
the 'formula' it in a more efficient way?

On 7/12/2020 10:05 pm, Radu Gheorghe wrote:

Hi Derek,

It’s hard to tell whether your boosts can be made better without knowing your 
data and what users expect of it. Which is a problem in itself.

I would suggest gathering judgements, like if a user queries for X, what doc 
IDs do you expect to get back?

Once you have enough of these judgements, you can experiment with boosts and 
see how the query results change. There are measures such as nDCG (
https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG
) that can help you measure that per query, and you can average this score 
across all your judgements to get an overall measure of how well you’re doing.

Or even better, you can have something like Quaerite play with boost values for 
you:

https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga


Best regards,
Radu
--
Sematext Cloud - Full Stack Observability -
https://sematext.com

Solr and Elasticsearch Consulting, Training and Production Support



On 7 Dec 2020, at 10:51, Derek Poh 
wrote:


Re: Need help to configure automated deletion of shard in solr

2020-12-08 Thread Pushkar Mishra
Hi Erick,

COLSTATUS does not work with Implicit router type collection  . Is there
any way to get the replica detail ?

Regards

On Mon, Nov 30, 2020 at 8:48 PM Erick Erickson 
wrote:

> Are you using the implicit router? Otherwise you cannot delete a shard.
> And you won’t have any shards that have zero documents anyway.
>
> It’d be a little convoluted, but you could use the collections COLSTATUS
> Api to
> find the names of all your replicas. Then query _one_ replica of each
> shard with something like
> solr/collection1_shard1_replica_n1/q=*:*&distrib=false
>
> that’ll return the number of live docs (i.e. non-deleted docs) and if it’s
> zero
> you can delete the shard.
>
> But the implicit router requires you take complete control of where
> documents
> go, i.e. which shard they land on.
>
> This really sounds like an XY problem. What’s the use  case you’re trying
> to support where you expect a shard’s number of live docs to drop to zero?
>
> Best,
> Erick
>
> > On Nov 30, 2020, at 4:57 AM, Pushkar Mishra 
> wrote:
> >
> > Hi Solr team,
> >
> > I am using solr cloud.(version 8.5.x). I have a need to find out a
> > configuration where I can delete a shard , when number of documents
> reaches
> > to zero in the shard , can some one help me out to achieve that ?
> >
> >
> > It is urgent , so a quick response will be highly appreciated .
> >
> > Thanks
> > Pushkar
> >
> > --
> > Pushkar Kumar Mishra
> > "Reactions are always instinctive whereas responses are always well
> thought
> > of... So start responding rather than reacting in life"
>
>

-- 
Pushkar Kumar Mishra
"Reactions are always instinctive whereas responses are always well thought
of... So start responding rather than reacting in life"


solrcloud with EKS kubernetes

2020-12-08 Thread Abhishek Mishra
Hello guys,
We are kind of facing some of the issues(Like timeout etc.) which are very
inconsistent. By any chance can it be related to EKS? We are using solr 7.7
and zookeeper 3.4.13. Should we move to ECS?

Regards,
Abhishek


Re: Can I express this nested query in JSON DSL?

2020-12-08 Thread Mikhail Khludnev
Hi, Mikhail.
Shouldn't be a big deal
"bool":{
 "must":[ "x",
{"bool":
 {"should":["y","z"]}}]
}

On Tue, Dec 8, 2020 at 6:13 AM Mikhail Edoshin 
wrote:

> Hi,
>
> I'm more or less new to Solr. I need to run queries that use joins all
> over the place. (The idea is to index database records pretty much as
> they are and then query them in interesting ways and, most importantly,
> get the rank. Our dataset is not too large so the performance is great.)
>
> I managed to express the logic using the following approach. For
> example, I want to search people by their names or addresses:
>
>q=type:Person^=0 AND ({!edismax qf= v=$p0} OR {!join
>  v=$p1})
>p1={!edismax qf= v=p0}
>p0=
>
> (Here 'type:Person' works as a filter so I zero its score.) This seems
> to work as expected and give the right results and ranking. It also
> seems to scale nicely for two levels of joins, although the queries
> become rather hard to follow in their raw form (I used a custom
> XML-to-query transformer to actually formulate more complex queries).
>
> So my question is that: can I express an equivalent query using the
> query DSL? I know I can use 'bool' like that:
>
> {
>"query": {
>   "bool" : {
>  "must" : [ ... ];
>  "should" : [ ... ]
>}
> }
>   }
>
> But how do I actually go from 'x AND (y OR z)' to 'bool' in the query
> DSL? I seem to lose the nice compositional properties of the expression.
> Here, for example, the expression implies that at least 'y' or 'z' must
> match; I don't quite see how I can express this in the DSL.
>
> Kind regards,
> Mikhail
>


-- 
Sincerely yours
Mikhail Khludnev


Boost a dynamic field

2020-12-08 Thread Kelv



Hello,

I'm trying to boost a document score based on the existence of a dynamic 
field. I can't seem to get the syntax right and get either Solr server 
errors or it just doesn't do anything to the Solr response.


In solrconfig.xml the dynamic fields are defined as...

stored="true" multiValued="true"/>


The field I want to check for is called DYNAMIC_rank. If it exists I 
want to boost the score so the document shows up first.


Hoping someone can help!

Thanks,

Kelv


Re: No numShards attribute exists in 'core.properties' with the newly added replica

2020-12-08 Thread Erick Erickson
I raised this JIRA: https://issues.apache.org/jira/browse/SOLR-15035

What’s not clear to me is whether numShards should even be in core.properties 
at all, even on the create command. In the state.json file it’s a 
collection-level property and not reflected in the individual replica’s 
information.

However, we should be consistent.

Best,
Erick

> On Dec 8, 2020, at 4:34 AM, Dawn  wrote:
> 
> Hi
> 
>   Solr8.7.0
> 
>   No numShards attribute exists in 'core.properties' with the newly added 
> replica. Causes numShards to be null using CloudDescriptor.
> 
>   Since the ADDREPLICA command does not get numShards property, the 
> coreProps will not save numShards in the constructor that creates the 
> CoreDescriptor, so that the 'core.properties' file will be generated without 
> numShards.
> 
>   Can the numShards attribute function be added to the process of adding 
> replica so that the 'core-properties' file of replica can contain numShards 
> attribute?



Can I express this nested query in JSON DSL?

2020-12-08 Thread Mikhail Edoshin

Hi,

I'm more or less new to Solr. I need to run queries that use joins all 
over the place. (The idea is to index database records pretty much as 
they are and then query them in interesting ways and, most importantly, 
get the rank. Our dataset is not too large so the performance is great.)


I managed to express the logic using the following approach. For 
example, I want to search people by their names or addresses:


  q=type:Person^=0 AND ({!edismax qf= v=$p0} OR {!join 
 v=$p1})

  p1={!edismax qf= v=p0}
  p0=

(Here 'type:Person' works as a filter so I zero its score.) This seems 
to work as expected and give the right results and ranking. It also 
seems to scale nicely for two levels of joins, although the queries 
become rather hard to follow in their raw form (I used a custom 
XML-to-query transformer to actually formulate more complex queries).


So my question is that: can I express an equivalent query using the 
query DSL? I know I can use 'bool' like that:


{
  "query": {
 "bool" : {
    "must" : [ ... ];
    "should" : [ ... ]
  }
   }
 }

But how do I actually go from 'x AND (y OR z)' to 'bool' in the query 
DSL? I seem to lose the nice compositional properties of the expression. 
Here, for example, the expression implies that at least 'y' or 'z' must 
match; I don't quite see how I can express this in the DSL.


Kind regards,
Mikhail


Re: optimize boosting parameters

2020-12-08 Thread Erick Erickson
Before worrying about it too much, exactly _how_ much has
the performance changed?

I’ve just been in too many situations where there’s
no objective measure of performance before and after, just
someone saying “it seems slower” and had those performance
changes disappear when a rigorous test is done. Then spent
a lot of time figuring out that the person reporting the 
problem hadn’t had coffee yet. Or the network was slow.
Or….

If it does turn out to be the boosting (and IIRC the
map function can be expensive), can you pre-compute some
number of the boosts? Your requirements look
like they can be computed at index time, then boost
by just the value of the pre-computed field. BTW, boosts < 1.0
_reduce_ the score. I mention that just in case that’s a surprise ;)
Of course that means that to change the boosting you need
to re-index.

 You use termfreq, which changes of course, but
1> if your corpus is updated often enough, the termfreqs will be relatively 
stable.
  in that case you can pre-compute them too.


2> your problem statement has nothing to do with termfreq so why are you
 using it in the first place?

Best,
Erick

> On Dec 8, 2020, at 12:46 AM, Radu Gheorghe  wrote:
> 
> Hi Derek,
> 
> Ah, then my reply was completely off :)
> 
> I don’t really see a better way. Maybe other than changing termfreq to field, 
> if the numeric field has docValues? That may be faster, but I don’t know for 
> sure.
> 
> Best regards,
> Radu
> --
> Sematext Cloud - Full Stack Observability - https://sematext.com
> Solr and Elasticsearch Consulting, Training and Production Support
> 
>> On 8 Dec 2020, at 06:17, Derek Poh  wrote:
>> 
>> Hi Radu
>> 
>> Apologies for not making myself clear.
>> 
>> I would like to know if there is a more simple or efficient way to craft the 
>> boosting parameters based on the requirements.
>> 
>> For example, I am using 'if', 'map' and 'termfreq' functions in the bf 
>> parameters.
>> 
>> Is there a more efficient or simple function that can be use instead? Or 
>> craft the 'formula' it in a more efficient way?
>> 
>> On 7/12/2020 10:05 pm, Radu Gheorghe wrote:
>>> Hi Derek,
>>> 
>>> It’s hard to tell whether your boosts can be made better without knowing 
>>> your data and what users expect of it. Which is a problem in itself.
>>> 
>>> I would suggest gathering judgements, like if a user queries for X, what 
>>> doc IDs do you expect to get back?
>>> 
>>> Once you have enough of these judgements, you can experiment with boosts 
>>> and see how the query results change. There are measures such as nDCG (
>>> https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG
>>> ) that can help you measure that per query, and you can average this score 
>>> across all your judgements to get an overall measure of how well you’re 
>>> doing.
>>> 
>>> Or even better, you can have something like Quaerite play with boost values 
>>> for you:
>>> 
>>> https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga
>>> 
>>> 
>>> Best regards,
>>> Radu
>>> --
>>> Sematext Cloud - Full Stack Observability - 
>>> https://sematext.com
>>> 
>>> Solr and Elasticsearch Consulting, Training and Production Support
>>> 
>>> 
 On 7 Dec 2020, at 10:51, Derek Poh 
 wrote:
 
 Hi
 
 I have added the following boosting requirements to the search query of a 
 page. Feedback from monitoring team is that the overall response of the 
 page has increased since then.
 I am trying to find out if the added boosting parameters (below) could 
 have contributed to the increased.
 
 The boosting is working as per requirements.
 
 May I know if the implemented boosting parameters can be enhanced or 
 optimized further?
 Hopefully to improve on the response time of the query and the page.
 
 Requirements:
 1. If P_SupplierResponseRate is:
   a. 3, boost by 0.4
   b. 2, boost by 0.2
 
 2. If P_SupplierResponseTime is:
   a. 4, boost by 0.4
   b. 3, boost by 0.2
 
 3. If P_MWSScore is:
   a. between 80-100, boost by 1.6
   b. between 60-79, boost by 0.8
 
 4. If P_SupplierRanking is:
   a. 3, boost by 0.3
   b. 4, boost by 0.6
   c. 5, boost by 0.9
   b. 6, boost by 1.2
 
 Boosting parameters implemented:
 bf=map(P_SupplierResponseRate,3,3,0.4,0)
 bf=map(P_SupplierResponseRate,2,2,0.2,0)
 
 bf=map(P_SupplierResponseTime,4,4,0.4,0)
 bf=map(P_SupplierResponseTime,3,3,0.2,0)
 
 bf=map(P_MWSScore,80,100,1.6,0)
 bf=map(P_MWSScore,60,79,0.8,0)
 
 bf=if(termfreq(P_SupplierRanking,3),0.3,if(termfreq(P_SupplierRanking,4),0.6,if(termfreq(P_SupplierRanking,5),0.9,if(termfreq(P_SupplierRanking,6),1.2,0
 
 
 I am using Solr 7.7.2
 
 --
 CONFIDENTIALITY NOTICE 
 This e-mail (including any attachments) may contain confidential and/or 
 privileged inform

Re: Is there a way to search for "..." (three dots)?

2020-12-08 Thread Erick Erickson
Yes, but…

Odds are your analysis configuration for the field is removing the dots.

Go to the admin/analysis page, pick your field type and put examples in
the “index” and “query” boxes and you’ll see what I mean.

You need something like WhitespaceTokenizer, as your tokenizer,
and avoid things like WordDelimiter(Graph)FilterFactory.

You’ll find this is tricky though. For instance, if you index
“…something is here”, WhitespaceTokenizer will split this into
“…something”, “is”, “here” and you won’t be able to search for 
“something” since the _token_ is “…something”.

You could use one of the other tokenizers or use one of the
regular expression tokenizers.

Best,
Erick

> On Dec 8, 2020, at 5:56 AM, nettadalet  wrote:
> 
> Hi,
> I need to be able to search for "..." (three dots), meaning the query should
> be "..." and the search should return results that have "..." in their
> names.
> Is there a way to do it?
> Thanks in advance.
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Commits (with openSearcher = true) are too slow in solr 8

2020-12-08 Thread raj.yadav
matthew sporleder wrote
> I would stick to soft commits and schedule hard-commits as
> spaced-out-as-possible in regular maintenance windows until you can
> find the culprit of the timeout.
> 
> This way you will have very focused windows for intense monitoring
> during the hard-commit runs.

*Little correction:*
In my last post, I had mentioned that softCommit is working fine and there
no delay or error message.
Here is what happening:

1. Hard commit with openSearcher=true
curl
"http://:solr_port/solr/my_collection/update?openSearcher=true&commit=true&wt=json"

All the cores started processing commit except , the one hosted ``.
Also we are getting timeout error on this.

2. softCommit
curl
"http://:solr_port/solr/my_collection/update?softCommit=true&wt=json"
Same as 1.

3.Hard commit with openSearcher=false
curl
"http://:solr_port/solr/my_collection/update?openSearcher=false&commit=true&wt=json"
All the cores started processing commit immediately and there is no error.


Solr commands used to set up system

Solr start command
#/var/solr-8.5.2/bin/solr start -c  -p solr_port  -z
zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port -s
/var/node_my_collection_1/solr-8.5.2/server/solr -h  -m 26g
-DzkClientTimeout=3 -force



Creat Collection
1.upload config to zookeper
#var/solr-8.5.2/server/scripts/cloud-scripts/./zkcli.sh -z
zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port  -cmd upconfig -confname
my_collection  -confdir /

2. Cretaed collection with 3 shards (shard1,shard2,shard3),
#curl
"http://:solr_port/solr/admin/collections?action=CREATE&name=my_collection&numShards=3&replicationFactor=1&maxShardsPerNode=1&collection.configName=my_collection&createNodeSet=solr_node1:solr_port,solr_node2:solr_port,solr_node3:solr_port"

3. Used SPLITSHARD command to split each shards into two half
(shard1_1,shard1_0,shard2_0,...)
e.g
 #curl
"http://:solr_port/solr/admin/collections?action=SPLITSHARD&collection=my_collection&shard=shard1

4. Used DELETESHARD command to delete old shatds (shard1,shard2,shard3).
e.g
 #curl
"http://:solr_port/solr/admin/collections?action=DELETESHARD&collection=my_collection&shard=shard1









--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to get the config set name of Solr core

2020-12-08 Thread Andreas Hubold
Hi,

I was able to add the config set to the STATUS response by implementing a
custom extended CoreAdminHandler.

However, it would be nice if this could be added in Solr itself. I've create
a JIRA for this: https://issues.apache.org/jira/browse/SOLR-15034

Kind regards,
Andreas



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Getting Reset cancel_stream_error on solr-8.5.2

2020-12-08 Thread raj.yadav
Hey All,
We have updated our system from solr 5.4 to solr 8.5.2 and we are suddenly
seeing a lot of the below errors in our logs.

HttpChannelState org.eclipse.jetty.io.EofException: Reset
cancel_stream_error

Is this related to some system level or solr level config?

How do I find the cause of this?
How do I solve this?

*Solr Setup Details:*
Solr version => solr-8.5.2

GC setting: GC_TUNE=" -XX:+UseG1GC -XX:+PerfDisableSharedMem
-XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=150
-XX:InitiatingHeapOccupancyPercent=60 -XX:+UseLargePages -XX:+AggressiveOpts
"

Solr Collection details: (running in solrCloud mode) It has 6 shards, and
each shard has only one replica (which is also a leader) and replica type is
NRT. Total doc in collection: 77 million and each shard index size: 11 GB
and avg size/doc: 1.0Kb

Zookeeper => We are using external zookeeper ensemble (3 node cluster)

System Datails:
Centos (7.7); disk size: 250 GB; cpu: (8 vcpus, 64 GiB memory)


Solr OPs

Solr start command
#/var/solr-8.5.2/bin/solr start -c  -p solr_port  -z
zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port -s
/var/node_my_collection_1/solr-8.5.2/server/solr -h  -m 26g 
-DzkClientTimeout=3 -force



Creat Collection 
1.upload config to zookeper
#var/solr-8.5.2/server/scripts/cloud-scripts/./zkcli.sh -z
zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port  -cmd upconfig -confname
my_collection  -confdir /

2. Cretaed collection with 3 shards (shard1,shard2,shard3),
#curl 
"http://:solr_port/solr/admin/collections?action=CREATE&name=my_collection&numShards=3&replicationFactor=1&maxShardsPerNode=1&collection.configName=my_collection&createNodeSet=solr_node1:solr_port,solr_node2:solr_port,solr_node3:solr_port"

3. Used SPLITSHARD command to split each shards into two half
(shard1_1,shard1_0,shard2_0,...)
e.g
 #curl
"http://:solr_port/solr/admin/collections?action=SPLITSHARD&collection=my_collection&shard=shard1

4. Used DELETESHARD command to delete old shatds (shard1,shard2,shard3).
e.g
 #curl
"http://:solr_port/solr/admin/collections?action=DELETESHARD&collection=my_collection&shard=shard1




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Is there a way to search for "..." (three dots)?

2020-12-08 Thread nettadalet
Hi,
I need to be able to search for "..." (three dots), meaning the query should
be "..." and the search should return results that have "..." in their
names.
Is there a way to do it?
Thanks in advance.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


No numShards attribute exists in 'core.properties' with the newly added replica

2020-12-08 Thread Dawn
Hi

Solr8.7.0

No numShards attribute exists in 'core.properties' with the newly added 
replica. Causes numShards to be null using CloudDescriptor.

Since the ADDREPLICA command does not get numShards property, the 
coreProps will not save numShards in the constructor that creates the 
CoreDescriptor, so that the 'core.properties' file will be generated without 
numShards.

Can the numShards attribute function be added to the process of adding 
replica so that the 'core-properties' file of replica can contain numShards 
attribute?