Adding solr-core via maven fails

2020-07-01 Thread Ali Akhtar
If I try adding solr-core to an existing project, e.g (SBT):

libraryDependencies += "org.apache.solr" % "solr-core" % "8.5.2"

It fails due a 404 on the dependencies:

Extracting structure failed
stack trace is suppressed; run last update for the full output
stack trace is suppressed; run last ssExtractDependencies for the full
output
(update) sbt.librarymanagement.ResolveException: Error downloading
org.restlet.jee:org.restlet:2.4.0
Not found
Not found
not found:
/home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml
not found:
https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom
Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0
Not found
Not found
not found:
/home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml
not found:
https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom
(ssExtractDependencies) sbt.librarymanagement.ResolveException: Error
downloading org.restlet.jee:org.restlet:2.4.0
Not found
Not found
not found:
/home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml
not found:
https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom
Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0
Not found
Not found
not found:
/home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml
not found:
https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom



Any ideas? Do I need to add a specific repository to get it to compile?


How to use two search string in a single solr query

2020-07-01 Thread Tushar Arora
Hi,
I have a scenario with following entry in the request handler(handler1) of
solrconfig.xml.(defType=edismax is used)
description category title^4 demand^0.3
2-1 4-30%

When I searched 'bags' as a search string, solr returned 15000 results.
Query Used :
http://localhost:8984/solr/core_name/select?fl=title=on=bags=handler1=10=json

And when searched 'books' as a search string, solr returns say 3348 results.
Query Used :
http://localhost:8984/solr/core_name/select?fl=title=on=books=handler1=10=json

I want to use both 'bags' and 'books' as a search string in a single query.
I used the following query:
http://localhost:8984/solr/core_name/select?fl=title=on=%22bags%22+OR+%22books%22=handler1=10=json
But OR operator not working. It will only give 7 results.


I even tried this :
http://localhost:8984/solr/core_name/select?fl=title=on=(bags)+OR+(books)=handler1=10=json
But it also gives 7 results.

But my concern is to include the result of both 'bags' OR 'books' in a
single query.
Is there any way to use two search strings in a single query?


Re: Suggestion or recommendation for NRT

2020-07-01 Thread Erick Erickson
That seems high. It can be tricky to get tests. Are you running with
some kind of test runner? Do you have, say, 3-4 thousand queries
you run? Are you running the tests after warming the searchers?

Also, if you have indexed down to one segment, _then_ tried
adding docs and measuring you are not getting accurate results.

See: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/

Best,
Erick

> On Jul 1, 2020, at 5:55 PM, ramyogi  wrote:
> 
> Thanks Erick for the details and reference to understand better about merging
> segment stuff.
> When I compare  performance of uninterrupted/optimized ( segment count 1)
> collection  for search request vs (indexing + search) in parallel  going on
> collection   performance is 3 times higher,
> for example : first one is responding 100ms in average but second one around
> 400ms.
> 
> is that expected behaviour like we need to tradeoff if we do Indexing and
> search in the same collection parallel.
> or we can still fine tune with some parameters for better performance then
> please suggest some.
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Suggestion or recommendation for NRT

2020-07-01 Thread ramyogi
Thanks Erick for the details and reference to understand better about merging
segment stuff.
When I compare  performance of uninterrupted/optimized ( segment count 1)
collection  for search request vs (indexing + search) in parallel  going on
collection   performance is 3 times higher,
for example : first one is responding 100ms in average but second one around
400ms.

is that expected behaviour like we need to tradeoff if we do Indexing and
search in the same collection parallel.
or we can still fine tune with some parameters for better performance then
please suggest some.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: FunctionScoreQuery how to use it

2020-07-01 Thread Mikhail Khludnev
Hi, Vincenzo.

Discussed earlier
https://www.mail-archive.com/java-user@lucene.apache.org/msg50255.html

On Wed, Jul 1, 2020 at 8:36 PM Vincenzo D'Amore  wrote:

> Hi all,
>
> I'm struggling with an old class that extends CustomScoreQuery.
> I was trying to port to solr 8.5.2 and I'm looking for an example on how to
> implement it using FunctionScoreQuery.
>
> Do you know if there are examples that explain how to port the code to the
> new implementation?
>
> --
> Vincenzo D'Amore
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Suggestion or recommendation for NRT

2020-07-01 Thread Erick Erickson
Updated documents are marked as deleted in the
old segment and added to a new segment. When
commits happen, merges occur and only then is the
space occupied by the deleted document reclaimed.

Which segments are merged on commit depends
on a number of factors.

Unless you can prove the extra space is a problem,
you should just ignore the issue. The percentage of
deleted documents should max out at around 33%
on Solr 7.5+.

For background on merging, see:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

The third animation (TieredMergePolicy) is the default.

Best,
Erick

> On Jul 1, 2020, at 3:51 PM, ramyogi  wrote:
> 
> Even though same document indexed over and over again due to incremental
> update. Index size is being increased.
> Do I miss any configuration to make optimization occur by internally ?
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Suggestion or recommendation for NRT

2020-07-01 Thread ramyogi
Even though same document indexed over and over again due to incremental
update. Index size is being increased.
Do I miss any configuration to make optimization occur by internally ?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: CDCR stress-test issues

2020-07-01 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
For the record, it is not just Solr7.4 which has the problem. When I start 
afresh with Solr8.5.2, both symptoms persist.

With Solr8.5.2, tlogs accumulate endlessly at the non-Leader nodes of the 
Source SolrCloud and are never released regardless of maxNumLogsToKeep setting

And with Solr8.5.2, if four scripts run simultaneously for a few minutes, each 
script running a loop each iteration of which adds batches of 6 records to the 
Source SolrCloud, a couple dozen records wind up on the Source without ever 
arriving at the Target SolrCloud (although the Target does have records which 
were added after the missing records).

Does anyone yet have any suggestion how to get CDCR to work properly?


-Original Message-
From: Oakley, Craig (NIH/NLM/NCBI) [C]  
Sent: Wednesday, June 24, 2020 9:46 AM
To: solr-user@lucene.apache.org
Subject: CDCR stress-test issues

In attempting to stress-test CDCR (running Solr 7.4), I am running into a 
couple of issues.

One is that the tlog files keep accumulating for some nodes in the CDCR system, 
particularly for the non-Leader nodes in the Source SolrCloud. No quantity of 
hard commits seem to cause any of these tlog files to be released. This can 
become a problem upon reboot if there are hundreds of thousands of tlog files, 
and Solr fails to start (complaining that there are too many open files).

The tlogs had been accumulating on all the nodes of the CDCR set of SolrClouds 
until I added these two lines to the solrconfig.xml file (for testing purposes, 
using numbers much lower than in the examples):
5
2
Since then, it is mostly the non-Leader nodes of the Source SolrCloud which 
accumulates tlog files (the Target SolrCloud does seem to have a tendency to 
clean up the tlog files, as does the Leader of the Source SolrCloud). If I use 
ADDREPLICAPROP and REBALANCELEADERS to change which node is the Leader, and if 
I then start adding more data, the tlogs on the new Leader sometimes will go 
away, but then the old Leader begins accumulating tlog files. I am dubious 
whether frequent reassignment of Leadership would be a practical solution.

I also have several times attempted to simulate a production environment by 
running several loops simultaneously, each of which inserts multiple records on 
each iteration of the loop. Several times, I end up with a dozen records on 
(both replicas of) the Source which never make it to (either replica of) the 
Target. The Target has thousands of records which were inserted before the 
missing records, and thousands of records which were inserted after the missing 
records (and all these records, the replicated and the missing, were inserted 
by curl commands which only differed in sequential numbers incorporated into 
the values being inserted).

I also have a question regarding SOLR-13141: the 11/Feb/19 comment says that 
the fix for Solr 7.3 had a problem; and the header says "Affects Version/s: 
7.5, 7.6": does that indicate that Solr 7.4 is not affected?

Are  there any suggestions?

Thanks


Re: Downsides to applying to WordDelimiterFilter twice in analyzer chain

2020-07-01 Thread Erick Erickson
Consider something other than WhitespaceTokenizer. In this case
the tokenizer would split on the period and it’d work. I don’t know
whether that would fit the rest of your problem space or not though.

But to answer your original question, no there’s no a-priori reason you
can’t have WordDelimiter(Graph)FilterFactory twice, but I suspect
better tokenization is a more robust answer.

Best,
Erick

> On Jul 1, 2020, at 3:11 PM, gnandre  wrote:
> 
> Here are links to images for the Analysis tab.
> 
> https://pasteboard.co/JfFTYu6.png
> https://pasteboard.co/JfFUYXf.png
> 
> 
> On Wed, Jul 1, 2020 at 3:03 PM gnandre  wrote:
> I am doing that already but it does not help.
> 
> Here is the complete analyzer chain.
> 
>   
> 
>   
> 
> 
> 
> 
> 
> 
>  preserveOriginal="1"  generateWordParts="1" generateNumberParts="1" 
> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
> 
> 
> 
> 
> 
> 
>  ignoreCase="true" expand="true"/>
> 
> 
> 
> 
> 
> 
> 
>   
> 
> 
>   
> 
> 
> 
> 
> 
> 
>  preserveOriginal="1"  generateWordParts="1" generateNumberParts="1" 
> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
> 
> 
> 
> 
>  synonyms="synonyms_en_query.txt" ignoreCase="true" expand="true"/>
> 
> 
> 
> 
> 
> 
> 
>   
> 
> 
>   
>   
> 
> 
> 
> 
> 
> On Wed, Jul 1, 2020 at 12:29 PM Erick Erickson  
> wrote:
> Why not just specify preserveOriginal and follow by a lowerCaseFilter and
> use one wordDelimiterFilterFactory?
> 
> Best,
> Erick
> 
> > On Jul 1, 2020, at 11:05 AM, gnandre  wrote:
> > 
> > Hi,
> > 
> > To satisfy one use-case, I need to apply WordDelimiterFilter with
> > splitOnCaseChange
> > with 0 once and then with 1 again. Are there some downsides to this
> > approach?
> > 
> > Use-case is to be able to match results when indexed content is my.camelCase
> > and search query is camelcase.
> 



Re: Downsides to applying to WordDelimiterFilter twice in analyzer chain

2020-07-01 Thread gnandre
Here are links to images for the Analysis tab.

https://pasteboard.co/JfFTYu6.png
https://pasteboard.co/JfFUYXf.png


On Wed, Jul 1, 2020 at 3:03 PM gnandre  wrote:

> I am doing that already but it does not help.
>
> Here is the complete analyzer chain.
>
>  "100">   "solr.WhitespaceTokenizerFactory"/>  "solr.WordDelimiterFilterFactory" protected="protect.txt" preserveOriginal
> ="1" generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>  "solr.LowerCaseFilterFactory"/>  "solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose"/>  class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt" ignoreCase=
> "true" expand="true"/>   class="solr.RemoveDuplicatesTokenFilterFactory"/>   type="query">   class="solr.WordDelimiterFilterFactory" protected="protect.txt"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange=
> "1"/>   "solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose"/>  class="solr.SynonymFilterFactory" synonyms="synonyms_en_query.txt"
> ignoreCase="true" expand="true"/>  />  
> 
>
> [image: image.png]
>
> [image: image.png]
>
> On Wed, Jul 1, 2020 at 12:29 PM Erick Erickson 
> wrote:
>
>> Why not just specify preserveOriginal and follow by a lowerCaseFilter and
>> use one wordDelimiterFilterFactory?
>>
>> Best,
>> Erick
>>
>> > On Jul 1, 2020, at 11:05 AM, gnandre  wrote:
>> >
>> > Hi,
>> >
>> > To satisfy one use-case, I need to apply WordDelimiterFilter with
>> > splitOnCaseChange
>> > with 0 once and then with 1 again. Are there some downsides to this
>> > approach?
>> >
>> > Use-case is to be able to match results when indexed content is
>> my.camelCase
>> > and search query is camelcase.
>>
>>


Re: Downsides to applying to WordDelimiterFilter twice in analyzer chain

2020-07-01 Thread gnandre
I am doing that already but it does not help.

Here is the complete analyzer chain.


 
 
  

[image: image.png]

[image: image.png]

On Wed, Jul 1, 2020 at 12:29 PM Erick Erickson 
wrote:

> Why not just specify preserveOriginal and follow by a lowerCaseFilter and
> use one wordDelimiterFilterFactory?
>
> Best,
> Erick
>
> > On Jul 1, 2020, at 11:05 AM, gnandre  wrote:
> >
> > Hi,
> >
> > To satisfy one use-case, I need to apply WordDelimiterFilter with
> > splitOnCaseChange
> > with 0 once and then with 1 again. Are there some downsides to this
> > approach?
> >
> > Use-case is to be able to match results when indexed content is
> my.camelCase
> > and search query is camelcase.
>
>


FunctionScoreQuery how to use it

2020-07-01 Thread Vincenzo D'Amore
Hi all,

I'm struggling with an old class that extends CustomScoreQuery.
I was trying to port to solr 8.5.2 and I'm looking for an example on how to
implement it using FunctionScoreQuery.

Do you know if there are examples that explain how to port the code to the
new implementation?

-- 
Vincenzo D'Amore


Re: Downsides to applying to WordDelimiterFilter twice in analyzer chain

2020-07-01 Thread Erick Erickson
Why not just specify preserveOriginal and follow by a lowerCaseFilter and
use one wordDelimiterFilterFactory?

Best,
Erick

> On Jul 1, 2020, at 11:05 AM, gnandre  wrote:
> 
> Hi,
> 
> To satisfy one use-case, I need to apply WordDelimiterFilter with
> splitOnCaseChange
> with 0 once and then with 1 again. Are there some downsides to this
> approach?
> 
> Use-case is to be able to match results when indexed content is my.camelCase
> and search query is camelcase.



Searching document content and mult-valued fields

2020-07-01 Thread Shaun Campbell
Hi

Been using Solr on a project now for a couple of years and is working well.
It's just a simple index of about 20 - 25 fields and 7,000 project records.

Now there's a requirement to be able to search on the content of documents
(web pages, Word, pdf etc) related to those projects.  My initial thought
was to just create a new index to store the Tika'd content and just search
on that. However, the requirement is to somehow search through both the
project records and the content records at the same time and list the main
project with perhaps some info on the matching content data. I tried to
explain that you may find matching main project records but no content, and
vice versa.

My only solution to this search problem is to either concatenate all the
document content into one field on the main project record, and add that to
my dismax search, and use boosting etc or to use a multi-valued field to
store the content of each project document.  I'm a bit reluctant to do this
as the application is running well and I'm a bit nervous about a change to
the schema and the indexing process.  I just wondered what you thought
about adding a lot of content to an existing schema (single or multivalued
field) that doesn't normally store big amounts of data.

Or does anyone know of any way, I can join two searches like this together
and two separate indexes?

Thanks
Shaun


Solr Grouping and Unique values

2020-07-01 Thread Reinhardt, Nate

I am trying to find a way to grab unique values based on a group. The idea 
would be to group by an id and then return that groups value.
Query params fl=valueIwant+myID=true=myId=:
"grouped": {
 "myID": {
 "matches": 7520236,
 "groups": [{
 "groupValue": "123456",
 "doclist": {
 "numFound": 6583,
 "start": 0,
 "docs": [{
 "myID": 123456,
 "valueIwant": "Hello World"
 }]
 }
 }
 ]
 }
}
This is fine but what I want to do is select the 'valueIwant' in a distinct 
way. The group.limit will return more values in the docs, but it wont be 
unique. Is there a way to restrict group.limit to only return unique fl values? 
With 6583 found for the above example. I would have to expand the limit to 6583 
then widdle it down by unique. This gets compounded when I have 700 unique ids 
that i want to group by with a total of 44m documents.
For example. If I do
fl=valueIwant+myID=3=true=myId=:
 "grouped": {
 "myID": {
 "matches": 7520236,
 "groups": [{
 "groupValue": "123456",
 "doclist": {
 "numFound": 6583,
 "start": 0,
 "docs": [{
 "myID": 123456,
 "valueIwant": "Hello World"
 },
 {
 "myID": 123456,
 "valueIwant": "Hello World"
 }
 {
 "myID": 123456,
 "valueIwant": "Hello World123456"
 }]]
 }
 }
 ]
 }
 }
What I want is the docs to be unique against valueIwant like so
 "grouped": {
 "myID": {
 "matches": 7520236,
 "groups": [{
 "groupValue": "123456",
 "doclist": {
 "numFound": 6583,
 "start": 0,
 "docs": [{
 "myID": 123456,
 "valueIwant": "Hello World"
 },
 {
 "myID": 123456,
 "valueIwant": "Hello Planet"
 }
 {
 "myID": 123456,
 "valueIwant": "Hello World123456"
 }]]
 }
 }
 ]
 }
}
Is there a way to do this? I was looking at functions but couldnt find anything 
I needed.
Copy of this question can be found 
https://stackoverflow.com/questions/62679939/solr-grouping-and-unique-values


Thanks

Nate


Re: Parallel SQL join on multivalue fields

2020-07-01 Thread Piero Scrima
the reason why JOIN works is because of the Calcite framework. The parallel
sql features leverages Calcite, which implements all the sql features, all
you need is to provide the way for calcite to get the collection/table, in
solr this is done by the SolrTable.java (package
org.apache.solr.handler.sql), implementation of AbastrctQueryTable. Once
you implement the calcite table interface you basically have a calcite
adapter and calcite gives you for free all the sql features.
You can try by yourself, let say you have a collection called table1 with
field name1_s and field1_s (docvalues) and a collection called table2 with
fields name2_s and field2_s, we can populate the table1 with
{name1_s:"obj1_table1",field1_s:"a"},{name1_s:"obj2_table1",field1_s:"b"}
and table2 with
{name2_s:"obj1_table2",field2_s:"a"},{name2_s:"obj2_table2",field2_s:"d"}
then you can run:

curl --data-urlencode 'stmt=select a.name1_s,b.name2_s from table1 as a
inner join table2 as b on a.field1_s=b.field2_s limit 10'
http://localhost:8999/solr/table1/sql?aggregationMode=facet

the answer will be

{
  "result-set":{
"docs":[{
"name1_s":"obj1_table1",
"name2_s":"obj1_table2"}
  ,{
"EOF":true,
"RESPONSE_TIME":xxx}]}}

it will work.
As I said it works because of the Calcite process, and I think that the
process is not optimized, moreover it does not work well with multivalued
fields. I think it would be great if solr parallel sql could have an
optimized join process (using join streaming api) and also have the support
for multivalued fields which could open several new use cases.

Il giorno mer 1 lug 2020 alle ore 15:31 Joel Bernstein 
ha scritto:

> There isn't any real support for joins in Parallel SQL currently. I'm
> surprised that you're having some success doing them. Can you provide a
> sample SQL join that is working for you?
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Jun 26, 2020 at 3:32 AM Piero Scrima  wrote:
>
> > Hi,
> >
> > Although there is no trace of join functionality in the official Solr
> > documentation
> > (https://lucene.apache.org/solr/guide/7_4/parallel-sql-interface.html),
> > joining in parallel sql works in practice. It only works if the field is
> > not a multivalued field. For my project it would be fantastic if it also
> > worked with multivalued fields.
> > Is there any way to do it? working with the streaming expression I
> managed
> > to do it with the following expression:
> >
> > innerJoin(
> > sort(
> > cartesianProduct(
> >
> >
> >
> search(census_defence_system,q="*:*",fl="id,defence_system,description,supplier",sort="id
> > asc",qt="/select",rows="1000"),
> >   supplier
> > ),
> > by="supplier asc"
> > ),
> > sort(
> >   cartesianProduct(
> >
> >
> search(census_components,q="*:*",fl="id,compoenent_name,supplier",sort="id
> > asc",qt="/select",rows="1"),
> > supplier
> > ),
> > by="supplier asc"
> > ),
> >   on="supplier"
> > )
> >
> > suplier of course is a multivalued field.
> >
> > Is there a way to do this with parallel sql, and if not can we plan a new
> > feature to add it? I could also work on it .
> >
> > (version 7.4)
> >
> > Thank you
> >
>


Downsides to applying to WordDelimiterFilter twice in analyzer chain

2020-07-01 Thread gnandre
Hi,

To satisfy one use-case, I need to apply WordDelimiterFilter with
splitOnCaseChange
with 0 once and then with 1 again. Are there some downsides to this
approach?

Use-case is to be able to match results when indexed content is my.camelCase
and search query is camelcase.


Re: Parallel SQL join on multivalue fields

2020-07-01 Thread Joel Bernstein
There isn't any real support for joins in Parallel SQL currently. I'm
surprised that you're having some success doing them. Can you provide a
sample SQL join that is working for you?



Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Jun 26, 2020 at 3:32 AM Piero Scrima  wrote:

> Hi,
>
> Although there is no trace of join functionality in the official Solr
> documentation
> (https://lucene.apache.org/solr/guide/7_4/parallel-sql-interface.html),
> joining in parallel sql works in practice. It only works if the field is
> not a multivalued field. For my project it would be fantastic if it also
> worked with multivalued fields.
> Is there any way to do it? working with the streaming expression I managed
> to do it with the following expression:
>
> innerJoin(
> sort(
> cartesianProduct(
>
>
> search(census_defence_system,q="*:*",fl="id,defence_system,description,supplier",sort="id
> asc",qt="/select",rows="1000"),
>   supplier
> ),
> by="supplier asc"
> ),
> sort(
>   cartesianProduct(
>
> search(census_components,q="*:*",fl="id,compoenent_name,supplier",sort="id
> asc",qt="/select",rows="1"),
> supplier
> ),
> by="supplier asc"
> ),
>   on="supplier"
> )
>
> suplier of course is a multivalued field.
>
> Is there a way to do this with parallel sql, and if not can we plan a new
> feature to add it? I could also work on it .
>
> (version 7.4)
>
> Thank you
>


Re: Supporting multiple indexes in one collection

2020-07-01 Thread Erick Erickson
Sharding always adds overhead, which balances against splitting the 
work up amongst several machines. 

Sharding works like this for queries:

1> node receives query

2> a sub-query is sent to one replica of each shard

3> each replica sends back its top N (rows parameter) with ID and sort data

4> the node in <1> sorts the candidate lists to get the overall top N

5> the node in <1> sends out another query to each replica to get the data 
associated with the final sorted list

6> the node in <1> assembles the results from <5> and returns the true top 10 
to the client.


All that takes time. OTOH, in this scenario all the replicas are only searching 
a subset of the data, so each sub-query can be faster. Until you reach that 
point, querying a single replica is faster. At some point when your index gets 
past a certain size, that overhead is more than made up for by, basically, 
throwing more hardware at the problem (assuming the shards can make use of more 
hardware or CPUs or threads or whatever). “A certain size” is dependent on your 
data, hardware and query patterns there’s no hard and fast rule.

But you haven’t really told us much. You say you’ve read that SolrCloud 
performance degrades when the number of collections rises. True. But the 
“number of collections” can be in the thousands. Are you talking about 5 
collections? 10 collections? 1,000,000 collections? Details matter.

And how many documents are you talking about per collection? Or in total? 

What are your performance criteria? Do you expect to handle 5 queries/second? 
50? 5,000,000?

When performance differs “by a few milliseconds”, unless you’re dealing with a 
very high total QPS it’s usually a waste of time to worry about it. Almost 
certainly there are much better things to spend your time on that the end users 
will actually notice ;) Plus, performance measurements are very tricky to 
actually get right. Are you measuring with a realistic data set and queries? 
Are you measuring with enough different queries to be hitting the various 
caches in a realistic manner? Are you indexing at the same time in a manner 
that reflects your real world? 

What I’m suggesting is that before making these kinds of decisions, and some of 
the ideas like composite routing and the like will require significant 
engineering effort you be very, very sure that they’re necessary. For instance, 
you’ll have to monitor every replica to see if it gets overloaded. Imagine your 
routing puts 300,000,000 documents for some very large client on a single shard 
(which, again, we have no idea whether that’s something you have to worry about 
since you haven’t told us). Now you’ll have to go in and fix that problem.

Best,
Erick

> On Jul 1, 2020, at 2:58 AM, Raji N  wrote:
> 
> Did the test while back . Revisiting this again. But in standalone solr we
> have experienced the queries more time if the data exists in 2 shards .
> That's the main reason this test was done. If anyone has experience want to
> hear
> 
> On Tue, Jun 30, 2020 at 11:50 PM Jörn Franke  wrote:
> 
>> How many documents ?
>> The real difference  was only a couple of ms?
>> 
>>> Am 01.07.2020 um 07:34 schrieb Raji N :
>>> 
>>> Had 2 indexes in 2 separate shards in one collection and had exact same
>>> data published with composite router with a prefix. Disabled all caches.
>>> Issued the same query which is a small query with q parameter and fq
>>> parameter . Number of queries which got executed  (with same threads and
>>> run for same time ) were more in 2  indexes with 2 separate shards case.
>>> 90th percentile response time was also few ms better.
>>> 
>>> Thanks,
>>> Raji
>>> 
 On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke 
>> wrote:
 
 What did you test? Which queries? What were the exact results in terms
>> of
 time ?
 
>> Am 30.06.2020 um 22:47 schrieb Raji N :
> 
> Hi ,
> 
> 
> Trying to place multiple smaller indexes in one collection (as we read
> solrcloud performance degrades as number of collections increase). We
>> are
> exploring two ways
> 
> 
> 1) Placing each index on a single shard of a collection
> 
> In this case placing documents for a single index is manual and
> automatic rebalancing not done by solr
> 
> 
> 2) Solr routing composite router with a prefix .
> 
>In this case solr doesn’t place all the docs with same prefix in
>> one
> shard , so searches becomes distributed. But shard rebalancing is taken
> care by solr.
> 
> 
> We did a small perf test with both these set up. We saw the performance
 for
> the first case (placing an index explicitly on a shard ) is better.
> 
> 
> Has anyone done anything similar. Can you please share your experience.
> 
> 
> Thanks,
> 
> Raji
 
>> 



Re: Supporting multiple indexes in one collection

2020-07-01 Thread Raji N
Did the test while back . Revisiting this again. But in standalone solr we
have experienced the queries more time if the data exists in 2 shards .
That's the main reason this test was done. If anyone has experience want to
hear

On Tue, Jun 30, 2020 at 11:50 PM Jörn Franke  wrote:

> How many documents ?
> The real difference  was only a couple of ms?
>
> > Am 01.07.2020 um 07:34 schrieb Raji N :
> >
> > Had 2 indexes in 2 separate shards in one collection and had exact same
> > data published with composite router with a prefix. Disabled all caches.
> > Issued the same query which is a small query with q parameter and fq
> > parameter . Number of queries which got executed  (with same threads and
> > run for same time ) were more in 2  indexes with 2 separate shards case.
> > 90th percentile response time was also few ms better.
> >
> > Thanks,
> > Raji
> >
> >> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke 
> wrote:
> >>
> >> What did you test? Which queries? What were the exact results in terms
> of
> >> time ?
> >>
>  Am 30.06.2020 um 22:47 schrieb Raji N :
> >>>
> >>> Hi ,
> >>>
> >>>
> >>> Trying to place multiple smaller indexes in one collection (as we read
> >>> solrcloud performance degrades as number of collections increase). We
> are
> >>> exploring two ways
> >>>
> >>>
> >>> 1) Placing each index on a single shard of a collection
> >>>
> >>>  In this case placing documents for a single index is manual and
> >>> automatic rebalancing not done by solr
> >>>
> >>>
> >>> 2) Solr routing composite router with a prefix .
> >>>
> >>> In this case solr doesn’t place all the docs with same prefix in
> one
> >>> shard , so searches becomes distributed. But shard rebalancing is taken
> >>> care by solr.
> >>>
> >>>
> >>> We did a small perf test with both these set up. We saw the performance
> >> for
> >>> the first case (placing an index explicitly on a shard ) is better.
> >>>
> >>>
> >>> Has anyone done anything similar. Can you please share your experience.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Raji
> >>
>


Re: Supporting multiple indexes in one collection

2020-07-01 Thread Jörn Franke
How many documents ? 
The real difference  was only a couple of ms?

> Am 01.07.2020 um 07:34 schrieb Raji N :
> 
> Had 2 indexes in 2 separate shards in one collection and had exact same
> data published with composite router with a prefix. Disabled all caches.
> Issued the same query which is a small query with q parameter and fq
> parameter . Number of queries which got executed  (with same threads and
> run for same time ) were more in 2  indexes with 2 separate shards case.
> 90th percentile response time was also few ms better.
> 
> Thanks,
> Raji
> 
>> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke  wrote:
>> 
>> What did you test? Which queries? What were the exact results in terms of
>> time ?
>> 
 Am 30.06.2020 um 22:47 schrieb Raji N :
>>> 
>>> Hi ,
>>> 
>>> 
>>> Trying to place multiple smaller indexes in one collection (as we read
>>> solrcloud performance degrades as number of collections increase). We are
>>> exploring two ways
>>> 
>>> 
>>> 1) Placing each index on a single shard of a collection
>>> 
>>>  In this case placing documents for a single index is manual and
>>> automatic rebalancing not done by solr
>>> 
>>> 
>>> 2) Solr routing composite router with a prefix .
>>> 
>>> In this case solr doesn’t place all the docs with same prefix in one
>>> shard , so searches becomes distributed. But shard rebalancing is taken
>>> care by solr.
>>> 
>>> 
>>> We did a small perf test with both these set up. We saw the performance
>> for
>>> the first case (placing an index explicitly on a shard ) is better.
>>> 
>>> 
>>> Has anyone done anything similar. Can you please share your experience.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Raji
>>