Re: Need help on handling large size of index.

2020-05-21 Thread Modassar Ather
Thanks Shawn for your response.

We have seen a performance increase in optimisation with a bigger number of
IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
whereas the same index took 5-6 hours to optimise with higher IOPs.
Yes the entire extra IOPs were never used to full other than a couple of
spike in its usage. So not able to understand how the increased IOPs makes
so much of difference.
Can you please help me understand what it involves to optimise? Is it the
more RAM/IOPs?

Search response time is very important. Please advise if we increase the
shard with extra servers how much effect it may have on search response
time.

Best,
Modassar

On Thu, May 21, 2020 at 2:16 PM Modassar Ather 
wrote:

> Thanks Phill for your response.
>
> Optimal Index size: Depends on what you are optimizing for. Query Speed?
> Hardware utilization?
> We are optimising it for query speed. What I understand even if we set the
> merge policy to any number the amount of hard disk will still be required
> for the bigger segment merges. Please correct me if I am wrong.
>
> Optimizing the index is something I never do. We live with about 28%
> deletes. You should check your configuration for your merge policy.
> There is a delete of about 10-20% in our updates. We have no merge policy
> set in configuration as we do a full optimisation after the indexing.
>
> Increased sharding has helped reduce query response time, but surely there
> is a point where the colation of results starts to be the bottleneck.
> The query response time is my concern. I understand the aggregation of
> results may increase the search response time.
>
> *What does your schema look like? I index around 120 fields per document.*
> The schema has a combination of text and string fields. None of the field
> except Id field is stored. We also have around 120 fields. A few of them
> have docValues enabled.
>
> *What does your queries look like? Mine are so varied that caching never
> helps, the same query rarely comes through.*
> Our search queries are combination of proximity, nested proximity and
> wildcards most of the time. The query can be very complex with 100s of
> wildcard and proximity terms in it. Different grouping option are also
> enabled on search result. And the search queries vary a lot.
>
> Oh, another thing, are you concerned about  availability? Do you have a
> replication factor > 1? Do you run those replicas in a different region for
> safety?
> How many zookeepers are you running and where are they?
> As of now we do not have any replication factor. We are not using
> zookeeper ensemble but would like to move to it sooner.
>
> Best,
> Modassar
>
> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:
>
>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>> > Can you please help me with following few questions?
>> >
>> > - What is the ideal index size per shard?
>>
>> We have no way of knowing that.  A size that works well for one index
>> use case may not work well for another, even if the index size in both
>> cases is identical.  Determining the ideal shard size requires
>> experimentation.
>>
>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> > - The optimisation takes lot of time and IOPs to complete. Will
>> > increasing the number of shards help in reducing the optimisation
>> time and
>> > IOPs?
>>
>> No, changing the number of shards will not help with the time required
>> to optimize, and might make it slower.  Increasing the speed of the
>> disks won't help either.  Optimizing involves a lot more than just
>> copying data -- it will never use all the available disk bandwidth of
>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>> a full collection sequentially, not simultaneously.
>>
>> > - We are planning to reduce each shard index size to 30GB and the
>> entire
>> > 3.5 TB index will be distributed across more shards. In this case
>> to almost
>> > 70+ shards. Will this help?
>>
>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>> of shards without adding additional servers, I would expect things to
>> get worse, not better.
>>
>> > Kindly share your thoughts on how best we can use Solr with such a large
>> > index size.
>>
>> Something to keep in mind -- memory is the resource that makes the most
>> difference in performance.  Buying enough memory to get decent
>> performance out of an index that big would probably be very expensive.
>> You should probably explore ways to make your index smaller.  Another
>> idea is to split things up so the most frequently accessed search data
>> is in a relatively small index and lives on beefy servers, and data used
>> for less frequent or data-mining queries (where performance doesn't
>> matter as much) can live on less expensive servers.
>>
>> Thanks,
>> Shawn
>>
>


Re: Need help on handling large size of index.

2020-05-21 Thread Modassar Ather
Thanks Phill for your response.

Optimal Index size: Depends on what you are optimizing for. Query Speed?
Hardware utilization?
We are optimising it for query speed. What I understand even if we set the
merge policy to any number the amount of hard disk will still be required
for the bigger segment merges. Please correct me if I am wrong.

Optimizing the index is something I never do. We live with about 28%
deletes. You should check your configuration for your merge policy.
There is a delete of about 10-20% in our updates. We have no merge policy
set in configuration as we do a full optimisation after the indexing.

Increased sharding has helped reduce query response time, but surely there
is a point where the colation of results starts to be the bottleneck.
The query response time is my concern. I understand the aggregation of
results may increase the search response time.

*What does your schema look like? I index around 120 fields per document.*
The schema has a combination of text and string fields. None of the field
except Id field is stored. We also have around 120 fields. A few of them
have docValues enabled.

*What does your queries look like? Mine are so varied that caching never
helps, the same query rarely comes through.*
Our search queries are combination of proximity, nested proximity and
wildcards most of the time. The query can be very complex with 100s of
wildcard and proximity terms in it. Different grouping option are also
enabled on search result. And the search queries vary a lot.

Oh, another thing, are you concerned about  availability? Do you have a
replication factor > 1? Do you run those replicas in a different region for
safety?
How many zookeepers are you running and where are they?
As of now we do not have any replication factor. We are not using zookeeper
ensemble but would like to move to it sooner.

Best,
Modassar

On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:

> On 5/20/2020 11:43 AM, Modassar Ather wrote:
> > Can you please help me with following few questions?
> >
> > - What is the ideal index size per shard?
>
> We have no way of knowing that.  A size that works well for one index
> use case may not work well for another, even if the index size in both
> cases is identical.  Determining the ideal shard size requires
> experimentation.
>
>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> > - The optimisation takes lot of time and IOPs to complete. Will
> > increasing the number of shards help in reducing the optimisation
> time and
> > IOPs?
>
> No, changing the number of shards will not help with the time required
> to optimize, and might make it slower.  Increasing the speed of the
> disks won't help either.  Optimizing involves a lot more than just
> copying data -- it will never use all the available disk bandwidth of
> modern disks.  SolrCloud does optimizes of the shard replicas making up
> a full collection sequentially, not simultaneously.
>
> > - We are planning to reduce each shard index size to 30GB and the
> entire
> > 3.5 TB index will be distributed across more shards. In this case to
> almost
> > 70+ shards. Will this help?
>
> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
> of shards without adding additional servers, I would expect things to
> get worse, not better.
>
> > Kindly share your thoughts on how best we can use Solr with such a large
> > index size.
>
> Something to keep in mind -- memory is the resource that makes the most
> difference in performance.  Buying enough memory to get decent
> performance out of an index that big would probably be very expensive.
> You should probably explore ways to make your index smaller.  Another
> idea is to split things up so the most frequently accessed search data
> is in a relatively small index and lives on beefy servers, and data used
> for less frequent or data-mining queries (where performance doesn't
> matter as much) can live on less expensive servers.
>
> Thanks,
> Shawn
>


Re: Query takes more time in Solr 8.5.1 compare to 6.1.0 version

2020-05-21 Thread Jason Gerlowski
Hi Jay,

I can't speak to why you're seeing a performance change between 6.x
and 8.x.  What I can suggest though is an alternative way of
formulating the query: you might get different performance if you run
your query using Solr's "terms" query parser:
https://lucene.apache.org/solr/guide/8_5/other-parsers.html#terms-query-parser
 It's not guaranteed to help, but there's a chance it'll work for you.
And knowing whether or not it helps might point others here towards
the cause of your slowdown.

Even if "terms" performs better for you, it's probably worth
understanding what's going on here of course.

Are all other queries running comparably?

Jason

On Thu, May 21, 2020 at 10:25 AM jay harkhani  wrote:
>
> Hello,
>
> Please refer below details.
>
> >Did you create Solrconfig.xml for the collection from scratch after 
> >upgrading and reindexing?
> Yes, We have created collection from scratch and also re-indexing.
>
> >Was it based on the latest template?
> Yes, It was as per latest template.
>
> >What happens if you reexecute the query?
> Not more visible difference. Minor change in milliseconds.
>
> >Are there other processes/containers running on the same VM?
> No
>
> >How much heap and how much total memory you have?
> My heap and total memory are same as Solr 6.1.0. heap memory 5 gb and total 
> memory 25gb. As per me there is no issue related to memory.
>
> >Maybe also you need to increase the corresponding caches in the config.
> We are not using cache in both version.
>
> Both version have same configuration.
>
> Regards,
> Jay Harkhani.
>
> 
> From: Jörn Franke 
> Sent: Thursday, May 21, 2020 7:05 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Query takes more time in Solr 8.5.1 compare to 6.1.0 version
>
> Did you create Solrconfig.xml for the collection from scratch after upgrading 
> and reindexing? Was it based on the latest template?
> If not then please try this. Maybe also you need to increase the 
> corresponding caches in the config.
>
> What happens if you reexecute the query?
>
> Are there other processes/containers running on the same VM?
>
> How much heap and how much total memory you have? You should only have a 
> minor fraction of the memory as heap and most of it „free“ (this means it is 
> used for file caches).
>
>
>
> > Am 21.05.2020 um 15:24 schrieb vishal patel :
> >
> > Any one is looking this issue?
> > I got same issue.
> >
> > Regards,
> > Vishal Patel
> >
> >
> >
> > 
> > From: jay harkhani 
> > Sent: Wednesday, May 20, 2020 7:39 PM
> > To: solr-user@lucene.apache.org 
> > Subject: Query takes more time in Solr 8.5.1 compare to 6.1.0 version
> >
> > Hello,
> >
> > Currently I upgrade Solr version from 6.1.0 to 8.5.1 and come across one 
> > issue. Query which have more ids (around 3000) and grouping is applied 
> > takes more time to execute. In Solr 6.1.0 it takes 677ms and in Solr 8.5.1 
> > it takes 26090ms. While take reading we have same solr schema and same no. 
> > of records in both solr version.
> >
> > Please refer below details for query, logs and thead dump (generate from 
> > Solr Admin while execute query).
> >
> > Query : 
> > https://drive.google.com/file/d/1bavCqwHfJxoKHFzdOEt-mSG8N0fCHE-w/view
> >
> > Logs and Thread dump stack trace
> > Solr 8.5.1 : 
> > https://drive.google.com/file/d/149IgaMdLomTjkngKHrwd80OSEa1eJbBF/view
> > Solr 6.1.0 : 
> > https://drive.google.com/file/d/13v1u__fM8nHfyvA0Mnj30IhdffW6xhwQ/view
> >
> > To analyse further more we found that if we remove grouping field or we 
> > reduce no. of ids from query it execute fast. Is anything change in 8.5.1 
> > version compare to 6.1.0 as in 6.1.0 even for large no. Ids along with 
> > grouping it works faster?
> >
> > Can someone please help to isolate this issue.
> >
> > Regards,
> > Jay Harkhani.


Is it possible to direct queries to replicas in SolrCloud

2020-05-21 Thread Pushkar Raste
Hi,
In master/slave we can send queries to slaves only, now that we have tlog
and pull replicas can we send queries to those replicas to achieve similar
scaling like master/slave for large search volumes?


-- 
— Pushkar Raste


Re: Why Did It Match?

2020-05-21 Thread Doug Turnbull
Is your concern that the Solr explain functionality is slower than Endecas?
Or harder to understand/interpret?

If the latter, I might recommend http://splainer.io as one solution

On Thu, May 21, 2020 at 4:52 PM Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> My company is working on a new website. The old/current site is powered by
> Endeca. The site under development is powered by Solr (currently 7.7.2)
>
> Out of the box, Endeca provides the capability to show how a query was
> matched in the search. The business users like this functionality, in solr
> this functionality is an expensive debug option. Is there another way to
> get this information from a query?
>
> Webster Homer
>
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search ; Contributor: *AI
Powered Search *
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Is it possible to direct queries to replicas in SolrCloud

2020-05-21 Thread Erick Erickson
https://lucene.apache.org/solr/guide/7_7/distributed-requests.html

> On May 21, 2020, at 5:40 PM, Pushkar Raste  wrote:
> 
> Hi,
> In master/slave we can send queries to slaves only, now that we have tlog
> and pull replicas can we send queries to those replicas to achieve similar
> scaling like master/slave for large search volumes?
> 
> 
> -- 
> — Pushkar Raste



Re: TimestampUpdateProcessorFactory updates the field even if the value if present

2020-05-21 Thread Furkan KAMACI
Hi,

How do you index that document? Do you index it with an empty
*index_time_stamp_create* field as the second time too?

Kind Regards,
Furkan KAMACI

On Fri, May 22, 2020 at 12:05 AM gnandre  wrote:

> Hi,
>
> Following is the update request processor chain.
>
> 
> <
> processor class="solr.TimestampUpdateProcessorFactory">  "fieldName">index_time_stamp_create   "solr.LogUpdateProcessorFactory" />  "solr.RunUpdateProcessorFactory" /> 
>
> And, here is how the field is defined in schema.xml
>
>  "true" />
>
> Every time I index the same document, above field changes its value with
> latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
> page, if a document does not contain a value in the timestamp field, a new
> Date will be generated and added as the value of that field. After the
> first indexing this document should always have a value, so why then it
> gets updated later?
>
> I am using Solr Admin UI's Documents tab to index the document for testing.
> I am using Solr 6.3 in master-slave architecture mode.
>


Require java 8 upgrade

2020-05-21 Thread Akhila John
Hi Team,

We use solr 5.3.1 for sitecore 8.2.
We require to upgrade Java version to 'Java 8 Update 251' and remove / Upgrade 
Wireshark to 3.2.3 in our application servers.
Could you please advise if this would have any impact on the solr. Does solr 
5.3.1 support Java 8.

Thanks and regards,

Akhila

Bupa A email disclaimer: The information contained in this email and any 
attachments is confidential and may be subject to copyright or other 
intellectual property protection. If you are not the intended recipient, you 
are not authorized to use or disclose this information, and we request that you 
notify us by reply mail or telephone and delete the original message from your 
mail system.


Why Did It Match?

2020-05-21 Thread Webster Homer
My company is working on a new website. The old/current site is powered by 
Endeca. The site under development is powered by Solr (currently 7.7.2)

Out of the box, Endeca provides the capability to show how a query was matched 
in the search. The business users like this functionality, in solr this 
functionality is an expensive debug option. Is there another way to get this 
information from a query?

Webster Homer



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: TimestampUpdateProcessorFactory updates the field even if the value if present

2020-05-21 Thread gnandre
Hi,

I do not pass that field at all.

Here is the document that I index again and again to test through Solr
Admin UI.
{
asset_id:"x:1",
title:"x"
}

On Thu, May 21, 2020 at 5:25 PM Furkan KAMACI 
wrote:

> Hi,
>
> How do you index that document? Do you index it with an empty
> *index_time_stamp_create* field as the second time too?
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, May 22, 2020 at 12:05 AM gnandre  wrote:
>
> > Hi,
> >
> > Following is the update request processor chain.
> >
> >  >
> > <
> > processor class="solr.TimestampUpdateProcessorFactory">  > "fieldName">index_time_stamp_create   > "solr.LogUpdateProcessorFactory" />  > "solr.RunUpdateProcessorFactory" /> 
> >
> > And, here is how the field is defined in schema.xml
> >
> >  > "true" />
> >
> > Every time I index the same document, above field changes its value with
> > latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
> > page, if a document does not contain a value in the timestamp field, a
> new
> > Date will be generated and added as the value of that field. After the
> > first indexing this document should always have a value, so why then it
> > gets updated later?
> >
> > I am using Solr Admin UI's Documents tab to index the document for
> testing.
> > I am using Solr 6.3 in master-slave architecture mode.
> >
>


Re: TimestampUpdateProcessorFactory updates the field even if the value if present

2020-05-21 Thread Furkan KAMACI
Hi,

Do you have an id field for your documents? On the other hand, does your
document count increases when you index it again?

Kind Regards,
Furkan KAMACI

On Fri, May 22, 2020 at 1:03 AM gnandre  wrote:

> Hi,
>
> I do not pass that field at all.
>
> Here is the document that I index again and again to test through Solr
> Admin UI.
> {
> asset_id:"x:1",
> title:"x"
> }
>
> On Thu, May 21, 2020 at 5:25 PM Furkan KAMACI 
> wrote:
>
> > Hi,
> >
> > How do you index that document? Do you index it with an empty
> > *index_time_stamp_create* field as the second time too?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Fri, May 22, 2020 at 12:05 AM gnandre 
> wrote:
> >
> > > Hi,
> > >
> > > Following is the update request processor chain.
> > >
> > >  default="true"
> > >
> > > <
> > > processor class="solr.TimestampUpdateProcessorFactory">  > > "fieldName">index_time_stamp_create   class=
> > > "solr.LogUpdateProcessorFactory" />  > > "solr.RunUpdateProcessorFactory" /> 
> > >
> > > And, here is how the field is defined in schema.xml
> > >
> > >  stored=
> > > "true" />
> > >
> > > Every time I index the same document, above field changes its value
> with
> > > latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
> > > page, if a document does not contain a value in the timestamp field, a
> > new
> > > Date will be generated and added as the value of that field. After the
> > > first indexing this document should always have a value, so why then it
> > > gets updated later?
> > >
> > > I am using Solr Admin UI's Documents tab to index the document for
> > testing.
> > > I am using Solr 6.3 in master-slave architecture mode.
> > >
> >
>


Re: Require java 8 upgrade

2020-05-21 Thread Furkan KAMACI
Hi Akhila,

Here is the related documentation:
https://lucene.apache.org/solr/5_3_1/SYSTEM_REQUIREMENTS.html which says:

"Apache Solr runs of Java 7 or greater, Java 8 is verified to be compatible
and may bring some performance improvements. When using Oracle Java 7 or
OpenJDK 7, be sure to not use the GA build 147 or update versions u40, u45
and u51! We recommend using u55 or later."

Kind Regards,
Furkan KAMACI

On Fri, May 22, 2020 at 4:26 AM Akhila John  wrote:

> Hi Team,
>
> We use solr 5.3.1 for sitecore 8.2.
> We require to upgrade Java version to 'Java 8 Update 251' and remove /
> Upgrade Wireshark to 3.2.3 in our application servers.
> Could you please advise if this would have any impact on the solr. Does
> solr 5.3.1 support Java 8.
>
> Thanks and regards,
>
> Akhila
>
> Bupa A email disclaimer: The information contained in this email and
> any attachments is confidential and may be subject to copyright or other
> intellectual property protection. If you are not the intended recipient,
> you are not authorized to use or disclose this information, and we request
> that you notify us by reply mail or telephone and delete the original
> message from your mail system.
>


Re: json faceting - Terms faceting and EnumField

2020-05-21 Thread Ponnuswamy, Poornima (GE Healthcare)
Can anyone provide some light on the issue I am having?. Thanks!

On 5/20/20, 4:55 PM, "Ponnuswamy, Poornima (GE Healthcare)" 
 wrote:

Hello,

We have solr 6.6 version.
Below is the field and field type that is defined in solr schema.



Below is the configuration for the enum
   

  servicerequestcorrective
  servicerequestplanned
  servicerequestinstallationandupgrade
  servicerequestrecall
  servicerequestother
  servicerequestinquiry
  servicerequestproactive
  servicerequestsystemupdate
  servicerequesticenteradmin
  servicerequestonwatch
  servicerequestfmi
  servicerequestapplication
   

When I try to invoke using the below call, I am getting error

http://localhost:8983/solr/activity01us/select?={ServiceRequestTypeCode:{type:terms,
 field:ServiceRequestTypeCode, limit:10}}=on=on=json=*
"Expected numeric field type 
:ServiceRequestTypeCode{type=ServiceRequestTypeCode,properties=indexed,stored,omitNorms,omitTermFreqAndPositions}"

But when I try to do as below it works fine.


http://localhost:8983/solr/activity01us/select?facet.field=ServiceRequestTypeCode=on=on=*:*=json

I would like to use json facet as it would help me in subfaceting.

Any help would be appreciated


Thanks,
Poornima





TimestampUpdateProcessorFactory updates the field even if the value if present

2020-05-21 Thread gnandre
Hi,

Following is the update request processor chain.

 <
processor class="solr.TimestampUpdateProcessorFactory"> index_time_stamp_create

And, here is how the field is defined in schema.xml



Every time I index the same document, above field changes its value with
latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
page, if a document does not contain a value in the timestamp field, a new
Date will be generated and added as the value of that field. After the
first indexing this document should always have a value, so why then it
gets updated later?

I am using Solr Admin UI's Documents tab to index the document for testing.
I am using Solr 6.3 in master-slave architecture mode.


Use Subquery Parameters to filter main query

2020-05-21 Thread rantonana
Hello, I need to do the following:
I have a main query who define a subquery called group with  "fields":
"*,group:[subquery]", 
the group document has a lot of fields, but I want to filter the main query
based on one of them. 
ex:
{
PID:1,
type:doc,
 "group":{"numFound":1,"start":0,"docs":[
{
members:[1,2,3]
}]
},
{
PID:2,
type:doc,
 "group":{"numFound":1,"start":0,"docs":[
{
members:[4,5,6]
}]
}

in the example, I want to filter type documents where members field has the
6 value.

thanks

 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Atomic update change value and field name

2020-05-21 Thread Jan Høydahl
Try adding -format solr to your bin/post command. By default the post command 
will treat input as arbitrary json, not solr-format json.

Jan Høydahl

> 21. mai 2020 kl. 02:50 skrev Hup Chen :
> 
> I am new to Solr. I tried to do Atomic update by using .json file update. 
> $SOLR/bin/post not only changing field values, but field name also has become 
> "fieldname.set", for instance, "price" become "price.set".  Update by curl 
> /update handler was working well but since I have several millions of 
> records, I can't update by calling curl several million times, that will be 
> extremely slow.
> 
> Any help will be appreciated.
> 
> 
># /usr/local/solr/bin/solr version
>8.5.1
> 
># curl http://localhost:8983/solr/books/select?q=id%3A0371558727
>"response":{"numFound":1,"start":0,"docs":[
>  {
>"id":"0371558727",
>"price":19.0,
>"_version_":1667214802265571328}]
>}
> 
># cat test.json
>[
>{"id":"0371558727",
> "price":{"set":19.95}
>}
>]
> 
># /usr/local/solr/bin/post -p 8983 -c books test.json
> 
># curl http://localhost:8983/solr/books/select?q=id%3A0371558727
>"response":{"numFound":1,"start":0,"docs":[
>  {
>"id":"0371558727",
>"price.set":[19.95],
>"_version_":1667214933776924672}]
>}
> 
> 


Re: Shingles behavior

2020-05-21 Thread Radu Gheorghe
Turns out, it’s down to setting enableGraphQueries=false in the field 
definition. I completely missed that :(

> On 21 May 2020, at 07:49, Radu Gheorghe  wrote:
> 
> Hi Alex, long time no see :)
> 
> I tried with sow, and that basically invalidates query-time shingles (it only 
> mathes mona OR lisa OR smile).
> 
> I'm using shingles at both index and query time as a substitute for pf2 and 
> pf3: the more shingles I match, the more relevant the document. Also, higher 
> order shingles naturally get lower frequencies, meaning they get a "natural" 
> boost.
> 
> Best regards,
> Radu
> 
> joi, 21 mai 2020, 00:28 Alexandre Rafalovitch  a scris:
> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
> 
> Regards,
>Alex.
> 
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe  
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed 
> > looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > >  > > maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
> > documents back, in that order. Because the first document matches all the 
> > terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only matches 
> > one.
> >
> > Instead, I only get the first document back. That’s because the query 
> > expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona 
> > > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> > > +shingle_field:lisa smile) (+shingle_field:mona lisa 
> > > +shingle_field:smile) shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using 
> > “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the 
> > options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
> > default, and minimum_should_match works as expected. The only difference I 
> > see between the two, on the analysis side, is that tokens start at 0 in 
> > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see 
> > that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is 
> > there a workaround?
> >
> > Thanks and best regards,
> > Radu



Re: How to restore deleted collection from filesystem

2020-05-21 Thread Erick Erickson
See inline.

> On May 21, 2020, at 10:13 AM, Kommu, Vinodh K.  wrote:
> 
> Thanks Eric for quick response.
> 
> Yes, our VMs are equipped with NetBackup which is like file based backup and 
> it can restore any files or directories that were deleted from latest 
> available full backup.
> 
> Can we create an empty collection with the same name which was deleted with 
> same number of shared & replicas and copy the content from restored core to 
> corresponding core?

Kind of. It is NOT necessary that it has the same name. There is no need at all 
(and I do NOT recommend) that you create the same number of replicas to start. 
As I said earlier, create a single-replica (i.e. leader-only) collection with 
the same number of shards. Copy _one_ data dir (not everything under core) to 
that _one_ corresponding replica. It doesn’t matter which replica you copy from

> I mean, copy all contents (directories & files) under 
> Oldcollection_shard1_replica1 core from old collection to corresponding 
> Newcollection_shard1_replica1 core in new collection. Would this approach 
> will work?
> 

As above, do not do this. Just copy the data dir from one of your backup copies 
to the leader-only replica. It doesn’t matter at all if the replica names are 
the same. The only thing that matters is that the shard number is identical. 
For instance, copy blah/blah/collection1_shard1_replica_57/data to 
blah/blah/collection1_shared1_replica_1/data if you want.

Once you have a one-replica collection with the data in it and you’ve done a 
bit of verification, use ADDREPLICA to build it out.

> Lastly anything needs to be aware in core.properties in newly created 
> collection or any reference pointing to new collection specific?

Do not copy or touch  core.properties, you can mess this up thoroughly by 
hand-editing. The _only_ thing you copy is the data directory, which will 
contain a tlog and index directory. And, the tlog isn’t even necessary.

Best,
Erick

> 
> 
> Thanks & Regards,
> Vinodh
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Thursday, May 21, 2020 6:17 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to restore deleted collection from filesystem
> 
> ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests 
> for Login Information.
> 
> So what I’m reading here is that you have the _data_ saved somewhere, right? 
> By “data” I just mean the data directories under the replica.
> 
> 1> Go ahead and recreate the collection. It _must_ have the same number of 
> shards. Make it leader-only, i.e. replicationFactor == 1
> 2> The collection will be empty, now shut down the Solr instances hosting any 
> of the replicas.
> 3> Replace the data directory under each replica with the corresponding one 
> from the backup. “Corresponding” means from the same shard, which should be 
> obvious from the replica name.
> 4> Start your Solr instances back up and verify it’s as you expect.
> 5> Use ADDREPLICA to build out your collection to have as many replicas of 
> each shard as you require. NOTE: I’d do this gradually, maybe 2-3 at a time 
> then wait for them to become active before adding more. The point here is 
> that each ADDREPLICA will cause the entire index down from from the leader 
> and with that many documents you don’t want to saturate your network.
> 
> Best,
> Erick
> 
>> On May 21, 2020, at 8:17 AM, Kommu, Vinodh K.  wrote:
>> 
>> Hi,
>> 
>> One of our largest collection which holds 3.2 billion docs was deleted 
>> accidentally in QA environment. Unfortunately we don't have latest solr 
>> backup for this collection either to restore. The only option left for us is 
>> to restore deleted replica directories under data directory using netbackup 
>> restore process.
>> 
>> We haven't done this way of restore before so following things are not clear:
>> 
>> 1. As the collection was deleted (not created yet), if the necessary replica 
>> directories and files are restore to same location, will the collection 
>> works without creating it again?
>> 2. If above option doesn't work, obviously we have to create collection but 
>> the replica names and placement may not be same as deleted collection's 
>> replica names and placements (creating collections using rule based 
>> replicas) so in this case what need to be done to restore the collection 
>> smoothly. Or is there any predefined steps available to handle this kind of 
>> scenario? Any suggestions is greatly appreciated.
>> 
>> 
>> Thanks & Regards,
>> Vinodh
>> 
>> DTCC DISCLAIMER: This email and any files transmitted with it are 
>> confidential and intended solely for the use of the individual or entity to 
>> whom they are addressed. If you have received this email in error, please 
>> notify us immediately and delete the email and any attachments from your 
>> system. The recipient should check this email and any attachments for the 
>> presence of viruses. The company accepts no liability for any damage caused 
>> 

+(-...) vs +(*:* -...) vs -(+...)

2020-05-21 Thread Jochen Barth

Dear reader,

why does +(-x_ss:y) finds 0 docs,

while -(+x_ss:y) finds many docs?

Ok... +(*:* -x_ss:y) works, too, but I'm a bit surprised.

Kind regards, J. Barth



Re: Need help on handling large size of index.

2020-05-21 Thread Phill Campbell
The optimal size for a shard of the index is be definition what works best on 
the hardware with the JVM heap that is in use.
More shards mean smaller sizes of the index for the shard as you already know. 

I spent months changing the sharing, the JVM heap, the GC values before taking 
the system live.
RAM is important, and I run with enough to allow Solr to load the entire index 
into RAM. From my understanding Solr uses the system to memory map the index 
files. I might be wrong.
I experimented with less RAM and SSD drives and found that was another way to 
get the performance I needed. Since RAM is cheaper, I choose that approach.

Again we never optimize. When we have to recover we rebuild the index by 
spinning up new machines and use a massive EMR (Map reduce job) to force the 
data into the system. Takes about 3 hours. Solr can ingest data at an amazing 
rate. Then we do a blue/green switch over.

Query time, from my experience with my environment, is improved with more 
sharding and additional hardware. Not just more sharding on the same hardware.

My fields are not stored either, except ID. There are some fields that are 
indexed and have DocValues and those are used for sorting and facets. My 
queries can have any number of wildcards as well, but my field’s data lengths 
are maybe a maximum of 100 characters so proximity searching is not too bad. I 
tokenize and index everything. I do not expand terms at query time to get 
broader results, I index the alternatives and let the indexer do what it does 
best.

If you are running in SolrCloud mode and you are using the embedded zookeeper I 
would change that. Solr and ZK are very chatty with each other, run ZK on 
machines in proximity to Solr.

Regards

> On May 21, 2020, at 2:46 AM, Modassar Ather  wrote:
> 
> Thanks Phill for your response.
> 
> Optimal Index size: Depends on what you are optimizing for. Query Speed?
> Hardware utilization?
> We are optimising it for query speed. What I understand even if we set the
> merge policy to any number the amount of hard disk will still be required
> for the bigger segment merges. Please correct me if I am wrong.
> 
> Optimizing the index is something I never do. We live with about 28%
> deletes. You should check your configuration for your merge policy.
> There is a delete of about 10-20% in our updates. We have no merge policy
> set in configuration as we do a full optimisation after the indexing.
> 
> Increased sharding has helped reduce query response time, but surely there
> is a point where the colation of results starts to be the bottleneck.
> The query response time is my concern. I understand the aggregation of
> results may increase the search response time.
> 
> *What does your schema look like? I index around 120 fields per document.*
> The schema has a combination of text and string fields. None of the field
> except Id field is stored. We also have around 120 fields. A few of them
> have docValues enabled.
> 
> *What does your queries look like? Mine are so varied that caching never
> helps, the same query rarely comes through.*
> Our search queries are combination of proximity, nested proximity and
> wildcards most of the time. The query can be very complex with 100s of
> wildcard and proximity terms in it. Different grouping option are also
> enabled on search result. And the search queries vary a lot.
> 
> Oh, another thing, are you concerned about  availability? Do you have a
> replication factor > 1? Do you run those replicas in a different region for
> safety?
> How many zookeepers are you running and where are they?
> As of now we do not have any replication factor. We are not using zookeeper
> ensemble but would like to move to it sooner.
> 
> Best,
> Modassar
> 
> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:
> 
>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>> Can you please help me with following few questions?
>>> 
>>>- What is the ideal index size per shard?
>> 
>> We have no way of knowing that.  A size that works well for one index
>> use case may not work well for another, even if the index size in both
>> cases is identical.  Determining the ideal shard size requires
>> experimentation.
>> 
>> 
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>> 
>>>- The optimisation takes lot of time and IOPs to complete. Will
>>>increasing the number of shards help in reducing the optimisation
>> time and
>>>IOPs?
>> 
>> No, changing the number of shards will not help with the time required
>> to optimize, and might make it slower.  Increasing the speed of the
>> disks won't help either.  Optimizing involves a lot more than just
>> copying data -- it will never use all the available disk bandwidth of
>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>> a full collection sequentially, not simultaneously.
>> 
>>>- We are planning to reduce each shard index size to 30GB and the
>> 

Re: +(-...) vs +(*:* -...) vs -(+...)

2020-05-21 Thread Houston Putman
Jochen,

For the standard query parser, pure negative queries (no positive query in
front of it, such as "*:*") are only allowed as a top level clause, so not
nested within parenthesis.

Check the second bullet point of the this section of the Ref Guide page for
the Standard Query Parser.


For the edismax query parser, pure negative queries are allowed to be
nested within parenthesis. Docs can be found in the Ref Guide page for the
eDismax Query Parser.


- Houston

On Thu, May 21, 2020 at 2:25 PM Jochen Barth 
wrote:

> Dear reader,
>
> why does +(-x_ss:y) finds 0 docs,
>
> while -(+x_ss:y) finds many docs?
>
> Ok... +(*:* -x_ss:y) works, too, but I'm a bit surprised.
>
> Kind regards, J. Barth
>
>


Re: +(-...) vs +(*:* -...) vs -(+...)

2020-05-21 Thread Shawn Heisey

On 5/21/2020 12:25 PM, Jochen Barth wrote:

why does +(-x_ss:y) finds 0 docs,

while -(+x_ss:y) finds many docs?

Ok... +(*:* -x_ss:y) works, too, but I'm a bit surprised.


Purely negative queries, if that is what ultimately makes it to Lucene, 
do not work.


The basic problem is that if you start with nothing and then subtract 
something, you get nothing.


When a purely negative query that's VERY simple is provided, Solr is 
able to detect the situation and implicitly add a starting point of all 
docs.  You'll probably find that the following query (which is missing 
the parentheses compared to your first example) will work because Solr 
is capable of detecting and fixing the problem itself:


-x_ss:y

With parentheses the query is too complex for the detection I described 
to work, and the constructed Lucene query remains purely negative.


Your third example is the correct way to construct a purely negative 
query so that it is guaranteed to work.


We have created a wiki page about this:

https://cwiki.apache.org/confluence/display/SOLR/NegativeQueryProblems

Thanks,
Shawn


Re: Does Solr master/slave support shard split

2020-05-21 Thread Erick Erickson
In a word, “no”. It’s a whole ’nother architecture to deal
with shards, and stand-alone (i.e. master/slave) has no
concept of that.

You could make a single-shard collection in SolrCloud,
copy the index to the right place (I’d shut down Solr while
I copied it), and then use SPLITSHARD on it, but that implies
you’d be going to SolrCloud.

Best,
Erick

> On May 21, 2020, at 10:35 AM, Pushkar Raste  wrote:
> 
> Hi,
> Does Solr support shard split in the master/slave setup. I understand that
> there is no shard concept is master/slave and we just have cores but can we
> split a core into two.
> 
> If yes is there way to specify new mapping based on the unique key.
> -- 
> — Pushkar Raste



Re: Unbalanced shard requests

2020-05-21 Thread Phill Campbell
Yes, JVM heap settings.

> On May 19, 2020, at 10:59 AM, Wei  wrote:
> 
> Hi Phill,
> 
> What is the RAM config you are referring to, JVM size? How is that related
> to the load balancing, if each node has the same configuration?
> 
> Thanks,
> Wei
> 
> On Mon, May 18, 2020 at 3:07 PM Phill Campbell
>  wrote:
> 
>> In my previous report I was configured to use as much RAM as possible.
>> With that configuration it seemed it was not load balancing.
>> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference
>> for the better!
>> 
>> 10.156.112.50   load average: 13.52, 10.56, 6.46
>> 10.156.116.34   load average: 11.23, 12.35, 9.63
>> 10.156.122.13   load average: 10.29, 12.40, 9.69
>> 
>> Very nice.
>> My tool that tests records RPS. In the “bad” configuration it was less
>> than 1 RPS.
>> NOW it is showing 21 RPS.
>> 
>> 
>> http://10.156.112.50:10002/solr/admin/metrics?group=core=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10002/solr/admin/metrics?group=core=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>"status":0,
>>"QTime":161},
>>  "metrics":{
>>"solr.core.BTS.shard1.replica_n2":{
>>  "QUERY./select.requestTimes":{
>>"count":5723,
>>"meanRate":6.8163888639859085,
>>"1minRate":11.557013215119536,
>>"5minRate":8.760356217628159,
>>"15minRate":4.707624230995833,
>>"min_ms":0.131545,
>>"max_ms":388.710848,
>>"mean_ms":30.300492048215947,
>>"median_ms":6.336654,
>>"stddev_ms":51.527164088667035,
>>"p75_ms":35.427943,
>>"p95_ms":140.025957,
>>"p99_ms":230.533099,
>>"p999_ms":388.710848
>> 
>> 
>> 
>> http://10.156.122.13:10004/solr/admin/metrics?group=core=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10004/solr/admin/metrics?group=core=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>"status":0,
>>"QTime":11},
>>  "metrics":{
>>"solr.core.BTS.shard2.replica_n8":{
>>  "QUERY./select.requestTimes":{
>>"count":6469,
>>"meanRate":7.502581801189549,
>>"1minRate":12.211423085368564,
>>"5minRate":9.445681397767322,
>>"15minRate":5.216209798637846,
>>"min_ms":0.154691,
>>"max_ms":701.657394,
>>"mean_ms":34.2734699171445,
>>"median_ms":5.640378,
>>"stddev_ms":62.27649205954566,
>>"p75_ms":39.016371,
>>"p95_ms":156.997982,
>>"p99_ms":288.883028,
>>"p999_ms":538.368031
>> 
>> 
>> http://10.156.116.34:10002/solr/admin/metrics?group=core=QUERY./select.requestTimes
>> <
>> http://10.156.116.34:10002/solr/admin/metrics?group=core=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>"status":0,
>>"QTime":67},
>>  "metrics":{
>>"solr.core.BTS.shard3.replica_n16":{
>>  "QUERY./select.requestTimes":{
>>"count":7109,
>>"meanRate":7.787524673806184,
>>"1minRate":11.88519763582083,
>>"5minRate":9.893315557386755,
>>"15minRate":5.620178363676527,
>>"min_ms":0.150887,
>>"max_ms":472.826462,
>>"mean_ms":32.184282366621204,
>>"median_ms":6.977733,
>>"stddev_ms":55.729908615189196,
>>"p75_ms":36.655011,
>>"p95_ms":151.12627,
>>"p99_ms":251.440162,
>>"p999_ms":472.826462
>> 
>> 
>> Compare that to the previous report and you can see the improvement.
>> So, note to myself. Figure out the sweet spot for RAM usage. Use too much
>> and strange behavior is noticed. While using too much all the load focused
>> on one box and query times slowed.
>> I did not see any OOM errors during any of this.
>> 
>> Regards
>> 
>> 
>> 
>>> On May 18, 2020, at 3:23 PM, Phill Campbell
>>  wrote:
>>> 
>>> I have been testing 8.5.2 and it looks like the load has moved but is
>> still on one machine.
>>> 
>>> Setup:
>>> 3 physical machines.
>>> Each machine hosts 8 instances of Solr.
>>> Each instance of Solr hosts one replica.
>>> 
>>> Another way to say it:
>>> Number of shards = 8. Replication factor = 3.
>>> 
>>> Here is the cluster state. You can see that the leaders are well
>> distributed.
>>> 
>>> {"TEST_COLLECTION":{
>>>   "pullReplicas":"0",
>>>   "replicationFactor":"3",
>>>   "shards":{
>>> "shard1":{
>>>   "range":"8000-9fff",
>>>   "state":"active",
>>>   "replicas":{
>>> "core_node3":{
>>>   "core":"TEST_COLLECTION_shard1_replica_n1",
>>>   "base_url":"http://10.156.122.13:10007/solr;,
>>>   "node_name":"10.156.122.13:10007_solr",
>>>   "state":"active",
>>>   "type":"NRT",
>>>   "force_set_state":"false"},
>>> "core_node5":{
>>>   "core":"TEST_COLLECTION_shard1_replica_n2",
>>>   "base_url":"http://10.156.112.50:10002/solr;,
>>>   "node_name":"10.156.112.50:10002_solr",
>>>   "state":"active",
>>>   "type":"NRT",
>>>   "force_set_state":"false",

Re: Does Solr master/slave support shard split

2020-05-21 Thread Pushkar Raste
Thanks Eric. Moving to SolrCloud for splitting is what I too imagined 

On Thu, May 21, 2020 at 1:28 PM Erick Erickson 
wrote:

> In a word, “no”. It’s a whole ’nother architecture to deal
> with shards, and stand-alone (i.e. master/slave) has no
> concept of that.
>
> You could make a single-shard collection in SolrCloud,
> copy the index to the right place (I’d shut down Solr while
> I copied it), and then use SPLITSHARD on it, but that implies
> you’d be going to SolrCloud.
>
> Best,
> Erick
>
> > On May 21, 2020, at 10:35 AM, Pushkar Raste 
> wrote:
> >
> > Hi,
> > Does Solr support shard split in the master/slave setup. I understand
> that
> > there is no shard concept is master/slave and we just have cores but can
> we
> > split a core into two.
> >
> > If yes is there way to specify new mapping based on the unique key.
> > --
> > — Pushkar Raste
>
> --
— Pushkar Raste


Re: How to restore deleted collection from filesystem

2020-05-21 Thread Erick Erickson
So what I’m reading here is that you have the _data_ saved somewhere, right? By 
“data” I just mean the data directories under the replica.

1> Go ahead and recreate the collection. It _must_ have the same number of 
shards. Make it leader-only, i.e. replicationFactor == 1
2> The collection will be empty, now shut down the Solr instances hosting any 
of the replicas.
3> Replace the data directory under each replica with the corresponding one 
from the backup. “Corresponding” means from the same shard, which should be 
obvious from the replica name.
4> Start your Solr instances back up and verify it’s as you expect.
5> Use ADDREPLICA to build out your collection to have as many replicas of each 
shard as you require. NOTE: I’d do this gradually, maybe 2-3 at a time then 
wait for them to become active before adding more. The point here is that each 
ADDREPLICA will cause the entire index down from from the leader and with that 
many documents you don’t want to saturate your network.

Best,
Erick

> On May 21, 2020, at 8:17 AM, Kommu, Vinodh K.  wrote:
> 
> Hi,
> 
> One of our largest collection which holds 3.2 billion docs was deleted 
> accidentally in QA environment. Unfortunately we don't have latest solr 
> backup for this collection either to restore. The only option left for us is 
> to restore deleted replica directories under data directory using netbackup 
> restore process.
> 
> We haven't done this way of restore before so following things are not clear:
> 
> 1. As the collection was deleted (not created yet), if the necessary replica 
> directories and files are restore to same location, will the collection works 
> without creating it again?
> 2. If above option doesn't work, obviously we have to create collection but 
> the replica names and placement may not be same as deleted collection's 
> replica names and placements (creating collections using rule based replicas) 
> so in this case what need to be done to restore the collection smoothly. Or 
> is there any predefined steps available to handle this kind of scenario? Any 
> suggestions is greatly appreciated.
> 
> 
> Thanks & Regards,
> Vinodh
> 
> DTCC DISCLAIMER: This email and any files transmitted with it are 
> confidential and intended solely for the use of the individual or entity to 
> whom they are addressed. If you have received this email in error, please 
> notify us immediately and delete the email and any attachments from your 
> system. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email.



Re: Query takes more time in Solr 8.5.1 compare to 6.1.0 version

2020-05-21 Thread Jörn Franke
Did you create Solrconfig.xml for the collection from scratch after upgrading 
and reindexing? Was it based on the latest template?
If not then please try this. Maybe also you need to increase the corresponding 
caches in the config.

What happens if you reexecute the query?

Are there other processes/containers running on the same VM?

How much heap and how much total memory you have? You should only have a minor 
fraction of the memory as heap and most of it „free“ (this means it is used for 
file caches).



> Am 21.05.2020 um 15:24 schrieb vishal patel :
> 
> Any one is looking this issue?
> I got same issue.
> 
> Regards,
> Vishal Patel
> 
> 
> 
> 
> From: jay harkhani 
> Sent: Wednesday, May 20, 2020 7:39 PM
> To: solr-user@lucene.apache.org 
> Subject: Query takes more time in Solr 8.5.1 compare to 6.1.0 version
> 
> Hello,
> 
> Currently I upgrade Solr version from 6.1.0 to 8.5.1 and come across one 
> issue. Query which have more ids (around 3000) and grouping is applied takes 
> more time to execute. In Solr 6.1.0 it takes 677ms and in Solr 8.5.1 it takes 
> 26090ms. While take reading we have same solr schema and same no. of records 
> in both solr version.
> 
> Please refer below details for query, logs and thead dump (generate from Solr 
> Admin while execute query).
> 
> Query : https://drive.google.com/file/d/1bavCqwHfJxoKHFzdOEt-mSG8N0fCHE-w/view
> 
> Logs and Thread dump stack trace
> Solr 8.5.1 : 
> https://drive.google.com/file/d/149IgaMdLomTjkngKHrwd80OSEa1eJbBF/view
> Solr 6.1.0 : 
> https://drive.google.com/file/d/13v1u__fM8nHfyvA0Mnj30IhdffW6xhwQ/view
> 
> To analyse further more we found that if we remove grouping field or we 
> reduce no. of ids from query it execute fast. Is anything change in 8.5.1 
> version compare to 6.1.0 as in 6.1.0 even for large no. Ids along with 
> grouping it works faster?
> 
> Can someone please help to isolate this issue.
> 
> Regards,
> Jay Harkhani.


How to restore deleted collection from filesystem

2020-05-21 Thread Kommu, Vinodh K.
Hi,

One of our largest collection which holds 3.2 billion docs was deleted 
accidentally in QA environment. Unfortunately we don't have latest solr backup 
for this collection either to restore. The only option left for us is to 
restore deleted replica directories under data directory using netbackup 
restore process.

We haven't done this way of restore before so following things are not clear:

1. As the collection was deleted (not created yet), if the necessary replica 
directories and files are restore to same location, will the collection works 
without creating it again?
2. If above option doesn't work, obviously we have to create collection but the 
replica names and placement may not be same as deleted collection's replica 
names and placements (creating collections using rule based replicas) so in 
this case what need to be done to restore the collection smoothly. Or is there 
any predefined steps available to handle this kind of scenario? Any suggestions 
is greatly appreciated.


Thanks & Regards,
Vinodh

DTCC DISCLAIMER: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify us 
immediately and delete the email and any attachments from your system. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email.


Re: Need help on handling large size of index.

2020-05-21 Thread Erick Erickson
Please consider _not_ optimizing. It’s kind of a misleading name anyway, and the
version of solr you’re using may have unintended consequences, see:

https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

There are situations where optimizing makes sense, but far too often people 
think
it’s A Good Thing (based almost entirely on the name, who _wouldn’t_ want an
optimized index?) without measuring, leading to tons of work to no real benefit.

Best,
Erick

> On May 21, 2020, at 4:58 AM, Modassar Ather  wrote:
> 
> Thanks Shawn for your response.
> 
> We have seen a performance increase in optimisation with a bigger number of
> IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
> whereas the same index took 5-6 hours to optimise with higher IOPs.
> Yes the entire extra IOPs were never used to full other than a couple of
> spike in its usage. So not able to understand how the increased IOPs makes
> so much of difference.
> Can you please help me understand what it involves to optimise? Is it the
> more RAM/IOPs?
> 
> Search response time is very important. Please advise if we increase the
> shard with extra servers how much effect it may have on search response
> time.
> 
> Best,
> Modassar
> 
> On Thu, May 21, 2020 at 2:16 PM Modassar Ather 
> wrote:
> 
>> Thanks Phill for your response.
>> 
>> Optimal Index size: Depends on what you are optimizing for. Query Speed?
>> Hardware utilization?
>> We are optimising it for query speed. What I understand even if we set the
>> merge policy to any number the amount of hard disk will still be required
>> for the bigger segment merges. Please correct me if I am wrong.
>> 
>> Optimizing the index is something I never do. We live with about 28%
>> deletes. You should check your configuration for your merge policy.
>> There is a delete of about 10-20% in our updates. We have no merge policy
>> set in configuration as we do a full optimisation after the indexing.
>> 
>> Increased sharding has helped reduce query response time, but surely there
>> is a point where the colation of results starts to be the bottleneck.
>> The query response time is my concern. I understand the aggregation of
>> results may increase the search response time.
>> 
>> *What does your schema look like? I index around 120 fields per document.*
>> The schema has a combination of text and string fields. None of the field
>> except Id field is stored. We also have around 120 fields. A few of them
>> have docValues enabled.
>> 
>> *What does your queries look like? Mine are so varied that caching never
>> helps, the same query rarely comes through.*
>> Our search queries are combination of proximity, nested proximity and
>> wildcards most of the time. The query can be very complex with 100s of
>> wildcard and proximity terms in it. Different grouping option are also
>> enabled on search result. And the search queries vary a lot.
>> 
>> Oh, another thing, are you concerned about  availability? Do you have a
>> replication factor > 1? Do you run those replicas in a different region for
>> safety?
>> How many zookeepers are you running and where are they?
>> As of now we do not have any replication factor. We are not using
>> zookeeper ensemble but would like to move to it sooner.
>> 
>> Best,
>> Modassar
>> 
>> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:
>> 
>>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
 Can you please help me with following few questions?
 
- What is the ideal index size per shard?
>>> 
>>> We have no way of knowing that.  A size that works well for one index
>>> use case may not work well for another, even if the index size in both
>>> cases is identical.  Determining the ideal shard size requires
>>> experimentation.
>>> 
>>> 
>>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>> 
- The optimisation takes lot of time and IOPs to complete. Will
increasing the number of shards help in reducing the optimisation
>>> time and
IOPs?
>>> 
>>> No, changing the number of shards will not help with the time required
>>> to optimize, and might make it slower.  Increasing the speed of the
>>> disks won't help either.  Optimizing involves a lot more than just
>>> copying data -- it will never use all the available disk bandwidth of
>>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>>> a full collection sequentially, not simultaneously.
>>> 
- We are planning to reduce each shard index size to 30GB and the
>>> entire
3.5 TB index will be distributed across more shards. In this case
>>> to almost
70+ shards. Will this help?
>>> 
>>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>>> of shards without adding additional servers, I would expect things to
>>> get worse, not better.
>>> 
 Kindly share your 

Re: Query takes more time in Solr 8.5.1 compare to 6.1.0 version

2020-05-21 Thread vishal patel
Any one is looking this issue?
I got same issue.

Regards,
Vishal Patel




From: jay harkhani 
Sent: Wednesday, May 20, 2020 7:39 PM
To: solr-user@lucene.apache.org 
Subject: Query takes more time in Solr 8.5.1 compare to 6.1.0 version

Hello,

Currently I upgrade Solr version from 6.1.0 to 8.5.1 and come across one issue. 
Query which have more ids (around 3000) and grouping is applied takes more time 
to execute. In Solr 6.1.0 it takes 677ms and in Solr 8.5.1 it takes 26090ms. 
While take reading we have same solr schema and same no. of records in both 
solr version.

Please refer below details for query, logs and thead dump (generate from Solr 
Admin while execute query).

Query : https://drive.google.com/file/d/1bavCqwHfJxoKHFzdOEt-mSG8N0fCHE-w/view

Logs and Thread dump stack trace
Solr 8.5.1 : 
https://drive.google.com/file/d/149IgaMdLomTjkngKHrwd80OSEa1eJbBF/view
Solr 6.1.0 : 
https://drive.google.com/file/d/13v1u__fM8nHfyvA0Mnj30IhdffW6xhwQ/view

To analyse further more we found that if we remove grouping field or we reduce 
no. of ids from query it execute fast. Is anything change in 8.5.1 version 
compare to 6.1.0 as in 6.1.0 even for large no. Ids along with grouping it 
works faster?

Can someone please help to isolate this issue.

Regards,
Jay Harkhani.


Re: Use cases for the graph streams

2020-05-21 Thread Joel Bernstein
Good question. Let me first point to an interesting example in the Visual
Guide to Streaming Expressions and Math Expressions:

https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/search-sample.adoc#nodes

This example gets to the heart of the core use case for the nodes
expression which is to discover the relationships between nodes in a graph.
So it's a discovery tool to learn something new about the data that you
can't see without having this specific ability of walking the nodes in a
graph.

In the broader context the nodes expression is part of a much wider set of
tools that allow people to use Solr to explore the relationships in their
data. This is described here:

https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/math-expressions.adoc

The goal of all this is to move search engines beyond basic aggregations to
study the correlations and relationships within the data set.

Graph traversal is part of this broader goal which will get developed more
over time. I'd be interested in hearing more about specific graph use cases
that you're interested in solving.

Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, May 20, 2020 at 12:32 PM Nightingale, Jonathan A (US) <
jonathan.nighting...@baesystems.com> wrote:

> This is kind of  broad question, but I was playing with the graph streams
> and was having trouble making the tools work for what I wanted to do. I'm
> wondering if the use case for the graph streams really supports standard
> graph queries you might use with Gemlin or the like? I ask because right
> now we have two implementations of our data storage to support these two
> ways of looking at it, the standard query and the semantic filtering.
>
> The usecases I usually see for the graph streams always seem to be limited
> to one link traversal for finding things related to nodes gathered from a
> query. But even with that it wasn't clear the best way to do things with
> lists of docvalues. So for example if you wanted to represent a node that
> had many doc values I had to use cross products to make a node for each doc
> value. The traversal didn't allow for that kind of node linking inherently
> it seemed.
>
> So my question really is (and maybe this is not the place for this) what
> is the intent of these graph features and what is the goal for them in the
> future? I was really hoping at one point to only use solr for our product
> but it didn't seem feasible, at least not easily.
>
> Thanks for all your help
> Jonathan
>
> Jonathan Nightingale
> GXP Solutions Engineer
> (office) 315 838 2273
> (cell) 315 271 0688
>
>


RE: How to restore deleted collection from filesystem

2020-05-21 Thread Kommu, Vinodh K.
Thanks Eric for quick response.

Yes, our VMs are equipped with NetBackup which is like file based backup and it 
can restore any files or directories that were deleted from latest available 
full backup.

Can we create an empty collection with the same name which was deleted with 
same number of shared & replicas and copy the content from restored core to 
corresponding core? I mean, copy all contents (directories & files) under 
Oldcollection_shard1_replica1 core from old collection to corresponding 
Newcollection_shard1_replica1 core in new collection. Would this approach will 
work?

Lastly anything needs to be aware in core.properties in newly created 
collection or any reference pointing to new collection specific?


Thanks & Regards,
Vinodh

-Original Message-
From: Erick Erickson  
Sent: Thursday, May 21, 2020 6:17 PM
To: solr-user@lucene.apache.org
Subject: Re: How to restore deleted collection from filesystem

ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests 
for Login Information.

So what I’m reading here is that you have the _data_ saved somewhere, right? By 
“data” I just mean the data directories under the replica.

1> Go ahead and recreate the collection. It _must_ have the same number of 
shards. Make it leader-only, i.e. replicationFactor == 1
2> The collection will be empty, now shut down the Solr instances hosting any 
of the replicas.
3> Replace the data directory under each replica with the corresponding one 
from the backup. “Corresponding” means from the same shard, which should be 
obvious from the replica name.
4> Start your Solr instances back up and verify it’s as you expect.
5> Use ADDREPLICA to build out your collection to have as many replicas of each 
shard as you require. NOTE: I’d do this gradually, maybe 2-3 at a time then 
wait for them to become active before adding more. The point here is that each 
ADDREPLICA will cause the entire index down from from the leader and with that 
many documents you don’t want to saturate your network.

Best,
Erick

> On May 21, 2020, at 8:17 AM, Kommu, Vinodh K.  wrote:
>
> Hi,
>
> One of our largest collection which holds 3.2 billion docs was deleted 
> accidentally in QA environment. Unfortunately we don't have latest solr 
> backup for this collection either to restore. The only option left for us is 
> to restore deleted replica directories under data directory using netbackup 
> restore process.
>
> We haven't done this way of restore before so following things are not clear:
>
> 1. As the collection was deleted (not created yet), if the necessary replica 
> directories and files are restore to same location, will the collection works 
> without creating it again?
> 2. If above option doesn't work, obviously we have to create collection but 
> the replica names and placement may not be same as deleted collection's 
> replica names and placements (creating collections using rule based replicas) 
> so in this case what need to be done to restore the collection smoothly. Or 
> is there any predefined steps available to handle this kind of scenario? Any 
> suggestions is greatly appreciated.
>
>
> Thanks & Regards,
> Vinodh
>
> DTCC DISCLAIMER: This email and any files transmitted with it are 
> confidential and intended solely for the use of the individual or entity to 
> whom they are addressed. If you have received this email in error, please 
> notify us immediately and delete the email and any attachments from your 
> system. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email.

DTCC DISCLAIMER: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify us 
immediately and delete the email and any attachments from your system. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email.


Re: Query takes more time in Solr 8.5.1 compare to 6.1.0 version

2020-05-21 Thread jay harkhani
Hello,

Please refer below details.

>Did you create Solrconfig.xml for the collection from scratch after upgrading 
>and reindexing?
Yes, We have created collection from scratch and also re-indexing.

>Was it based on the latest template?
Yes, It was as per latest template.

>What happens if you reexecute the query?
Not more visible difference. Minor change in milliseconds.

>Are there other processes/containers running on the same VM?
No

>How much heap and how much total memory you have?
My heap and total memory are same as Solr 6.1.0. heap memory 5 gb and total 
memory 25gb. As per me there is no issue related to memory.

>Maybe also you need to increase the corresponding caches in the config.
We are not using cache in both version.

Both version have same configuration.

Regards,
Jay Harkhani.


From: Jörn Franke 
Sent: Thursday, May 21, 2020 7:05 PM
To: solr-user@lucene.apache.org 
Subject: Re: Query takes more time in Solr 8.5.1 compare to 6.1.0 version

Did you create Solrconfig.xml for the collection from scratch after upgrading 
and reindexing? Was it based on the latest template?
If not then please try this. Maybe also you need to increase the corresponding 
caches in the config.

What happens if you reexecute the query?

Are there other processes/containers running on the same VM?

How much heap and how much total memory you have? You should only have a minor 
fraction of the memory as heap and most of it „free“ (this means it is used for 
file caches).



> Am 21.05.2020 um 15:24 schrieb vishal patel :
>
> Any one is looking this issue?
> I got same issue.
>
> Regards,
> Vishal Patel
>
>
>
> 
> From: jay harkhani 
> Sent: Wednesday, May 20, 2020 7:39 PM
> To: solr-user@lucene.apache.org 
> Subject: Query takes more time in Solr 8.5.1 compare to 6.1.0 version
>
> Hello,
>
> Currently I upgrade Solr version from 6.1.0 to 8.5.1 and come across one 
> issue. Query which have more ids (around 3000) and grouping is applied takes 
> more time to execute. In Solr 6.1.0 it takes 677ms and in Solr 8.5.1 it takes 
> 26090ms. While take reading we have same solr schema and same no. of records 
> in both solr version.
>
> Please refer below details for query, logs and thead dump (generate from Solr 
> Admin while execute query).
>
> Query : https://drive.google.com/file/d/1bavCqwHfJxoKHFzdOEt-mSG8N0fCHE-w/view
>
> Logs and Thread dump stack trace
> Solr 8.5.1 : 
> https://drive.google.com/file/d/149IgaMdLomTjkngKHrwd80OSEa1eJbBF/view
> Solr 6.1.0 : 
> https://drive.google.com/file/d/13v1u__fM8nHfyvA0Mnj30IhdffW6xhwQ/view
>
> To analyse further more we found that if we remove grouping field or we 
> reduce no. of ids from query it execute fast. Is anything change in 8.5.1 
> version compare to 6.1.0 as in 6.1.0 even for large no. Ids along with 
> grouping it works faster?
>
> Can someone please help to isolate this issue.
>
> Regards,
> Jay Harkhani.


Does Solr master/slave support shard split

2020-05-21 Thread Pushkar Raste
Hi,
Does Solr support shard split in the master/slave setup. I understand that
there is no shard concept is master/slave and we just have cores but can we
split a core into two.

If yes is there way to specify new mapping based on the unique key.
-- 
— Pushkar Raste