Re: Phonetic filters

2019-06-18 Thread Ruslan Dautkhanov
Thank you Walter!

Will have a look at how to do this with edismax.

Ruslan



On Tue, Jun 18, 2019 at 6:26 PM Walter Underwood 
wrote:

> Use two fields, one for exact, one for phonetic. Use the edismax query
> handler and set
> a higher weight on the exact field.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 18, 2019, at 5:23 PM, Ruslan Dautkhanov 
> wrote:
> >
> > We're using phonetic filters (BMPM), and we want to boost exact matches
> if
> > there are any.
> >
> > For example, for name "stephen" BM filter will generate two terms: stifn,
> > stipin
> > And for example it'll find for name "stepheM" (misspelled last letter),
> > it'll match on the same two terms.
> >
> > This makes match score of "stephen" same to "stephe*m*" (misspelled) and
> > "stephen"
> > (exact match).
> >
> > We want to boost score for exact matches.
> > What is a good way to do it?
> >
> > A workaround is to duplicate first_name and not to do phonetic filter on
> > that one.
> > But then we would need to change how our application calls (add one field
> > to query ).
> >
> > It would be great if we can boost exact match without adding new fields /
> > changing
> > application query to explicitly specify two fields - exact and phonetic.
> >
> >
> > Thank you,
> > Ruslan
>
> --

-- 
Ruslan Dautkhanov


Re: Phonetic filters

2019-06-18 Thread Walter Underwood
Use two fields, one for exact, one for phonetic. Use the edismax query handler 
and set 
a higher weight on the exact field.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 18, 2019, at 5:23 PM, Ruslan Dautkhanov  wrote:
> 
> We're using phonetic filters (BMPM), and we want to boost exact matches if
> there are any.
> 
> For example, for name "stephen" BM filter will generate two terms: stifn,
> stipin
> And for example it'll find for name "stepheM" (misspelled last letter),
> it'll match on the same two terms.
> 
> This makes match score of "stephen" same to "stephe*m*" (misspelled) and
> "stephen"
> (exact match).
> 
> We want to boost score for exact matches.
> What is a good way to do it?
> 
> A workaround is to duplicate first_name and not to do phonetic filter on
> that one.
> But then we would need to change how our application calls (add one field
> to query ).
> 
> It would be great if we can boost exact match without adding new fields /
> changing
> application query to explicitly specify two fields - exact and phonetic.
> 
> 
> Thank you,
> Ruslan



Phonetic filters

2019-06-18 Thread Ruslan Dautkhanov
We're using phonetic filters (BMPM), and we want to boost exact matches if
there are any.

For example, for name "stephen" BM filter will generate two terms: stifn,
stipin
And for example it'll find for name "stepheM" (misspelled last letter),
it'll match on the same two terms.

This makes match score of "stephen" same to "stephe*m*" (misspelled) and
"stephen"
(exact match).

We want to boost score for exact matches.
What is a good way to do it?

A workaround is to duplicate first_name and not to do phonetic filter on
that one.
But then we would need to change how our application calls (add one field
to query ).

It would be great if we can boost exact match without adding new fields /
changing
application query to explicitly specify two fields - exact and phonetic.


Thank you,
Ruslan


Re: SolrCloud: Configured socket timeouts not reflecting

2019-06-18 Thread Rahul Goswami
Hello,

I was looking into the code to try to get to the root of this issue. Looks
like this is an issue after all (as of 7.2.1 which is the version we are
using), but wanted to confirm on the user list before creating a JIRA. I
found that the soTimeout property of ConcurrentUpdateSolrClient class (in
the code referenced below) remains null and hence the default of 60 ms
is set as the timeout in HttpPost class instance variable "method".
https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L334


When the call is finally made in the below line, the Httpclient does
contain the configured timeout (as in solr.xml or -DdistribUpdateSoTimeout)
but gets overriden by the hard default of 60 in the "method" parameter
of the execute call.

https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L348


The hard default of 60 is set here:
https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L333


I tried to create a local patch with the below fix which works fine:
https://github.com/apache/lucene-solr/blob/86fe24cbef238d2042d68494bd94e2362a2d996e/solr/core/src/java/org/apache/solr/update/StreamingSolrClients.java#L69



client = new ErrorReportingConcurrentUpdateSolrClient.Builder(url, req,
errors)
  .withHttpClient(httpClient)
  .withQueueSize(100)
  .withSocketTimeout(getSocketTimeout(req))
  .withThreadCount(runnerCount)
  .withExecutorService(updateExecutor)
  .alwaysStreamDeletes()
  .build();

private int getSocketTimeout(SolrCmdDistributor.Req req) {
if(req==null) {
  return UpdateShardHandlerConfig.DEFAULT_DISTRIBUPDATESOTIMEOUT;
}

return
req.cmd.req.getCore().getCoreContainer().getConfig().getUpdateShardHandlerConfig().getDistributedSocketTimeout();
  }

I found this open JIRA on this issue:

https://issues.apache.org/jira/browse/SOLR-12550?jql=text%20~%20%22distribUpdateSoTimeout%22


Should I update the JIRA with this ?

Thanks,
Rahul




On Thu, Jun 13, 2019 at 12:00 AM Rahul Goswami 
wrote:

> Hello,
>
> I am running Solr 7.2.1 in cloud mode. To overcome a setup hardware
> bottleneck, I tried to configure distribUpdateSoTimeout and socketTimeout
> to a value greater than the default 10 mins. I did this by passing these as
> system properties at Solr start up time (-DdistribUpdateSoTimeout and
> -DsocketTimeout  ). The Solr admin UI shows these values in the Dashboard
> args section. As a test, I tried setting each of them to one hour
> (360). However I start seeing socket read timeouts within a few mins.
> Looks like the values are not taking effect. What am I missing? If this is
> a known issue, is there a JIRA for it ?
>
> Thanks,
> Rahul
>


Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks

2019-06-18 Thread Andrew Kettmann
Attached the patch, but that isn't sent out on the mailing list, my mistake. 
Patch below:



### START

diff --git 
a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java 
b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
index 24a52eaf97..e018f8a42f 100644
--- 
a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
+++ 
b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
@@ -135,7 +135,9 @@ public class SplitShardCmd implements 
OverseerCollectionMessageHandler.Cmd {
 }

 RTimerTree t = timings.sub("checkDiskSpace");
-checkDiskSpace(collectionName, slice.get(), parentShardLeader);
+if (splitMethod != SolrIndexSplitter.SplitMethod.LINK) {
+  checkDiskSpace(collectionName, slice.get(), parentShardLeader);
+}
 t.stop();

 // let's record the ephemeralOwner of the parent leader node

### END


From: Andrew Kettmann
Sent: Tuesday, June 18, 2019 3:05:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on 
disk usage checks


Looks like the disk check here is the problem, I am no Java developer, but this 
patch ignores the check if you are using the link method for splitting. 
Attached the patch. This is off of the commit for 7.7.2, d4c30fc285 . The 
modified version only has to be run on the overseer machine, so there is that 
at least.


From: Andrew Kettmann
Sent: Tuesday, June 18, 2019 11:32:43 AM
To: solr-user@lucene.apache.org
Subject: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk 
usage checks


Using Solr 7.7.2 Docker image, testing some of the new autoscale features, huge 
fan so far. Tested with the link method on a 2GB core and found that it took 
less than 1MB of additional space. Filled the core quite a bit larger, 12GB of 
a 20GB PVC, and now splitting the shard fails with the following error message 
on my overseer:


2019-06-18 16:27:41.754 ERROR 
(OverseerThreadFactory-49-thread-5-processing-n:10.0.192.74:8983_solr) 
[c:test_autoscale s:shard1  ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Collection: test_autoscale operation: splitshard 
failed:org.apache.solr.common.SolrException: not enough free disk space to 
perform index split on node 10.0.193.23:8983_solr, required: 23.35038321465254, 
available: 7.811378479003906
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.checkDiskSpace(SplitShardCmd.java:567)
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:138)
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.call(SplitShardCmd.java:94)
at 
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:294)
at 
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)



I attempted sending the request to the node itself to see if it did anything 
different, but no luck. My parameters are (Note Python formatting as that is my 
language of choice):



splitparams = {'action':'SPLITSHARD',
   'collection':'test_autoscale',
   'shard':'shard1',
   'splitMethod':'link',
   'timing':'true',
   'async':'shardsplitasync'}


And this is confirmed by the log message from the node itself:


2019-06-18 16:27:41.730 INFO  (qtp1107530534-16) [c:test_autoscale   ] 
o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 
params={async=shardsplitasync=true=SPLITSHARD=test_autoscale=shard1=link}
 status=0 QTime=20


While it is true I do not have enough space if I were using the rewrite method, 
the link method on a 2GB core used an additional less than 1MB of space. Is 
there something I am missing here? is there an option to disable the disk space 
check that I need to pass? I can't find anything in the documentation at this 
point.


[https://storage.googleapis.com/e24-email-images/e24logonotag.png]
Andrew Kettmann
DevOps Engineer
P: 1.314.596.2836
[LinkedIn] [Twitter] 
  [Instagram] 


evolve24 Confidential & Proprietary Statement: This email and any attachments 
are confidential and may contain information that is privileged, confidential 
or exempt from disclosure under applicable law. It is intended for the use of 
the recipients. If you are not the intended recipient, or believe that you have 
received this 

Filtering children of parent doc from the input domain

2019-06-18 Thread Srijan
Hello,

We're on Solr 6.2.1 and have a requirement where we need to facet on nested
docs. So far we'd been using two pass query approach, where the first query
searches within the parent domain and gets all the matching nested doc ids
as facets (parent docs keep track of nested docs they contain) and then a
second query at the nested doc level that includes the ids of all the
matching nested docs as boolean clauses found by query 1. We then go back
and correct the facet count for query 1. But with this approach we've been
hitting the maxBooleanClauses limit regularly and with the amount of data
we're seeing in this area, this approach is just not sustainable.

The answer seems to be JSON faceting with blockChildren. The problem with
this approach, at least in 6.2.1, is that filter clause is not supported
within the facet domain. blockChildren matches all children of every parent
doc matched in the query but in reality I only want a subset of the
children - only the ones that match some criteria. Is there a way to filter
out children without the need for the filter clause and without having to
move to Solr 6.4?

Reference:
http://yonik.com/solr-nested-objects/#faceting

Thanks,
Srijan


Re: Solr 5.3 to 6.0

2019-06-18 Thread Shawn Heisey

On 6/18/2019 12:16 PM, ilango dhandapani wrote:

Tried several attempts like delete collection/config, take index backup from
5.3, clear index and place them back after upgrade. All tried resulted in
faceting not working with 5.3 and 6.0 data combined.


Most likely what happened here is that the field classes you're using 
changed so they default to docValues="true" in version 6.  When 
docValues is enabled on a field, Solr will use docValues for faceting 
and sorting ... and the existing data in the index from the old version 
may not actually have any docValues data.



I can clean up index and upgrade to 6.0 and reindex all from my DB. Here
faceting works fine.

Problem is I have less data in non prod but I have ~280 million records in
prod and is growing daily. So cleaning and reindexing all of them from DB is
going to take weeks.


You could *try* explicitly setting docValues="false" on the fields where 
you need to facet so that Solr will not try to use docValues for 
faceting ... but I can't guarantee that this is going to actually work.


Really what you need to do is reindex from scratch on the new version.

Thanks,
Shawn


[ANNOUNCE] Apache Solr Reference Guide for 8.1 released

2019-06-18 Thread Cassandra Targett
The Lucene PMC is pleased to announce that the Solr Reference Guide for
Solr 8.1 is now available.

This 1,483 page PDF is the definitive guide to Apache Solr, the search
server built on Apache Lucene.

The PDF can be downloaded from:
https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-8.1.pdf

The Guide is also available online, at
https://lucene.apache.org/solr/guide/8_1/.

Regards,
The Lucene PMC


Solr 5.3 to 6.0

2019-06-18 Thread ilango dhandapani
Am trying to upgrade from solr (cloud mode ) 5.3 to 6.0. My ZK version is
3.4.6

I updated the schema from 6.0 and started back solr as 6.0. All the old data
is present. I have a UI, where all the files are displayed ( by search from
solr).
When I add new data , faceting is not working and having issues in
displaying files in my UI.

Tried several attempts like – take index 5.3 backup , clean index , upgrade,
place index back. Still When I add new data , faceting is not working and
having issues in displaying files in my UI.

I reindexed all data from my DB, with an empty 6.0 and no issues here.
Faceting works fine with all new data and old one reindexed from my DB
freshly.

But in my production, I have ~ 280 million records and will take several
weeks to re index them from my DB.

Please let me know if there is a better way to deal with this scenario.

Thanks,
Ilango




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr 5.3 to 6.0

2019-06-18 Thread ilango dhandapani
Am trying to upgrade my solr (cloud mode ) from 5.3 to 6.0 version. My Zk
verison is 3.4.6.

After updating schema and starting solr as 6.0, all the nodes health look
fine. When I add new files and they goto all the shards, faceting stops
working. I have a UI where the files are displayed ( search from solr) and
this functionality is impacted. But am able to search for any 5.3 indexed
file but only faceting is not functioning as expected.

Tried several attempts like delete collection/config, take index backup from
5.3, clear index and place them back after upgrade. All tried resulted in
faceting not working with 5.3 and 6.0 data combined.

I can clean up index and upgrade to 6.0 and reindex all from my DB. Here
faceting works fine.

Problem is I have less data in non prod but I have ~280 million records in
prod and is growing daily. So cleaning and reindexing all of them from DB is
going to take weeks. 

Is there a better way to handle this scenario ?

Thanks,
Ilango



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks

2019-06-18 Thread Andrew Kettmann
Looks like the disk check here is the problem, I am no Java developer, but this 
patch ignores the check if you are using the link method for splitting. 
Attached the patch. This is off of the commit for 7.7.2, d4c30fc285 . The 
modified version only has to be run on the overseer machine, so there is that 
at least.


From: Andrew Kettmann
Sent: Tuesday, June 18, 2019 11:32:43 AM
To: solr-user@lucene.apache.org
Subject: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk 
usage checks


Using Solr 7.7.2 Docker image, testing some of the new autoscale features, huge 
fan so far. Tested with the link method on a 2GB core and found that it took 
less than 1MB of additional space. Filled the core quite a bit larger, 12GB of 
a 20GB PVC, and now splitting the shard fails with the following error message 
on my overseer:


2019-06-18 16:27:41.754 ERROR 
(OverseerThreadFactory-49-thread-5-processing-n:10.0.192.74:8983_solr) 
[c:test_autoscale s:shard1  ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Collection: test_autoscale operation: splitshard 
failed:org.apache.solr.common.SolrException: not enough free disk space to 
perform index split on node 10.0.193.23:8983_solr, required: 23.35038321465254, 
available: 7.811378479003906
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.checkDiskSpace(SplitShardCmd.java:567)
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:138)
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.call(SplitShardCmd.java:94)
at 
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:294)
at 
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)



I attempted sending the request to the node itself to see if it did anything 
different, but no luck. My parameters are (Note Python formatting as that is my 
language of choice):



splitparams = {'action':'SPLITSHARD',
   'collection':'test_autoscale',
   'shard':'shard1',
   'splitMethod':'link',
   'timing':'true',
   'async':'shardsplitasync'}


And this is confirmed by the log message from the node itself:


2019-06-18 16:27:41.730 INFO  (qtp1107530534-16) [c:test_autoscale   ] 
o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 
params={async=shardsplitasync=true=SPLITSHARD=test_autoscale=shard1=link}
 status=0 QTime=20


While it is true I do not have enough space if I were using the rewrite method, 
the link method on a 2GB core used an additional less than 1MB of space. Is 
there something I am missing here? is there an option to disable the disk space 
check that I need to pass? I can't find anything in the documentation at this 
point.


[https://storage.googleapis.com/e24-email-images/e24logonotag.png]
Andrew Kettmann
DevOps Engineer
P: 1.314.596.2836
[LinkedIn] [Twitter] 
  [Instagram] 


evolve24 Confidential & Proprietary Statement: This email and any attachments 
are confidential and may contain information that is privileged, confidential 
or exempt from disclosure under applicable law. It is intended for the use of 
the recipients. If you are not the intended recipient, or believe that you have 
received this communication in error, please do not read, print, copy, 
retransmit, disseminate, or otherwise use the information. Please delete this 
email and attachments, without reading, printing, copying, forwarding or saving 
them, and notify the Sender immediately by reply email. No confidentiality or 
privilege is waived or lost by any transmission in error.
diff --git a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
index 24a52eaf97..e018f8a42f 100644
--- a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
+++ b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
@@ -135,7 +135,9 @@ public class SplitShardCmd implements OverseerCollectionMessageHandler.Cmd {
 }
 
 RTimerTree t = timings.sub("checkDiskSpace");
-checkDiskSpace(collectionName, slice.get(), parentShardLeader);
+if (splitMethod != SolrIndexSplitter.SplitMethod.LINK) {
+  checkDiskSpace(collectionName, slice.get(), parentShardLeader);
+}
 t.stop();
 
 // let's record the ephemeralOwner of the parent leader 

Re: [CAUTION] CDCR Monitoring - To figure out the latency between source and target replication delay

2019-06-18 Thread Natarajan, Rajeswari
I see below for CDCR Queues API Documentation 

The output is composed of a list “queues” which contains a list of (ZooKeeper) 
Target hosts, themselves containing a list of Target collections. For each 
collection, the current size of the queue and the timestamp of the last update 
operation successfully processed is provided. The timestamp of the update 
operation is the original timestamp, i.e., the time this operation was 
processed on the Source SolrCloud. This allows an estimate the latency of the 
replication process.

The timestamp of the update operation in the source solrcloud is given,  how 
does it help to figure out the latency of replication. Can someone please 
explain , am I missing something obvious. We want to generate alert  if there 
is a huge latency , looking to see how this can be done.

Thank you.
Rajeswari

On 5/30/19, 9:47 AM, "Natarajan, Rajeswari"  
wrote:

Hi,

Is there a way to  monitor the replication delay between Primary/Secondary 
Cluster for CDCR  and raise alerts ,if it exceeds above some threshold.

I see below API’s for monitoring.

·
core/cdcr?action=QUEUES: Fetches statistics about the 
queue for each 
replica and about the update logs.
· core/cdcr?action=OPS: Fetches statistics about the replication 
performance 
(operations per second) for each replica.
· core/cdcr?action=ERRORS: Fetches statistics and other information 
about replication 
errors for each 
replica.

These report the stats, performance and errors.
Thanks,
Rajeswari





Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks

2019-06-18 Thread Andrew Kettmann
Using Solr 7.7.2 Docker image, testing some of the new autoscale features, huge 
fan so far. Tested with the link method on a 2GB core and found that it took 
less than 1MB of additional space. Filled the core quite a bit larger, 12GB of 
a 20GB PVC, and now splitting the shard fails with the following error message 
on my overseer:


2019-06-18 16:27:41.754 ERROR 
(OverseerThreadFactory-49-thread-5-processing-n:10.0.192.74:8983_solr) 
[c:test_autoscale s:shard1  ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Collection: test_autoscale operation: splitshard 
failed:org.apache.solr.common.SolrException: not enough free disk space to 
perform index split on node 10.0.193.23:8983_solr, required: 23.35038321465254, 
available: 7.811378479003906
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.checkDiskSpace(SplitShardCmd.java:567)
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:138)
at 
org.apache.solr.cloud.api.collections.SplitShardCmd.call(SplitShardCmd.java:94)
at 
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:294)
at 
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)



I attempted sending the request to the node itself to see if it did anything 
different, but no luck. My parameters are (Note Python formatting as that is my 
language of choice):



splitparams = {'action':'SPLITSHARD',
   'collection':'test_autoscale',
   'shard':'shard1',
   'splitMethod':'link',
   'timing':'true',
   'async':'shardsplitasync'}


And this is confirmed by the log message from the node itself:


2019-06-18 16:27:41.730 INFO  (qtp1107530534-16) [c:test_autoscale   ] 
o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 
params={async=shardsplitasync=true=SPLITSHARD=test_autoscale=shard1=link}
 status=0 QTime=20


While it is true I do not have enough space if I were using the rewrite method, 
the link method on a 2GB core used an additional less than 1MB of space. Is 
there something I am missing here? is there an option to disable the disk space 
check that I need to pass? I can't find anything in the documentation at this 
point.


[https://storage.googleapis.com/e24-email-images/e24logonotag.png]
Andrew Kettmann
DevOps Engineer
P: 1.314.596.2836
[LinkedIn] [Twitter] 
  [Instagram] 


evolve24 Confidential & Proprietary Statement: This email and any attachments 
are confidential and may contain information that is privileged, confidential 
or exempt from disclosure under applicable law. It is intended for the use of 
the recipients. If you are not the intended recipient, or believe that you have 
received this communication in error, please do not read, print, copy, 
retransmit, disseminate, or otherwise use the information. Please delete this 
email and attachments, without reading, printing, copying, forwarding or saving 
them, and notify the Sender immediately by reply email. No confidentiality or 
privilege is waived or lost by any transmission in error.


Re: bi-directional CDCR

2019-06-18 Thread Natarajan, Rajeswari
We are using bidirectional CDCR with solr 7.6 and it works for us. Did you look 
at the logs to see if there are any errors.

"Both Cluster 1 and Cluster 2 can act as Source and Target at any given
point of time but a cluster cannot be both Source and Target at the same
time."

The above means the publishing can take place on one cluster only at any point. 
Publishing cannot happen simultaneously on both clusters.

Hope this helps
Rajeswari

On 6/11/19, 7:13 PM, "Susheel Kumar"  wrote:

Hello,

What does that mean by below.  How do we set which cluster will act as
source or target at a time?

Both Cluster 1 and Cluster 2 can act as Source and Target at any given
point of time but a cluster cannot be both Source and Target at the same
time.
Also following the directions mentioned in this page doesn't make cdcr
works. No data flows from cluster 1  to cluster 2. The Solr 7.7.1.  Is
there something missing.

https://lucene.apache.org/solr/guide/7_7/cdcr-config.html#bi-directional-updates




Re: Solr CPU spiking up on bulk indexing

2019-06-18 Thread Erick Erickson
Dynamic fields don’t make any difference, they’re just like fixed fields as far
as merging is concerned.

So this is almost certainly merging being kicked off by your commits. The number
of documents and the more terms, the more work Lucene has to do, so I suspect
this is just how things work.

I’ll add parenthetically that your cache settings, while not adding to this 
problem,
are suspiciously high. filterCache in particular can take up maxDoc/8 _per 
entry_,
2048 in this case. I’d recommend you think about reducing the size here while
monitoring your hit ratio.

Oh, and if you use NOW in filter clauses, that’s an anti-pattern, see:

see: https://dzone.com/articles/solr-date-math-now-and-filter

Best,
Erick

> On Jun 18, 2019, at 8:20 AM, Venu  wrote:
> 
> Thanks Erick. 
> 
> I see the above pattern only at the time of commit.
> 
> I have many fields (like around 250 fields out of which around 100 fields
> are dynamic fields and around 3 n-gram fields and text fields, while many of
> them are stored fields along with indexed fields), will a merge take a lot
> of time in this kind of case, I mean is it CPU intensive because of many
> dynamic fields or because of huge data?
> 
> Also, I am doing a hard commit for every 5 minutes and open-searcher is true
> in my case. I am not doing soft-commit.
> 
> And below are the configurations for filter, query and document caches.
> Should I try reducing initialsize?
> 
>  size="2048"
> initialSize="512"
> autowarmCount="0"/>
>size="2048"
>   initialSize="512"
>   autowarmCount="0"/>
>   size="2048"
>  initialSize="512"
>  autowarmCount="0"/>
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr CPU spiking up on bulk indexing

2019-06-18 Thread Venu
Thanks Erick. 

I see the above pattern only at the time of commit.

I have many fields (like around 250 fields out of which around 100 fields
are dynamic fields and around 3 n-gram fields and text fields, while many of
them are stored fields along with indexed fields), will a merge take a lot
of time in this kind of case, I mean is it CPU intensive because of many
dynamic fields or because of huge data?

Also, I am doing a hard commit for every 5 minutes and open-searcher is true
in my case. I am not doing soft-commit.

And below are the configurations for filter, query and document caches.
Should I try reducing initialsize?

 







--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


JOIN Execution Plan

2019-06-18 Thread Paresh
Hi,

I have two collections - collection1 and collection2.
I am doing HTTP request on collection2 using
http://localhost:8983/solr/collection1/tcfts/?params={q=col:value AND
_query_:{!join}...AND _query_:{!join}..}

If my query is like - fieldOnCollection1:somevalue AND INNER JOIN (with
collection1) AND ACROSS JOIN( with collection2)

(Q)Drawing analogy to Oracle, how will SOLR execute this query? 
Is there anyway to see in the logs the way SOLR executes all criterias? I
mean the sequence in which it is executed.

When I do JOINs in RDBMS, I specify smallest table first. 
(Q) Do I also need to specify the first criteria as the one where I am
supposed to get less matching documents from SOLR collection?

Any help is appreciated.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


JOIN Execution Plan

2019-06-18 Thread Paresh
Hi,

I have two collections - collection1 and collection2.
I am doing HTTP request on collection2 using
http://localhost:8983/solr/collection1/tcfts/?params={q=col:value AND
_query_:{!join}...AND _query_:{!join}..}

If my query is like - fieldOnCollection1:somevalue AND INNER JOIN (with
collection1) AND ACROSS JOIN( with collection2)

(Q)Drawing analogy to Oracle, how will SOLR execute this query? 
Is there anyway to see in the logs the way SOLR executes all criterias? I
mean the sequence in which it is executed.

When I do JOINs in RDBMS, I specify smallest table first. 
(Q) Do I also need to specify the first criteria as the one where I am
supposed to get less matching documents from SOLR collection?

Any help is appreciated.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Delete with Solrj deleteByQuery - Boolean clauses

2019-06-18 Thread rgummadi
Thanks Erick. I will try the terms query.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Delete with Solrj deleteByQuery - Boolean clauses

2019-06-18 Thread Erick Erickson
If these are the id field (i.e. ), then delete by id is
much less painful. That aside:

1> check how the query is parsed with just one or two Barcodes. If
you are pushing this through edismax or similar, you might be getting
surprising results

2> try putting that massive OR clause inside a Terms query parser, It’d 
look something like:
CustomerCode:BACU1 AND _query_:”{!terms f=barcode}0001,0002,0003…..”

Best,
Erick

> On Jun 18, 2019, at 2:57 AM, rgummadi  wrote:
> 
> There is a situation where we have to delete a lot of assets for a customer
> from solr index. There are occasions where the number of assets run into
> thousands. So I am constructing  the query as below. If the number of ‘OR’
> clauses cross a certain limit (like 50), the delete is not working. 
> 
> We are using the SolrJ for this (solrServer.deleteByQuery(stb.toString())).
> 
> Also the MaxBooleanClause parameter is set to 3000 in solrconfig.xml.
> 
> Why is the query failing even though the number of clauses is below 3000. Is
> there a better way of doing this.
> 
> CustomerCode:BACU1 AND  ( Barcode:(06743476636 OR 06743482288 OR
> 06881406272 OR 06881406315 OR 06881406343 OR 06881406383 OR
> 06881406432 OR 06857852700 OR 06857852756 OR 06857852783 OR
> 06857852740 OR 06857852768 OR 06857852801 OR 06857852810 OR
> 06857852819 OR 06857852828 OR 06857852844 OR 06857852859 OR
> 06857852873 OR 06857852887 OR 06857852894 OR 06857852904 OR
> 06851761567 OR 06851761572 OR 06851761573 OR 06851761575 OR
> 06851761576 OR 06851761577 OR 06851761578 OR 06851761580 OR
> 06851761582 OR 06851761583 OR 06852758372 OR 06852758354 OR
> 06852758338 OR 06852758318 OR 06852758276 OR 06857789194 OR
> 06857789198 OR 06857789204 OR 06857789200 OR 06857789220 OR
> 06856559248 OR 06856559309 OR 06743482260 OR 06743482254 OR
> 06743482294 OR 06743482232 OR 06743478792 OR 06743482235 OR
> 06743482250 OR 06743482259 OR 06743482262 OR 06743482256 OR
> 06743482237 OR 06743478796 OR 06743482018 OR 06743482033 OR
> 06743482063 OR 06743496937 OR 06743482044 OR 06743481998 OR
> 06743482066 OR 06743482070 OR 06743482037 OR 06743482001 OR
> 06743482004 OR 06743482021 OR 06743482056 OR 06743495615 OR
> 06743482007 OR 06743482028 OR 06743482047 OR 06743496939 OR
> 06743482009 OR 06743482038 OR 06743496736 OR 06743495617 OR
> 06743482025 OR 06743482057 OR 06743482049 OR 06743482075 OR
> 06743482051 OR 06743482031 OR 06743496737 OR 06743496419 OR
> 06743495619 OR 06743482059 OR 06743482041 OR 06743482060 OR
> 06743482062 OR 06743482012 OR 06743482053 OR 06743495620 OR
> 06743496738 OR 06743496940 OR 06743482081 OR 06743482104 OR
> 06743482130 OR 06743482121 OR 06743482107 OR 06743482094 OR
> 06743495622 OR 06743482136 OR 06743482096 OR 06743482078 OR
> 06743482079 OR 06743482085 OR 06743495624 OR 06743482140 OR
> 06743482141 OR 06743482086 OR 06743482123 OR 06838138542 OR
> 06743495626 OR 06743495627 OR 06743495628 OR 06743495630 OR
> 06743482126 OR 06743482089 OR 06743482133 OR 06743482100 OR
> 06743482110 OR 06743495631 OR 06743495632 OR 06743482134 OR
> 06743495634 OR 06743495635 OR 06743482137 OR 06743495636 OR
> 06743495638 OR 06743482128 OR 06743482102 OR 06743482116 OR
> 06743482091 OR 06743482119 OR 06743495640 OR 06743497254 OR
> 06743496421 OR 06743496486 OR 06743496487 OR 06743496488 OR
> 06743496489 OR 06743496422 OR 06743497255 OR 06743496491 OR
> 06743496423 OR 06743497256 OR 06743496424 OR 06743496492 OR
> 06743496494 OR 06743496425))
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Could not find lemmatization model

2019-06-18 Thread Tomoko Uchida
Hi,
maybe this one?
http://opennlp.sourceforge.net/models-1.5/

2019年6月18日(火) 17:13 Vidit Mathur :
>
> Sir/ma'am
> I was trying to integrate OpenNLP with solr for lemmatizating the search
> text but I could find the lemmatization model on opennle.sourceforge.net .
> Could you please help me with this issue or suggestion some work around.
> Regards
> Vidit Mathur
> (Student)


Delete with Solrj deleteByQuery - Boolean clauses

2019-06-18 Thread rgummadi
There is a situation where we have to delete a lot of assets for a customer
from solr index. There are occasions where the number of assets run into
thousands. So I am constructing  the query as below. If the number of ‘OR’
clauses cross a certain limit (like 50), the delete is not working. 

We are using the SolrJ for this (solrServer.deleteByQuery(stb.toString())).

Also the MaxBooleanClause parameter is set to 3000 in solrconfig.xml.

Why is the query failing even though the number of clauses is below 3000. Is
there a better way of doing this.

CustomerCode:BACU1 AND  ( Barcode:(06743476636 OR 06743482288 OR
06881406272 OR 06881406315 OR 06881406343 OR 06881406383 OR
06881406432 OR 06857852700 OR 06857852756 OR 06857852783 OR
06857852740 OR 06857852768 OR 06857852801 OR 06857852810 OR
06857852819 OR 06857852828 OR 06857852844 OR 06857852859 OR
06857852873 OR 06857852887 OR 06857852894 OR 06857852904 OR
06851761567 OR 06851761572 OR 06851761573 OR 06851761575 OR
06851761576 OR 06851761577 OR 06851761578 OR 06851761580 OR
06851761582 OR 06851761583 OR 06852758372 OR 06852758354 OR
06852758338 OR 06852758318 OR 06852758276 OR 06857789194 OR
06857789198 OR 06857789204 OR 06857789200 OR 06857789220 OR
06856559248 OR 06856559309 OR 06743482260 OR 06743482254 OR
06743482294 OR 06743482232 OR 06743478792 OR 06743482235 OR
06743482250 OR 06743482259 OR 06743482262 OR 06743482256 OR
06743482237 OR 06743478796 OR 06743482018 OR 06743482033 OR
06743482063 OR 06743496937 OR 06743482044 OR 06743481998 OR
06743482066 OR 06743482070 OR 06743482037 OR 06743482001 OR
06743482004 OR 06743482021 OR 06743482056 OR 06743495615 OR
06743482007 OR 06743482028 OR 06743482047 OR 06743496939 OR
06743482009 OR 06743482038 OR 06743496736 OR 06743495617 OR
06743482025 OR 06743482057 OR 06743482049 OR 06743482075 OR
06743482051 OR 06743482031 OR 06743496737 OR 06743496419 OR
06743495619 OR 06743482059 OR 06743482041 OR 06743482060 OR
06743482062 OR 06743482012 OR 06743482053 OR 06743495620 OR
06743496738 OR 06743496940 OR 06743482081 OR 06743482104 OR
06743482130 OR 06743482121 OR 06743482107 OR 06743482094 OR
06743495622 OR 06743482136 OR 06743482096 OR 06743482078 OR
06743482079 OR 06743482085 OR 06743495624 OR 06743482140 OR
06743482141 OR 06743482086 OR 06743482123 OR 06838138542 OR
06743495626 OR 06743495627 OR 06743495628 OR 06743495630 OR
06743482126 OR 06743482089 OR 06743482133 OR 06743482100 OR
06743482110 OR 06743495631 OR 06743495632 OR 06743482134 OR
06743495634 OR 06743495635 OR 06743482137 OR 06743495636 OR
06743495638 OR 06743482128 OR 06743482102 OR 06743482116 OR
06743482091 OR 06743482119 OR 06743495640 OR 06743497254 OR
06743496421 OR 06743496486 OR 06743496487 OR 06743496488 OR
06743496489 OR 06743496422 OR 06743497255 OR 06743496491 OR
06743496423 OR 06743497256 OR 06743496424 OR 06743496492 OR
06743496494 OR 06743496425))





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Could not find lemmatization model

2019-06-18 Thread Vidit Mathur
Sir/ma'am
I was trying to integrate OpenNLP with solr for lemmatizating the search
text but I could find the lemmatization model on opennle.sourceforge.net .
Could you please help me with this issue or suggestion some work around.
Regards
Vidit Mathur
(Student)


Some questions about solr source code

2019-06-18 Thread ??????????
Dear Solr Developer,
I am a Chinese Software developer and I having been using solr for nearly 
4years.First Thank u for your continuous effort on improving solr.  Recently I 
began to read the source code because I am very curious about how it works. But 
I encountered many questions which I spend much time to thinkd about but 
failed.I hope u can explain for me.I list the question below.I am using solr 
6.6.0.Please parden me for my poor English.
1.In the method boolean handleVersions(ShardResponse srsp) of the class 
org.apache.solr.update.PeerSync,I can not understand the variant 'boolean 
completeList = otherVersions.size() < nUpdates;  ',I think it is just the 
opposite.Maybe I misunderstand the whole passage
2.Also in the class org.apache.solr.update.PeerSync,the method 
'handleVersionsWithRanges'??I think there is a bug in line 555 
while ((otherUpdatesIndex < otherVersions.size()) && 
(Math.abs(otherVersions.get(otherUpdatesIndex)) < 
Math.abs(ourUpdates.get(ourUpdatesIndex {
x
}
otherUpdatesIndex are always < otherVersions.size(),because it starts with 
otherVersions.size() -1 and it never get bigger,so there no point in checking 
this.I think it should be replaced with 'otherUpdatesIndex >= 0'
   I am not an expert of Solr.May be I misunderstand the whole class or method 
so please tell me directly if I am wrong.
   Please parden me again for my poor English.I am looking forward for your 
replay. Thank u again for your great effort to develop and improve Solr.


Best Regards
Guohua Wu

Re: Does anyone want to put in a few hours at their rate?

2019-06-18 Thread Jörn Franke
Can you please describe the steps you have done so far?


> Am 18.06.2019 um 02:22 schrieb Val D :
> 
> To whom it may concern:
> 
> I have a Windows based system running Java 8.  I have installed SOLR 7.7.2  ( 
> I also tried this with version 8.1.1 as well with the same results ). I have 
> SQL Server 2018 with 1 table that contains 22+ columns and a few thousand 
> rows.  I am attempting to index the SQL Server table using SOLR as the 
> indexing mechanism.  Once indexed I need to be able to search the table using 
> the SOLR ADMIN module.  That is it.  I have tried every on-line example, 
> sample or explanation and none of them have helped.  I can only assume that I 
> am doing something wrong.  I do not have the required expertise to determine 
> where I am going wrong.  I would like to know if you can help, what it would 
> cost to have someone log into my system to get it working and how long you 
> think it might take.  I will be working very late today.
> 
> Thank you in advance,
> Vincent
> 
> (650) 334-2925
> US UTC-5   Feel free to call at any hour.


Does anyone want to put in a few hours at their rate?

2019-06-18 Thread Val D
To whom it may concern:

I have a Windows based system running Java 8.  I have installed SOLR 7.7.2  ( I 
also tried this with version 8.1.1 as well with the same results ). I have SQL 
Server 2018 with 1 table that contains 22+ columns and a few thousand rows.  I 
am attempting to index the SQL Server table using SOLR as the indexing 
mechanism.  Once indexed I need to be able to search the table using the SOLR 
ADMIN module.  That is it.  I have tried every on-line example, sample or 
explanation and none of them have helped.  I can only assume that I am doing 
something wrong.  I do not have the required expertise to determine where I am 
going wrong.  I would like to know if you can help, what it would cost to have 
someone log into my system to get it working and how long you think it might 
take.  I will be working very late today.

Thank you in advance,
Vincent

(650) 334-2925
US UTC-5   Feel free to call at any hour.