Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Noble Paul
Looking at the code I see a 692 occurrences of the word "slave".
Mostly variable names and ref guide docs.

The word "slave" is present in the responses as well. Any change in
the request param/response payload is backward incompatible.

I have no objection to changing the names in ref guide and other
internal variables. Going ahead with backward incompatible changes is
painful. If somebody has the appetite to take it up, it's OK

If we must change, master/follower can be a good enough option.

master (noun): A man in charge of an organization or group.
master(adj) : having or showing very great skill or proficiency.
master(verb): acquire complete knowledge or skill in (a subject,
technique, or art).
master (verb): gain control of; overcome.

I hope nobody has a problem with the term "master"

On Thu, Jun 18, 2020 at 3:19 PM Ilan Ginzburg  wrote:
>
> Would master/follower work?
>
> Half the rename work while still getting rid of the slavery connotation...
>
>
> On Thu 18 Jun 2020 at 07:13, Walter Underwood  wrote:
>
> > > On Jun 17, 2020, at 4:00 PM, Shawn Heisey  wrote:
> > >
> > > It has been interesting watching this discussion play out on multiple
> > open source mailing lists.  On other projects, I have seen a VERY high
> > level of resistance to these changes, which I find disturbing and
> > surprising.
> >
> > Yes, it is nice to see everyone just pitch in and do it on this list.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >



-- 
-
Noble Paul


Re: Solr cloud backup/restore not working

2020-06-17 Thread Shawn Heisey

On 6/17/2020 8:55 PM, yaswanth kumar wrote:

Caused by: javax.crypto.BadPaddingException: RSA private key operation
failed


Something appears to be wrong with the private key that Solr is 
attempting to use for a certificate.


Best guess, incorporating everything I can see in the stacktrace, is 
that you have enabled certificate-based authentication, and the private 
key for the client certificate is malformed in some way.  The error 
message originated in Java code, not Solr.


https://docs.oracle.com/javase/8/docs/api/javax/crypto/BadPaddingException.html

It sounds like the keystore has a problem.  You would need to consult 
with someone who is an expert at Java crypto mechanisms.


Thanks,
Shawn


Re: Solr cloud backup/restore not working

2020-06-17 Thread Shawn Heisey

On 6/16/2020 8:44 AM, yaswanth kumar wrote:

I don't see anything related in the solr.log file for the same error. Not
sure if there is anyother place where I can check for this.


The underlying request that failed might be happening on one of the 
other nodes in the cloud.  It might be necessary to check the solr.log 
file on multiple machines.


The response here does NOT contain any information about what caused the 
problem.  All it says is that an ADDREPLICA action necessary to complete 
the restore failed.  You'll need to locate the node where the ADDREPLICA 
failed, and we will need to see the FULL error message.  It is probably 
dozens of lines in length.


I see that you've opened an issue in Jira.  That is premature.  The Solr 
project does NOT use Jira as a support portal.  If we determine that 
you're running into a bug, then it would be appropriate to open an issue.


Thanks,
Shawn


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Ilan Ginzburg
Would master/follower work?

Half the rename work while still getting rid of the slavery connotation...


On Thu 18 Jun 2020 at 07:13, Walter Underwood  wrote:

> > On Jun 17, 2020, at 4:00 PM, Shawn Heisey  wrote:
> >
> > It has been interesting watching this discussion play out on multiple
> open source mailing lists.  On other projects, I have seen a VERY high
> level of resistance to these changes, which I find disturbing and
> surprising.
>
> Yes, it is nice to see everyone just pitch in and do it on this list.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood
> On Jun 17, 2020, at 4:00 PM, Shawn Heisey  wrote:
> 
> It has been interesting watching this discussion play out on multiple open 
> source mailing lists.  On other projects, I have seen a VERY high level of 
> resistance to these changes, which I find disturbing and surprising.

Yes, it is nice to see everyone just pitch in and do it on this list.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood
Master/slave is not going away in our company. That cluster has zero downtime
in five years. I can’t say that about our Solr Cloud clusters.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 9:36 PM, Noble Paul  wrote:
> 
> I really do not see a reason why a master/slave terminology is a problem.
> We do not have slavery anywhere in the world. Should we also remove it from
> the dictionary?
> 
> The old mode is going to go away anyway. Why waste time bikeshedding on
> this?
> 
> On Thu, Jun 18, 2020, 12:04 PM Trey Grainger  wrote:
> 
>> @Shawn,
>> 
>> Ok, yeah, apologies, my semantics were wrong.
>> 
>> I was thinking that a TLog replica is a follower role only and becomes an
>> NRT replica if it gets elected leader. From a pure semantics standpoint,
>> though, I guess technically the TLog replica doesn't "become" an NRT
>> replica, but just "acts the same" as if it was an NRT replica when it gets
>> elected as leader. From the docs regarding TLog replicas: "This type of
>> replica maintains a transaction log but does not index document changes
>> locally... When this type of replica needs to update its index, it does so
>> by replicating the index from the leader... If it does become a leader, it
>> will behave the same as if it was a NRT type of replica."
>> 
>> The Tlog replicas are a bit of a red herring to the point I was making,
>> though, which is that Pull Replicas in SolrCloud mode and Slaves in
>> non-SolrCloud mode both just pull the index from the leader/master and as
>> opposed to updates being pushed the other way. As such, I don't see a
>> meaningful distinction between master/slave and leader/follower behavior in
>> non-SolrCloud mode vs. SolrCloud mode for the specific functionality we're
>> talking about renaming (Solr cores that pull indices from other Solr
>> cores).
>> 
>> At any rate, this is not a hill I care to die on. My belief is that it's
>> better to have consistent terminology for what I see as essentially the
>> same functionality. I respect that others disagree and would rather
>> introduce new terminology to clearly distinguish between modes. Regardless
>> of the naming decided on, I'm in support of removing the master/slave
>> nomenclature.
>> 
>> Trey Grainger
>> Founder, Searchkernel
>> https://searchkernel.com
>> 
>> On Wed, Jun 17, 2020 at 7:00 PM Shawn Heisey  wrote:
>> 
>>> On 6/17/2020 2:36 PM, Trey Grainger wrote:
 2) TLOG - which can only serve in the role of follower
>>> 
>>> This is inaccurate.  TLOG can become leader.  If that happens, then it
>>> functions exactly like an NRT leader.
>>> 
>>> I'm aware that saying the following is bikeshedding ... but I do think
>>> it would be as mistake to use any existing SolrCloud terminology for
>>> non-cloud deployments, including the word "replica".  The top contenders
>>> I have seen to replace master/slave in Solr are primary/secondary and
>>> publisher/subscriber.
>>> 
>>> It has been interesting watching this discussion play out on multiple
>>> open source mailing lists.  On other projects, I have seen a VERY high
>>> level of resistance to these changes, which I find disturbing and
>>> surprising.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>> 



Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Noble Paul
I really do not see a reason why a master/slave terminology is a problem.
We do not have slavery anywhere in the world. Should we also remove it from
the dictionary?

The old mode is going to go away anyway. Why waste time bikeshedding on
this?

On Thu, Jun 18, 2020, 12:04 PM Trey Grainger  wrote:

> @Shawn,
>
> Ok, yeah, apologies, my semantics were wrong.
>
> I was thinking that a TLog replica is a follower role only and becomes an
> NRT replica if it gets elected leader. From a pure semantics standpoint,
> though, I guess technically the TLog replica doesn't "become" an NRT
> replica, but just "acts the same" as if it was an NRT replica when it gets
> elected as leader. From the docs regarding TLog replicas: "This type of
> replica maintains a transaction log but does not index document changes
> locally... When this type of replica needs to update its index, it does so
> by replicating the index from the leader... If it does become a leader, it
> will behave the same as if it was a NRT type of replica."
>
> The Tlog replicas are a bit of a red herring to the point I was making,
> though, which is that Pull Replicas in SolrCloud mode and Slaves in
> non-SolrCloud mode both just pull the index from the leader/master and as
> opposed to updates being pushed the other way. As such, I don't see a
> meaningful distinction between master/slave and leader/follower behavior in
> non-SolrCloud mode vs. SolrCloud mode for the specific functionality we're
> talking about renaming (Solr cores that pull indices from other Solr
> cores).
>
> At any rate, this is not a hill I care to die on. My belief is that it's
> better to have consistent terminology for what I see as essentially the
> same functionality. I respect that others disagree and would rather
> introduce new terminology to clearly distinguish between modes. Regardless
> of the naming decided on, I'm in support of removing the master/slave
> nomenclature.
>
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
>
> On Wed, Jun 17, 2020 at 7:00 PM Shawn Heisey  wrote:
>
> > On 6/17/2020 2:36 PM, Trey Grainger wrote:
> > > 2) TLOG - which can only serve in the role of follower
> >
> > This is inaccurate.  TLOG can become leader.  If that happens, then it
> > functions exactly like an NRT leader.
> >
> > I'm aware that saying the following is bikeshedding ... but I do think
> > it would be as mistake to use any existing SolrCloud terminology for
> > non-cloud deployments, including the word "replica".  The top contenders
> > I have seen to replace master/slave in Solr are primary/secondary and
> > publisher/subscriber.
> >
> > It has been interesting watching this discussion play out on multiple
> > open source mailing lists.  On other projects, I have seen a VERY high
> > level of resistance to these changes, which I find disturbing and
> > surprising.
> >
> > Thanks,
> > Shawn
> >
>


Re: Solr cloud backup/restore not working

2020-06-17 Thread yaswanth kumar
Hi Vinodh,

Here is what I see when I tried with requestid,

Collection: test operation: restore
failed:org.apache.solr.common.SolrException: ADDREPLICA failed to create
replica
at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler$ShardRequestTracker.processResponses(OverseerCollectionMessageHandler.java:1030)
at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler$ShardRequestTracker.processResponses(OverseerCollectionMessageHandler.java:1013)
at
org.apache.solr.cloud.api.collections.AddReplicaCmd.lambda$addReplica$1(AddReplicaCmd.java:177)
at
org.apache.solr.cloud.api.collections.AddReplicaCmd$$Lambda$746/.run(Unknown
Source)
at
org.apache.solr.cloud.api.collections.AddReplicaCmd.addReplica(AddReplicaCmd.java:199)
at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.addReplica(OverseerCollectionMessageHandler.java:708)
at
org.apache.solr.cloud.api.collections.RestoreCmd.call(RestoreCmd.java:286)
at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:264)
at
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$142/.run(Unknown
Source)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException:
javax.crypto.BadPaddingException: RSA private key operation failed
at org.apache.solr.util.CryptoKeys$RSAKeyPair.encrypt(CryptoKeys.java:325)
at
org.apache.solr.security.PKIAuthenticationPlugin.generateToken(PKIAuthenticationPlugin.java:305)
at
org.apache.solr.security.PKIAuthenticationPlugin.access$200(PKIAuthenticationPlugin.java:61)
at
org.apache.solr.security.PKIAuthenticationPlugin$2.onQueued(PKIAuthenticationPlugin.java:239)
at
org.apache.solr.client.solrj.impl.Http2SolrClient.decorateRequest(Http2SolrClient.java:468)
at
org.apache.solr.client.solrj.impl.Http2SolrClient.makeRequest(Http2SolrClient.java:455)
at
org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:364)
at
org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:746)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1274)
at
org.apache.solr.handler.component.HttpShardHandler.request(HttpShardHandler.java:238)
at
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:199)
at
org.apache.solr.handler.component.HttpShardHandler$$Lambda$529/.call(Unknown
Source)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
... 5 more
Caused by: javax.crypto.BadPaddingException: RSA private key operation
failed
at
java.base/sun.security.rsa.NativeRSACore.crtCrypt_Native(NativeRSACore.java:149)
at java.base/sun.security.rsa.NativeRSACore.rsa(NativeRSACore.java:91)
at java.base/sun.security.rsa.RSACore.rsa(RSACore.java:149)
at java.base/com.sun.crypto.provider.RSACipher.doFinal(RSACipher.java:355)
at
java.base/com.sun.crypto.provider.RSACipher.engineDoFinal(RSACipher.java:392)
at java.base/javax.crypto.Cipher.doFinal(Cipher.java:2260)
at org.apache.solr.util.CryptoKeys$RSAKeyPair.encrypt(CryptoKeys.java:323)

Thanks,

On Wed, Jun 17, 2020 at 8:08 AM Kommu, Vinodh K.  wrote:

> Hi,
>
> What is the log level defined for solr nodes? Did you used requestid in
> restore command? If so, check the status of the requestid if that points to
> any errors.
>
> Thanks & Regards,
> Vinodh
>
> -Original Message-
> From: yaswanth kumar 
> Sent: Wednesday, June 17, 2020 4:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr cloud backup/restore not working
>
> ATTENTION: External Email – Be Suspicious of Attachments, Links and
> Requests for Login Information.
>
> Can someone please guide me on where can I get more detailed error of the
> above exception while doing restore?? All that I see in solr.log was pasted
> above
>
> Thanks,
>
> On Tue, Jun 16, 2020 at 10:44 AM yaswanth kumar 
> wrote:
>
> > I don't see anything related in the solr.log file for the same error.
> > Not sure if there is anyother place where I can check for this.
> >
> > Thanks,
> >
> > On Tue, Jun 16, 2020 at 10:21 AM Shawn Heisey 
> wrote:
> >
> >> On 6/12/2020 8:38 AM, yaswanth kumar wrote:
> >> > Using solr 8.2.0 and setup a cloud with 2 nodes. (2 replica's for
> >> > each
> >> > collection)

Re: Log4J Logging to Http

2020-06-17 Thread Shawn Heisey

On 6/17/2020 1:33 AM, Krönert Florian wrote:
2020-06-17T07:06:55.121856339Z java.lang.NoClassDefFoundError: Failed to 
initialize Apache Solr: Could not find necessary SLF4j logging jars. If 
using Jetty, the SLF4j logging jars need to go in the jetty lib/ext 
directory. For other containers, the corresponding directory should be 
used. For more information, see: http://wiki.apache.org/solr/SolrLogging


It seems that only when using the http appender these jars are needed, 
without this appender everything works.


There must be some aspect of your log4j2.xml configuration that requires 
a jar that is not included with Solr.


Can you point me in the right direction, where I need to place the 
needed jars? Seems to be a little special since I only access the 
/var/solr mount directly, the rest is running in docker.


If there are extra jars needed for your logging config, they should go 
in the server/lib/ext directory, which should already exist and contain 
several jars related to logging.


Thanks,
Shawn


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
@Shawn,

Ok, yeah, apologies, my semantics were wrong.

I was thinking that a TLog replica is a follower role only and becomes an
NRT replica if it gets elected leader. From a pure semantics standpoint,
though, I guess technically the TLog replica doesn't "become" an NRT
replica, but just "acts the same" as if it was an NRT replica when it gets
elected as leader. From the docs regarding TLog replicas: "This type of
replica maintains a transaction log but does not index document changes
locally... When this type of replica needs to update its index, it does so
by replicating the index from the leader... If it does become a leader, it
will behave the same as if it was a NRT type of replica."

The Tlog replicas are a bit of a red herring to the point I was making,
though, which is that Pull Replicas in SolrCloud mode and Slaves in
non-SolrCloud mode both just pull the index from the leader/master and as
opposed to updates being pushed the other way. As such, I don't see a
meaningful distinction between master/slave and leader/follower behavior in
non-SolrCloud mode vs. SolrCloud mode for the specific functionality we're
talking about renaming (Solr cores that pull indices from other Solr cores).

At any rate, this is not a hill I care to die on. My belief is that it's
better to have consistent terminology for what I see as essentially the
same functionality. I respect that others disagree and would rather
introduce new terminology to clearly distinguish between modes. Regardless
of the naming decided on, I'm in support of removing the master/slave
nomenclature.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 7:00 PM Shawn Heisey  wrote:

> On 6/17/2020 2:36 PM, Trey Grainger wrote:
> > 2) TLOG - which can only serve in the role of follower
>
> This is inaccurate.  TLOG can become leader.  If that happens, then it
> functions exactly like an NRT leader.
>
> I'm aware that saying the following is bikeshedding ... but I do think
> it would be as mistake to use any existing SolrCloud terminology for
> non-cloud deployments, including the word "replica".  The top contenders
> I have seen to replace master/slave in Solr are primary/secondary and
> publisher/subscriber.
>
> It has been interesting watching this discussion play out on multiple
> open source mailing lists.  On other projects, I have seen a VERY high
> level of resistance to these changes, which I find disturbing and
> surprising.
>
> Thanks,
> Shawn
>


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Michael Gibney
I agree with Shawn that the top contenders so far (from my
perspective) are "primary/secondary" and "publisher/subscriber", and
agree with Walter that whatever term pair is used should ideally be
usable *as a pair* (to identify a cluster type) in addition to
individually (to identify the individual roles in that cluster).

To take the "bikeshedding" metaphor in another direction, I'd submit
"hub/spoke"? It's a little overloaded, but afaict mainly in domains
other than cluster architecture. It's very usable as a pair; it
manages to convey the singular nature of the "hub" and the
equivalent/final nature of the "spokes" in a way that
"primary/secondary" doesn't really; and it avoids implying an active
role in cluster maintenance for the "hub" (cf. "publisher", which
could be misleading in this regard).

Michael

On Wed, Jun 17, 2020 at 9:12 PM Scott Cote  wrote:
>
> Perhaps  Apache could provide a nomenclature suggestion that the projects 
> could adopt.   This would stand well for the whole Apache  community in 
> regards to BLM.
> My two cents as a “user”
> Good luck.
>
>
> Sent from Yahoo Mail for iPhone
>
>
> On Wednesday, June 17, 2020, 6:00 PM, Shawn Heisey  
> wrote:
>
> On 6/17/2020 2:36 PM, Trey Grainger wrote:
> > 2) TLOG - which can only serve in the role of follower
>
> This is inaccurate.  TLOG can become leader.  If that happens, then it
> functions exactly like an NRT leader.
>
> I'm aware that saying the following is bikeshedding ... but I do think
> it would be as mistake to use any existing SolrCloud terminology for
> non-cloud deployments, including the word "replica".  The top contenders
> I have seen to replace master/slave in Solr are primary/secondary and
> publisher/subscriber.
>
> It has been interesting watching this discussion play out on multiple
> open source mailing lists.  On other projects, I have seen a VERY high
> level of resistance to these changes, which I find disturbing and
> surprising.
>
> Thanks,
> Shawn
>
>
>


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Scott Cote
Perhaps  Apache could provide a nomenclature suggestion that the projects could 
adopt.   This would stand well for the whole Apache  community in regards to 
BLM.
My two cents as a “user” 
Good luck.


Sent from Yahoo Mail for iPhone


On Wednesday, June 17, 2020, 6:00 PM, Shawn Heisey  wrote:

On 6/17/2020 2:36 PM, Trey Grainger wrote:
> 2) TLOG - which can only serve in the role of follower

This is inaccurate.  TLOG can become leader.  If that happens, then it 
functions exactly like an NRT leader.

I'm aware that saying the following is bikeshedding ... but I do think 
it would be as mistake to use any existing SolrCloud terminology for 
non-cloud deployments, including the word "replica".  The top contenders 
I have seen to replace master/slave in Solr are primary/secondary and 
publisher/subscriber.

It has been interesting watching this discussion play out on multiple 
open source mailing lists.  On other projects, I have seen a VERY high 
level of resistance to these changes, which I find disturbing and 
surprising.

Thanks,
Shawn





Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Shawn Heisey

On 6/17/2020 2:36 PM, Trey Grainger wrote:

2) TLOG - which can only serve in the role of follower


This is inaccurate.  TLOG can become leader.  If that happens, then it 
functions exactly like an NRT leader.


I'm aware that saying the following is bikeshedding ... but I do think 
it would be as mistake to use any existing SolrCloud terminology for 
non-cloud deployments, including the word "replica".  The top contenders 
I have seen to replace master/slave in Solr are primary/secondary and 
publisher/subscriber.


It has been interesting watching this discussion play out on multiple 
open source mailing lists.  On other projects, I have seen a VERY high 
level of resistance to these changes, which I find disturbing and 
surprising.


Thanks,
Shawn


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood
Master/slave is not just two roles, but a kind of cluster. I really don’t think
“Standalone” captures the non-Cloud cluster. Nobody in Chegg would 
have any idea that “standalone” meant “no Zookeeper”.

I’ve never thought that master/slave accurately described the traditional
replication model, but I can’t remember what terms I preferred because 
that was ten years ago. A master gives commands. That isn’t how Solr
masters work. It is closer to how an NRT or TLOG leader works, actually.

A Solr master just sits there and waits for other nodes to copy the index.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 3:03 PM, Trey Grainger  wrote:
> 
> Hi Walter,
> 
>> In Solr Cloud, the leader knows about each follower and updates them.
> Respectfully, I think you're mixing the "TYPE" of replica with the role of
> the "leader" and "follower"
> 
> In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the leader
> push updates those followers.
> 
> When the TYPE of a follower is PULL, then it does not.  In Standalone mode,
> the type of a (currently) master would be NRT, and the type of the
> (currently) slaves is always PULL.
> 
> As such, this behavior is consistent across both SolrCloud and Standalone
> mode. It is true that Standalone mode does not currently have support for
> two of the replica TYPES that SolrCloud mode does, but I maintain that
> leader vs. follower behavior is inconsistent here.
> 
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
> 
> 
> 
> On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
> wrote:
> 
>> But they are not the same. In Solr Cloud, the leader knows about each
>> follower and updates them. In standalone, the master has no idea that
>> slaves exist until a replication request arrives.
>> 
>> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
>> config load time.
>> 
>> Looking ahead in my email inbox, publisher/subscriber is an excellent
>> choice.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
>>> 
>>> I guess I don't see it as polysemous, but instead simplifying.
>>> 
>>> In my proposal, the terms "leader" and "follower" would have the exact
>> same
>>> meaning in both SolrCloud and standalone mode. The only difference would
>> be
>>> that SolrCloud automatically manages the leaders and followers, whereas
>> in
>>> standalone mode you have to manage them manually (as is the case with
>> most
>>> things in SolrCloud vs. Standalone).
>>> 
>>> My view is that having an entirely different set of terminology
>> describing
>>> the same thing is way more cognitive overhead than having consistent
>>> terminology.
>>> 
>>> Trey Grainger
>>> Founder, Searchkernel
>>> https://searchkernel.com
>>> 
>>> On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
>>> wrote:
>>> 
 I strongly disagree with using the Solr Cloud leader/follower
>> terminology
 for non-Cloud clusters. People in my company are confused enough without
 using polysemous terminology.
 
 “This node is the leader, but it means something different than the
>> leader
 in this other cluster.” I’m dreading that conversation.
 
 I like “principal”. How about “clone” for the slave role? That suggests
 that
 it does not accept updates and that it is loosely-coupled, only
>> depending
 on the state of the no-longer-called-master.
 
 Chegg has five production Solr Cloud clusters and one production
 master/slave
 cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
>> in
 production.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> 
> Proposal:
> "A Solr COLLECTION is composed of one or more SHARDS, which each have
>> one
> or more REPLICAS. Each replica can have a ROLE of either:
> 1) A LEADER, which can process external updates for the shard
> 2) A FOLLOWER, which receives updates from another replica"
> 
> (Note: I prefer "role" but if others think it's too overloaded due to
>> the
> overseer role, we could replace it with "mode" or something similar)
> ---
> 
> To be explicit with the above definitions:
> 1) In SolrCloud, the roles of leaders and followers can dynamically
 change
> based upon the status of the cluster. In standalone mode, they can be
> changed by manual intervention.
> 2) A leader does not have to have any followers (i.e. only one active
> replica)
> 3) Each shard always has one leader.
> 4) A follower can also pull updates from another follower instead of a
> leader (traditionally known as a REPEATER). A repeater is still a
 follower,
> but would 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Sorry:
>
> but I maintain that leader vs. follower behavior is inconsistent here.


Sorry, that should have said "I maintain that leader vs. follower behavior
is consistent here."

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 6:03 PM Trey Grainger  wrote:

> Hi Walter,
>
> >In Solr Cloud, the leader knows about each follower and updates them.
> Respectfully, I think you're mixing the "TYPE" of replica with the role of
> the "leader" and "follower"
>
> In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the
> leader push updates those followers.
>
> When the TYPE of a follower is PULL, then it does not.  In Standalone
> mode, the type of a (currently) master would be NRT, and the type of the
> (currently) slaves is always PULL.
>
> As such, this behavior is consistent across both SolrCloud and Standalone
> mode. It is true that Standalone mode does not currently have support for
> two of the replica TYPES that SolrCloud mode does, but I maintain that
> leader vs. follower behavior is inconsistent here.
>
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
>
>
>
> On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
> wrote:
>
>> But they are not the same. In Solr Cloud, the leader knows about each
>> follower and updates them. In standalone, the master has no idea that
>> slaves exist until a replication request arrives.
>>
>> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
>> config load time.
>>
>> Looking ahead in my email inbox, publisher/subscriber is an excellent
>> choice.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> > On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
>> >
>> > I guess I don't see it as polysemous, but instead simplifying.
>> >
>> > In my proposal, the terms "leader" and "follower" would have the exact
>> same
>> > meaning in both SolrCloud and standalone mode. The only difference
>> would be
>> > that SolrCloud automatically manages the leaders and followers, whereas
>> in
>> > standalone mode you have to manage them manually (as is the case with
>> most
>> > things in SolrCloud vs. Standalone).
>> >
>> > My view is that having an entirely different set of terminology
>> describing
>> > the same thing is way more cognitive overhead than having consistent
>> > terminology.
>> >
>> > Trey Grainger
>> > Founder, Searchkernel
>> > https://searchkernel.com
>> >
>> > On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood > >
>> > wrote:
>> >
>> >> I strongly disagree with using the Solr Cloud leader/follower
>> terminology
>> >> for non-Cloud clusters. People in my company are confused enough
>> without
>> >> using polysemous terminology.
>> >>
>> >> “This node is the leader, but it means something different than the
>> leader
>> >> in this other cluster.” I’m dreading that conversation.
>> >>
>> >> I like “principal”. How about “clone” for the slave role? That suggests
>> >> that
>> >> it does not accept updates and that it is loosely-coupled, only
>> depending
>> >> on the state of the no-longer-called-master.
>> >>
>> >> Chegg has five production Solr Cloud clusters and one production
>> >> master/slave
>> >> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
>> in
>> >> production.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wun...@wunderwood.org
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>> On Jun 17, 2020, at 1:36 PM, Trey Grainger 
>> wrote:
>> >>>
>> >>> Proposal:
>> >>> "A Solr COLLECTION is composed of one or more SHARDS, which each have
>> one
>> >>> or more REPLICAS. Each replica can have a ROLE of either:
>> >>> 1) A LEADER, which can process external updates for the shard
>> >>> 2) A FOLLOWER, which receives updates from another replica"
>> >>>
>> >>> (Note: I prefer "role" but if others think it's too overloaded due to
>> the
>> >>> overseer role, we could replace it with "mode" or something similar)
>> >>> ---
>> >>>
>> >>> To be explicit with the above definitions:
>> >>> 1) In SolrCloud, the roles of leaders and followers can dynamically
>> >> change
>> >>> based upon the status of the cluster. In standalone mode, they can be
>> >>> changed by manual intervention.
>> >>> 2) A leader does not have to have any followers (i.e. only one active
>> >>> replica)
>> >>> 3) Each shard always has one leader.
>> >>> 4) A follower can also pull updates from another follower instead of a
>> >>> leader (traditionally known as a REPEATER). A repeater is still a
>> >> follower,
>> >>> but would not be considered a leader because it can't process external
>> >>> updates.
>> >>> 5) A replica cannot be both a leader and a follower.
>> >>>
>> >>> In addition to the above roles, each replica can have a TYPE of one
>> of:
>> >>> 1) NRT - which can serve in the role of leader or follower
>> >>> 2) TLOG - which can only serve in the role of follower
>> >>> 3) PULL 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Hi Walter,

>In Solr Cloud, the leader knows about each follower and updates them.
Respectfully, I think you're mixing the "TYPE" of replica with the role of
the "leader" and "follower"

In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the leader
push updates those followers.

When the TYPE of a follower is PULL, then it does not.  In Standalone mode,
the type of a (currently) master would be NRT, and the type of the
(currently) slaves is always PULL.

As such, this behavior is consistent across both SolrCloud and Standalone
mode. It is true that Standalone mode does not currently have support for
two of the replica TYPES that SolrCloud mode does, but I maintain that
leader vs. follower behavior is inconsistent here.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com



On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
wrote:

> But they are not the same. In Solr Cloud, the leader knows about each
> follower and updates them. In standalone, the master has no idea that
> slaves exist until a replication request arrives.
>
> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
> config load time.
>
> Looking ahead in my email inbox, publisher/subscriber is an excellent
> choice.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
> >
> > I guess I don't see it as polysemous, but instead simplifying.
> >
> > In my proposal, the terms "leader" and "follower" would have the exact
> same
> > meaning in both SolrCloud and standalone mode. The only difference would
> be
> > that SolrCloud automatically manages the leaders and followers, whereas
> in
> > standalone mode you have to manage them manually (as is the case with
> most
> > things in SolrCloud vs. Standalone).
> >
> > My view is that having an entirely different set of terminology
> describing
> > the same thing is way more cognitive overhead than having consistent
> > terminology.
> >
> > Trey Grainger
> > Founder, Searchkernel
> > https://searchkernel.com
> >
> > On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
> > wrote:
> >
> >> I strongly disagree with using the Solr Cloud leader/follower
> terminology
> >> for non-Cloud clusters. People in my company are confused enough without
> >> using polysemous terminology.
> >>
> >> “This node is the leader, but it means something different than the
> leader
> >> in this other cluster.” I’m dreading that conversation.
> >>
> >> I like “principal”. How about “clone” for the slave role? That suggests
> >> that
> >> it does not accept updates and that it is loosely-coupled, only
> depending
> >> on the state of the no-longer-called-master.
> >>
> >> Chegg has five production Solr Cloud clusters and one production
> >> master/slave
> >> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
> in
> >> production.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> >>>
> >>> Proposal:
> >>> "A Solr COLLECTION is composed of one or more SHARDS, which each have
> one
> >>> or more REPLICAS. Each replica can have a ROLE of either:
> >>> 1) A LEADER, which can process external updates for the shard
> >>> 2) A FOLLOWER, which receives updates from another replica"
> >>>
> >>> (Note: I prefer "role" but if others think it's too overloaded due to
> the
> >>> overseer role, we could replace it with "mode" or something similar)
> >>> ---
> >>>
> >>> To be explicit with the above definitions:
> >>> 1) In SolrCloud, the roles of leaders and followers can dynamically
> >> change
> >>> based upon the status of the cluster. In standalone mode, they can be
> >>> changed by manual intervention.
> >>> 2) A leader does not have to have any followers (i.e. only one active
> >>> replica)
> >>> 3) Each shard always has one leader.
> >>> 4) A follower can also pull updates from another follower instead of a
> >>> leader (traditionally known as a REPEATER). A repeater is still a
> >> follower,
> >>> but would not be considered a leader because it can't process external
> >>> updates.
> >>> 5) A replica cannot be both a leader and a follower.
> >>>
> >>> In addition to the above roles, each replica can have a TYPE of one of:
> >>> 1) NRT - which can serve in the role of leader or follower
> >>> 2) TLOG - which can only serve in the role of follower
> >>> 3) PULL - which can only serve in the role of follower
> >>>
> >>> A replica's type may be changed automatically in the event that its
> role
> >>> changes.
> >>>
> >>> I think this terminology is consistent with the current Leader/Follower
> >>> usage while also being able to easily accomodate a rename of the
> >> historical
> >>> master/slave terminology without mental gymnastics or the introduction
> or
> >>> more cognitive load through new terminology. I 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Sameer Maggon
+1 for simplifying and using the Leader/Follower Terminology. Our company
operates both SolrCloud, Standalone Solr, and Master/Slave Configurations,
outside of the Solr Developer community, it's painful and confusing to talk
about Master/Slave and Leader/Replica. It would be easier if we had the
following:

The internal differences between manual configuration or SolrCloud being
smart about managing and assigning roles are just the evolution of the
design and details of a particular mode/implementation and shouldn't matter
to the end-user.

Today, when someone not involved in the Solr development looks at the
terminology, it looks new terminology is introduced without thinking about
existing customers or thinking through the system as a whole and how to
best evolve it (not saying that's what happened, but just a perception).
Adding new terminology should be introduced carefully and +1 on reducing
the cognitive load on an average guy like me.

- There are leaders and there are followers
- Solr Clusters can be configured in two modes/implementation (SolrCloud or
Master/Slave). This one is hard because you don't want to introduce yet
another name here as people are now already familiar with it.
- These modes happen to have different designs and depending upon the mode,
you can go into the design differences of these two modes.

Cheers!
-- 

*Sameer Maggon*
*SearchStax* | www.searchstax.com


On Wed, Jun 17, 2020 at 2:22 PM gnandre  wrote:

> +1 for Leader-Follower. How about Publisher-Subscriber?
>
> On Wed, Jun 17, 2020 at 5:19 PM Rahul Goswami 
> wrote:
>
> > +1 on avoiding SolrCloud terminology. In the interest of keeping it
> obvious
> > and simple, may I I please suggest primary/secondary?
> >
> > On Wed, Jun 17, 2020 at 5:14 PM Atita Arora 
> wrote:
> >
> > > I agree avoiding using of solr cloud terminology too.
> > >
> > > I may suggest going for "prime" and "clone"
> > > (Short and precise as Master and Slave).
> > >
> > > Best,
> > > Atita
> > >
> > >
> > >
> > >
> > >
> > > On Wed, 17 Jun 2020, 22:50 Walter Underwood, 
> > > wrote:
> > >
> > > > I strongly disagree with using the Solr Cloud leader/follower
> > terminology
> > > > for non-Cloud clusters. People in my company are confused enough
> > without
> > > > using polysemous terminology.
> > > >
> > > > “This node is the leader, but it means something different than the
> > > leader
> > > > in this other cluster.” I’m dreading that conversation.
> > > >
> > > > I like “principal”. How about “clone” for the slave role? That
> suggests
> > > > that
> > > > it does not accept updates and that it is loosely-coupled, only
> > depending
> > > > on the state of the no-longer-called-master.
> > > >
> > > > Chegg has five production Solr Cloud clusters and one production
> > > > master/slave
> > > > cluster, so this is not a hypothetical for us. We have 100+ Solr
> hosts
> > in
> > > > production.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > > > On Jun 17, 2020, at 1:36 PM, Trey Grainger 
> > wrote:
> > > > >
> > > > > Proposal:
> > > > > "A Solr COLLECTION is composed of one or more SHARDS, which each
> have
> > > one
> > > > > or more REPLICAS. Each replica can have a ROLE of either:
> > > > > 1) A LEADER, which can process external updates for the shard
> > > > > 2) A FOLLOWER, which receives updates from another replica"
> > > > >
> > > > > (Note: I prefer "role" but if others think it's too overloaded due
> to
> > > the
> > > > > overseer role, we could replace it with "mode" or something
> similar)
> > > > > ---
> > > > >
> > > > > To be explicit with the above definitions:
> > > > > 1) In SolrCloud, the roles of leaders and followers can dynamically
> > > > change
> > > > > based upon the status of the cluster. In standalone mode, they can
> be
> > > > > changed by manual intervention.
> > > > > 2) A leader does not have to have any followers (i.e. only one
> active
> > > > > replica)
> > > > > 3) Each shard always has one leader.
> > > > > 4) A follower can also pull updates from another follower instead
> of
> > a
> > > > > leader (traditionally known as a REPEATER). A repeater is still a
> > > > follower,
> > > > > but would not be considered a leader because it can't process
> > external
> > > > > updates.
> > > > > 5) A replica cannot be both a leader and a follower.
> > > > >
> > > > > In addition to the above roles, each replica can have a TYPE of one
> > of:
> > > > > 1) NRT - which can serve in the role of leader or follower
> > > > > 2) TLOG - which can only serve in the role of follower
> > > > > 3) PULL - which can only serve in the role of follower
> > > > >
> > > > > A replica's type may be changed automatically in the event that its
> > > role
> > > > > changes.
> > > > >
> > > > > I think this terminology is consistent with the current
> > Leader/Follower
> > > > > usage while also being able 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood
But they are not the same. In Solr Cloud, the leader knows about each
follower and updates them. In standalone, the master has no idea that
slaves exist until a replication request arrives.

In Solr Cloud, the leader is elected. In standalone, that role is fixed at
config load time.

Looking ahead in my email inbox, publisher/subscriber is an excellent choice.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
> 
> I guess I don't see it as polysemous, but instead simplifying.
> 
> In my proposal, the terms "leader" and "follower" would have the exact same
> meaning in both SolrCloud and standalone mode. The only difference would be
> that SolrCloud automatically manages the leaders and followers, whereas in
> standalone mode you have to manage them manually (as is the case with most
> things in SolrCloud vs. Standalone).
> 
> My view is that having an entirely different set of terminology describing
> the same thing is way more cognitive overhead than having consistent
> terminology.
> 
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
> 
> On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
> wrote:
> 
>> I strongly disagree with using the Solr Cloud leader/follower terminology
>> for non-Cloud clusters. People in my company are confused enough without
>> using polysemous terminology.
>> 
>> “This node is the leader, but it means something different than the leader
>> in this other cluster.” I’m dreading that conversation.
>> 
>> I like “principal”. How about “clone” for the slave role? That suggests
>> that
>> it does not accept updates and that it is loosely-coupled, only depending
>> on the state of the no-longer-called-master.
>> 
>> Chegg has five production Solr Cloud clusters and one production
>> master/slave
>> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
>> production.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
>>> 
>>> Proposal:
>>> "A Solr COLLECTION is composed of one or more SHARDS, which each have one
>>> or more REPLICAS. Each replica can have a ROLE of either:
>>> 1) A LEADER, which can process external updates for the shard
>>> 2) A FOLLOWER, which receives updates from another replica"
>>> 
>>> (Note: I prefer "role" but if others think it's too overloaded due to the
>>> overseer role, we could replace it with "mode" or something similar)
>>> ---
>>> 
>>> To be explicit with the above definitions:
>>> 1) In SolrCloud, the roles of leaders and followers can dynamically
>> change
>>> based upon the status of the cluster. In standalone mode, they can be
>>> changed by manual intervention.
>>> 2) A leader does not have to have any followers (i.e. only one active
>>> replica)
>>> 3) Each shard always has one leader.
>>> 4) A follower can also pull updates from another follower instead of a
>>> leader (traditionally known as a REPEATER). A repeater is still a
>> follower,
>>> but would not be considered a leader because it can't process external
>>> updates.
>>> 5) A replica cannot be both a leader and a follower.
>>> 
>>> In addition to the above roles, each replica can have a TYPE of one of:
>>> 1) NRT - which can serve in the role of leader or follower
>>> 2) TLOG - which can only serve in the role of follower
>>> 3) PULL - which can only serve in the role of follower
>>> 
>>> A replica's type may be changed automatically in the event that its role
>>> changes.
>>> 
>>> I think this terminology is consistent with the current Leader/Follower
>>> usage while also being able to easily accomodate a rename of the
>> historical
>>> master/slave terminology without mental gymnastics or the introduction or
>>> more cognitive load through new terminology. I think adopting the
>>> Primary/Replica terminology will be incredibly confusing given the
>> already
>>> specific and well established meaning of "replica" within Solr.
>>> 
>>> All the Best,
>>> 
>>> Trey Grainger
>>> Founder, Searchkernel
>>> https://searchkernel.com
>>> 
>>> 
>>> 
>>> On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
>> wrote:
>>> 
 Hi everyone,
 
 Moving a conversation that was happening on the PMC list to the public
 forum. Most of the following is just me recapping the conversation that
>> has
 happened so far.
 
 Some members of the community have been discussing getting rid of the
 master/slave nomenclature from Solr.
 
 While this may require a non-trivial effort, a general consensus so far
 seems to be to start this process and switch over incrementally, if a
 single change ends up being too big.
 
 There have been a lot of suggestions around what the new nomenclature
>> might
 look like, a few people don’t want to overlap the naming here with what
 already exists 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread gnandre
+1 for Leader-Follower. How about Publisher-Subscriber?

On Wed, Jun 17, 2020 at 5:19 PM Rahul Goswami  wrote:

> +1 on avoiding SolrCloud terminology. In the interest of keeping it obvious
> and simple, may I I please suggest primary/secondary?
>
> On Wed, Jun 17, 2020 at 5:14 PM Atita Arora  wrote:
>
> > I agree avoiding using of solr cloud terminology too.
> >
> > I may suggest going for "prime" and "clone"
> > (Short and precise as Master and Slave).
> >
> > Best,
> > Atita
> >
> >
> >
> >
> >
> > On Wed, 17 Jun 2020, 22:50 Walter Underwood, 
> > wrote:
> >
> > > I strongly disagree with using the Solr Cloud leader/follower
> terminology
> > > for non-Cloud clusters. People in my company are confused enough
> without
> > > using polysemous terminology.
> > >
> > > “This node is the leader, but it means something different than the
> > leader
> > > in this other cluster.” I’m dreading that conversation.
> > >
> > > I like “principal”. How about “clone” for the slave role? That suggests
> > > that
> > > it does not accept updates and that it is loosely-coupled, only
> depending
> > > on the state of the no-longer-called-master.
> > >
> > > Chegg has five production Solr Cloud clusters and one production
> > > master/slave
> > > cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
> in
> > > production.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > > > On Jun 17, 2020, at 1:36 PM, Trey Grainger 
> wrote:
> > > >
> > > > Proposal:
> > > > "A Solr COLLECTION is composed of one or more SHARDS, which each have
> > one
> > > > or more REPLICAS. Each replica can have a ROLE of either:
> > > > 1) A LEADER, which can process external updates for the shard
> > > > 2) A FOLLOWER, which receives updates from another replica"
> > > >
> > > > (Note: I prefer "role" but if others think it's too overloaded due to
> > the
> > > > overseer role, we could replace it with "mode" or something similar)
> > > > ---
> > > >
> > > > To be explicit with the above definitions:
> > > > 1) In SolrCloud, the roles of leaders and followers can dynamically
> > > change
> > > > based upon the status of the cluster. In standalone mode, they can be
> > > > changed by manual intervention.
> > > > 2) A leader does not have to have any followers (i.e. only one active
> > > > replica)
> > > > 3) Each shard always has one leader.
> > > > 4) A follower can also pull updates from another follower instead of
> a
> > > > leader (traditionally known as a REPEATER). A repeater is still a
> > > follower,
> > > > but would not be considered a leader because it can't process
> external
> > > > updates.
> > > > 5) A replica cannot be both a leader and a follower.
> > > >
> > > > In addition to the above roles, each replica can have a TYPE of one
> of:
> > > > 1) NRT - which can serve in the role of leader or follower
> > > > 2) TLOG - which can only serve in the role of follower
> > > > 3) PULL - which can only serve in the role of follower
> > > >
> > > > A replica's type may be changed automatically in the event that its
> > role
> > > > changes.
> > > >
> > > > I think this terminology is consistent with the current
> Leader/Follower
> > > > usage while also being able to easily accomodate a rename of the
> > > historical
> > > > master/slave terminology without mental gymnastics or the
> introduction
> > or
> > > > more cognitive load through new terminology. I think adopting the
> > > > Primary/Replica terminology will be incredibly confusing given the
> > > already
> > > > specific and well established meaning of "replica" within Solr.
> > > >
> > > > All the Best,
> > > >
> > > > Trey Grainger
> > > > Founder, Searchkernel
> > > > https://searchkernel.com
> > > >
> > > >
> > > >
> > > > On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta  >
> > > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> Moving a conversation that was happening on the PMC list to the
> public
> > > >> forum. Most of the following is just me recapping the conversation
> > that
> > > has
> > > >> happened so far.
> > > >>
> > > >> Some members of the community have been discussing getting rid of
> the
> > > >> master/slave nomenclature from Solr.
> > > >>
> > > >> While this may require a non-trivial effort, a general consensus so
> > far
> > > >> seems to be to start this process and switch over incrementally, if
> a
> > > >> single change ends up being too big.
> > > >>
> > > >> There have been a lot of suggestions around what the new
> nomenclature
> > > might
> > > >> look like, a few people don’t want to overlap the naming here with
> > what
> > > >> already exists in SolrCloud i.e. leader/follower.
> > > >>
> > > >> Primary/Replica was an option that was suggested based on what other
> > > >> vendors are moving towards based on Wikipedia:
> > > >> https://en.wikipedia.org/wiki/Master/slave_(technology)
> > > >> , however there 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
I guess I don't see it as polysemous, but instead simplifying.

In my proposal, the terms "leader" and "follower" would have the exact same
meaning in both SolrCloud and standalone mode. The only difference would be
that SolrCloud automatically manages the leaders and followers, whereas in
standalone mode you have to manage them manually (as is the case with most
things in SolrCloud vs. Standalone).

My view is that having an entirely different set of terminology describing
the same thing is way more cognitive overhead than having consistent
terminology.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
wrote:

> I strongly disagree with using the Solr Cloud leader/follower terminology
> for non-Cloud clusters. People in my company are confused enough without
> using polysemous terminology.
>
> “This node is the leader, but it means something different than the leader
> in this other cluster.” I’m dreading that conversation.
>
> I like “principal”. How about “clone” for the slave role? That suggests
> that
> it does not accept updates and that it is loosely-coupled, only depending
> on the state of the no-longer-called-master.
>
> Chegg has five production Solr Cloud clusters and one production
> master/slave
> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
> production.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> >
> > Proposal:
> > "A Solr COLLECTION is composed of one or more SHARDS, which each have one
> > or more REPLICAS. Each replica can have a ROLE of either:
> > 1) A LEADER, which can process external updates for the shard
> > 2) A FOLLOWER, which receives updates from another replica"
> >
> > (Note: I prefer "role" but if others think it's too overloaded due to the
> > overseer role, we could replace it with "mode" or something similar)
> > ---
> >
> > To be explicit with the above definitions:
> > 1) In SolrCloud, the roles of leaders and followers can dynamically
> change
> > based upon the status of the cluster. In standalone mode, they can be
> > changed by manual intervention.
> > 2) A leader does not have to have any followers (i.e. only one active
> > replica)
> > 3) Each shard always has one leader.
> > 4) A follower can also pull updates from another follower instead of a
> > leader (traditionally known as a REPEATER). A repeater is still a
> follower,
> > but would not be considered a leader because it can't process external
> > updates.
> > 5) A replica cannot be both a leader and a follower.
> >
> > In addition to the above roles, each replica can have a TYPE of one of:
> > 1) NRT - which can serve in the role of leader or follower
> > 2) TLOG - which can only serve in the role of follower
> > 3) PULL - which can only serve in the role of follower
> >
> > A replica's type may be changed automatically in the event that its role
> > changes.
> >
> > I think this terminology is consistent with the current Leader/Follower
> > usage while also being able to easily accomodate a rename of the
> historical
> > master/slave terminology without mental gymnastics or the introduction or
> > more cognitive load through new terminology. I think adopting the
> > Primary/Replica terminology will be incredibly confusing given the
> already
> > specific and well established meaning of "replica" within Solr.
> >
> > All the Best,
> >
> > Trey Grainger
> > Founder, Searchkernel
> > https://searchkernel.com
> >
> >
> >
> > On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
> wrote:
> >
> >> Hi everyone,
> >>
> >> Moving a conversation that was happening on the PMC list to the public
> >> forum. Most of the following is just me recapping the conversation that
> has
> >> happened so far.
> >>
> >> Some members of the community have been discussing getting rid of the
> >> master/slave nomenclature from Solr.
> >>
> >> While this may require a non-trivial effort, a general consensus so far
> >> seems to be to start this process and switch over incrementally, if a
> >> single change ends up being too big.
> >>
> >> There have been a lot of suggestions around what the new nomenclature
> might
> >> look like, a few people don’t want to overlap the naming here with what
> >> already exists in SolrCloud i.e. leader/follower.
> >>
> >> Primary/Replica was an option that was suggested based on what other
> >> vendors are moving towards based on Wikipedia:
> >> https://en.wikipedia.org/wiki/Master/slave_(technology)
> >> , however there were concerns around the use of “replica” as that
> denotes a
> >> very specific concept in SolrCloud. Current terminology clearly
> >> differentiates the use of the traditional replication model from
> SolrCloud
> >> and reusing the names would make it difficult for that to happen.
> >>
> >> There were similar concerns around using Leader/follower.

Re: RankLib model output format to Solr LTR model format

2020-06-17 Thread gnandre
Thanks Doug, this is very helpful.

On Wed, Jun 17, 2020 at 1:11 PM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> There are several scripts for doing this.
>
> I might encourage you to checkout our Hello LTR library of notebooks, which
> has a ranklib training driver, and helpers to log training data, train a
> model w/ Ranklib, and search with it. I am using this code for my LTR
> contributions AI Powered Search
>
> http://github.com/o19s/hello-ltr
>
> But if you just care about the conversion, check out this code. It's
> adapted / inspired by code written by Christine Poerschke with her Ltr For
> Bees demo / talk
>
> https://github.com/o19s/hello-ltr/blob/master/ltr/helpers/convert.py
>
> Best
> -Doug
>
>
>
>
> On Wed, Jun 17, 2020 at 12:46 PM gnandre  wrote:
>
> > Hi,
> >
> > Before I start writing my own implementation for converting RankLib's
> model
> > output format to Solr LTR model format for my own use cases, I just
> wanted
> > to check if there is any work done on this front already. Any references
> > are welcome.
> >
>
>
> --
> *Doug Turnbull **| CTO* | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search ; Contributor: *AI
> Powered Search *
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Rahul Goswami
+1 on avoiding SolrCloud terminology. In the interest of keeping it obvious
and simple, may I I please suggest primary/secondary?

On Wed, Jun 17, 2020 at 5:14 PM Atita Arora  wrote:

> I agree avoiding using of solr cloud terminology too.
>
> I may suggest going for "prime" and "clone"
> (Short and precise as Master and Slave).
>
> Best,
> Atita
>
>
>
>
>
> On Wed, 17 Jun 2020, 22:50 Walter Underwood, 
> wrote:
>
> > I strongly disagree with using the Solr Cloud leader/follower terminology
> > for non-Cloud clusters. People in my company are confused enough without
> > using polysemous terminology.
> >
> > “This node is the leader, but it means something different than the
> leader
> > in this other cluster.” I’m dreading that conversation.
> >
> > I like “principal”. How about “clone” for the slave role? That suggests
> > that
> > it does not accept updates and that it is loosely-coupled, only depending
> > on the state of the no-longer-called-master.
> >
> > Chegg has five production Solr Cloud clusters and one production
> > master/slave
> > cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
> > production.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> > >
> > > Proposal:
> > > "A Solr COLLECTION is composed of one or more SHARDS, which each have
> one
> > > or more REPLICAS. Each replica can have a ROLE of either:
> > > 1) A LEADER, which can process external updates for the shard
> > > 2) A FOLLOWER, which receives updates from another replica"
> > >
> > > (Note: I prefer "role" but if others think it's too overloaded due to
> the
> > > overseer role, we could replace it with "mode" or something similar)
> > > ---
> > >
> > > To be explicit with the above definitions:
> > > 1) In SolrCloud, the roles of leaders and followers can dynamically
> > change
> > > based upon the status of the cluster. In standalone mode, they can be
> > > changed by manual intervention.
> > > 2) A leader does not have to have any followers (i.e. only one active
> > > replica)
> > > 3) Each shard always has one leader.
> > > 4) A follower can also pull updates from another follower instead of a
> > > leader (traditionally known as a REPEATER). A repeater is still a
> > follower,
> > > but would not be considered a leader because it can't process external
> > > updates.
> > > 5) A replica cannot be both a leader and a follower.
> > >
> > > In addition to the above roles, each replica can have a TYPE of one of:
> > > 1) NRT - which can serve in the role of leader or follower
> > > 2) TLOG - which can only serve in the role of follower
> > > 3) PULL - which can only serve in the role of follower
> > >
> > > A replica's type may be changed automatically in the event that its
> role
> > > changes.
> > >
> > > I think this terminology is consistent with the current Leader/Follower
> > > usage while also being able to easily accomodate a rename of the
> > historical
> > > master/slave terminology without mental gymnastics or the introduction
> or
> > > more cognitive load through new terminology. I think adopting the
> > > Primary/Replica terminology will be incredibly confusing given the
> > already
> > > specific and well established meaning of "replica" within Solr.
> > >
> > > All the Best,
> > >
> > > Trey Grainger
> > > Founder, Searchkernel
> > > https://searchkernel.com
> > >
> > >
> > >
> > > On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
> > wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> Moving a conversation that was happening on the PMC list to the public
> > >> forum. Most of the following is just me recapping the conversation
> that
> > has
> > >> happened so far.
> > >>
> > >> Some members of the community have been discussing getting rid of the
> > >> master/slave nomenclature from Solr.
> > >>
> > >> While this may require a non-trivial effort, a general consensus so
> far
> > >> seems to be to start this process and switch over incrementally, if a
> > >> single change ends up being too big.
> > >>
> > >> There have been a lot of suggestions around what the new nomenclature
> > might
> > >> look like, a few people don’t want to overlap the naming here with
> what
> > >> already exists in SolrCloud i.e. leader/follower.
> > >>
> > >> Primary/Replica was an option that was suggested based on what other
> > >> vendors are moving towards based on Wikipedia:
> > >> https://en.wikipedia.org/wiki/Master/slave_(technology)
> > >> , however there were concerns around the use of “replica” as that
> > denotes a
> > >> very specific concept in SolrCloud. Current terminology clearly
> > >> differentiates the use of the traditional replication model from
> > SolrCloud
> > >> and reusing the names would make it difficult for that to happen.
> > >>
> > >> There were similar concerns around using Leader/follower.
> > >>
> > >> Let’s continue this 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Atita Arora
I agree avoiding using of solr cloud terminology too.

I may suggest going for "prime" and "clone"
(Short and precise as Master and Slave).

Best,
Atita





On Wed, 17 Jun 2020, 22:50 Walter Underwood,  wrote:

> I strongly disagree with using the Solr Cloud leader/follower terminology
> for non-Cloud clusters. People in my company are confused enough without
> using polysemous terminology.
>
> “This node is the leader, but it means something different than the leader
> in this other cluster.” I’m dreading that conversation.
>
> I like “principal”. How about “clone” for the slave role? That suggests
> that
> it does not accept updates and that it is loosely-coupled, only depending
> on the state of the no-longer-called-master.
>
> Chegg has five production Solr Cloud clusters and one production
> master/slave
> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
> production.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> >
> > Proposal:
> > "A Solr COLLECTION is composed of one or more SHARDS, which each have one
> > or more REPLICAS. Each replica can have a ROLE of either:
> > 1) A LEADER, which can process external updates for the shard
> > 2) A FOLLOWER, which receives updates from another replica"
> >
> > (Note: I prefer "role" but if others think it's too overloaded due to the
> > overseer role, we could replace it with "mode" or something similar)
> > ---
> >
> > To be explicit with the above definitions:
> > 1) In SolrCloud, the roles of leaders and followers can dynamically
> change
> > based upon the status of the cluster. In standalone mode, they can be
> > changed by manual intervention.
> > 2) A leader does not have to have any followers (i.e. only one active
> > replica)
> > 3) Each shard always has one leader.
> > 4) A follower can also pull updates from another follower instead of a
> > leader (traditionally known as a REPEATER). A repeater is still a
> follower,
> > but would not be considered a leader because it can't process external
> > updates.
> > 5) A replica cannot be both a leader and a follower.
> >
> > In addition to the above roles, each replica can have a TYPE of one of:
> > 1) NRT - which can serve in the role of leader or follower
> > 2) TLOG - which can only serve in the role of follower
> > 3) PULL - which can only serve in the role of follower
> >
> > A replica's type may be changed automatically in the event that its role
> > changes.
> >
> > I think this terminology is consistent with the current Leader/Follower
> > usage while also being able to easily accomodate a rename of the
> historical
> > master/slave terminology without mental gymnastics or the introduction or
> > more cognitive load through new terminology. I think adopting the
> > Primary/Replica terminology will be incredibly confusing given the
> already
> > specific and well established meaning of "replica" within Solr.
> >
> > All the Best,
> >
> > Trey Grainger
> > Founder, Searchkernel
> > https://searchkernel.com
> >
> >
> >
> > On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
> wrote:
> >
> >> Hi everyone,
> >>
> >> Moving a conversation that was happening on the PMC list to the public
> >> forum. Most of the following is just me recapping the conversation that
> has
> >> happened so far.
> >>
> >> Some members of the community have been discussing getting rid of the
> >> master/slave nomenclature from Solr.
> >>
> >> While this may require a non-trivial effort, a general consensus so far
> >> seems to be to start this process and switch over incrementally, if a
> >> single change ends up being too big.
> >>
> >> There have been a lot of suggestions around what the new nomenclature
> might
> >> look like, a few people don’t want to overlap the naming here with what
> >> already exists in SolrCloud i.e. leader/follower.
> >>
> >> Primary/Replica was an option that was suggested based on what other
> >> vendors are moving towards based on Wikipedia:
> >> https://en.wikipedia.org/wiki/Master/slave_(technology)
> >> , however there were concerns around the use of “replica” as that
> denotes a
> >> very specific concept in SolrCloud. Current terminology clearly
> >> differentiates the use of the traditional replication model from
> SolrCloud
> >> and reusing the names would make it difficult for that to happen.
> >>
> >> There were similar concerns around using Leader/follower.
> >>
> >> Let’s continue this conversation here while making sure that we converge
> >> without much bike-shedding.
> >>
> >> -Anshum
> >>
>
>


Re: Autocommit in SolrCloud with many shards

2020-06-17 Thread Erick Erickson
Please raise a JIRA and attach your patch to that….

Best,
Erick

P.S. Buy me some beers sometime if we’re even in the same place...

> On Jun 17, 2020, at 5:00 PM, Bram Van Dam  wrote:
> 
> Thanks for pointing that out. I'm attaching a patch for the ref-guide
> which summarizes what you said. Maybe other people will find this useful
> as well?
> 
> Oh and Erick, thanks for your ever thoughtful replies. Given all the
> hours of your time I've soaked up over the years, you should probably
> start invoicing me :-)
> 
> - Bram
> 
> On 17/06/2020 13:55, Erick Erickson wrote:
>> Each node has its own timer that starts when it receives an update.
>> So in your situation, 60 seconds after any give replica gets it’s first
>> update, all documents that have been received in the interval will
>> be committed.
>> 
>> But note several things:
>> 
>> 1> commits will tend to cluster for a given shard. By that I mean
>>they’ll tend to happen within a few milliseconds of each other
>>   ‘cause it doesn’t take that long for an update to get from the
>>   leader to all the followers.
>> 
>> 2> this is per replica. So if you host replicas from multiple collections
>>   on some node, their commits have no relation to each other. And
>>   say for some reason you transmit exactly one document that lands
>>   on shard1. Further, say nodeA contains replicas for shard1 and shard2.
>>   Only the replica for shard1 would commit.
>> 
>> 3> Solr promises eventual consistency. In this case, due to all the
>>   timing variables it is not guaranteed that every replica of a single
>>   shard has the same document available for search at any given time.
>>   Say doc1 hits the leader at time T and a follower at time T+10ms.
>>   Say doc2 hits the leader and gets indexed 5ms before the 
>>   commit is triggered, but for some reason it takes 15ms for it to get
>>   to the follower. The leader will be able to search doc2, but the
>>  follower won’t until 60 seconds later.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 17, 2020, at 5:36 AM, Bram Van Dam  wrote:
>>> 
>>> 'morning :-)
>>> 
>>> I'm wondering how autocommits work in Solr.
>>> 
>>> Say I have a cluster with many nodes and many colections with many
>>> shards. If each collection's config has a hard autocommit configured
>>> every minute, does that mean that SolrCloud (presumably the leader?)
>>> will dish out commit requests to each node on that schedule? Or does
>>> each node have its own timed trigger?
>>> 
>>> If it's the former, doesn't that mean the load will spike dramatically
>>> across the whole cluster every minute?
>>> 
>>> I tried reading the code, but I don't quite understand the way
>>> CommitTracker and the UpdateHandlers interact with SolrCloud.
>>> 
>>> Thanks,
>>> 
>>> - Bram
>> 
> 
> 



Re: Autocommit in SolrCloud with many shards

2020-06-17 Thread Bram Van Dam
Thanks for pointing that out. I'm attaching a patch for the ref-guide
which summarizes what you said. Maybe other people will find this useful
as well?

Oh and Erick, thanks for your ever thoughtful replies. Given all the
hours of your time I've soaked up over the years, you should probably
start invoicing me :-)

 - Bram

On 17/06/2020 13:55, Erick Erickson wrote:
> Each node has its own timer that starts when it receives an update.
> So in your situation, 60 seconds after any give replica gets it’s first
> update, all documents that have been received in the interval will
> be committed.
> 
> But note several things:
> 
> 1> commits will tend to cluster for a given shard. By that I mean
> they’ll tend to happen within a few milliseconds of each other
>‘cause it doesn’t take that long for an update to get from the
>leader to all the followers.
> 
> 2> this is per replica. So if you host replicas from multiple collections
>on some node, their commits have no relation to each other. And
>say for some reason you transmit exactly one document that lands
>on shard1. Further, say nodeA contains replicas for shard1 and shard2.
>Only the replica for shard1 would commit.
> 
> 3> Solr promises eventual consistency. In this case, due to all the
>timing variables it is not guaranteed that every replica of a single
>shard has the same document available for search at any given time.
>Say doc1 hits the leader at time T and a follower at time T+10ms.
>Say doc2 hits the leader and gets indexed 5ms before the 
>commit is triggered, but for some reason it takes 15ms for it to get
>to the follower. The leader will be able to search doc2, but the
>   follower won’t until 60 seconds later.
> 
> Best,
> Erick
> 
>> On Jun 17, 2020, at 5:36 AM, Bram Van Dam  wrote:
>>
>> 'morning :-)
>>
>> I'm wondering how autocommits work in Solr.
>>
>> Say I have a cluster with many nodes and many colections with many
>> shards. If each collection's config has a hard autocommit configured
>> every minute, does that mean that SolrCloud (presumably the leader?)
>> will dish out commit requests to each node on that schedule? Or does
>> each node have its own timed trigger?
>>
>> If it's the former, doesn't that mean the load will spike dramatically
>> across the whole cluster every minute?
>>
>> I tried reading the code, but I don't quite understand the way
>> CommitTracker and the UpdateHandlers interact with SolrCloud.
>>
>> Thanks,
>>
>> - Bram
> 

>From 858406e5c322a96c82934a6477518f65c5c605cc Mon Sep 17 00:00:00 2001
From: Bram 
Date: Wed, 17 Jun 2020 22:54:46 +0200
Subject: [PATCH] Add a blurb about commit timings to the SolrCloud
 documentation

---
 .../src/shards-and-indexing-data-in-solrcloud.adoc  | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc b/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc
index 3aa07cbdae7..43828048383 100644
--- a/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc
+++ b/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc
@@ -122,6 +122,8 @@ More details on how to use shard splitting is in the section on the Collection A
 
 In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with `openSearcher=false` and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster.
 
+TIP: Each node has its own auto commit timer which starts upon receipt of an update. While Solr promises eventual consistency, leaders will generally receive updates *before* replicas; it is therefore possible for replicas to lag behind somewhat.
+
 To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the `IgnoreCommitOptimizeUpdateProcessorFactory`, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code.
 
 To activate this request processor you'll need to add the following to your `solrconfig.xml`:
-- 
2.20.1



Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Walter Underwood
I strongly disagree with using the Solr Cloud leader/follower terminology
for non-Cloud clusters. People in my company are confused enough without
using polysemous terminology.

“This node is the leader, but it means something different than the leader
in this other cluster.” I’m dreading that conversation.

I like “principal”. How about “clone” for the slave role? That suggests that
it does not accept updates and that it is loosely-coupled, only depending 
on the state of the no-longer-called-master.

Chegg has five production Solr Cloud clusters and one production master/slave
cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in 
production.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> 
> Proposal:
> "A Solr COLLECTION is composed of one or more SHARDS, which each have one
> or more REPLICAS. Each replica can have a ROLE of either:
> 1) A LEADER, which can process external updates for the shard
> 2) A FOLLOWER, which receives updates from another replica"
> 
> (Note: I prefer "role" but if others think it's too overloaded due to the
> overseer role, we could replace it with "mode" or something similar)
> ---
> 
> To be explicit with the above definitions:
> 1) In SolrCloud, the roles of leaders and followers can dynamically change
> based upon the status of the cluster. In standalone mode, they can be
> changed by manual intervention.
> 2) A leader does not have to have any followers (i.e. only one active
> replica)
> 3) Each shard always has one leader.
> 4) A follower can also pull updates from another follower instead of a
> leader (traditionally known as a REPEATER). A repeater is still a follower,
> but would not be considered a leader because it can't process external
> updates.
> 5) A replica cannot be both a leader and a follower.
> 
> In addition to the above roles, each replica can have a TYPE of one of:
> 1) NRT - which can serve in the role of leader or follower
> 2) TLOG - which can only serve in the role of follower
> 3) PULL - which can only serve in the role of follower
> 
> A replica's type may be changed automatically in the event that its role
> changes.
> 
> I think this terminology is consistent with the current Leader/Follower
> usage while also being able to easily accomodate a rename of the historical
> master/slave terminology without mental gymnastics or the introduction or
> more cognitive load through new terminology. I think adopting the
> Primary/Replica terminology will be incredibly confusing given the already
> specific and well established meaning of "replica" within Solr.
> 
> All the Best,
> 
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
> 
> 
> 
> On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta  wrote:
> 
>> Hi everyone,
>> 
>> Moving a conversation that was happening on the PMC list to the public
>> forum. Most of the following is just me recapping the conversation that has
>> happened so far.
>> 
>> Some members of the community have been discussing getting rid of the
>> master/slave nomenclature from Solr.
>> 
>> While this may require a non-trivial effort, a general consensus so far
>> seems to be to start this process and switch over incrementally, if a
>> single change ends up being too big.
>> 
>> There have been a lot of suggestions around what the new nomenclature might
>> look like, a few people don’t want to overlap the naming here with what
>> already exists in SolrCloud i.e. leader/follower.
>> 
>> Primary/Replica was an option that was suggested based on what other
>> vendors are moving towards based on Wikipedia:
>> https://en.wikipedia.org/wiki/Master/slave_(technology)
>> , however there were concerns around the use of “replica” as that denotes a
>> very specific concept in SolrCloud. Current terminology clearly
>> differentiates the use of the traditional replication model from SolrCloud
>> and reusing the names would make it difficult for that to happen.
>> 
>> There were similar concerns around using Leader/follower.
>> 
>> Let’s continue this conversation here while making sure that we converge
>> without much bike-shedding.
>> 
>> -Anshum
>> 



Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Proposal:
"A Solr COLLECTION is composed of one or more SHARDS, which each have one
or more REPLICAS. Each replica can have a ROLE of either:
1) A LEADER, which can process external updates for the shard
2) A FOLLOWER, which receives updates from another replica"

(Note: I prefer "role" but if others think it's too overloaded due to the
overseer role, we could replace it with "mode" or something similar)
---

To be explicit with the above definitions:
1) In SolrCloud, the roles of leaders and followers can dynamically change
based upon the status of the cluster. In standalone mode, they can be
changed by manual intervention.
2) A leader does not have to have any followers (i.e. only one active
replica)
3) Each shard always has one leader.
4) A follower can also pull updates from another follower instead of a
leader (traditionally known as a REPEATER). A repeater is still a follower,
but would not be considered a leader because it can't process external
updates.
5) A replica cannot be both a leader and a follower.

In addition to the above roles, each replica can have a TYPE of one of:
1) NRT - which can serve in the role of leader or follower
2) TLOG - which can only serve in the role of follower
3) PULL - which can only serve in the role of follower

A replica's type may be changed automatically in the event that its role
changes.

I think this terminology is consistent with the current Leader/Follower
usage while also being able to easily accomodate a rename of the historical
master/slave terminology without mental gymnastics or the introduction or
more cognitive load through new terminology. I think adopting the
Primary/Replica terminology will be incredibly confusing given the already
specific and well established meaning of "replica" within Solr.

All the Best,

Trey Grainger
Founder, Searchkernel
https://searchkernel.com



On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta  wrote:

> Hi everyone,
>
> Moving a conversation that was happening on the PMC list to the public
> forum. Most of the following is just me recapping the conversation that has
> happened so far.
>
> Some members of the community have been discussing getting rid of the
> master/slave nomenclature from Solr.
>
> While this may require a non-trivial effort, a general consensus so far
> seems to be to start this process and switch over incrementally, if a
> single change ends up being too big.
>
> There have been a lot of suggestions around what the new nomenclature might
> look like, a few people don’t want to overlap the naming here with what
> already exists in SolrCloud i.e. leader/follower.
>
> Primary/Replica was an option that was suggested based on what other
> vendors are moving towards based on Wikipedia:
> https://en.wikipedia.org/wiki/Master/slave_(technology)
> , however there were concerns around the use of “replica” as that denotes a
> very specific concept in SolrCloud. Current terminology clearly
> differentiates the use of the traditional replication model from SolrCloud
> and reusing the names would make it difficult for that to happen.
>
> There were similar concerns around using Leader/follower.
>
> Let’s continue this conversation here while making sure that we converge
> without much bike-shedding.
>
> -Anshum
>


Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Anshum Gupta
Hi everyone,

Moving a conversation that was happening on the PMC list to the public
forum. Most of the following is just me recapping the conversation that has
happened so far.

Some members of the community have been discussing getting rid of the
master/slave nomenclature from Solr.

While this may require a non-trivial effort, a general consensus so far
seems to be to start this process and switch over incrementally, if a
single change ends up being too big.

There have been a lot of suggestions around what the new nomenclature might
look like, a few people don’t want to overlap the naming here with what
already exists in SolrCloud i.e. leader/follower.

Primary/Replica was an option that was suggested based on what other
vendors are moving towards based on Wikipedia:
https://en.wikipedia.org/wiki/Master/slave_(technology)
, however there were concerns around the use of “replica” as that denotes a
very specific concept in SolrCloud. Current terminology clearly
differentiates the use of the traditional replication model from SolrCloud
and reusing the names would make it difficult for that to happen.

There were similar concerns around using Leader/follower.

Let’s continue this conversation here while making sure that we converge
without much bike-shedding.

-Anshum


Re: RankLib model output format to Solr LTR model format

2020-06-17 Thread Doug Turnbull
There are several scripts for doing this.

I might encourage you to checkout our Hello LTR library of notebooks, which
has a ranklib training driver, and helpers to log training data, train a
model w/ Ranklib, and search with it. I am using this code for my LTR
contributions AI Powered Search

http://github.com/o19s/hello-ltr

But if you just care about the conversion, check out this code. It's
adapted / inspired by code written by Christine Poerschke with her Ltr For
Bees demo / talk

https://github.com/o19s/hello-ltr/blob/master/ltr/helpers/convert.py

Best
-Doug




On Wed, Jun 17, 2020 at 12:46 PM gnandre  wrote:

> Hi,
>
> Before I start writing my own implementation for converting RankLib's model
> output format to Solr LTR model format for my own use cases, I just wanted
> to check if there is any work done on this front already. Any references
> are welcome.
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search ; Contributor: *AI
Powered Search *
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Java - setting multi-valued fields

2020-06-17 Thread kumar gaurav
HI

Example:

String[] values = new String[] {“value 1”, “value 2” };

inputDoc.setField (multiFieldName, values);


Can you try once to change the array to list ?

List values = new ArrayList<>();

values.add("value 1");

values.add("value 2");

inputDoc.setField (multiFieldName, values);



regards

Kumar Gaurav







On Wed, Jun 17, 2020 at 8:33 PM Eivind Hodneland <
eivind.hodnel...@uptimeconsulting.no> wrote:

> Hi,
>
>
>
> My customer has a Solr index with a large amount of fields, many of these
> are multivalued (type=”string”, multiValued=”true”).
>
>
>
> I am having problems with setting the values for these fields in my Java
> update processors.
>
> Example:
>
> String[] values = new String[] {“value 1”, “value 2” };
>
> inputDoc.setField (multiFieldName, values);
>
>
>
> However, only “value 1” is present in the index after updating.
>
> What is the best / correct way to make this work?
>
>
>
>
>
>
>
> Uptime Consulting | Eivind Hodneland | Senior Consultant | Munchs gate 7,
> NO-0165 Oslo, Norway
>
> Tel: +47 22 33 71 00 | Mob: +47 971 76 083 |
> eivind.hodnel...@uptimeconsulting.no  | www.uptimeconsulting.no
>
> --
>
> Search and Big Data solutions
>
> Software Development
>
> IT outsourcing services and consultancy
>
>
>
> [image: 4180EEB7]
>
>
>


Re: Solr 7.6 optimize index size increase

2020-06-17 Thread Erick Erickson
What Walter said. Although with Solr 7.6, unless you specify maxSegments 
explicitly,
you won’t create segments over the default 5G maximum.

And if you have in the past specified maxSegments so you have segments over 5G, 
optimize (again without specifying maxSegments) will do a “singleton merge” on 
them,
i.e. it’ll rewrite each large segment into a single new segment with all the 
deleted data
removed thus gradually shrinking it. This happens automatically if you delete
documents (update is a delete + add so counts), but you may have a significant
percentage of deleted docs in your index..

Best,
Erick

> On Jun 17, 2020, at 12:39 PM, Walter Underwood  wrote:
> 
> From that short description, you should not be running optimize at all.
> 
> Just stop doing it. It doesn’t make that big a difference.
> 
> It may take your indexes a few weeks to get back to a normal state after the 
> forced merges.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jun 17, 2020, at 4:12 AM, Raveendra Yerraguntla 
>>  wrote:
>> 
>> Thank you David, Walt , Eric.
>> 1. First time bloated index generated , there is no disk space issue. one 
>> copy of index is 1/6 of disk capacity. we ran into disk capacity after more 
>> than 2  copies of bloated copies.2. Solr is upgraded from 5.*. in 5.* more 
>> than 5 segments is causing performance issue. Performance in 7.* is not 
>> measured for increasing segments. I will plan a PT to get optimum number. 
>> Application has incremental indexing multiple times in a work week.
>> I will keep you updated on the resolution.
>> Thanks again 
>>   On Tuesday, June 16, 2020, 07:34:26 PM EDT, Erick Erickson 
>>  wrote:  
>> 
>> It Depends (tm).
>> 
>> As of Solr 7.5, optimize is different. See: 
>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>> 
>> So, assuming you have _not_ specified maxSegments=1, any very large
>> segment (near 5G) that has _zero_ deleted documents won’t be merged.
>> 
>> So there are two scenarios:
>> 
>> 1> What Walter mentioned. The optimize process runs out of disk space
>>and leaves lots of crud around
>> 
>> 2> your “older segments” are just max-sized segments with zero deletions.
>> 
>> 
>> All that said… do you have demonstrable performance improvements after
>> optimizing? The entire name “optimize” is misleading, of course who
>> wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
>> it made quite a difference. In more recent Solr releases, it’s not as clear
>> cut. So before worrying about making optimize work, I’d recommend that
>> you do some performance tests on optimized and un-optimized indexes. 
>> If there are significant improvements, that’s one thing. Otherwise, it’s
>> a waste.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
>>> 
>>> For a full forced merge (mistakenly named “optimize”), the worst case disk 
>>> space
>>> is 3X the size of the index. It is common to need 2X the size of the index.
>>> 
>>> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
>>> behavior.
>>> I implemented a disk space check that would refuse to merge if there wasn’t 
>>> enough
>>> free space. It would log an error and send an email to the admin.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
 On Jun 16, 2020, at 1:58 PM, David Hastings  
 wrote:
 
 I cant give you a 100% true answer but ive experienced this, and what
 "seemed" to happen to me was that the optimize would start, and that will
 drive the size up by 3 fold, and if you out of disk space in the process
 the optimize will quit since, it cant optimize, and leave the live index
 pieces in tact, so now you have the "current" index as well as the
 "optimized" fragments
 
 i cant say for certain thats what you ran into, but we found that if you
 get an expanding disk it will keep growing and prevent this from happening,
 then the index will contract and the disk will shrink back to only what it
 needs.  saved me a lot of headaches not needing to ever worry about disk
 space
 
 On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
  wrote:
 
> 
> when optimize command is issued, the expectation after the completion of
> optimization process is that the index size either decreases or at most
> remain same. In solr 7.6 cluster with 50 plus shards, when optimize 
> command
> is issued, some of the shard's transient or older segment files are not
> deleted. This is happening randomly across all shards. When unnoticed 
> these
> transient files makes disk full. Currently it is handled through monitors,
> but question is what is causing the transient/older files remains there.
> Are there any specific race conditions which laves the older files not
> being deleted?
> 

Re: Facet Performance

2020-06-17 Thread Erick Erickson
queryResultCache doesn’t really help with faceting, even if it’s hit for the 
main query. 
That cache only stores a subset of the hits, and to facet properly you need 
the entire result set….

> On Jun 17, 2020, at 12:47 PM, James Bodkin  
> wrote:
> 
> We've noticed that the filterCache uses a significant amount of memory, as 
> we've assigned 8GB Heap per instance.
> In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
> alone, further memory is required to ensure the index is always memory mapped 
> for performance reasons.
> 
> Ideally I would like to be able to reduce the amount of memory assigned to 
> the heap by using docValues instead of indexed but it doesn't seem possible.
> The QTime (after warming) for facet.method=enum is around 150-250ms whereas 
> the QTime for facet.method=fc is around 1000-1200ms.
> As we require the results in real-time for customers searching on our 
> website, the later QTime of 1000-1200ms is too slow for us to be able to use.
> 
> Our facet queries change as the customer selects different search criteria, 
> and hence the possible number of potential queries makes it very difficult 
> for the query result cache.
> We already have a custom implementation in which we check our redis cache for 
> queries before they are sent to our aggregators which runs at 30% hit rate.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 17/06/2020, 16:21, "Michael Gibney"  wrote:
> 
>To expand a bit on what Erick said regarding performance: my sense is
>that the RefGuide assertion that "docValues=true" makes faceting
>"faster" could use some qualification/clarification. My take, fwiw:
> 
>First, to reiterate/paraphrase what Erick said: the "faster" assertion
>is not comparing to "facet.method=enum". For low-cardinality fields,
>if you have the heap space, and are very intentional about configuring
>your filterCache (and monitoring it as access patterns might change),
>"facet.method=enum" will likely be as fast as you can get (at least
>for "legacy" facets or whatever -- not sure about "enum" method in
>JSON facets).
> 
>Even where "docValues=true" arguably does make faceting "faster", the
>main benefit is that the "uninverted" data structures are serialized
>on disk, so you're avoiding the need to uninvert each facet field
>on-heap for every new indexSearcher, which is generally high-latency
>-- user perception of this latency can be mitigated using warming
>queries, but it can still be problematic, esp. for frequent index
>updates. On-heap uninversion also inherently consumes a lot of heap
>space, which has general implications wrt GC, etc ... so in that
>respect even if faceting per se might not be "faster" with
>"docValues=true", your overall system may in many cases perform
>better.
> 
>(and Anthony, I'm pretty sure that tag/ex on facets should be
>orthogonal to the "facet.method=enum"/filterCache discussion, as
>tag/ex only affects the DocSet domain over which facets are calculated
>... I think that step is pretty cleanly separated from the actual
>calculation of the facets. I'm not 100% sure on that, so proceed with
>caution, but it could definitely be worth evaluating for your use
>case!)
> 
>Michael
> 
>On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  
> wrote:
>> 
>> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
>> use a docValues=false
>> field for faceting/grouping/sorting/function queries. The primary point of 
>> docValues=true is twofold:
>> 
>> 1> reduce Java heap requirements by using the OS memory to hold it
>> 
>> 2> uninverting can be expensive CPU wise too, although not with just a few
>>unique values (for each term, read the list of docs that have it and flip 
>> a bit).
>> 
>> It doesn’t really make sense to set it on an index=false field, since 
>> uninverting only happens on
>> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
>> That said, I frankly
>> don’t know how that interacts with facet.method=enum.
>> 
>> As far as speed… yeah, you’re in the edge cases. All things being equal, 
>> stuffing these into the
>> filterCache is the fastest way to facet if you have the memory. I’ve seen 
>> very few installations
>> where people have that luxury though. Each entry in the filterCache can 
>> occupy maxDoc/8 + some overhead
>> bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
>> memory. I’m cheating
>> a bit here since the size might be smaller if only a few docs have any 
>> particular entry then the
>> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
>> could theoretically hit
>> the perfect storm where, due to some particular sequence of queries, your 
>> entire filter
>> cache fills up with entries that size.
>> 
>> You’ll have some overhead to keep the cache at that size, but it sounds like 
>> it’s worth it.

Re: Facet Performance

2020-06-17 Thread James Bodkin
We've noticed that the filterCache uses a significant amount of memory, as 
we've assigned 8GB Heap per instance.
In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
alone, further memory is required to ensure the index is always memory mapped 
for performance reasons.

Ideally I would like to be able to reduce the amount of memory assigned to the 
heap by using docValues instead of indexed but it doesn't seem possible.
The QTime (after warming) for facet.method=enum is around 150-250ms whereas the 
QTime for facet.method=fc is around 1000-1200ms.
As we require the results in real-time for customers searching on our website, 
the later QTime of 1000-1200ms is too slow for us to be able to use.

Our facet queries change as the customer selects different search criteria, and 
hence the possible number of potential queries makes it very difficult for the 
query result cache.
We already have a custom implementation in which we check our redis cache for 
queries before they are sent to our aggregators which runs at 30% hit rate.

Kind Regards,

James Bodkin

On 17/06/2020, 16:21, "Michael Gibney"  wrote:

To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:

First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  
wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t 
_unknowingly_ use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point 
of docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
> unique values (for each term, read the list of docs that have it and 
flip a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm 
either. That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
very few installations
> where people have that luxury though. Each entry in the filterCache can 
occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
memory. I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause 
you could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds 
like it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin 
 wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique 
values. We have two fields over that with 150 unique values and 5300 unique 
values retrospectively.
> > At the moment, 

RankLib model output format to Solr LTR model format

2020-06-17 Thread gnandre
Hi,

Before I start writing my own implementation for converting RankLib's model
output format to Solr LTR model format for my own use cases, I just wanted
to check if there is any work done on this front already. Any references
are welcome.


Re: Solr 7.6 optimize index size increase

2020-06-17 Thread Walter Underwood
From that short description, you should not be running optimize at all.

Just stop doing it. It doesn’t make that big a difference.

It may take your indexes a few weeks to get back to a normal state after the 
forced merges.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 4:12 AM, Raveendra Yerraguntla 
>  wrote:
> 
> Thank you David, Walt , Eric.
> 1. First time bloated index generated , there is no disk space issue. one 
> copy of index is 1/6 of disk capacity. we ran into disk capacity after more 
> than 2  copies of bloated copies.2. Solr is upgraded from 5.*. in 5.* more 
> than 5 segments is causing performance issue. Performance in 7.* is not 
> measured for increasing segments. I will plan a PT to get optimum number. 
> Application has incremental indexing multiple times in a work week.
> I will keep you updated on the resolution.
> Thanks again 
>On Tuesday, June 16, 2020, 07:34:26 PM EDT, Erick Erickson 
>  wrote:  
> 
> It Depends (tm).
> 
> As of Solr 7.5, optimize is different. See: 
> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> 
> So, assuming you have _not_ specified maxSegments=1, any very large
> segment (near 5G) that has _zero_ deleted documents won’t be merged.
> 
> So there are two scenarios:
> 
> 1> What Walter mentioned. The optimize process runs out of disk space
> and leaves lots of crud around
> 
> 2> your “older segments” are just max-sized segments with zero deletions.
> 
> 
> All that said… do you have demonstrable performance improvements after
> optimizing? The entire name “optimize” is misleading, of course who
> wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
> it made quite a difference. In more recent Solr releases, it’s not as clear
> cut. So before worrying about making optimize work, I’d recommend that
> you do some performance tests on optimized and un-optimized indexes. 
> If there are significant improvements, that’s one thing. Otherwise, it’s
> a waste.
> 
> Best,
> Erick
> 
>> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
>> 
>> For a full forced merge (mistakenly named “optimize”), the worst case disk 
>> space
>> is 3X the size of the index. It is common to need 2X the size of the index.
>> 
>> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
>> behavior.
>> I implemented a disk space check that would refuse to merge if there wasn’t 
>> enough
>> free space. It would log an error and send an email to the admin.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 16, 2020, at 1:58 PM, David Hastings  
>>> wrote:
>>> 
>>> I cant give you a 100% true answer but ive experienced this, and what
>>> "seemed" to happen to me was that the optimize would start, and that will
>>> drive the size up by 3 fold, and if you out of disk space in the process
>>> the optimize will quit since, it cant optimize, and leave the live index
>>> pieces in tact, so now you have the "current" index as well as the
>>> "optimized" fragments
>>> 
>>> i cant say for certain thats what you ran into, but we found that if you
>>> get an expanding disk it will keep growing and prevent this from happening,
>>> then the index will contract and the disk will shrink back to only what it
>>> needs.  saved me a lot of headaches not needing to ever worry about disk
>>> space
>>> 
>>> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>>>  wrote:
>>> 
 
 when optimize command is issued, the expectation after the completion of
 optimization process is that the index size either decreases or at most
 remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
 is issued, some of the shard's transient or older segment files are not
 deleted. This is happening randomly across all shards. When unnoticed these
 transient files makes disk full. Currently it is handled through monitors,
 but question is what is causing the transient/older files remains there.
 Are there any specific race conditions which laves the older files not
 being deleted?
 Any pointers around this will be helpful.
 TIA
>> 



Re: Master Slave Terminology

2020-06-17 Thread Walter Underwood
I’ve long thought that master/slave was not the right metaphor for a pull model 
anyway.

We probably should not use “replica” since that already has a use in Solr Cloud.

Where is the discussion?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 16, 2020, at 11:51 PM, Jan Høydahl  wrote:
> 
> Hi Kaya,
> 
> Thanks for bringing it up. The topic is already being discussed by 
> developers, so expect to see some change in this area; Not over-night, but 
> incremental.
> Also, if you want to lend a helping hand, patches are more than welcome as 
> always.
> 
> Jan
> 
>> 17. jun. 2020 kl. 04:22 skrev Kayak28 :
>> 
>> Hello, Community:
>> 
>> As the Github and Python will replace terminologies that relative to
>> slavery,
>> why don't we replace master-slave for Solr as well?
>> 
>> https://developers.srad.jp/story/18/09/14/0935201/
>> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
>> 
>> -- 
>> 
>> Sincerely,
>> Kaya
>> github: https://github.com/28kayak
> 



Re: ChildDocTransformer and export handler

2020-06-17 Thread Munendra S N
Currently, Doc transformers are not supported while exporting the results.
The document covers the field requirements for the export handler. I hope
this helps.
https://lucene.apache.org/solr/guide/8_5/exporting-result-sets.html#field-requirements


Regards,
Munendra S N



On Wed, Jun 17, 2020 at 8:49 PM Ludger Steens 
wrote:

> Dear Community,
>
>
>
>
>
> we are using the /export handler with Solr 7.7 to fetch a large number of
> documents from Solr.
>
> Recently we have extended our schema with Child Documents and now we are
> wondering if/how it is possible to export parent documents together with
> their corresponding Child Documents.
>
> When using the /select handler this can be done with the
> ChildDocTransformer (
>
> https://lucene.apache.org/solr/guide/7_7/transforming-result-documents.html#child-childdoctransformerfactory
> ).
>
> However, when using the export handle we get an error from Solr.
>
>
>
> Our request:
>
> {
>   "query" : "*:*",
>   "sort" : "id asc",
>   "fields" : "id,[child parentFilter='-child_type:* *:*']"
> }
>
>
>
> The response from Solr:
>
> {
>
>   "responseHeader":{"status":400},
>
>   "response":{
>
> "numFound":0,
>
> "docs":[{"EXCEPTION":"org.apache.solr.common.SolrException:
> undefined field: \"[child parentFilter=\"-child_type:* *:*\"]\""}]}
>
> }
>
>
>
> Is it possible to get parent documents together with their corresponding
> child documents?
>
> If it is possible: What is the correct query?
>
> If it is not possible: Can Streaming Expressions be used together with
> child documents? As far as I understand they internally use the export
> handler.
>
>
>
> Thanks in advance for your help
>
>
>
> Ludger
>
>
> --
>
> *„Beste Arbeitgeber ITK 2020“ - 1. Platz für QAware*
> ausgezeichnet von Great Place to Work
> <
> https://www.qaware.de/news/great-place-to-work-deutschlands-beste-arbeitgeber-2020/
> >
> --
>
> Ludger Steens
> Softwarearchitekt
>
> QAware GmbH
> Aschauer Straße 32
> 81549 München, Germany
> Mobil +49 175 7973969
> ludger.ste...@qaware.de
> www.qaware.de
> --
>
> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
> Registergericht: München
> Handelsregisternummer: HRB 163761
>


Re: Facet Performance

2020-06-17 Thread Michael Gibney
To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:

First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
> use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point of 
> docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
> unique values (for each term, read the list of docs that have it and flip 
> a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
> uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
> That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
> stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
> very few installations
> where people have that luxury though. Each entry in the filterCache can 
> occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
> I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
> particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
> could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
> entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds like 
> it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin  
> > wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique values. 
> > We have two fields over that with 150 unique values and 5300 unique values 
> > retrospectively.
> > At the moment, our filterCache is configured with a maximum size of 8192.
> >
> > From the DocValues documentation 
> > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> > this approach promises to make lookups for faceting, sorting and grouping 
> > much faster.
> > Hence I thought that using DocValues would be better than using Indexed and 
> > in turn improve our response times and possibly lower memory requirements. 
> > It sounds like this isn't the case if you are able to allocate enough 
> > memory to the filterCache.
> >
> > I haven't yet tried changing the uninvertible setting, I was looking at the 
> > documentation for this field earlier today.
> > Should we be setting uninvertible="false" if docValues="true" regardless of 
> > whether indexed is true or false?
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 17/06/2020, 14:02, "Michael Gibney"  wrote:
> >
> >facet.method=enum works by executing a query (against indexed values)
> >for each indexed value in a given field (which, for indexed=false, is
> >"no values"). So that explains why facet.method=enum no longer works.
> >I was going to suggest that you might not want to set indexed=false on
> >the docValues facet fields anyway, since the indexed values are still
> >used for facet refinement (assuming your index is distributed).
> >

Re: Master Slave Terminology

2020-06-17 Thread Doug Turnbull
+1 to name change. Also 'overseer' which doesn't go well with Master/Slave!

On Wed, Jun 17, 2020 at 11:16 AM David Smiley 
wrote:

> priv...@lucene.apache.org but it should have been public and expect it to
> spill out to the dev list today.
>
> ~ David
>
>
> On Wed, Jun 17, 2020 at 11:14 AM Mike Drob  wrote:
>
> > Hi Jan,
> >
> > Can you link to the discussion? I searched the dev list and didn’t see
> > anything, is it on slack or a jira or somewhere else?
> >
> > Mike
> >
> > On Wed, Jun 17, 2020 at 1:51 AM Jan Høydahl 
> wrote:
> >
> > > Hi Kaya,
> > >
> > > Thanks for bringing it up. The topic is already being discussed by
> > > developers, so expect to see some change in this area; Not over-night,
> > but
> > > incremental.
> > > Also, if you want to lend a helping hand, patches are more than welcome
> > as
> > > always.
> > >
> > > Jan
> > >
> > > > 17. jun. 2020 kl. 04:22 skrev Kayak28 :
> > > >
> > > > Hello, Community:
> > > >
> > > > As the Github and Python will replace terminologies that relative to
> > > > slavery,
> > > > why don't we replace master-slave for Solr as well?
> > > >
> > > > https://developers.srad.jp/story/18/09/14/0935201/
> > > >
> > >
> >
> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
> > > >
> > > > --
> > > >
> > > > Sincerely,
> > > > Kaya
> > > > github: https://github.com/28kayak
> > >
> > >
> >
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search ; Contributor: *AI
Powered Search *
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


ChildDocTransformer and export handler

2020-06-17 Thread Ludger Steens
Dear Community,





we are using the /export handler with Solr 7.7 to fetch a large number of
documents from Solr.

Recently we have extended our schema with Child Documents and now we are
wondering if/how it is possible to export parent documents together with
their corresponding Child Documents.

When using the /select handler this can be done with the
ChildDocTransformer (
https://lucene.apache.org/solr/guide/7_7/transforming-result-documents.html#child-childdoctransformerfactory
).

However, when using the export handle we get an error from Solr.



Our request:

{
  "query" : "*:*",
  "sort" : "id asc",
  "fields" : "id,[child parentFilter='-child_type:* *:*']"
}



The response from Solr:

{

  "responseHeader":{"status":400},

  "response":{

"numFound":0,

"docs":[{"EXCEPTION":"org.apache.solr.common.SolrException:
undefined field: \"[child parentFilter=\"-child_type:* *:*\"]\""}]}

}



Is it possible to get parent documents together with their corresponding
child documents?

If it is possible: What is the correct query?

If it is not possible: Can Streaming Expressions be used together with
child documents? As far as I understand they internally use the export
handler.



Thanks in advance for your help



Ludger


--

*„Beste Arbeitgeber ITK 2020“ - 1. Platz für QAware*
ausgezeichnet von Great Place to Work

--

Ludger Steens
Softwarearchitekt

QAware GmbH
Aschauer Straße 32
81549 München, Germany
Mobil +49 175 7973969
ludger.ste...@qaware.de
www.qaware.de
--

Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
Registergericht: München
Handelsregisternummer: HRB 163761


Re: Master Slave Terminology

2020-06-17 Thread David Smiley
priv...@lucene.apache.org but it should have been public and expect it to
spill out to the dev list today.

~ David


On Wed, Jun 17, 2020 at 11:14 AM Mike Drob  wrote:

> Hi Jan,
>
> Can you link to the discussion? I searched the dev list and didn’t see
> anything, is it on slack or a jira or somewhere else?
>
> Mike
>
> On Wed, Jun 17, 2020 at 1:51 AM Jan Høydahl  wrote:
>
> > Hi Kaya,
> >
> > Thanks for bringing it up. The topic is already being discussed by
> > developers, so expect to see some change in this area; Not over-night,
> but
> > incremental.
> > Also, if you want to lend a helping hand, patches are more than welcome
> as
> > always.
> >
> > Jan
> >
> > > 17. jun. 2020 kl. 04:22 skrev Kayak28 :
> > >
> > > Hello, Community:
> > >
> > > As the Github and Python will replace terminologies that relative to
> > > slavery,
> > > why don't we replace master-slave for Solr as well?
> > >
> > > https://developers.srad.jp/story/18/09/14/0935201/
> > >
> >
> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
> > >
> > > --
> > >
> > > Sincerely,
> > > Kaya
> > > github: https://github.com/28kayak
> >
> >
>


Re: Master Slave Terminology

2020-06-17 Thread Mike Drob
Hi Jan,

Can you link to the discussion? I searched the dev list and didn’t see
anything, is it on slack or a jira or somewhere else?

Mike

On Wed, Jun 17, 2020 at 1:51 AM Jan Høydahl  wrote:

> Hi Kaya,
>
> Thanks for bringing it up. The topic is already being discussed by
> developers, so expect to see some change in this area; Not over-night, but
> incremental.
> Also, if you want to lend a helping hand, patches are more than welcome as
> always.
>
> Jan
>
> > 17. jun. 2020 kl. 04:22 skrev Kayak28 :
> >
> > Hello, Community:
> >
> > As the Github and Python will replace terminologies that relative to
> > slavery,
> > why don't we replace master-slave for Solr as well?
> >
> > https://developers.srad.jp/story/18/09/14/0935201/
> >
> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
> >
> > --
> >
> > Sincerely,
> > Kaya
> > github: https://github.com/28kayak
>
>


Java - setting multi-valued fields

2020-06-17 Thread Eivind Hodneland
Hi,

My customer has a Solr index with a large amount of fields, many of these are 
multivalued (type="string", multiValued="true").

I am having problems with setting the values for these fields in my Java update 
processors.
Example:
String[] values = new String[] {"value 1", "value 2" };
inputDoc.setField (multiFieldName, values);

However, only "value 1" is present in the index after updating.
What is the best / correct way to make this work?



Uptime Consulting | Eivind Hodneland | Senior Consultant | Munchs gate 7, 
NO-0165 Oslo, Norway
Tel: +47 22 33 71 00 | Mob: +47 971 76 083 | 
eivind.hodnel...@uptimeconsulting.no
  | www.uptimeconsulting.no
--
Search and Big Data solutions
Software Development
IT outsourcing services and consultancy

[4180EEB7]



Re: Facet Performance

2020-06-17 Thread Erick Erickson
Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
use a docValues=false
field for faceting/grouping/sorting/function queries. The primary point of 
docValues=true is twofold:

1> reduce Java heap requirements by using the OS memory to hold it

2> uninverting can be expensive CPU wise too, although not with just a few
unique values (for each term, read the list of docs that have it and flip a 
bit).

It doesn’t really make sense to set it on an index=false field, since 
uninverting only happens on
index=true docValues=false. OTOH, I don’t think it would do any harm either. 
That said, I frankly
don’t know how that interacts with facet.method=enum.

As far as speed… yeah, you’re in the edge cases. All things being equal, 
stuffing these into the
filterCache is the fastest way to facet if you have the memory. I’ve seen very 
few installations
where people have that luxury though. Each entry in the filterCache can occupy 
maxDoc/8 + some overhead
bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
I’m cheating
a bit here since the size might be smaller if only a few docs have any 
particular entry then the
size is smaller. But that’s the worst-case you have to allow for ‘cause you 
could theoretically hit
the perfect storm where, due to some particular sequence of queries, your 
entire filter
cache fills up with entries that size. 

You’ll have some overhead to keep the cache at that size, but it sounds like 
it’s worth it.

Best,
Erick



> On Jun 17, 2020, at 10:05 AM, James Bodkin  
> wrote:
> 
> The large majority of the relevant fields have fewer than 20 unique values. 
> We have two fields over that with 150 unique values and 5300 unique values 
> retrospectively.
> At the moment, our filterCache is configured with a maximum size of 8192.
> 
> From the DocValues documentation 
> (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> this approach promises to make lookups for faceting, sorting and grouping 
> much faster.
> Hence I thought that using DocValues would be better than using Indexed and 
> in turn improve our response times and possibly lower memory requirements. It 
> sounds like this isn't the case if you are able to allocate enough memory to 
> the filterCache.
> 
> I haven't yet tried changing the uninvertible setting, I was looking at the 
> documentation for this field earlier today.
> Should we be setting uninvertible="false" if docValues="true" regardless of 
> whether indexed is true or false?
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 17/06/2020, 14:02, "Michael Gibney"  wrote:
> 
>facet.method=enum works by executing a query (against indexed values)
>for each indexed value in a given field (which, for indexed=false, is
>"no values"). So that explains why facet.method=enum no longer works.
>I was going to suggest that you might not want to set indexed=false on
>the docValues facet fields anyway, since the indexed values are still
>used for facet refinement (assuming your index is distributed).
> 
>What's the number of unique values in the relevant fields? If it's low
>enough, setting docValues=false and indexed=true and using
>facet.method=enum (with a sufficiently large filterCache) is
>definitely a viable option, and will almost certainly be faster than
>docValues-based faceting. (As an aside, noting for future reference:
>high-cardinality facets over high-cardinality DocSet domains might be
>able to benefit from a term facet count cache:
>https://issues.apache.org/jira/browse/SOLR-13807)
> 
>I think you didn't specifically mention whether you acted on Erick's
>suggestion of setting "uninvertible=false" (I think Erick accidentally
>said "uninvertible=true") to fail fast. I'd also recommend doing that,
>perhaps even above all else -- it shouldn't actually *do* anything,
>but will help ensure that things are behaving as you expect them to!
> 
>Michael
> 
>On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
> wrote:
>> 
>> Thanks, I've implemented some queries that improve the first-hit execution 
>> for faceting.
>> 
>> Since turning off indexed on those fields, we've noticed that 
>> facet.method=enum no longer returns the facets when used.
>> Using facet.method=fc/fcs is significantly slower compared to 
>> facet.method=enum for us. Why do these two differences exist?
>> 
>> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>> 
>>Ok, I see the disconnect... Necessary parts if the index are read from 
>> disk
>>lazily. So your newSearcher or firstSearcher query needs to do whatever
>>operation causes the relevant parts of the index to be read. In this case,
>>probably just facet on all the fields you care about. I'd add sorting too
>>if you sort on different fields.
>> 
>>The *:* query without facets or sorting does virtually nothing due to some
>>special handling...
>> 
>>On Tue, Jun 16, 

Re: Facet Performance

2020-06-17 Thread James Bodkin
The large majority of the relevant fields have fewer than 20 unique values. We 
have two fields over that with 150 unique values and 5300 unique values 
retrospectively.
At the moment, our filterCache is configured with a maximum size of 8192.

From the DocValues documentation 
(https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
this approach promises to make lookups for faceting, sorting and grouping much 
faster.
Hence I thought that using DocValues would be better than using Indexed and in 
turn improve our response times and possibly lower memory requirements. It 
sounds like this isn't the case if you are able to allocate enough memory to 
the filterCache.

I haven't yet tried changing the uninvertible setting, I was looking at the 
documentation for this field earlier today.
Should we be setting uninvertible="false" if docValues="true" regardless of 
whether indexed is true or false?

Kind Regards,

James Bodkin

On 17/06/2020, 14:02, "Michael Gibney"  wrote:

facet.method=enum works by executing a query (against indexed values)
for each indexed value in a given field (which, for indexed=false, is
"no values"). So that explains why facet.method=enum no longer works.
I was going to suggest that you might not want to set indexed=false on
the docValues facet fields anyway, since the indexed values are still
used for facet refinement (assuming your index is distributed).

What's the number of unique values in the relevant fields? If it's low
enough, setting docValues=false and indexed=true and using
facet.method=enum (with a sufficiently large filterCache) is
definitely a viable option, and will almost certainly be faster than
docValues-based faceting. (As an aside, noting for future reference:
high-cardinality facets over high-cardinality DocSet domains might be
able to benefit from a term facet count cache:
https://issues.apache.org/jira/browse/SOLR-13807)

I think you didn't specifically mention whether you acted on Erick's
suggestion of setting "uninvertible=false" (I think Erick accidentally
said "uninvertible=true") to fail fast. I'd also recommend doing that,
perhaps even above all else -- it shouldn't actually *do* anything,
but will help ensure that things are behaving as you expect them to!

Michael

On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
 wrote:
>
> Thanks, I've implemented some queries that improve the first-hit 
execution for faceting.
>
> Since turning off indexed on those fields, we've noticed that 
facet.method=enum no longer returns the facets when used.
> Using facet.method=fc/fcs is significantly slower compared to 
facet.method=enum for us. Why do these two differences exist?
>
> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>
> Ok, I see the disconnect... Necessary parts if the index are read 
from disk
> lazily. So your newSearcher or firstSearcher query needs to do 
whatever
> operation causes the relevant parts of the index to be read. In this 
case,
> probably just facet on all the fields you care about. I'd add sorting 
too
> if you sort on different fields.
>
> The *:* query without facets or sorting does virtually nothing due to 
some
> special handling...
>
> On Tue, Jun 16, 2020, 10:48 James Bodkin 

> wrote:
>
> > I've been trying to build a query that I can use in newSearcher 
based off
> > the information in your previous e-mail. I thought you meant to 
build a *:*
> > query as per Query 1 in my previous e-mail but I'm still seeing the
> > first-hit execution.
> > Now I'm wondering if you meant to create a *:* query with each of 
the
> > fields as part of the fl query parameters or a *:* query with each 
of the
> > fields and values as part of the fq query parameters.
> >
> > At the moment I've been running these manually as I expected that I 
would
> > see the first-execution penalty disappear by the time I got to 
query 4, as
> > I thought this would replicate the actions of the newSeacher.
> > Unfortunately we can't use the autowarm count that is available as 
part of
> > the filterCache/filterCache due to the custom deployment mechanism 
we use
> > to update our index.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 16/06/2020, 15:30, "Erick Erickson"  
wrote:
> >
> > Did you try the autowarming like I mentioned in my previous 
e-mail?
> >
> > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > james.bod...@loveholidays.com> wrote:
> > >
> > > We've changed the schema to enable docValues for these fields 
and
> > this led to an improvement in the response time. We found a further
> > 

Re: Facet Performance

2020-06-17 Thread Anthony Groves
Ah, interesting! So if the number of possible values is low (like <= 10),
it is faster to *not *use docvalues on that (indexed) faceted field?
Does this hold true even when using faceting techniques like tag and
exclusion?

Thanks,
Anthony


On Wed, Jun 17, 2020 at 9:37 AM David Smiley 
wrote:

> I strongly recommend setting indexed=true on a field you facet on for the
> purposes of efficient refinement (fq=field:value).  But it strictly isn't
> required, as you have discovered.
>
> ~ David
>
>
> On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney 
> wrote:
>
> > facet.method=enum works by executing a query (against indexed values)
> > for each indexed value in a given field (which, for indexed=false, is
> > "no values"). So that explains why facet.method=enum no longer works.
> > I was going to suggest that you might not want to set indexed=false on
> > the docValues facet fields anyway, since the indexed values are still
> > used for facet refinement (assuming your index is distributed).
> >
> > What's the number of unique values in the relevant fields? If it's low
> > enough, setting docValues=false and indexed=true and using
> > facet.method=enum (with a sufficiently large filterCache) is
> > definitely a viable option, and will almost certainly be faster than
> > docValues-based faceting. (As an aside, noting for future reference:
> > high-cardinality facets over high-cardinality DocSet domains might be
> > able to benefit from a term facet count cache:
> > https://issues.apache.org/jira/browse/SOLR-13807)
> >
> > I think you didn't specifically mention whether you acted on Erick's
> > suggestion of setting "uninvertible=false" (I think Erick accidentally
> > said "uninvertible=true") to fail fast. I'd also recommend doing that,
> > perhaps even above all else -- it shouldn't actually *do* anything,
> > but will help ensure that things are behaving as you expect them to!
> >
> > Michael
> >
> > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
> >  wrote:
> > >
> > > Thanks, I've implemented some queries that improve the first-hit
> > execution for faceting.
> > >
> > > Since turning off indexed on those fields, we've noticed that
> > facet.method=enum no longer returns the facets when used.
> > > Using facet.method=fc/fcs is significantly slower compared to
> > facet.method=enum for us. Why do these two differences exist?
> > >
> > > On 16/06/2020, 17:52, "Erick Erickson" 
> wrote:
> > >
> > > Ok, I see the disconnect... Necessary parts if the index are read
> > from disk
> > > lazily. So your newSearcher or firstSearcher query needs to do
> > whatever
> > > operation causes the relevant parts of the index to be read. In
> this
> > case,
> > > probably just facet on all the fields you care about. I'd add
> > sorting too
> > > if you sort on different fields.
> > >
> > > The *:* query without facets or sorting does virtually nothing due
> > to some
> > > special handling...
> > >
> > > On Tue, Jun 16, 2020, 10:48 James Bodkin <
> > james.bod...@loveholidays.com>
> > > wrote:
> > >
> > > > I've been trying to build a query that I can use in newSearcher
> > based off
> > > > the information in your previous e-mail. I thought you meant to
> > build a *:*
> > > > query as per Query 1 in my previous e-mail but I'm still seeing
> the
> > > > first-hit execution.
> > > > Now I'm wondering if you meant to create a *:* query with each of
> > the
> > > > fields as part of the fl query parameters or a *:* query with
> each
> > of the
> > > > fields and values as part of the fq query parameters.
> > > >
> > > > At the moment I've been running these manually as I expected that
> > I would
> > > > see the first-execution penalty disappear by the time I got to
> > query 4, as
> > > > I thought this would replicate the actions of the newSeacher.
> > > > Unfortunately we can't use the autowarm count that is available
> as
> > part of
> > > > the filterCache/filterCache due to the custom deployment
> mechanism
> > we use
> > > > to update our index.
> > > >
> > > > Kind Regards,
> > > >
> > > > James Bodkin
> > > >
> > > > On 16/06/2020, 15:30, "Erick Erickson"  >
> > wrote:
> > > >
> > > > Did you try the autowarming like I mentioned in my previous
> > e-mail?
> > > >
> > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > > > james.bod...@loveholidays.com> wrote:
> > > > >
> > > > > We've changed the schema to enable docValues for these
> > fields and
> > > > this led to an improvement in the response time. We found a
> further
> > > > improvement by also switching off indexed as these fields are
> used
> > for
> > > > faceting and filtering only.
> > > > > Since those changes, we've found that the first-execution
> for
> > > > queries is really noticeable. I thought this would be the
> > filterCache based
> > > > on what I saw in 

Re: Facet Performance

2020-06-17 Thread David Smiley
I strongly recommend setting indexed=true on a field you facet on for the
purposes of efficient refinement (fq=field:value).  But it strictly isn't
required, as you have discovered.

~ David


On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney 
wrote:

> facet.method=enum works by executing a query (against indexed values)
> for each indexed value in a given field (which, for indexed=false, is
> "no values"). So that explains why facet.method=enum no longer works.
> I was going to suggest that you might not want to set indexed=false on
> the docValues facet fields anyway, since the indexed values are still
> used for facet refinement (assuming your index is distributed).
>
> What's the number of unique values in the relevant fields? If it's low
> enough, setting docValues=false and indexed=true and using
> facet.method=enum (with a sufficiently large filterCache) is
> definitely a viable option, and will almost certainly be faster than
> docValues-based faceting. (As an aside, noting for future reference:
> high-cardinality facets over high-cardinality DocSet domains might be
> able to benefit from a term facet count cache:
> https://issues.apache.org/jira/browse/SOLR-13807)
>
> I think you didn't specifically mention whether you acted on Erick's
> suggestion of setting "uninvertible=false" (I think Erick accidentally
> said "uninvertible=true") to fail fast. I'd also recommend doing that,
> perhaps even above all else -- it shouldn't actually *do* anything,
> but will help ensure that things are behaving as you expect them to!
>
> Michael
>
> On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
>  wrote:
> >
> > Thanks, I've implemented some queries that improve the first-hit
> execution for faceting.
> >
> > Since turning off indexed on those fields, we've noticed that
> facet.method=enum no longer returns the facets when used.
> > Using facet.method=fc/fcs is significantly slower compared to
> facet.method=enum for us. Why do these two differences exist?
> >
> > On 16/06/2020, 17:52, "Erick Erickson"  wrote:
> >
> > Ok, I see the disconnect... Necessary parts if the index are read
> from disk
> > lazily. So your newSearcher or firstSearcher query needs to do
> whatever
> > operation causes the relevant parts of the index to be read. In this
> case,
> > probably just facet on all the fields you care about. I'd add
> sorting too
> > if you sort on different fields.
> >
> > The *:* query without facets or sorting does virtually nothing due
> to some
> > special handling...
> >
> > On Tue, Jun 16, 2020, 10:48 James Bodkin <
> james.bod...@loveholidays.com>
> > wrote:
> >
> > > I've been trying to build a query that I can use in newSearcher
> based off
> > > the information in your previous e-mail. I thought you meant to
> build a *:*
> > > query as per Query 1 in my previous e-mail but I'm still seeing the
> > > first-hit execution.
> > > Now I'm wondering if you meant to create a *:* query with each of
> the
> > > fields as part of the fl query parameters or a *:* query with each
> of the
> > > fields and values as part of the fq query parameters.
> > >
> > > At the moment I've been running these manually as I expected that
> I would
> > > see the first-execution penalty disappear by the time I got to
> query 4, as
> > > I thought this would replicate the actions of the newSeacher.
> > > Unfortunately we can't use the autowarm count that is available as
> part of
> > > the filterCache/filterCache due to the custom deployment mechanism
> we use
> > > to update our index.
> > >
> > > Kind Regards,
> > >
> > > James Bodkin
> > >
> > > On 16/06/2020, 15:30, "Erick Erickson" 
> wrote:
> > >
> > > Did you try the autowarming like I mentioned in my previous
> e-mail?
> > >
> > > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > > james.bod...@loveholidays.com> wrote:
> > > >
> > > > We've changed the schema to enable docValues for these
> fields and
> > > this led to an improvement in the response time. We found a further
> > > improvement by also switching off indexed as these fields are used
> for
> > > faceting and filtering only.
> > > > Since those changes, we've found that the first-execution for
> > > queries is really noticeable. I thought this would be the
> filterCache based
> > > on what I saw in NewRelic however it is probably trying to read the
> > > docValues from disk. How can we use the autowarming to improve
> this?
> > > >
> > > > For example, I've run the following queries in sequence and
> each
> > > query has a first-execution penalty.
> > > >
> > > > Query 1:
> > > >
> > > > q=*:*
> > > > facet=true
> > > > facet.field=D_DepartureAirport
> > > > facet.field=D_Destination
> > > > facet.limit=-1
> > > > rows=0
> > >

Re: Facet Performance

2020-06-17 Thread Michael Gibney
facet.method=enum works by executing a query (against indexed values)
for each indexed value in a given field (which, for indexed=false, is
"no values"). So that explains why facet.method=enum no longer works.
I was going to suggest that you might not want to set indexed=false on
the docValues facet fields anyway, since the indexed values are still
used for facet refinement (assuming your index is distributed).

What's the number of unique values in the relevant fields? If it's low
enough, setting docValues=false and indexed=true and using
facet.method=enum (with a sufficiently large filterCache) is
definitely a viable option, and will almost certainly be faster than
docValues-based faceting. (As an aside, noting for future reference:
high-cardinality facets over high-cardinality DocSet domains might be
able to benefit from a term facet count cache:
https://issues.apache.org/jira/browse/SOLR-13807)

I think you didn't specifically mention whether you acted on Erick's
suggestion of setting "uninvertible=false" (I think Erick accidentally
said "uninvertible=true") to fail fast. I'd also recommend doing that,
perhaps even above all else -- it shouldn't actually *do* anything,
but will help ensure that things are behaving as you expect them to!

Michael

On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
 wrote:
>
> Thanks, I've implemented some queries that improve the first-hit execution 
> for faceting.
>
> Since turning off indexed on those fields, we've noticed that 
> facet.method=enum no longer returns the facets when used.
> Using facet.method=fc/fcs is significantly slower compared to 
> facet.method=enum for us. Why do these two differences exist?
>
> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>
> Ok, I see the disconnect... Necessary parts if the index are read from 
> disk
> lazily. So your newSearcher or firstSearcher query needs to do whatever
> operation causes the relevant parts of the index to be read. In this case,
> probably just facet on all the fields you care about. I'd add sorting too
> if you sort on different fields.
>
> The *:* query without facets or sorting does virtually nothing due to some
> special handling...
>
> On Tue, Jun 16, 2020, 10:48 James Bodkin 
> wrote:
>
> > I've been trying to build a query that I can use in newSearcher based 
> off
> > the information in your previous e-mail. I thought you meant to build a 
> *:*
> > query as per Query 1 in my previous e-mail but I'm still seeing the
> > first-hit execution.
> > Now I'm wondering if you meant to create a *:* query with each of the
> > fields as part of the fl query parameters or a *:* query with each of 
> the
> > fields and values as part of the fq query parameters.
> >
> > At the moment I've been running these manually as I expected that I 
> would
> > see the first-execution penalty disappear by the time I got to query 4, 
> as
> > I thought this would replicate the actions of the newSeacher.
> > Unfortunately we can't use the autowarm count that is available as part 
> of
> > the filterCache/filterCache due to the custom deployment mechanism we 
> use
> > to update our index.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 16/06/2020, 15:30, "Erick Erickson"  wrote:
> >
> > Did you try the autowarming like I mentioned in my previous e-mail?
> >
> > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > james.bod...@loveholidays.com> wrote:
> > >
> > > We've changed the schema to enable docValues for these fields and
> > this led to an improvement in the response time. We found a further
> > improvement by also switching off indexed as these fields are used for
> > faceting and filtering only.
> > > Since those changes, we've found that the first-execution for
> > queries is really noticeable. I thought this would be the filterCache 
> based
> > on what I saw in NewRelic however it is probably trying to read the
> > docValues from disk. How can we use the autowarming to improve this?
> > >
> > > For example, I've run the following queries in sequence and each
> > query has a first-execution penalty.
> > >
> > > Query 1:
> > >
> > > q=*:*
> > > facet=true
> > > facet.field=D_DepartureAirport
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 2:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2660)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 3:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2661)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > 

RE: Solr cloud backup/restore not working

2020-06-17 Thread Kommu, Vinodh K.
Hi,

What is the log level defined for solr nodes? Did you used requestid in restore 
command? If so, check the status of the requestid if that points to any errors.

Thanks & Regards,
Vinodh

-Original Message-
From: yaswanth kumar  
Sent: Wednesday, June 17, 2020 4:33 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud backup/restore not working

ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests 
for Login Information.

Can someone please guide me on where can I get more detailed error of the above 
exception while doing restore?? All that I see in solr.log was pasted above

Thanks,

On Tue, Jun 16, 2020 at 10:44 AM yaswanth kumar 
wrote:

> I don't see anything related in the solr.log file for the same error. 
> Not sure if there is anyother place where I can check for this.
>
> Thanks,
>
> On Tue, Jun 16, 2020 at 10:21 AM Shawn Heisey  wrote:
>
>> On 6/12/2020 8:38 AM, yaswanth kumar wrote:
>> > Using solr 8.2.0 and setup a cloud with 2 nodes. (2 replica's for 
>> > each
>> > collection)
>> > Enabled basic authentication and gave all access to the admin user
>> >
>> > Now trying to use solr cloud backup/restore API, backup is working
>> great,
>> > but when trying to invoke restore API its throwing the below error
>> 
>> >  "msg":"ADDREPLICA failed to create replica",
>> >  "trace":"org.apache.solr.common.SolrException: ADDREPLICA 
>> > failed to create replica\n\tat
>> >
>> org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.j
>> ava:53)\n\tat
>> >
>> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(Collect
>> ionsHandler.java:280)\n\tat
>>
>> The underlying cause of this exception is not recorded here.  Are 
>> there other entries in the Solr log with more detailed information 
>> from the ADDREPLICA attempt?
>>
>> Thanks,
>> Shawn
>>
>
>
> --
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com
>


--
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com
DTCC DISCLAIMER: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify us 
immediately and delete the email and any attachments from your system. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email.


Re: Migration for total noob?

2020-06-17 Thread Erick Erickson
Yeah, there’s a lot to get your head around with Solr, I wish it could be 
simpler…

If at all possible, I recommend you just re-index the data from the system of 
record.

That aside, you say you “copied the cores”. Is this still stand-alone and did 
that include the conf directory? What this seems to indicate is that the 
configuration (your solrconfig.xml file) is incompatible, particularly the 
suggesters. Suggesters are tricky. There should be more detailed error messages 
in the solr log file, have you looked there?

What I’d do is go into the solrconfig.xml files and comment out all of the 
suggester-related configurations. See if your cores start (or reload) then take 
a look at suggesters….

Best,
Erick

> On Jun 16, 2020, at 12:03 PM, Hammer, Erich F  wrote:
> 
> Disclaimer:  My background is Windows desktop and AD management and 
> PowerShell.  I have no experience with Solr and only very limited experience 
> with Java, so please be patient with me.  
> 
> I have inherited a Solr 7.2.1 setup (on Windows), and I'm trying to figure it 
> out so that it can be migrated to a newer system and to Solr 8.5.2.  I feel 
> like the documentation assumes an awful lot of prior knowledge that I'm 
> clearly lacking and especially in how to upgrade versions of Solr.  As a 
> "Windows guy" I'm used to binaries and configuration files in separate 
> locations and upgrading is generally and easy replacement of the binaries and 
> (sometimes automated) adjustments to the config.  With Solr, it's all jumbled 
> into the same folder structure, and I am trying to track down where all the 
> important info is set.
> 
> The old setup appears to be a stand-alone system with 5 cores (some of which 
> may be test/experiments) and what I believe are pretty small indexes and not 
> using any configsets (although there are a few in there).  I compared the 
> "Solr.in.cmd" files from the old to the default, new and adjusted as seemed 
> fitting.  I was able to successfully start an empty Solr 8.5.2 and view the 
> admin interface.
> 
> Then, I stopped the service on the old server (it's not a critical system) 
> and copied the folders for the cores over to the new system.  When I start it 
> up, one of the cores is running and I get errors on the other four.  Two each 
> of:
> 
>Plugin init failure for [schema.xml] fieldType "textSuggest"   
>Plugin init failure for [schema.xml] fieldType "textSpell"
> 
> I'm not having any luck finding information on how to resolve this.  Am I 
> missing a plugin java library?  Where might I get it and/or load it?  Is 
> there some config file I missed from some other location?
> 
> I appreciate any suggestions you can offer.
> 
> Erich
> 



Re: Autocommit in SolrCloud with many shards

2020-06-17 Thread Erick Erickson
Each node has its own timer that starts when it receives an update.
So in your situation, 60 seconds after any give replica gets it’s first
update, all documents that have been received in the interval will
be committed.

But note several things:

1> commits will tend to cluster for a given shard. By that I mean
they’ll tend to happen within a few milliseconds of each other
   ‘cause it doesn’t take that long for an update to get from the
   leader to all the followers.

2> this is per replica. So if you host replicas from multiple collections
   on some node, their commits have no relation to each other. And
   say for some reason you transmit exactly one document that lands
   on shard1. Further, say nodeA contains replicas for shard1 and shard2.
   Only the replica for shard1 would commit.

3> Solr promises eventual consistency. In this case, due to all the
   timing variables it is not guaranteed that every replica of a single
   shard has the same document available for search at any given time.
   Say doc1 hits the leader at time T and a follower at time T+10ms.
   Say doc2 hits the leader and gets indexed 5ms before the 
   commit is triggered, but for some reason it takes 15ms for it to get
   to the follower. The leader will be able to search doc2, but the
  follower won’t until 60 seconds later.

Best,
Erick

> On Jun 17, 2020, at 5:36 AM, Bram Van Dam  wrote:
> 
> 'morning :-)
> 
> I'm wondering how autocommits work in Solr.
> 
> Say I have a cluster with many nodes and many colections with many
> shards. If each collection's config has a hard autocommit configured
> every minute, does that mean that SolrCloud (presumably the leader?)
> will dish out commit requests to each node on that schedule? Or does
> each node have its own timed trigger?
> 
> If it's the former, doesn't that mean the load will spike dramatically
> across the whole cluster every minute?
> 
> I tried reading the code, but I don't quite understand the way
> CommitTracker and the UpdateHandlers interact with SolrCloud.
> 
> Thanks,
> 
> - Bram



Re: Solr 7.6 optimize index size increase

2020-06-17 Thread Raveendra Yerraguntla
Thank you David, Walt , Eric.
1. First time bloated index generated , there is no disk space issue. one copy 
of index is 1/6 of disk capacity. we ran into disk capacity after more than 2  
copies of bloated copies.2. Solr is upgraded from 5.*. in 5.* more than 5 
segments is causing performance issue. Performance in 7.* is not measured for 
increasing segments. I will plan a PT to get optimum number. Application has 
incremental indexing multiple times in a work week.
I will keep you updated on the resolution.
Thanks again 
On Tuesday, June 16, 2020, 07:34:26 PM EDT, Erick Erickson 
 wrote:  
 
 It Depends (tm).

As of Solr 7.5, optimize is different. See: 
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

So, assuming you have _not_ specified maxSegments=1, any very large
segment (near 5G) that has _zero_ deleted documents won’t be merged.

So there are two scenarios:

1> What Walter mentioned. The optimize process runs out of disk space
    and leaves lots of crud around

2> your “older segments” are just max-sized segments with zero deletions.


All that said… do you have demonstrable performance improvements after
optimizing? The entire name “optimize” is misleading, of course who
wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
it made quite a difference. In more recent Solr releases, it’s not as clear
cut. So before worrying about making optimize work, I’d recommend that
you do some performance tests on optimized and un-optimized indexes. 
If there are significant improvements, that’s one thing. Otherwise, it’s
a waste.

Best,
Erick

> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
> 
> For a full forced merge (mistakenly named “optimize”), the worst case disk 
> space
> is 3X the size of the index. It is common to need 2X the size of the index.
> 
> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
> behavior.
> I implemented a disk space check that would refuse to merge if there wasn’t 
> enough
> free space. It would log an error and send an email to the admin.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jun 16, 2020, at 1:58 PM, David Hastings  
>> wrote:
>> 
>> I cant give you a 100% true answer but ive experienced this, and what
>> "seemed" to happen to me was that the optimize would start, and that will
>> drive the size up by 3 fold, and if you out of disk space in the process
>> the optimize will quit since, it cant optimize, and leave the live index
>> pieces in tact, so now you have the "current" index as well as the
>> "optimized" fragments
>> 
>> i cant say for certain thats what you ran into, but we found that if you
>> get an expanding disk it will keep growing and prevent this from happening,
>> then the index will contract and the disk will shrink back to only what it
>> needs.  saved me a lot of headaches not needing to ever worry about disk
>> space
>> 
>> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>>  wrote:
>> 
>>> 
>>> when optimize command is issued, the expectation after the completion of
>>> optimization process is that the index size either decreases or at most
>>> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
>>> is issued, some of the shard's transient or older segment files are not
>>> deleted. This is happening randomly across all shards. When unnoticed these
>>> transient files makes disk full. Currently it is handled through monitors,
>>> but question is what is causing the transient/older files remains there.
>>> Are there any specific race conditions which laves the older files not
>>> being deleted?
>>> Any pointers around this will be helpful.
>>> TIA
> 
  

Re: Solr cloud backup/restore not working

2020-06-17 Thread yaswanth kumar
Can someone please guide me on where can I get more detailed error of the
above exception while doing restore?? All that I see in solr.log was pasted
above

Thanks,

On Tue, Jun 16, 2020 at 10:44 AM yaswanth kumar 
wrote:

> I don't see anything related in the solr.log file for the same error. Not
> sure if there is anyother place where I can check for this.
>
> Thanks,
>
> On Tue, Jun 16, 2020 at 10:21 AM Shawn Heisey  wrote:
>
>> On 6/12/2020 8:38 AM, yaswanth kumar wrote:
>> > Using solr 8.2.0 and setup a cloud with 2 nodes. (2 replica's for each
>> > collection)
>> > Enabled basic authentication and gave all access to the admin user
>> >
>> > Now trying to use solr cloud backup/restore API, backup is working
>> great,
>> > but when trying to invoke restore API its throwing the below error
>> 
>> >  "msg":"ADDREPLICA failed to create replica",
>> >  "trace":"org.apache.solr.common.SolrException: ADDREPLICA failed to
>> > create replica\n\tat
>> >
>> org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
>> >
>> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:280)\n\tat
>>
>> The underlying cause of this exception is not recorded here.  Are there
>> other entries in the Solr log with more detailed information from the
>> ADDREPLICA attempt?
>>
>> Thanks,
>> Shawn
>>
>
>
> --
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com
>


-- 
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com


Autocommit in SolrCloud with many shards

2020-06-17 Thread Bram Van Dam
'morning :-)

I'm wondering how autocommits work in Solr.

Say I have a cluster with many nodes and many colections with many
shards. If each collection's config has a hard autocommit configured
every minute, does that mean that SolrCloud (presumably the leader?)
will dish out commit requests to each node on that schedule? Or does
each node have its own timed trigger?

If it's the former, doesn't that mean the load will spike dramatically
across the whole cluster every minute?

I tried reading the code, but I don't quite understand the way
CommitTracker and the UpdateHandlers interact with SolrCloud.

Thanks,

 - Bram


Re: Facet Performance

2020-06-17 Thread James Bodkin
Thanks, I've implemented some queries that improve the first-hit execution for 
faceting.

Since turning off indexed on those fields, we've noticed that facet.method=enum 
no longer returns the facets when used.
Using facet.method=fc/fcs is significantly slower compared to facet.method=enum 
for us. Why do these two differences exist?

On 16/06/2020, 17:52, "Erick Erickson"  wrote:

Ok, I see the disconnect... Necessary parts if the index are read from disk
lazily. So your newSearcher or firstSearcher query needs to do whatever
operation causes the relevant parts of the index to be read. In this case,
probably just facet on all the fields you care about. I'd add sorting too
if you sort on different fields.

The *:* query without facets or sorting does virtually nothing due to some
special handling...

On Tue, Jun 16, 2020, 10:48 James Bodkin 
wrote:

> I've been trying to build a query that I can use in newSearcher based off
> the information in your previous e-mail. I thought you meant to build a 
*:*
> query as per Query 1 in my previous e-mail but I'm still seeing the
> first-hit execution.
> Now I'm wondering if you meant to create a *:* query with each of the
> fields as part of the fl query parameters or a *:* query with each of the
> fields and values as part of the fq query parameters.
>
> At the moment I've been running these manually as I expected that I would
> see the first-execution penalty disappear by the time I got to query 4, as
> I thought this would replicate the actions of the newSeacher.
> Unfortunately we can't use the autowarm count that is available as part of
> the filterCache/filterCache due to the custom deployment mechanism we use
> to update our index.
>
> Kind Regards,
>
> James Bodkin
>
> On 16/06/2020, 15:30, "Erick Erickson"  wrote:
>
> Did you try the autowarming like I mentioned in my previous e-mail?
>
> > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> james.bod...@loveholidays.com> wrote:
> >
> > We've changed the schema to enable docValues for these fields and
> this led to an improvement in the response time. We found a further
> improvement by also switching off indexed as these fields are used for
> faceting and filtering only.
> > Since those changes, we've found that the first-execution for
> queries is really noticeable. I thought this would be the filterCache 
based
> on what I saw in NewRelic however it is probably trying to read the
> docValues from disk. How can we use the autowarming to improve this?
> >
> > For example, I've run the following queries in sequence and each
> query has a first-execution penalty.
> >
> > Query 1:
> >
> > q=*:*
> > facet=true
> > facet.field=D_DepartureAirport
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 2:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 3:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 4:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660+OR+2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > We've kept the field type as a string, as the value is mapped by
> application that accesses Solr. In the examples above, the values are
> mapped to airports and destinations.
> > Is it possible to prewarm the above queries without having to define
> all the potential filters manually in the auto warming?
> >
> > At the moment, we update and optimise our index in a different
> environment and then copy the index to our production instances by using a
> rolling deployment in Kubernetes.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 12/06/2020, 18:58, "Erick Erickson" 
> wrote:
> >
> >I question whether fiterCache has anything to do with it, I
> suspect what’s really happening is that first time you’re reading the
> relevant bits from disk into memory. And to double check you should have
> docVaues enabled for all these fields. The “uninverting” process  can be
> very expensive, and docValues bypasses that.
> >
> >As of Solr 7.6, you can define “uninvertible=true” to your
> field(Type) to “fail fast” if Solr needs to uninvert the field.
> >
> >But that’s an aside. In either case, my claim is that 

Log4J Logging to Http

2020-06-17 Thread Krönert Florian
Hello everyone,

We want to log our queries to a HTTP endpoint and tried configuring our log4j 
settings accordingly.
We are using Solr inside Docker with the official Solr image (version 
solr:8.3.1).

As soon as we add a http appender, we receive errors on startup and solr fails 
to start completely:

2020-06-17T07:06:54.976390509Z DEBUG StatusLogger 
JsonLayout$Builder(propertiesAsList="null", objectMessageAsJsonObject="null", 
={}, eventEol="null", compact="null", complete="null", locationInfo="null", 
properties="true", includeStacktrace="null", stacktraceAsString="null", 
includeNullDelimiter="null", ={}, charset="null", footerSerializer=null, 
headerSerializer=null, Configuration(/var/solr/log4j2.xml), footer="null", 
header="null")
2020-06-17T07:06:55.121825039Z 2020-06-17 
07:06:55.104:WARN:oejw.WebAppContext:main: Failed startup of context 
o.e.j.w.WebAppContext@611df6e3{/solr,file:///opt/solr-8.3.1/server/solr-webapp/webapp/,UNAVAILABLE}{/opt/solr-8.3.1/server/solr-webapp/webapp}
2020-06-17T07:06:55.121856339Z java.lang.NoClassDefFoundError: Failed to 
initialize Apache Solr: Could not find necessary SLF4j logging jars. If using 
Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For 
other containers, the corresponding directory should be used. For more 
information, see: http://wiki.apache.org/solr/SolrLogging

It seems that only when using the http appender these jars are needed, without 
this appender everything works.
Can you point me in the right direction, where I need to place the needed jars? 
Seems to be a little special since I only access the /var/solr mount directly, 
the rest is running in docker.

Kind Regards,

Florian Krönert
Senior Software Developer

[cid:image001.gif@01D6448A.53F24410]
ORBIS AG | Planckstraße 10 | D-88677 Markdorf
Phone: +49 7544 50398 21 | Mobile: +49 162 3065972 | E-Mail: 
florian.kroen...@orbis.de
www.orbis.de

[cid:image002.png@01D6448A.53F24410] 
   [cid:image003.jpg@01D6448A.53F24410]

Registered Seat: Saarbrücken
Commercial Register Court: Amtsgericht Saarbrücken, HRB 12022
Board of Management: Thomas Gard (Chairman), Michael Jung, Stefan Mailänder, 
Frank Schmelzer
Chairman of the Supervisory Board: Ulrich Holzer
[cid:image004.png@01D6448A.53F24410]   
[cid:image005.png@01D6448A.53F24410] 

[cid:image006.png@01D6448A.53F24410]    
 [cid:image007.png@01D6448A.53F24410] 
[cid:image008.png@01D6448A.53F24410] 

[cid:image009.png@01D6448A.53F24410]

[cid:image010.png@01D6448A.53F24410]


[cid:EBanner_MS_Workshops_650x130px_6d91b97b-fc83-4ac1-9fd1-36a1a2fc2cc1.png]


Re: Master Slave Terminology

2020-06-17 Thread Jan Høydahl
Hi Kaya,

Thanks for bringing it up. The topic is already being discussed by developers, 
so expect to see some change in this area; Not over-night, but incremental.
Also, if you want to lend a helping hand, patches are more than welcome as 
always.

Jan

> 17. jun. 2020 kl. 04:22 skrev Kayak28 :
> 
> Hello, Community:
> 
> As the Github and Python will replace terminologies that relative to
> slavery,
> why don't we replace master-slave for Solr as well?
> 
> https://developers.srad.jp/story/18/09/14/0935201/
> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
> 
> -- 
> 
> Sincerely,
> Kaya
> github: https://github.com/28kayak