Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-06 Thread Noble Paul
It can't be considered a bug , it's just that there are too many
calculations involved as there are a very large no:of nodes. Any further
spped up would require a change in the way it's calculated

On Thu, Sep 5, 2019, 1:30 AM Andrew Kettmann 
wrote:

>
> > there are known perf issues in computing very large clusters
>
> Is there any documentation/open tickets on this that you have handy? If
> that is the case, then we might be back to looking at separate Znodes.
> Right now if we provide a nodeset on collection creation, it is creating
> them quickly. I don't want to make many changes as this is part of our
> production at this time.
>
>
>
>
>
> From: Noble Paul 
>
> Sent: Wednesday, September 4, 2019 12:14 AM
>
> To: solr-user@lucene.apache.org 
>
> Subject: Re: Solr 7.7.2 Autoscaling policy - Poor performance
>
>
>
>
> there are known perf issues in computing very large clusters
>
>
>
> give it a try with the following rules
>
>
>
> "FOO_CUSTOMER":[
>
>   {
>
> "replica":"0",
>
> "sysprop.HELM_CHART":"!FOO_CUSTOMER",
>
> "strict":"true"},
>
>   {
>
> "replica":"<2",
>
> "node":"#ANY",
>
> "strict":"false"}]
>
>
>
> On Wed, Sep 4, 2019 at 1:49 AM Andrew Kettmann
>
>  wrote:
>
> >
>
> > Currently our 7.7.2 cluster has ~600 hosts and each collection is using
> an autoscaling policy based on system property. Our goal is a single core
> per host (container, running on K8S). However as we have rolled more
> containers/collections into the cluster
>  any creation/move actions are taking a huge amount of time. In fact we
> generally hit the 180 second timeout if we don't schedule it as async.
> Though the action gets completed anyway. Looking at the code, it looks like
> for each core it is considering the entire
>  cluster.
>
> >
>
> > Right now our autoscaling policies look like this, note we are feeding a
> sysprop on startup for each collection to map to specific containers:
>
> >
>
> > "FOO_CUSTOMER":[
>
> >   {
>
> > "replica":"#ALL",
>
> > "sysprop.HELM_CHART":"FOO_CUSTOMER",
>
> > "strict":"true"},
>
> >   {
>
> > "replica":"<2",
>
> > "node":"#ANY",
>
> > "strict":"false"}]
>
> >
>
> > Does name based filtering allow wildcards ? Also would that likely fix
> the issue of the time it takes for Solr to decide where cores can go? Or
> any other suggestions for making this more efficient on the Solr overseer?
> We do have dedicated overseer nodes,
>  but the leader maxes out CPU for awhile while it is thinking about this.
>
> >
>
> > We are considering putting each collection into its own zookeeper
> znode/chroot if we can't support this many nodes per overseer. I would like
> to avoid that if possible, but also creating a collection in sub 10 minutes
> would be neat too.
>
> >
>
> > I appreciate any input/suggestions anyone has!
>
> >
>
> > [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<
> https://www.evolve24.com> Andrew Kettmann
>
> > DevOps Engineer
>
> > P: 1.314.596.2836
>
> > [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <
> https://twitter.com/evolve24>  [Instagram] <
> https://www.instagram.com/evolve_24>
>
> >
>
> > evolve24 Confidential & Proprietary Statement: This email and any
> attachments are confidential and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. It
> is intended for the use of the recipients. If you
>  are not the intended recipient, or believe that you have received this
> communication in error, please do not read, print, copy, retransmit,
> disseminate, or otherwise use the information. Please delete this email and
> attachments, without reading, printing,
>  copying, forwarding or saving them, and notify the Sender immediately by
> reply email. No confidentiality or privilege is waived or lost by any
> transmission in error.
>
>
>
>
>
>
>
> --
>
> -
>
> Noble Paul
>
>


Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-04 Thread Andrew Kettmann

> there are known perf issues in computing very large clusters

Is there any documentation/open tickets on this that you have handy? If that is 
the case, then we might be back to looking at separate Znodes. Right now if we 
provide a nodeset on collection creation, it is creating them quickly. I don't 
want to make many changes as this is part of our production at this time. 





From: Noble Paul 

Sent: Wednesday, September 4, 2019 12:14 AM

To: solr-user@lucene.apache.org 

Subject: Re: Solr 7.7.2 Autoscaling policy - Poor performance

 


there are known perf issues in computing very large clusters



give it a try with the following rules



"FOO_CUSTOMER":[

  {

    "replica":"0",

    "sysprop.HELM_CHART":"!FOO_CUSTOMER",

    "strict":"true"},

  {

    "replica":"<2",

    "node":"#ANY",

    "strict":"false"}]



On Wed, Sep 4, 2019 at 1:49 AM Andrew Kettmann

 wrote:

>

> Currently our 7.7.2 cluster has ~600 hosts and each collection is using an 
> autoscaling policy based on system property. Our goal is a single core per 
> host (container, running on K8S). However as we have rolled more 
> containers/collections into the cluster
 any creation/move actions are taking a huge amount of time. In fact we 
generally hit the 180 second timeout if we don't schedule it as async. Though 
the action gets completed anyway. Looking at the code, it looks like for each 
core it is considering the entire
 cluster.

>

> Right now our autoscaling policies look like this, note we are feeding a 
> sysprop on startup for each collection to map to specific containers:

>

> "FOO_CUSTOMER":[

>   {

> "replica":"#ALL",

> "sysprop.HELM_CHART":"FOO_CUSTOMER",

> "strict":"true"},

>   {

> "replica":"<2",

> "node":"#ANY",

> "strict":"false"}]

>

> Does name based filtering allow wildcards ? Also would that likely fix the 
> issue of the time it takes for Solr to decide where cores can go? Or any 
> other suggestions for making this more efficient on the Solr overseer? We do 
> have dedicated overseer nodes,
 but the leader maxes out CPU for awhile while it is thinking about this.

>

> We are considering putting each collection into its own zookeeper 
> znode/chroot if we can't support this many nodes per overseer. I would like 
> to avoid that if possible, but also creating a collection in sub 10 minutes 
> would be neat too.

>

> I appreciate any input/suggestions anyone has!

>

> [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com>
>  Andrew Kettmann

> DevOps Engineer

> P: 1.314.596.2836

> [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] 
> <https://twitter.com/evolve24>  [Instagram] 
> <https://www.instagram.com/evolve_24>

>

> evolve24 Confidential & Proprietary Statement: This email and any attachments 
> are confidential and may contain information that is privileged, confidential 
> or exempt from disclosure under applicable law. It is intended for the use of 
> the recipients. If you
 are not the intended recipient, or believe that you have received this 
communication in error, please do not read, print, copy, retransmit, 
disseminate, or otherwise use the information. Please delete this email and 
attachments, without reading, printing,
 copying, forwarding or saving them, and notify the Sender immediately by reply 
email. No confidentiality or privilege is waived or lost by any transmission in 
error.







-- 

-

Noble Paul



Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Noble Paul
there are known perf issues in computing very large clusters

give it a try with the following rules

"FOO_CUSTOMER":[
  {
"replica":"0",
"sysprop.HELM_CHART":"!FOO_CUSTOMER",
"strict":"true"},
  {
"replica":"<2",
"node":"#ANY",
"strict":"false"}]

On Wed, Sep 4, 2019 at 1:49 AM Andrew Kettmann
 wrote:
>
> Currently our 7.7.2 cluster has ~600 hosts and each collection is using an 
> autoscaling policy based on system property. Our goal is a single core per 
> host (container, running on K8S). However as we have rolled more 
> containers/collections into the cluster any creation/move actions are taking 
> a huge amount of time. In fact we generally hit the 180 second timeout if we 
> don't schedule it as async. Though the action gets completed anyway. Looking 
> at the code, it looks like for each core it is considering the entire cluster.
>
> Right now our autoscaling policies look like this, note we are feeding a 
> sysprop on startup for each collection to map to specific containers:
>
> "FOO_CUSTOMER":[
>   {
> "replica":"#ALL",
> "sysprop.HELM_CHART":"FOO_CUSTOMER",
> "strict":"true"},
>   {
> "replica":"<2",
> "node":"#ANY",
> "strict":"false"}]
>
> Does name based filtering allow wildcards ? Also would that likely fix the 
> issue of the time it takes for Solr to decide where cores can go? Or any 
> other suggestions for making this more efficient on the Solr overseer? We do 
> have dedicated overseer nodes, but the leader maxes out CPU for awhile while 
> it is thinking about this.
>
> We are considering putting each collection into its own zookeeper 
> znode/chroot if we can't support this many nodes per overseer. I would like 
> to avoid that if possible, but also creating a collection in sub 10 minutes 
> would be neat too.
>
> I appreciate any input/suggestions anyone has!
>
> [https://storage.googleapis.com/e24-email-images/e24logonotag.png]
>  Andrew Kettmann
> DevOps Engineer
> P: 1.314.596.2836
> [LinkedIn] [Twitter] 
>   [Instagram] 
> 
>
> evolve24 Confidential & Proprietary Statement: This email and any attachments 
> are confidential and may contain information that is privileged, confidential 
> or exempt from disclosure under applicable law. It is intended for the use of 
> the recipients. If you are not the intended recipient, or believe that you 
> have received this communication in error, please do not read, print, copy, 
> retransmit, disseminate, or otherwise use the information. Please delete this 
> email and attachments, without reading, printing, copying, forwarding or 
> saving them, and notify the Sender immediately by reply email. No 
> confidentiality or privilege is waived or lost by any transmission in error.



-- 
-
Noble Paul


Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Mark Miller
Hook up a profiler to the overseer and see what it's doing, file a JIRA and
note the hotspots or what methods appear to be hanging out.

On Tue, Sep 3, 2019 at 1:15 PM Andrew Kettmann 
wrote:

>
> > You’re going to want to start by having more than 3gb for memory in my
> opinion but the rest of your set up is more complex than I’ve dealt with.
>
> right now the overseer is set to a max heap of 3GB, but is only using
> ~260MB of heap, so memory doesn't seem to be the issue unless there is a
> part of the picture I am missing there?
>
> Our overseers only jobs are being overseer and holding the .system
> collection. I would imagine if the overseer were hitting memory constraints
> it would have allocated more than 300MB of the total 3GB it is allowed,
> right?
>
> evolve24 Confidential & Proprietary Statement: This email and any
> attachments are confidential and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. It
> is intended for the use of the recipients. If you are not the intended
> recipient, or believe that you have received this communication in error,
> please do not read, print, copy, retransmit, disseminate, or otherwise use
> the information. Please delete this email and attachments, without reading,
> printing, copying, forwarding or saving them, and notify the Sender
> immediately by reply email. No confidentiality or privilege is waived or
> lost by any transmission in error.
>


-- 
- Mark

http://about.me/markrmiller


Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Andrew Kettmann

> You’re going to want to start by having more than 3gb for memory in my 
> opinion but the rest of your set up is more complex than I’ve dealt with.

right now the overseer is set to a max heap of 3GB, but is only using ~260MB of 
heap, so memory doesn't seem to be the issue unless there is a part of the 
picture I am missing there?

Our overseers only jobs are being overseer and holding the .system collection. 
I would imagine if the overseer were hitting memory constraints it would have 
allocated more than 300MB of the total 3GB it is allowed, right?

evolve24 Confidential & Proprietary Statement: This email and any attachments 
are confidential and may contain information that is privileged, confidential 
or exempt from disclosure under applicable law. It is intended for the use of 
the recipients. If you are not the intended recipient, or believe that you have 
received this communication in error, please do not read, print, copy, 
retransmit, disseminate, or otherwise use the information. Please delete this 
email and attachments, without reading, printing, copying, forwarding or saving 
them, and notify the Sender immediately by reply email. No confidentiality or 
privilege is waived or lost by any transmission in error.


Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Dave
You’re going to want to start by having more than 3gb for memory in my opinion 
but the rest of your set up is more complex than I’ve dealt with. 

On Sep 3, 2019, at 1:10 PM, Andrew Kettmann  
wrote:

>> How many zookeepers do you have? How many collections? What is there size?
>> How much CPU / memory do you give per container? How much heap in comparison 
>> to total memory of the container ?
> 
> 3 Zookeepers.
> 733 containers/nodes
> 735 total cores. Each core ranges from ~4-10GB of index. (Autoscaling splits 
> at 12GB)
> 10 collections, ranging from 147 shards at most, to 3 at least. Replication 
> factor of 2 other than .system which has 3 replicas.
> Each container has a min/max heap of 750MB other than the overseer containers 
> which have a min/max of 3GB.
> Containers aren't hard limited by K8S on memory or CPU but the machines the 
> containers are on have 4 cores and ~13GB of ram.
> 
> Now that I look at the CPU usage on a per container basis, it looks like it 
> is maxing out all four cores on the VM that is hosting the overseer 
> container. Barely using the heap (300MB).
> 
> I suppose that means that if we put the overseers on machines with more 
> cores, it might be able to get things done a bit faster. Though that still 
> seems like a limited solution as we are going to grow this cluster at least 
> double in size if not larger.
> 
> We are using the solr:7.7.2 container.
> 
> Java Options on the home page are below:
>-DHELM_CHART=overseer
>-DSTOP.KEY=solrrocks
>-DSTOP.PORT=7983
>-Dhost=overseer-solr-0.solr.DOMAIN
>-Djetty.home=/opt/solr/server
>-Djetty.port=8983
>-Dsolr.data.home=
>-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf
>-Dsolr.install.dir=/opt/solr
>-Dsolr.jetty.https.port=8983
>-Dsolr.log.dir=/opt/solr/server/logs
>-Dsolr.log.level=INFO
>-Dsolr.solr.home=/opt/solr/server/home
>-Duser.timezone=UTC
>-DzkClientTimeout=6
>
> -DzkHost=zookeeper-1.DOMAIN:2181,zookeeper-2.DOMAIN:2181,zookeeper-3.DOMAIN:2181/ZNODE
>-XX:+CMSParallelRemarkEnabled
>-XX:+CMSScavengeBeforeRemark
>-XX:+ParallelRefProcEnabled
>-XX:+UseCMSInitiatingOccupancyOnly
>-XX:+UseConcMarkSweepGC
>-XX:-OmitStackTraceInFastThrow
>-XX:CMSInitiatingOccupancyFraction=50
>-XX:CMSMaxAbortablePrecleanTime=6000
>-XX:ConcGCThreads=4
>-XX:MaxTenuringThreshold=8
>-XX:NewRatio=3
>-XX:ParallelGCThreads=4
>-XX:PretenureSizeThreshold=64m
>-XX:SurvivorRatio=4
>-XX:TargetSurvivorRatio=90
>
> -Xlog:gc*:file=/opt/solr/server/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M
>-Xmx3g
>-Xmx3g
>-Xss256k
> 
> 
> 
> evolve24 Confidential & Proprietary Statement: This email and any attachments 
> are confidential and may contain information that is privileged, confidential 
> or exempt from disclosure under applicable law. It is intended for the use of 
> the recipients. If you are not the intended recipient, or believe that you 
> have received this communication in error, please do not read, print, copy, 
> retransmit, disseminate, or otherwise use the information. Please delete this 
> email and attachments, without reading, printing, copying, forwarding or 
> saving them, and notify the Sender immediately by reply email. No 
> confidentiality or privilege is waived or lost by any transmission in error.


Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Andrew Kettmann
> How many zookeepers do you have? How many collections? What is there size?
> How much CPU / memory do you give per container? How much heap in comparison 
> to total memory of the container ?

3 Zookeepers.
733 containers/nodes
735 total cores. Each core ranges from ~4-10GB of index. (Autoscaling splits at 
12GB)
10 collections, ranging from 147 shards at most, to 3 at least. Replication 
factor of 2 other than .system which has 3 replicas.
Each container has a min/max heap of 750MB other than the overseer containers 
which have a min/max of 3GB.
Containers aren't hard limited by K8S on memory or CPU but the machines the 
containers are on have 4 cores and ~13GB of ram.

Now that I look at the CPU usage on a per container basis, it looks like it is 
maxing out all four cores on the VM that is hosting the overseer container. 
Barely using the heap (300MB).

I suppose that means that if we put the overseers on machines with more cores, 
it might be able to get things done a bit faster. Though that still seems like 
a limited solution as we are going to grow this cluster at least double in size 
if not larger.

We are using the solr:7.7.2 container.

Java Options on the home page are below:
-DHELM_CHART=overseer
-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dhost=overseer-solr-0.solr.DOMAIN
-Djetty.home=/opt/solr/server
-Djetty.port=8983
-Dsolr.data.home=
-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf
-Dsolr.install.dir=/opt/solr
-Dsolr.jetty.https.port=8983
-Dsolr.log.dir=/opt/solr/server/logs
-Dsolr.log.level=INFO
-Dsolr.solr.home=/opt/solr/server/home
-Duser.timezone=UTC
-DzkClientTimeout=6

-DzkHost=zookeeper-1.DOMAIN:2181,zookeeper-2.DOMAIN:2181,zookeeper-3.DOMAIN:2181/ZNODE
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90

-Xlog:gc*:file=/opt/solr/server/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M
-Xmx3g
-Xmx3g
-Xss256k



evolve24 Confidential & Proprietary Statement: This email and any attachments 
are confidential and may contain information that is privileged, confidential 
or exempt from disclosure under applicable law. It is intended for the use of 
the recipients. If you are not the intended recipient, or believe that you have 
received this communication in error, please do not read, print, copy, 
retransmit, disseminate, or otherwise use the information. Please delete this 
email and attachments, without reading, printing, copying, forwarding or saving 
them, and notify the Sender immediately by reply email. No confidentiality or 
privilege is waived or lost by any transmission in error.


Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Jörn Franke
How many zookeepers do you have? How many collections? What is there size?
How much CPU / memory do you give per container? How much heap in comparison to 
total memory of the container ?

> Am 03.09.2019 um 17:49 schrieb Andrew Kettmann :
> 
> Currently our 7.7.2 cluster has ~600 hosts and each collection is using an 
> autoscaling policy based on system property. Our goal is a single core per 
> host (container, running on K8S). However as we have rolled more 
> containers/collections into the cluster any creation/move actions are taking 
> a huge amount of time. In fact we generally hit the 180 second timeout if we 
> don't schedule it as async. Though the action gets completed anyway. Looking 
> at the code, it looks like for each core it is considering the entire cluster.
> 
> Right now our autoscaling policies look like this, note we are feeding a 
> sysprop on startup for each collection to map to specific containers:
> 
> "FOO_CUSTOMER":[
>  {
>"replica":"#ALL",
>"sysprop.HELM_CHART":"FOO_CUSTOMER",
>"strict":"true"},
>  {
>"replica":"<2",
>"node":"#ANY",
>"strict":"false"}]
> 
> Does name based filtering allow wildcards ? Also would that likely fix the 
> issue of the time it takes for Solr to decide where cores can go? Or any 
> other suggestions for making this more efficient on the Solr overseer? We do 
> have dedicated overseer nodes, but the leader maxes out CPU for awhile while 
> it is thinking about this.
> 
> We are considering putting each collection into its own zookeeper 
> znode/chroot if we can't support this many nodes per overseer. I would like 
> to avoid that if possible, but also creating a collection in sub 10 minutes 
> would be neat too.
> 
> I appreciate any input/suggestions anyone has!
> 
> [https://storage.googleapis.com/e24-email-images/e24logonotag.png]
>  Andrew Kettmann
> DevOps Engineer
> P: 1.314.596.2836
> [LinkedIn] [Twitter] 
>   [Instagram] 
> 
> 
> evolve24 Confidential & Proprietary Statement: This email and any attachments 
> are confidential and may contain information that is privileged, confidential 
> or exempt from disclosure under applicable law. It is intended for the use of 
> the recipients. If you are not the intended recipient, or believe that you 
> have received this communication in error, please do not read, print, copy, 
> retransmit, disseminate, or otherwise use the information. Please delete this 
> email and attachments, without reading, printing, copying, forwarding or 
> saving them, and notify the Sender immediately by reply email. No 
> confidentiality or privilege is waived or lost by any transmission in error.


Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Andrew Kettmann
Currently our 7.7.2 cluster has ~600 hosts and each collection is using an 
autoscaling policy based on system property. Our goal is a single core per host 
(container, running on K8S). However as we have rolled more 
containers/collections into the cluster any creation/move actions are taking a 
huge amount of time. In fact we generally hit the 180 second timeout if we 
don't schedule it as async. Though the action gets completed anyway. Looking at 
the code, it looks like for each core it is considering the entire cluster.

Right now our autoscaling policies look like this, note we are feeding a 
sysprop on startup for each collection to map to specific containers:

"FOO_CUSTOMER":[
  {
"replica":"#ALL",
"sysprop.HELM_CHART":"FOO_CUSTOMER",
"strict":"true"},
  {
"replica":"<2",
"node":"#ANY",
"strict":"false"}]

Does name based filtering allow wildcards ? Also would that likely fix the 
issue of the time it takes for Solr to decide where cores can go? Or any other 
suggestions for making this more efficient on the Solr overseer? We do have 
dedicated overseer nodes, but the leader maxes out CPU for awhile while it is 
thinking about this.

We are considering putting each collection into its own zookeeper znode/chroot 
if we can't support this many nodes per overseer. I would like to avoid that if 
possible, but also creating a collection in sub 10 minutes would be neat too.

I appreciate any input/suggestions anyone has!

[https://storage.googleapis.com/e24-email-images/e24logonotag.png]
 Andrew Kettmann
DevOps Engineer
P: 1.314.596.2836
[LinkedIn] [Twitter] 
  [Instagram] 


evolve24 Confidential & Proprietary Statement: This email and any attachments 
are confidential and may contain information that is privileged, confidential 
or exempt from disclosure under applicable law. It is intended for the use of 
the recipients. If you are not the intended recipient, or believe that you have 
received this communication in error, please do not read, print, copy, 
retransmit, disseminate, or otherwise use the information. Please delete this 
email and attachments, without reading, printing, copying, forwarding or saving 
them, and notify the Sender immediately by reply email. No confidentiality or 
privilege is waived or lost by any transmission in error.