Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-27 Thread Yago Riveiro
bq: "That is really a job for streaming, not simple faceting.”

True, it’s the next step to improve our performance (right now we are using 
JSON facets), and 6.3.0 has a lot of useful tools to work with streaming 
expressions. Our last release before 6.3 was 5.3.1 and the streaming 
expressions were buggy in some scenarios.

bq: "Okay. You could create a new collection with the wanted amount of shards 
and do a full re-index into that.”

True, you are right but we are trying to avoid that (this point falls into 
“keep management low”).

Solr it’s a amazing tool, with a lack of auto magic management stuff. You have 
all the power and therefore all the work :p

Following your advices I will try to review the topology of my collection and 
try to point the oversharded collections.

--

/Yago Riveiro

On 27 Dec 2016 21:54 +, Toke Eskildsen , wrote:
> Yago Riveiro  wrote:
> > One thing that I forget to mention is that my clients can aggregate
> > by any field in the schema with limit=-1, this is not a problem with
> > 99% of the fields, but 2 or 3 of them are URLs. URLs has very
> > high cardinality and one of the reasons to sharding collections is
> > to lower the memory footprint to not blow the node and do the
> > last merge in a big machine.
>
> That is really a job for streaming, not simple faceting.
>
> Even if you insist on faceting, the problem remains that your merger needs to 
> be powerful enough to process the full result set. Using that machine with a 
> single shard collection instead would eliminate the excessive overhead of 
> doing distributed faceting on millions of values, sparing a lot of hardware 
> allocation, which could be used to beef up the single-shard hardware even 
> more.
>
> [Toke: You can always split later]
>
> > Every time I run the SPLITSHARD command, the command fails
> > in a different way. IMHO right now Solr doesn’t have an efficient
> > way to rebalance collection’s shard.
>
> Okay. You coul create a new collection with the wanted amount of shards and 
> do a full re-index into that.
>
> [Toke: "And yes, more logistics on your part as one size no longer fits all”]
>
> > The key point of this deploy is reduce the amount of management
> > as much as possible,
>
> That is your prerogative. I hope my suggestions can be used by other people 
> with similar challenges then.
>
> - Toke Eskildsen


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-27 Thread Toke Eskildsen
Yago Riveiro  wrote:
> One thing that I forget to mention is that my clients can aggregate
> by any field in the schema with limit=-1, this is not a problem with
> 99% of the fields, but 2 or 3 of them are URLs. URLs has very
> high cardinality and one of the reasons to sharding collections is
> to lower the memory footprint to not blow the node and do the
> last merge in a big machine.

That is really a job for streaming, not simple faceting.

Even if you insist on faceting, the problem remains that your merger needs to 
be powerful enough to process the full result set. Using that machine with a 
single shard collection instead would eliminate the excessive overhead of doing 
distributed faceting on millions of values, sparing a lot of hardware 
allocation, which could be used to beef up the single-shard hardware even more.

[Toke: You can always split later]

> Every time I run the SPLITSHARD command, the command fails
> in a different way. IMHO right now Solr doesn’t have an efficient
> way to rebalance collection’s shard.

Okay. You coul create a new collection with the wanted amount of shards and do 
a full re-index into that.

[Toke: "And yes, more logistics on your part as one size no longer fits all”]

> The key point of this deploy is reduce the amount of management
> as much as possible,

That is your prerogative. I hope my suggestions can be used by other people 
with similar challenges then.

- Toke Eskildsen


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-27 Thread Yago Riveiro
One thing that I forget to mention is that my clients can aggregate by any 
field in the schema with limit=-1, this is not a problem with 99% of the 
fields, but 2 or 3 of them are URLs. URLs has very high cardinality and one of 
the reasons to sharding collections is to lower the memory footprint to not 
blow the node and do the last merge in a big machine.

"Should a collection grow past whatever threshold you determine, you can always 
split it.”

Every time I run the SPLITSHARD command, the command fails in a different way. 
IMHO right now Solr doesn’t have an efficient way to rebalance collection’s 
shard.

"And yes, more logistics on your part as one size no longer fits all”

The key point of this deploy is reduce the amount of management as much as 
possible, Solr improved the management of the cluster a lot in comparison with 
4.x release. Even so, remains difficult manage a big cluster without custom 
tools.

Solr continues to improve with each version, and I saw issues with a lot of 
nice stuff like SOLR-9735 and SOLR-9241

--

/Yago Riveiro

On 26 Dec 2016 22:10 +, Toke Eskildsen , wrote:
> Yago Riveiro  wrtoe:
> > My cluster holds more than 10B documents stored in 15T.
> >
> > The size of my collections is variable but I have collections with 800M
> > documents distributed over the 12 nodes, the amount of documents per shard
> > is ~66M and indeed the performance is good.
>
> The math supports Erick's point about over-sharding. On average you have:
> 15 TB/ 1200 collections / 12 shards ~= 1GB / shard.
> 10B docs / 1200 collections / 12 shards ~= 700K docs/shard
>
> While your 12 shards fits well with your large collections, such as the one 
> you described above, they are a very poor match for your average collection. 
> Assuming your collections behave roughly the same way as each other, your 
> average and smaller than average collections would be much better off with 
> just 1 shard (and 2 replicas). That eliminates the overhead of distributed 
> search-requests (for that collection) and lowers your overall shard-count 
> significantly. Should a collection grow past whatever threshold you 
> determine, you can always split it.
>
> Better performance, lower hardware requirements, more manageable shard 
> amount. And yes, more logistics on your part as one size no longer fits all.
>
> - Toke Eskildsen


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-26 Thread Toke Eskildsen
Yago Riveiro  wrtoe:
> My cluster holds more than 10B documents stored in 15T.
> 
> The size of my collections is variable but I have collections with 800M
> documents distributed over the 12 nodes, the amount of documents per shard
> is ~66M and indeed the performance is good.

The math supports Erick's point about over-sharding. On average you have:
15 TB/ 1200 collections / 12 shards ~= 1GB / shard.
10B docs / 1200 collections / 12 shards ~= 700K docs/shard

While your 12 shards fits well with your large collections, such as the one you 
described above, they are a very poor match for your average collection. 
Assuming your collections behave roughly the same way as each other, your 
average and smaller than average collections would be much better off with just 
1 shard (and 2 replicas). That eliminates the overhead of distributed 
search-requests (for that collection) and lowers your overall shard-count 
significantly. Should a collection grow past whatever threshold you determine, 
you can always split it.

Better performance, lower hardware requirements, more manageable shard amount. 
And yes, more logistics on your part as one size no longer fits all.

- Toke Eskildsen


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-26 Thread Yago Riveiro
My cluster holds more than 10B documents stored in 15T.

The size of my collections is variable but I have collections with 800M
documents distributed over the 12 nodes, the amount of documents per shard
is ~66M and indeed the performance is good.

I need the collections to isolate the data of my clients and for scalability
reasons. Isolate data in collections give the power to allocate the data in
new machines in a easy way, or promote my clients to better hardware.

In a situation like that fast restarts are critical to ensure availability
and to recover from situations where 2 or more nodes goes down.




-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Uncaught-exception-java-lang-StackOverflowError-in-6-3-0-tp4309849p4311200.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-15 Thread Erick Erickson
Right, so if I'm doing the math right you have 2,400 replicas per JVM?
I'm not clear whether each node has a single JVM or not.

Anyway. 2048 is indeed much too high. If nothing else, dropping it to,
say, 64 would show whether this was the real root of your problem or not.
Even if it slowed startup unacceptably, it would show you that this was,
indeed, the problem.

Is this a multi-tenant situation? I'm trying to understand why you need
so many cores. Having 1,200 collections each with 12 shards seems like
massive over-sharding. How many docs exist in each core? I'm
wondering if you've backed yourself into a corner by unnecessary sharding.
If you could, say, reduce your shards per collection to 2 (or even one?) you
might get out of this bind cheaply.

I regularly see 50M docs on a single shard give very good performance
FWIW.

Best,
Erick

On Thu, Dec 15, 2016 at 11:55 AM, Yago Riveiro  wrote:
> Yes, I changed the value of coreLoadThreads.
>
> With the default value a node takes like 40 minutes to be available with all 
> replicas up.
>
> Right now I have ~1.2K collections with 12 shards each, 2 replicas spread in 
> 12 nodes. Indeed the value I configured maybe is too much (2048) but I can 
> start nodes in 10 minutes.
>
> I need to review the value to something more conservative maybe.
>
> --
>
> /Yago Riveiro
>
> On 15 Dec 2016, 16:43 +, Erick Erickson , wrote:
>> Hmmm, have you changed coreLoadThreads? We had a problem with this a
>> while back with loading lots and lots of cores, see:
>> https://issues.apache.org/jira/browse/SOLR-7280
>>
>> But that was fixed in 6.2, so unless you changed the number of threads
>> used to load cores it shouldn't be a problem on 6.3...
>>
>> The symptom was also that replicas would never change to "active",
>> they'd be stuck in ercovery or down.
>>
>> Best,
>> Erick
>>
>> On Thu, Dec 15, 2016 at 3:07 AM, Yago Riveiro  wrote:
>> > Hi,
>> >
>> > I'm getting this error in my log
>> >
>> > 12/15/2016, 9:28:18 AM ERROR true ExecutorUtil Uncaught exception
>> > java.lang.StackOverflowError thrown by thread:
>> > coreZkRegister-1-thread-48-processing-n:XXX.XXX.XXX.XXX:8983_solr
>> > x:collection1_shard3_replica2 s:shard3 c:collection1-visitors r:core_node5
>> > java.lang.Exception: Submitter stack trace
>> > at
>> > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:204)
>> > at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:204)
>> > at org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:505)
>> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> > at
>> > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>> > at
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> > at
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> > at java.lang.Thread.run(Thread.java:745)
>> >
>> >
>> >
>> > -
>> > Best regards
>> > --
>> > View this message in context: 
>> > http://lucene.472066.n3.nabble.com/Uncaught-exception-java-lang-StackOverflowError-in-6-3-0-tp4309849.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-15 Thread Yago Riveiro
Yes, I changed the value of coreLoadThreads.

With the default value a node takes like 40 minutes to be available with all 
replicas up.

Right now I have ~1.2K collections with 12 shards each, 2 replicas spread in 12 
nodes. Indeed the value I configured maybe is too much (2048) but I can start 
nodes in 10 minutes.

I need to review the value to something more conservative maybe.

--

/Yago Riveiro

On 15 Dec 2016, 16:43 +, Erick Erickson , wrote:
> Hmmm, have you changed coreLoadThreads? We had a problem with this a
> while back with loading lots and lots of cores, see:
> https://issues.apache.org/jira/browse/SOLR-7280
>
> But that was fixed in 6.2, so unless you changed the number of threads
> used to load cores it shouldn't be a problem on 6.3...
>
> The symptom was also that replicas would never change to "active",
> they'd be stuck in ercovery or down.
>
> Best,
> Erick
>
> On Thu, Dec 15, 2016 at 3:07 AM, Yago Riveiro  wrote:
> > Hi,
> >
> > I'm getting this error in my log
> >
> > 12/15/2016, 9:28:18 AM ERROR true ExecutorUtil Uncaught exception
> > java.lang.StackOverflowError thrown by thread:
> > coreZkRegister-1-thread-48-processing-n:XXX.XXX.XXX.XXX:8983_solr
> > x:collection1_shard3_replica2 s:shard3 c:collection1-visitors r:core_node5
> > java.lang.Exception: Submitter stack trace
> > at
> > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:204)
> > at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:204)
> > at org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:505)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> >
> >
> >
> > -
> > Best regards
> > --
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Uncaught-exception-java-lang-StackOverflowError-in-6-3-0-tp4309849.html
> > Sent from the Solr - User mailing list archive at Nabble.com.


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-15 Thread Erick Erickson
Hmmm, have you changed coreLoadThreads? We had a problem with this a
while back with loading lots and lots of cores, see:
https://issues.apache.org/jira/browse/SOLR-7280

But that was fixed in 6.2, so unless you changed the number of threads
used to load cores it shouldn't be a problem on 6.3...

The symptom was also that replicas would never change to "active",
they'd be stuck in ercovery or down.

Best,
Erick

On Thu, Dec 15, 2016 at 3:07 AM, Yago Riveiro  wrote:
> Hi,
>
> I'm getting this error in my log
>
> 12/15/2016, 9:28:18 AM  ERROR true  ExecutorUtilUncaught 
> exception
> java.lang.StackOverflowError thrown by thread:
> coreZkRegister-1-thread-48-processing-n:XXX.XXX.XXX.XXX:8983_solr
> x:collection1_shard3_replica2 s:shard3 c:collection1-visitors r:core_node5
> java.lang.Exception: Submitter stack trace
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:204)
> at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:204)
> at 
> org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:505)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> -
> Best regards
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Uncaught-exception-java-lang-StackOverflowError-in-6-3-0-tp4309849.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-15 Thread Yago Riveiro
Hi,

I'm getting this error in my log

12/15/2016, 9:28:18 AM  ERROR true  ExecutorUtilUncaught 
exception
java.lang.StackOverflowError thrown by thread:
coreZkRegister-1-thread-48-processing-n:XXX.XXX.XXX.XXX:8983_solr
x:collection1_shard3_replica2 s:shard3 c:collection1-visitors r:core_node5
java.lang.Exception: Submitter stack trace
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:204)
at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:204)
at 
org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:505)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Uncaught-exception-java-lang-StackOverflowError-in-6-3-0-tp4309849.html
Sent from the Solr - User mailing list archive at Nabble.com.