subject:"SolrCloud 4.x hangs under high update volume"

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt

)

  at


org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)

  at


org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)

  at


org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)

  at


org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)

  at


org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)

  at


org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)

  at


org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)

  at


org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)

  at


org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)

  at


org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)

  at org.eclipse.jetty.server.Server.handle(Server.java:445)
  at

org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)

  at


org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)

  at


org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)

  at


org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)

  at


org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)

  at java.lang.Thread.run(Thread.java:724)

On your live_nodes question, I don't have historical data on this from

when

the crash occurred, which I guess is what you're looking for. I could

add

this to our monitoring for future tests, however. I'd be glad to

continue

further testing, but I think first more monitoring is needed to

understand

this further. Could we come up with a list of metrics that would be

useful

to see following another test and successful crash?

Metrics needed:

1) # of live_nodes.
2) Full stack traces.
3) CPU used by Solr's JVM specifically (instead of system-wide).
4) Solr's JVM thread count (already done)
5) ?

Cheers,

Tim Vaillancourt


On 6 September 2013 13:11, Mark Millermarkrmil...@gmail.com  wrote:


Did you ever get to index that long before without hitting the

deadlock?

There really isn't anything negative the patch could be introducing,

other

than allowing for some more threads to possibly run at once. If I had

to

guess, I would say its likely this patch fixes the deadlock issue and

your

seeing another issue - which looks like the system cannot keep up

with

the

requests or something for some reason - perhaps due to some OS

networking

settings or something (more guessing). Connection refused happens

generally

when there is nothing listening on the port.

Do you see anything interesting change with the rest of the system?

CPU

usage spikes or something like that?

Clamping down further on the overall number of threads night help

(which

would require making something configurable). How many nodes are

listed in

zk under live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourtt...@elementspace.com
wrote:


Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious

issues a

few hours of 500 update-batches/sec. Our update batch is 10 docs, so

we

are

writing about 5000 docs/sec total, using autoCommit to commit the

updates

(no explicit commits).

Our environment:

   Solr 4.3.1 w/SOLR-5216 patch.
   Jetty 9, Java 1.7.
   3 solr instances, 1 per physical server.
   1 collection.
   3 shards.
   2 replicas (each instance is a leader and a replica).
   Soft autoCommit is 1000ms.
   Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of

these

stalled transactions (below), and the Solr instances start to see

each

other as down, flooding our Solr logs with Connection Refused

exceptions,

and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this

24-hour

soak.

My script normalizes the ERROR-severity stack traces and returns

them

in

order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma

markus.jel...@openindex.io

wrote:

Thanks!

-Original message-

From:Erick Ericksonerickerick...@gmail.com
Sent: Friday 6th September 2013 16:20
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
markus.jel...@openindex.iowrote:


Hi Mark,

Got an issue to watch?

Thanks,
Markus

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Erick Erickson

these

stalled transactions (below), and the Solr instances start to see

each

other as down, flooding our Solr logs with Connection Refused

exceptions,

and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this

24-hour

soak.

My script normalizes the ERROR-severity stack traces and returns

them

order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt

On 6 September 2013 07:27, Markus Jelsma

markus.jel...@openindex.io

wrote:

Thanks!

-Original message-

From:Erick
Ericksonerickerickson@gmail.**comerickerick...@gmail.com

Sent: Friday 6th September 2013 16:20
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

Markus:

See:
https://issues.apache.org/**jira/browse/SOLR-5216https://issues.apache.org/jira/browse/SOLR-5216

On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
markus.jel...@openindex.io**wrote:

Hi Mark,

Got an issue to watch?

Thanks,
Markus

-Original message-

From:Mark Millermarkrmil...@gmail.com
Sent: Wednesday 4th September 2013 16:55
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

I'm going to try and fix the root cause for 4.5 - I've suspected

what it

is since early this year, but it's never personally been an

issue,

it's

rolled along for a long time.

Mark

Sent from my iPhone

On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt

t...@elementspace.com

wrote:

Hey guys,

I am looking into an issue we've been having with SolrCloud

since

the

beginning of our testing, all the way from 4.1 to 4.3 (haven't

tested

4.4.0

yet). I've noticed other users with this same issue, so I'd

really

like to

get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12

hours

see stalled transactions that snowball to consume all Jetty

threads in

the

JVM. This eventually causes the JVM to hang with most threads

waiting

the condition/stack provided at the bottom of this message. At

this

point

SolrCloud instances then start to see their neighbors (who also

have

all

threads hung) as down w/Connection Refused, and the shards

become

down

in state. Sometimes a node or two survives and just returns

503s

server

hosting shard errors.

As a workaround/experiment, we have tuned the number of threads

sending

updates to Solr, as well as the batch size (we batch updates

from

client -

solr), and the Soft/Hard autoCommits, all to no avail. Turning

off

Client-to-Solr batching (1 update = 1 call to Solr), which also

did not

help. Certain combinations of update threads and batch sizes

seem

mask/help the problem, but not resolve it entirely.

shard

and

a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no

movement

on a

good

day.
- 5000 max jetty threads (well above what we use when we are

healthy),

Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or

Java

version

(I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is

the

following, which seems to be waiting on a lock that I would

very

much

to understand further:

java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for0x0007216e68d8 (a
java.util.concurrent.**Semaphore$NonfairSync)
at

java.util.concurrent.locks.**LockSupport.park(LockSupport.**
java:186)

java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)

java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)

java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)

at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
at

org.apache.solr.util.**AdjustableSemaphore.acquire(**
AdjustableSemaphore.java:61)

org.apache.solr.update.**SolrCmdDistributor.submit(**
SolrCmdDistributor.java:418)

org.apache.solr.update.**SolrCmdDistributor.submit(**
SolrCmdDistributor.java:368

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt

 or something like that?
 
  Clamping down further on the overall number of threads night help
 
  (which
 
  would require making something configurable). How many nodes are
 
  listed in
 
  zk under live_nodes?
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 12:02 PM, Tim Vaillancourttim@elementspace.
 **comt...@elementspace.com
  
  wrote:
 
   Hey guys,
 
  (copy of my post to SOLR-5216)
 
  We tested this patch and unfortunately encountered some serious
 
  issues a
 
  few hours of 500 update-batches/sec. Our update batch is 10 docs, so
 
  we
 
  are
 
  writing about 5000 docs/sec total, using autoCommit to commit the
 
  updates
 
  (no explicit commits).
 
  Our environment:
 
 Solr 4.3.1 w/SOLR-5216 patch.
 Jetty 9, Java 1.7.
 3 solr instances, 1 per physical server.
 1 collection.
 3 shards.
 2 replicas (each instance is a leader and a replica).
 Soft autoCommit is 1000ms.
 Hard autoCommit is 15000ms.
 
  After about 6 hours of stress-testing this patch, we see many of
 
  these
 
  stalled transactions (below), and the Solr instances start to see
 
  each
 
  other as down, flooding our Solr logs with Connection Refused
 
  exceptions,
 
  and otherwise no obviously-useful logs that I could see.
 
  I did notice some stalled transactions on both /select and
 /update,
  however. This never occurred without this patch.
 
  Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
  Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
  Lastly, I have a summary of the ERROR-severity logs from this
 
  24-hour
 
  soak.
 
  My script normalizes the ERROR-severity stack traces and returns
 
  them
 
  in
 
  order of occurrence.
 
  Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
  Thanks!
 
  Tim Vaillancourt
 
 
  On 6 September 2013 07:27, Markus Jelsma
 
  markus.jel...@openindex.io
 
  wrote:
 
  Thanks!
 
  -Original message-
 
  From:Erick Ericksonerickerickson@gmail.**com
 erickerick...@gmail.com
  
  Sent: Friday 6th September 2013 16:20
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  Markus:
 
  See: https://issues.apache.org/**jira/browse/SOLR-5216
 https://issues.apache.org/jira/browse/SOLR-5216
 
 
  On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
  markus.jel...@openindex.io**wrote:
 
   Hi Mark,
 
  Got an issue to watch?
 
  Thanks,
  Markus
 
  -Original message-
 
  From:Mark Millermarkrmil...@gmail.com
  Sent: Wednesday 4th September 2013 16:55
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  I'm going to try and fix the root cause for 4.5 - I've
 suspected
 
  what it
 
  is since early this year, but it's never personally been an
 
  issue,
 
  so
 
  it's
 
  rolled along for a long time.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt
 
  t...@elementspace.com
 
  wrote:
 
  Hey guys,
 
  I am looking into an issue we've been having with SolrCloud
 
  since
 
  the
 
  beginning of our testing, all the way from 4.1 to 4.3 (haven't
 
  tested
 
  4.4.0
 
  yet). I've noticed other users with this same issue, so I'd
 
  really
 
  like to
 
  get to the bottom of it.
 
  Under a very, very high rate of updates (2000+/sec), after
 1-12
 
  hours
 
  we
 
  see stalled transactions that snowball to consume all Jetty
 
  threads in
 
  the
 
  JVM. This eventually causes the JVM to hang with most threads
 
  waiting
 
  on
 
  the condition/stack provided at the bottom of this message. At
 
  this
 
  point
 
  SolrCloud instances then start to see their neighbors (who
 also
 
  have
 
  all
 
  threads hung) as down w/Connection Refused, and the shards
 
  become
 
  down
 
  in state. Sometimes a node or two survives and just returns
 
  503s
 
  no
 
  server
 
  hosting shard errors.
 
  As a workaround/experiment, we have tuned the number of
 threads
 
  sending
 
  updates to Solr, as well as the batch size (we batch updates
 
  from
 
  client -
 
  solr), and the Soft/Hard autoCommits, all to no avail. Turning
 
  off
 
  Client-to-Solr batching (1 update = 1 call to Solr), which also
 
  did not
 
  help. Certain combinations of update threads and batch sizes
 
  seem
 
  to
 
  mask/help the problem, but not resolve it entirely.
 
  Our current environment is the following:
  - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
  - 3 x Zookeeper instances, external Java 7 JVM.
  - 1 collection, 3 shards, 2 replicas (each node is a leader
 of
 
  1
 
  shard
 
  and
 
  a replica of 1 shard).
  - Log4j 1.2 for Solr logs, set to WARN. This log has no
 
  movement
 
  on a
 
  good
 
  day.
  - 5000 max jetty threads (well above what we use when we are
 
  healthy),
 
  Linux-user threads ulimit is 6000.
  - Occurs under Jetty 8 or 9 (many versions).
  - Occurs under Java 1.6 or 1.7 (several minor versions).
  - Occurs under several JVM tunings.
  - Everything seems to point to Solr

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Mark Miller

 2013 13:11, Mark Millermarkrmil...@gmail.com
 wrote:
 
 Did you ever get to index that long before without hitting the
 
 deadlock?
 
 There really isn't anything negative the patch could be
 introducing,
 
 other
 
 than allowing for some more threads to possibly run at once. If I
 had
 
 to
 
 guess, I would say its likely this patch fixes the deadlock issue
 and
 
 your
 
 seeing another issue - which looks like the system cannot keep up
 
 with
 
 the
 
 requests or something for some reason - perhaps due to some OS
 
 networking
 
 settings or something (more guessing). Connection refused happens
 
 generally
 
 when there is nothing listening on the port.
 
 Do you see anything interesting change with the rest of the
 system?
 
 CPU
 
 usage spikes or something like that?
 
 Clamping down further on the overall number of threads night help
 
 (which
 
 would require making something configurable). How many nodes are
 
 listed in
 
 zk under live_nodes?
 
 Mark
 
 Sent from my iPhone
 
 On Sep 6, 2013, at 12:02 PM, Tim Vaillancourttim@elementspace.
 **comt...@elementspace.com
 
 wrote:
 
 Hey guys,
 
 (copy of my post to SOLR-5216)
 
 We tested this patch and unfortunately encountered some serious
 
 issues a
 
 few hours of 500 update-batches/sec. Our update batch is 10 docs,
 so
 
 we
 
 are
 
 writing about 5000 docs/sec total, using autoCommit to commit
 the
 
 updates
 
 (no explicit commits).
 
 Our environment:
 
   Solr 4.3.1 w/SOLR-5216 patch.
   Jetty 9, Java 1.7.
   3 solr instances, 1 per physical server.
   1 collection.
   3 shards.
   2 replicas (each instance is a leader and a replica).
   Soft autoCommit is 1000ms.
   Hard autoCommit is 15000ms.
 
 After about 6 hours of stress-testing this patch, we see many of
 
 these
 
 stalled transactions (below), and the Solr instances start to see
 
 each
 
 other as down, flooding our Solr logs with Connection Refused
 
 exceptions,
 
 and otherwise no obviously-useful logs that I could see.
 
 I did notice some stalled transactions on both /select and
 /update,
 however. This never occurred without this patch.
 
 Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
 Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
 Lastly, I have a summary of the ERROR-severity logs from this
 
 24-hour
 
 soak.
 
 My script normalizes the ERROR-severity stack traces and
 returns
 
 them
 
 in
 
 order of occurrence.
 
 Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
 Thanks!
 
 Tim Vaillancourt
 
 
 On 6 September 2013 07:27, Markus Jelsma
 
 markus.jel...@openindex.io
 
 wrote:
 
 Thanks!
 
 -Original message-
 
 From:Erick Ericksonerickerickson@gmail.**com
 erickerick...@gmail.com
 
 Sent: Friday 6th September 2013 16:20
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 Markus:
 
 See: https://issues.apache.org/**jira/browse/SOLR-5216
 https://issues.apache.org/jira/browse/SOLR-5216
 
 
 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
 markus.jel...@openindex.io**wrote:
 
 Hi Mark,
 
 Got an issue to watch?
 
 Thanks,
 Markus
 
 -Original message-
 
 From:Mark Millermarkrmil...@gmail.com
 Sent: Wednesday 4th September 2013 16:55
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 I'm going to try and fix the root cause for 4.5 - I've
 suspected
 
 what it
 
 is since early this year, but it's never personally been an
 
 issue,
 
 so
 
 it's
 
 rolled along for a long time.
 
 Mark
 
 Sent from my iPhone
 
 On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt
 
 t...@elementspace.com
 
 wrote:
 
 Hey guys,
 
 I am looking into an issue we've been having with SolrCloud
 
 since
 
 the
 
 beginning of our testing, all the way from 4.1 to 4.3 (haven't
 
 tested
 
 4.4.0
 
 yet). I've noticed other users with this same issue, so I'd
 
 really
 
 like to
 
 get to the bottom of it.
 
 Under a very, very high rate of updates (2000+/sec), after
 1-12
 
 hours
 
 we
 
 see stalled transactions that snowball to consume all Jetty
 
 threads in
 
 the
 
 JVM. This eventually causes the JVM to hang with most
 threads
 
 waiting
 
 on
 
 the condition/stack provided at the bottom of this message.
 At
 
 this
 
 point
 
 SolrCloud instances then start to see their neighbors (who
 also
 
 have
 
 all
 
 threads hung) as down w/Connection Refused, and the shards
 
 become
 
 down
 
 in state. Sometimes a node or two survives and just returns
 
 503s
 
 no
 
 server
 
 hosting shard errors.
 
 As a workaround/experiment, we have tuned the number of
 threads
 
 sending
 
 updates to Solr, as well as the batch size (we batch updates
 
 from
 
 client -
 
 solr), and the Soft/Hard autoCommits, all to no avail.
 Turning
 
 off
 
 Client-to-Solr batching (1 update = 1 call to Solr), which also
 
 did not
 
 help. Certain combinations of update threads and batch sizes
 
 seem
 
 to
 
 mask/help the problem, but not resolve it entirely.
 
 Our current

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Erick Erickson

 crash?
  
   Metrics needed:
  
   1) # of live_nodes.
   2) Full stack traces.
   3) CPU used by Solr's JVM specifically (instead of system-wide).
   4) Solr's JVM thread count (already done)
   5) ?
  
   Cheers,
  
   Tim Vaillancourt
  
  
   On 6 September 2013 13:11, Mark Millermarkrmil...@gmail.com
   wrote:
  
Did you ever get to index that long before without hitting the
  
   deadlock?
  
   There really isn't anything negative the patch could be
 introducing,
  
   other
  
   than allowing for some more threads to possibly run at once. If I
  had
  
   to
  
   guess, I would say its likely this patch fixes the deadlock issue
  and
  
   your
  
   seeing another issue - which looks like the system cannot keep up
  
   with
  
   the
  
   requests or something for some reason - perhaps due to some OS
  
   networking
  
   settings or something (more guessing). Connection refused happens
  
   generally
  
   when there is nothing listening on the port.
  
   Do you see anything interesting change with the rest of the
 system?
  
   CPU
  
   usage spikes or something like that?
  
   Clamping down further on the overall number of threads night help
  
   (which
  
   would require making something configurable). How many nodes are
  
   listed in
  
   zk under live_nodes?
  
   Mark
  
   Sent from my iPhone
  
   On Sep 6, 2013, at 12:02 PM, Tim Vaillancourttim@elementspace.
  **comt...@elementspace.com
   
   wrote:
  
Hey guys,
  
   (copy of my post to SOLR-5216)
  
   We tested this patch and unfortunately encountered some serious
  
   issues a
  
   few hours of 500 update-batches/sec. Our update batch is 10 docs,
 so
  
   we
  
   are
  
   writing about 5000 docs/sec total, using autoCommit to commit
 the
  
   updates
  
   (no explicit commits).
  
   Our environment:
  
  Solr 4.3.1 w/SOLR-5216 patch.
  Jetty 9, Java 1.7.
  3 solr instances, 1 per physical server.
  1 collection.
  3 shards.
  2 replicas (each instance is a leader and a replica).
  Soft autoCommit is 1000ms.
  Hard autoCommit is 15000ms.
  
   After about 6 hours of stress-testing this patch, we see many of
  
   these
  
   stalled transactions (below), and the Solr instances start to see
  
   each
  
   other as down, flooding our Solr logs with Connection Refused
  
   exceptions,
  
   and otherwise no obviously-useful logs that I could see.
  
   I did notice some stalled transactions on both /select and
  /update,
   however. This never occurred without this patch.
  
   Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
   Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
  
   Lastly, I have a summary of the ERROR-severity logs from this
  
   24-hour
  
   soak.
  
   My script normalizes the ERROR-severity stack traces and
 returns
  
   them
  
   in
  
   order of occurrence.
  
   Summary of my solr.log: http://pastebin.com/pBdMAWeb
  
   Thanks!
  
   Tim Vaillancourt
  
  
   On 6 September 2013 07:27, Markus Jelsma
  
   markus.jel...@openindex.io
  
   wrote:
  
   Thanks!
  
   -Original message-
  
   From:Erick Ericksonerickerickson@gmail.**com
  erickerick...@gmail.com
   
   Sent: Friday 6th September 2013 16:20
   To: solr-user@lucene.apache.org
   Subject: Re: SolrCloud 4.x hangs under high update volume
  
   Markus:
  
   See: https://issues.apache.org/**jira/browse/SOLR-5216
  https://issues.apache.org/jira/browse/SOLR-5216
  
  
   On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
   markus.jel...@openindex.io**wrote:
  
Hi Mark,
  
   Got an issue to watch?
  
   Thanks,
   Markus
  
   -Original message-
  
   From:Mark Millermarkrmil...@gmail.com
   Sent: Wednesday 4th September 2013 16:55
   To: solr-user@lucene.apache.org
   Subject: Re: SolrCloud 4.x hangs under high update volume
  
   I'm going to try and fix the root cause for 4.5 - I've
  suspected
  
   what it
  
   is since early this year, but it's never personally been an
  
   issue,
  
   so
  
   it's
  
   rolled along for a long time.
  
   Mark
  
   Sent from my iPhone
  
   On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt
  
   t...@elementspace.com
  
   wrote:
  
   Hey guys,
  
   I am looking into an issue we've been having with SolrCloud
  
   since
  
   the
  
   beginning of our testing, all the way from 4.1 to 4.3 (haven't
  
   tested
  
   4.4.0
  
   yet). I've noticed other users with this same issue, so I'd
  
   really
  
   like to
  
   get to the bottom of it.
  
   Under a very, very high rate of updates (2000+/sec), after
  1-12
  
   hours
  
   we
  
   see stalled transactions that snowball to consume all Jetty
  
   threads in
  
   the
  
   JVM. This eventually causes the JVM to hang with most
 threads
  
   waiting
  
   on
  
   the condition/stack provided at the bottom of this message.
 At
  
   this
  
   point
  
   SolrCloud instances then start to see their neighbors (who
  also
  
   have
  
   all

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt

, however. I'd be glad to
 
  continue
 
  further testing, but I think first more monitoring is needed to
 
  understand
 
  this further. Could we come up with a list of metrics that would
  be
 
  useful
 
  to see following another test and successful crash?
 
  Metrics needed:
 
  1) # of live_nodes.
  2) Full stack traces.
  3) CPU used by Solr's JVM specifically (instead of system-wide).
  4) Solr's JVM thread count (already done)
  5) ?
 
  Cheers,
 
  Tim Vaillancourt
 
 
  On 6 September 2013 13:11, Mark Millermarkrmil...@gmail.com
  wrote:
 
  Did you ever get to index that long before without hitting the
 
  deadlock?
 
  There really isn't anything negative the patch could be
  introducing,
 
  other
 
  than allowing for some more threads to possibly run at once. If I
  had
 
  to
 
  guess, I would say its likely this patch fixes the deadlock issue
  and
 
  your
 
  seeing another issue - which looks like the system cannot keep up
 
  with
 
  the
 
  requests or something for some reason - perhaps due to some OS
 
  networking
 
  settings or something (more guessing). Connection refused happens
 
  generally
 
  when there is nothing listening on the port.
 
  Do you see anything interesting change with the rest of the
  system?
 
  CPU
 
  usage spikes or something like that?
 
  Clamping down further on the overall number of threads night
 help
 
  (which
 
  would require making something configurable). How many nodes are
 
  listed in
 
  zk under live_nodes?
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 12:02 PM, Tim Vaillancourttim@elementspace.
  **comt...@elementspace.com
 
  wrote:
 
  Hey guys,
 
  (copy of my post to SOLR-5216)
 
  We tested this patch and unfortunately encountered some serious
 
  issues a
 
  few hours of 500 update-batches/sec. Our update batch is 10 docs,
  so
 
  we
 
  are
 
  writing about 5000 docs/sec total, using autoCommit to commit
  the
 
  updates
 
  (no explicit commits).
 
  Our environment:
 
Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.
 
  After about 6 hours of stress-testing this patch, we see many
 of
 
  these
 
  stalled transactions (below), and the Solr instances start to see
 
  each
 
  other as down, flooding our Solr logs with Connection Refused
 
  exceptions,
 
  and otherwise no obviously-useful logs that I could see.
 
  I did notice some stalled transactions on both /select and
  /update,
  however. This never occurred without this patch.
 
  Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
  Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
  Lastly, I have a summary of the ERROR-severity logs from this
 
  24-hour
 
  soak.
 
  My script normalizes the ERROR-severity stack traces and
  returns
 
  them
 
  in
 
  order of occurrence.
 
  Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
  Thanks!
 
  Tim Vaillancourt
 
 
  On 6 September 2013 07:27, Markus Jelsma
 
  markus.jel...@openindex.io
 
  wrote:
 
  Thanks!
 
  -Original message-
 
  From:Erick Ericksonerickerickson@gmail.**com
  erickerick...@gmail.com
 
  Sent: Friday 6th September 2013 16:20
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  Markus:
 
  See: https://issues.apache.org/**jira/browse/SOLR-5216
  https://issues.apache.org/jira/browse/SOLR-5216
 
 
  On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
  markus.jel...@openindex.io**wrote:
 
  Hi Mark,
 
  Got an issue to watch?
 
  Thanks,
  Markus
 
  -Original message-
 
  From:Mark Millermarkrmil...@gmail.com
  Sent: Wednesday 4th September 2013 16:55
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  I'm going to try and fix the root cause for 4.5 - I've
  suspected
 
  what it
 
  is since early this year, but it's never personally been an
 
  issue,
 
  so
 
  it's
 
  rolled along for a long time.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt
 
  t...@elementspace.com
 
  wrote:
 
  Hey guys,
 
  I am looking into an issue we've been having with
 SolrCloud
 
  since
 
  the
 
  beginning of our testing, all the way from 4.1 to 4.3
 (haven't
 
  tested
 
  4.4.0
 
  yet). I've noticed other users with this same issue, so I'd
 
  really
 
  like to
 
  get to the bottom of it.
 
  Under a very, very high rate of updates (2000+/sec), after
  1-12
 
  hours
 
  we
 
  see stalled transactions that snowball to consume all Jetty
 
  threads in
 
  the
 
  JVM. This eventually causes the JVM to hang with most
  threads
 
  waiting
 
  on
 
  the condition/stack provided at the bottom of this message.
  At
 
  this
 
  point
 
  SolrCloud instances then start to see their neighbors (who
  also
 
  have
 
  all
 
  threads

Re: SolrCloud 4.x hangs under high update volume

2013-09-11 Thread Erick Erickson

)
at

org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
at

org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
at

org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
at

org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
at

org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:445)
at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
at

org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
at

org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
at

org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
at java.lang.Thread.run(Thread.java:724)

On your live_nodes question, I don't have historical data on this from
when
the crash occurred, which I guess is what you're looking for. I could
add
this to our monitoring for future tests, however. I'd be glad to
continue
further testing, but I think first more monitoring is needed to
understand
this further. Could we come up with a list of metrics that would be
useful
to see following another test and successful crash?

Metrics needed:

1) # of live_nodes.
2) Full stack traces.
3) CPU used by Solr's JVM specifically (instead of system-wide).
4) Solr's JVM thread count (already done)
5) ?

Cheers,

Tim Vaillancourt

On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote:

Did you ever get to index that long before without hitting the
deadlock?

There really isn't anything negative the patch could be introducing,
other
than allowing for some more threads to possibly run at once. If I had
to
guess, I would say its likely this patch fixes the deadlock issue and
your
seeing another issue - which looks like the system cannot keep up
with
the
requests or something for some reason - perhaps due to some OS
networking
settings or something (more guessing). Connection refused happens
generally
when there is nothing listening on the port.

Do you see anything interesting change with the rest of the system?
CPU
usage spikes or something like that?

Clamping down further on the overall number of threads night help
(which
would require making something configurable). How many nodes are
listed in
zk under live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com
wrote:

Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious
issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so
we
are
writing about 5000 docs/sec total, using autoCommit to commit the
updates
(no explicit commits).

Our environment:

Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of
these
stalled transactions (below), and the Solr instances start to see
each
other as down, flooding our Solr logs with Connection Refused
exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this
24-hour
soak.
My script normalizes the ERROR-severity stack traces and returns
them
in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt

On 6 September 2013 07:27, Markus Jelsma
markus.jel...@openindex.io
wrote:

Thanks!

-Original message-
From:Erick Erickson erickerick...@gmail.com
Sent: Friday 6th September 2013 16:20
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216

On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

Hi Mark,

Got an issue to watch?

Thanks,
Markus

-Original message-
From:Mark Miller markrmil...@gmail.com
Sent: Wednesday 4th September 2013

Re: SolrCloud 4.x hangs under high update volume

2013-09-10 Thread Tim Vaillancourt

)
   at
 org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
   at
 
 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
   at
 
 org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
   at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
   at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
   at java.lang.Thread.run(Thread.java:724)
 
  On your live_nodes question, I don't have historical data on this from
 when
  the crash occurred, which I guess is what you're looking for. I could
 add
  this to our monitoring for future tests, however. I'd be glad to
 continue
  further testing, but I think first more monitoring is needed to
 understand
  this further. Could we come up with a list of metrics that would be
 useful
  to see following another test and successful crash?
 
  Metrics needed:
 
  1) # of live_nodes.
  2) Full stack traces.
  3) CPU used by Solr's JVM specifically (instead of system-wide).
  4) Solr's JVM thread count (already done)
  5) ?
 
  Cheers,
 
  Tim Vaillancourt
 
 
  On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote:
 
  Did you ever get to index that long before without hitting the
 deadlock?
 
  There really isn't anything negative the patch could be introducing,
 other
  than allowing for some more threads to possibly run at once. If I had
 to
  guess, I would say its likely this patch fixes the deadlock issue and
 your
  seeing another issue - which looks like the system cannot keep up with
 the
  requests or something for some reason - perhaps due to some OS
 networking
  settings or something (more guessing). Connection refused happens
 generally
  when there is nothing listening on the port.
 
  Do you see anything interesting change with the rest of the system? CPU
  usage spikes or something like that?
 
  Clamping down further on the overall number of threads night help
 (which
  would require making something configurable). How many nodes are
 listed in
  zk under live_nodes?
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com
  wrote:
 
  Hey guys,
 
  (copy of my post to SOLR-5216)
 
  We tested this patch and unfortunately encountered some serious
 issues a
  few hours of 500 update-batches/sec. Our update batch is 10 docs, so
 we
  are
  writing about 5000 docs/sec total, using autoCommit to commit the
 updates
  (no explicit commits).
 
  Our environment:
 
Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.
 
  After about 6 hours of stress-testing this patch, we see many of these
  stalled transactions (below), and the Solr instances start to see each
  other as down, flooding our Solr logs with Connection Refused
  exceptions,
  and otherwise no obviously-useful logs that I could see.
 
  I did notice some stalled transactions on both /select and /update,
  however. This never occurred without this patch.
 
  Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
  Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
  Lastly, I have a summary of the ERROR-severity logs from this 24-hour
  soak.
  My script normalizes the ERROR-severity stack traces and returns
 them
  in
  order of occurrence.
 
  Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
  Thanks!
 
  Tim Vaillancourt
 
 
  On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io
  wrote:
 
  Thanks!
 
  -Original message-
  From:Erick Erickson erickerick...@gmail.com
  Sent: Friday 6th September 2013 16:20
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  Markus:
 
  See: https://issues.apache.org/jira/browse/SOLR-5216
 
 
  On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
  Hi Mark,
 
  Got an issue to watch?
 
  Thanks,
  Markus
 
  -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Wednesday 4th September 2013 16:55
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  I'm going to try and fix the root cause for 4.5 - I've suspected
  what it
  is since early this year, but it's never personally been an issue,
 so
  it's
  rolled along for a long time.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
 t...@elementspace.com
  wrote:
 
  Hey guys,
 
  I am looking into an issue we've been having with SolrCloud since
  the
  beginning of our testing, all the way from 4.1 to 4.3 (haven't
  tested
  4.4.0
  yet). I've noticed other users with this same issue, so I'd
 really
  like to
  get to the bottom of it.
 
  Under a very, very high rate of updates (2000+/sec), after 1-12
  hours
  we

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Erick Erickson

Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi Mark,

 Got an issue to watch?

 Thanks,
 Markus

 -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Wednesday 4th September 2013 16:55
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  I'm going to try and fix the root cause for 4.5 - I've suspected what it
 is since early this year, but it's never personally been an issue, so it's
 rolled along for a long time.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
   Hey guys,
  
   I am looking into an issue we've been having with SolrCloud since the
   beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
 4.4.0
   yet). I've noticed other users with this same issue, so I'd really
 like to
   get to the bottom of it.
  
   Under a very, very high rate of updates (2000+/sec), after 1-12 hours
 we
   see stalled transactions that snowball to consume all Jetty threads in
 the
   JVM. This eventually causes the JVM to hang with most threads waiting
 on
   the condition/stack provided at the bottom of this message. At this
 point
   SolrCloud instances then start to see their neighbors (who also have
 all
   threads hung) as down w/Connection Refused, and the shards become
 down
   in state. Sometimes a node or two survives and just returns 503s no
 server
   hosting shard errors.
  
   As a workaround/experiment, we have tuned the number of threads sending
   updates to Solr, as well as the batch size (we batch updates from
 client -
   solr), and the Soft/Hard autoCommits, all to no avail. Turning off
   Client-to-Solr batching (1 update = 1 call to Solr), which also did not
   help. Certain combinations of update threads and batch sizes seem to
   mask/help the problem, but not resolve it entirely.
  
   Our current environment is the following:
   - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
   - 3 x Zookeeper instances, external Java 7 JVM.
   - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
 and
   a replica of 1 shard).
   - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
 good
   day.
   - 5000 max jetty threads (well above what we use when we are healthy),
   Linux-user threads ulimit is 6000.
   - Occurs under Jetty 8 or 9 (many versions).
   - Occurs under Java 1.6 or 1.7 (several minor versions).
   - Occurs under several JVM tunings.
   - Everything seems to point to Solr itself, and not a Jetty or Java
 version
   (I hope I'm wrong).
  
   The stack trace that is holding up all my Jetty QTP threads is the
   following, which seems to be waiting on a lock that I would very much
 like
   to understand further:
  
   java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for  0x0007216e68d8 (a
   java.util.concurrent.Semaphore$NonfairSync)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
  at
  
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
  at
  
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
  at
  
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
  at
  
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
  at
  
 org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
  at
  
 org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
  at
  
 org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
  at
  
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
  at
  
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
  at
  
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt

Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
writing about 5000 docs/sec total, using autoCommit to commit the updates
(no explicit commits).

Our environment:

Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of these
stalled transactions (below), and the Solr instances start to see each
other as down, flooding our Solr logs with Connection Refused exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
My script normalizes the ERROR-severity stack traces and returns them in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io wrote:

 Thanks!

 -Original message-
  From:Erick Erickson erickerick...@gmail.com
  Sent: Friday 6th September 2013 16:20
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  Markus:
 
  See: https://issues.apache.org/jira/browse/SOLR-5216
 
 
  On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Hi Mark,
  
   Got an issue to watch?
  
   Thanks,
   Markus
  
   -Original message-
From:Mark Miller markrmil...@gmail.com
Sent: Wednesday 4th September 2013 16:55
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume
   
I'm going to try and fix the root cause for 4.5 - I've suspected
 what it
   is since early this year, but it's never personally been an issue, so
 it's
   rolled along for a long time.
   
Mark
   
Sent from my iPhone
   
On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
   wrote:
   
 Hey guys,

 I am looking into an issue we've been having with SolrCloud since
 the
 beginning of our testing, all the way from 4.1 to 4.3 (haven't
 tested
   4.4.0
 yet). I've noticed other users with this same issue, so I'd really
   like to
 get to the bottom of it.

 Under a very, very high rate of updates (2000+/sec), after 1-12
 hours
   we
 see stalled transactions that snowball to consume all Jetty
 threads in
   the
 JVM. This eventually causes the JVM to hang with most threads
 waiting
   on
 the condition/stack provided at the bottom of this message. At this
   point
 SolrCloud instances then start to see their neighbors (who also
 have
   all
 threads hung) as down w/Connection Refused, and the shards become
   down
 in state. Sometimes a node or two survives and just returns 503s
 no
   server
 hosting shard errors.

 As a workaround/experiment, we have tuned the number of threads
 sending
 updates to Solr, as well as the batch size (we batch updates from
   client -
 solr), and the Soft/Hard autoCommits, all to no avail. Turning off
 Client-to-Solr batching (1 update = 1 call to Solr), which also
 did not
 help. Certain combinations of update threads and batch sizes seem
 to
 mask/help the problem, but not resolve it entirely.

 Our current environment is the following:
 - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
 - 3 x Zookeeper instances, external Java 7 JVM.
 - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
 shard
   and
 a replica of 1 shard).
 - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
 on a
   good
 day.
 - 5000 max jetty threads (well above what we use when we are
 healthy),
 Linux-user threads ulimit is 6000.
 - Occurs under Jetty 8 or 9 (many versions).
 - Occurs under Java 1.6 or 1.7 (several minor versions).
 - Occurs under several JVM tunings.
 - Everything seems to point to Solr itself, and not a Jetty or Java
   version
 (I hope I'm wrong).

 The stack trace that is holding up all my Jetty QTP threads is the
 following, which seems to be waiting on a lock that I would very
 much
   like
 to understand further:

 java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x0007216e68d8 (a
 java.util.concurrent.Semaphore$NonfairSync)
at
 java.util.concurrent.locks.LockSupport.park(LockSupport.java:186

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller

Did you ever get to index that long before without hitting the deadlock?

There really isn't anything negative the patch could be introducing, other than 
allowing for some more threads to possibly run at once. If I had to guess, I 
would say its likely this patch fixes the deadlock issue and your seeing 
another issue - which looks like the system cannot keep up with the requests or 
something for some reason - perhaps due to some OS networking settings or 
something (more guessing). Connection refused happens generally when there is 
nothing listening on the port. 

Do you see anything interesting change with the rest of the system? CPU usage 
spikes or something like that?

Clamping down further on the overall number of threads night help (which would 
require making something configurable). How many nodes are listed in zk under 
live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com wrote:

 Hey guys,
 
 (copy of my post to SOLR-5216)
 
 We tested this patch and unfortunately encountered some serious issues a
 few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
 writing about 5000 docs/sec total, using autoCommit to commit the updates
 (no explicit commits).
 
 Our environment:
 
Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.
 
 After about 6 hours of stress-testing this patch, we see many of these
 stalled transactions (below), and the Solr instances start to see each
 other as down, flooding our Solr logs with Connection Refused exceptions,
 and otherwise no obviously-useful logs that I could see.
 
 I did notice some stalled transactions on both /select and /update,
 however. This never occurred without this patch.
 
 Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
 Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
 Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
 My script normalizes the ERROR-severity stack traces and returns them in
 order of occurrence.
 
 Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
 Thanks!
 
 Tim Vaillancourt
 
 
 On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io wrote:
 
 Thanks!
 
 -Original message-
 From:Erick Erickson erickerick...@gmail.com
 Sent: Friday 6th September 2013 16:20
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 Markus:
 
 See: https://issues.apache.org/jira/browse/SOLR-5216
 
 
 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
 Hi Mark,
 
 Got an issue to watch?
 
 Thanks,
 Markus
 
 -Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Wednesday 4th September 2013 16:55
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 I'm going to try and fix the root cause for 4.5 - I've suspected
 what it
 is since early this year, but it's never personally been an issue, so
 it's
 rolled along for a long time.
 
 Mark
 
 Sent from my iPhone
 
 On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
 Hey guys,
 
 I am looking into an issue we've been having with SolrCloud since
 the
 beginning of our testing, all the way from 4.1 to 4.3 (haven't
 tested
 4.4.0
 yet). I've noticed other users with this same issue, so I'd really
 like to
 get to the bottom of it.
 
 Under a very, very high rate of updates (2000+/sec), after 1-12
 hours
 we
 see stalled transactions that snowball to consume all Jetty
 threads in
 the
 JVM. This eventually causes the JVM to hang with most threads
 waiting
 on
 the condition/stack provided at the bottom of this message. At this
 point
 SolrCloud instances then start to see their neighbors (who also
 have
 all
 threads hung) as down w/Connection Refused, and the shards become
 down
 in state. Sometimes a node or two survives and just returns 503s
 no
 server
 hosting shard errors.
 
 As a workaround/experiment, we have tuned the number of threads
 sending
 updates to Solr, as well as the batch size (we batch updates from
 client -
 solr), and the Soft/Hard autoCommits, all to no avail. Turning off
 Client-to-Solr batching (1 update = 1 call to Solr), which also
 did not
 help. Certain combinations of update threads and batch sizes seem
 to
 mask/help the problem, but not resolve it entirely.
 
 Our current environment is the following:
 - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
 - 3 x Zookeeper instances, external Java 7 JVM.
 - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
 shard
 and
 a replica of 1 shard).
 - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
 on a
 good
 day.
 - 5000 max jetty threads (well above what we use when we are
 healthy),
 Linux-user threads

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller

 anything interesting change with the rest of the system? CPU
 usage spikes or something like that?
 
 Clamping down further on the overall number of threads night help (which
 would require making something configurable). How many nodes are listed in
 zk under live_nodes?
 
 Mark
 
 Sent from my iPhone
 
 On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
 Hey guys,
 
 (copy of my post to SOLR-5216)
 
 We tested this patch and unfortunately encountered some serious issues a
 few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
 are
 writing about 5000 docs/sec total, using autoCommit to commit the updates
 (no explicit commits).
 
 Our environment:
 
   Solr 4.3.1 w/SOLR-5216 patch.
   Jetty 9, Java 1.7.
   3 solr instances, 1 per physical server.
   1 collection.
   3 shards.
   2 replicas (each instance is a leader and a replica).
   Soft autoCommit is 1000ms.
   Hard autoCommit is 15000ms.
 
 After about 6 hours of stress-testing this patch, we see many of these
 stalled transactions (below), and the Solr instances start to see each
 other as down, flooding our Solr logs with Connection Refused
 exceptions,
 and otherwise no obviously-useful logs that I could see.
 
 I did notice some stalled transactions on both /select and /update,
 however. This never occurred without this patch.
 
 Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
 Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
 Lastly, I have a summary of the ERROR-severity logs from this 24-hour
 soak.
 My script normalizes the ERROR-severity stack traces and returns them
 in
 order of occurrence.
 
 Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
 Thanks!
 
 Tim Vaillancourt
 
 
 On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
 Thanks!
 
 -Original message-
 From:Erick Erickson erickerick...@gmail.com
 Sent: Friday 6th September 2013 16:20
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 Markus:
 
 See: https://issues.apache.org/jira/browse/SOLR-5216
 
 
 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
 Hi Mark,
 
 Got an issue to watch?
 
 Thanks,
 Markus
 
 -Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Wednesday 4th September 2013 16:55
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 I'm going to try and fix the root cause for 4.5 - I've suspected
 what it
 is since early this year, but it's never personally been an issue, so
 it's
 rolled along for a long time.
 
 Mark
 
 Sent from my iPhone
 
 On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
 Hey guys,
 
 I am looking into an issue we've been having with SolrCloud since
 the
 beginning of our testing, all the way from 4.1 to 4.3 (haven't
 tested
 4.4.0
 yet). I've noticed other users with this same issue, so I'd really
 like to
 get to the bottom of it.
 
 Under a very, very high rate of updates (2000+/sec), after 1-12
 hours
 we
 see stalled transactions that snowball to consume all Jetty
 threads in
 the
 JVM. This eventually causes the JVM to hang with most threads
 waiting
 on
 the condition/stack provided at the bottom of this message. At this
 point
 SolrCloud instances then start to see their neighbors (who also
 have
 all
 threads hung) as down w/Connection Refused, and the shards become
 down
 in state. Sometimes a node or two survives and just returns 503s
 no
 server
 hosting shard errors.
 
 As a workaround/experiment, we have tuned the number of threads
 sending
 updates to Solr, as well as the batch size (we batch updates from
 client -
 solr), and the Soft/Hard autoCommits, all to no avail. Turning off
 Client-to-Solr batching (1 update = 1 call to Solr), which also
 did not
 help. Certain combinations of update threads and batch sizes seem
 to
 mask/help the problem, but not resolve it entirely.
 
 Our current environment is the following:
 - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
 - 3 x Zookeeper instances, external Java 7 JVM.
 - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
 shard
 and
 a replica of 1 shard).
 - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
 on a
 good
 day.
 - 5000 max jetty threads (well above what we use when we are
 healthy),
 Linux-user threads ulimit is 6000.
 - Occurs under Jetty 8 or 9 (many versions).
 - Occurs under Java 1.6 or 1.7 (several minor versions).
 - Occurs under several JVM tunings.
 - Everything seems to point to Solr itself, and not a Jetty or Java
 version
 (I hope I'm wrong).
 
 The stack trace that is holding up all my Jetty QTP threads is the
 following, which seems to be waiting on a lock that I would very
 much
 like
 to understand further:
 
 java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for  0x0007216e68d8

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
are
writing about 5000 docs/sec total, using autoCommit to commit the updates
(no explicit commits).

Our environment:

After about 6 hours of stress-testing this patch, we see many of these
stalled transactions (below), and the Solr instances start to see each
other as down, flooding our Solr logs with Connection Refused
exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this 24-hour
soak.
My script normalizes the ERROR-severity stack traces and returns them
in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt

On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io
wrote:

Thanks!

-Original message-
From:Erick Erickson erickerick...@gmail.com
Sent: Friday 6th September 2013 16:20
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216

On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

Hi Mark,

Got an issue to watch?

Thanks,
Markus

-Original message-
From:Mark Miller markrmil...@gmail.com
Sent: Wednesday 4th September 2013 16:55
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

I'm going to try and fix the root cause for 4.5 - I've suspected
what it
is since early this year, but it's never personally been an issue, so
it's
rolled along for a long time.

Mark

Sent from my iPhone

On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
wrote:

Hey guys,

I am looking into an issue we've been having with SolrCloud since
the
beginning of our testing, all the way from 4.1 to 4.3 (haven't
tested
4.4.0
yet). I've noticed other users with this same issue, so I'd really
like to
get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12
hours
we
see stalled transactions that snowball to consume all Jetty
threads in
the
JVM. This eventually causes the JVM to hang with most threads
waiting
on
the condition/stack provided at the bottom of this message. At this
point
SolrCloud instances then start to see their neighbors (who also
have
all
threads hung) as down w/Connection Refused, and the shards become
down
in state. Sometimes a node or two survives and just returns 503s
no
server
hosting shard errors.

As a workaround/experiment, we have tuned the number of threads
sending
updates to Solr, as well as the batch size (we batch updates from
client -
solr), and the Soft/Hard autoCommits, all to no avail. Turning off
Client-to-Solr batching (1 update = 1 call to Solr), which also
did not
help. Certain combinations of update threads and batch sizes seem
to
mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1
shard
and
a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement
on a
good
day.
- 5000 max jetty threads (well above what we use when we are
healthy),
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java
version
(I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the
following, which seems to be waiting on a lock that I would very
much
like
to understand further:

java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for 0x0007216e68d8 (a
java.util.concurrent.Semaphore$NonfairSync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at

java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt

 cannot keep up with
 the
  requests or something for some reason - perhaps due to some OS
 networking
  settings or something (more guessing). Connection refused happens
 generally
  when there is nothing listening on the port.
 
  Do you see anything interesting change with the rest of the system? CPU
  usage spikes or something like that?
 
  Clamping down further on the overall number of threads night help (which
  would require making something configurable). How many nodes are listed
 in
  zk under live_nodes?
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com
  wrote:
 
  Hey guys,
 
  (copy of my post to SOLR-5216)
 
  We tested this patch and unfortunately encountered some serious issues
 a
  few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
  are
  writing about 5000 docs/sec total, using autoCommit to commit the
 updates
  (no explicit commits).
 
  Our environment:
 
Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.
 
  After about 6 hours of stress-testing this patch, we see many of these
  stalled transactions (below), and the Solr instances start to see each
  other as down, flooding our Solr logs with Connection Refused
  exceptions,
  and otherwise no obviously-useful logs that I could see.
 
  I did notice some stalled transactions on both /select and /update,
  however. This never occurred without this patch.
 
  Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
  Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
  Lastly, I have a summary of the ERROR-severity logs from this 24-hour
  soak.
  My script normalizes the ERROR-severity stack traces and returns them
  in
  order of occurrence.
 
  Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
  Thanks!
 
  Tim Vaillancourt
 
 
  On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io
  wrote:
 
  Thanks!
 
  -Original message-
  From:Erick Erickson erickerick...@gmail.com
  Sent: Friday 6th September 2013 16:20
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  Markus:
 
  See: https://issues.apache.org/jira/browse/SOLR-5216
 
 
  On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
  Hi Mark,
 
  Got an issue to watch?
 
  Thanks,
  Markus
 
  -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Wednesday 4th September 2013 16:55
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  I'm going to try and fix the root cause for 4.5 - I've suspected
  what it
  is since early this year, but it's never personally been an issue,
 so
  it's
  rolled along for a long time.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
 
  wrote:
 
  Hey guys,
 
  I am looking into an issue we've been having with SolrCloud since
  the
  beginning of our testing, all the way from 4.1 to 4.3 (haven't
  tested
  4.4.0
  yet). I've noticed other users with this same issue, so I'd really
  like to
  get to the bottom of it.
 
  Under a very, very high rate of updates (2000+/sec), after 1-12
  hours
  we
  see stalled transactions that snowball to consume all Jetty
  threads in
  the
  JVM. This eventually causes the JVM to hang with most threads
  waiting
  on
  the condition/stack provided at the bottom of this message. At
 this
  point
  SolrCloud instances then start to see their neighbors (who also
  have
  all
  threads hung) as down w/Connection Refused, and the shards
 become
  down
  in state. Sometimes a node or two survives and just returns 503s
  no
  server
  hosting shard errors.
 
  As a workaround/experiment, we have tuned the number of threads
  sending
  updates to Solr, as well as the batch size (we batch updates from
  client -
  solr), and the Soft/Hard autoCommits, all to no avail. Turning off
  Client-to-Solr batching (1 update = 1 call to Solr), which also
  did not
  help. Certain combinations of update threads and batch sizes seem
  to
  mask/help the problem, but not resolve it entirely.
 
  Our current environment is the following:
  - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
  - 3 x Zookeeper instances, external Java 7 JVM.
  - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
  shard
  and
  a replica of 1 shard).
  - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
  on a
  good
  day.
  - 5000 max jetty threads (well above what we use when we are
  healthy),
  Linux-user threads ulimit is 6000.
  - Occurs under Jetty 8 or 9 (many versions).
  - Occurs under Java 1.6 or 1.7 (several minor versions).
  - Occurs under several JVM tunings.
  - Everything seems to point to Solr itself

Re: SolrCloud 4.x hangs under high update volume

2013-09-05 Thread Tim Vaillancourt

Update: It is a bit too soon to tell, but about 6 hours into testing there
are no crashes with this patch. :)

We are pushing 500 batches of 10 updates per second to a 3 node, 3 shard
cluster I mentioned above. 5000 updates per second total.

More tomorrow after a 24 hr soak!

Tim

On Wednesday, 4 September 2013, Tim Vaillancourt wrote:

 Thanks so much for the explanation Mark, I owe you one (many)!

 We have this on our high TPS cluster and will run it through it's paces
 tomorrow. I'll provide any feedback I can, more soon! :D

 Cheers,

 Tim

RE: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Greg Walters

Tim,

Take a look at 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html
 and https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue that 
you're reporting for a while then I applied the patch from SOLR-4816 to my 
clients and the problems went away. If you don't feel like applying the patch 
it looks like it should be included in the release of version 4.5. Also note 
that the problem happens more frequently when the replication factor is greater 
than 1.

Thanks,
Greg

-Original Message-
From: Tim Vaillancourt [mailto:t...@elementspace.com] 
Sent: Tuesday, September 03, 2013 6:31 PM
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.x hangs under high update volume

Hey guys,

I am looking into an issue we've been having with SolrCloud since the beginning 
of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've 
noticed other users with this same issue, so I'd really like to get to the 
bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see 
stalled transactions that snowball to consume all Jetty threads in the JVM. 
This eventually causes the JVM to hang with most threads waiting on the 
condition/stack provided at the bottom of this message. At this point SolrCloud 
instances then start to see their neighbors (who also have all threads hung) as 
down w/Connection Refused, and the shards become down
in state. Sometimes a node or two survives and just returns 503s no server 
hosting shard errors.

As a workaround/experiment, we have tuned the number of threads sending updates 
to Solr, as well as the batch size (we batch updates from client - solr), and 
the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching 
(1 update = 1 call to Solr), which also did not help. Certain combinations of 
update threads and batch sizes seem to mask/help the problem, but not resolve 
it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a 
replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day.
- 5000 max jetty threads (well above what we use when we are healthy), 
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java version (I 
hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the following, 
which seems to be waiting on a lock that I would very much like to understand 
further:

java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x0007216e68d8 (a
java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
at
org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
at
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
at
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Mark Miller

I'm going to try and fix the root cause for 4.5 - I've suspected what it is 
since early this year, but it's never personally been an issue, so it's rolled 
along for a long time. 

Mark

Sent from my iPhone

On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com wrote:

 Hey guys,
 
 I am looking into an issue we've been having with SolrCloud since the
 beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
 yet). I've noticed other users with this same issue, so I'd really like to
 get to the bottom of it.
 
 Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
 see stalled transactions that snowball to consume all Jetty threads in the
 JVM. This eventually causes the JVM to hang with most threads waiting on
 the condition/stack provided at the bottom of this message. At this point
 SolrCloud instances then start to see their neighbors (who also have all
 threads hung) as down w/Connection Refused, and the shards become down
 in state. Sometimes a node or two survives and just returns 503s no server
 hosting shard errors.
 
 As a workaround/experiment, we have tuned the number of threads sending
 updates to Solr, as well as the batch size (we batch updates from client -
 solr), and the Soft/Hard autoCommits, all to no avail. Turning off
 Client-to-Solr batching (1 update = 1 call to Solr), which also did not
 help. Certain combinations of update threads and batch sizes seem to
 mask/help the problem, but not resolve it entirely.
 
 Our current environment is the following:
 - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
 - 3 x Zookeeper instances, external Java 7 JVM.
 - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
 a replica of 1 shard).
 - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
 day.
 - 5000 max jetty threads (well above what we use when we are healthy),
 Linux-user threads ulimit is 6000.
 - Occurs under Jetty 8 or 9 (many versions).
 - Occurs under Java 1.6 or 1.7 (several minor versions).
 - Occurs under several JVM tunings.
 - Everything seems to point to Solr itself, and not a Jetty or Java version
 (I hope I'm wrong).
 
 The stack trace that is holding up all my Jetty QTP threads is the
 following, which seems to be waiting on a lock that I would very much like
 to understand further:
 
 java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x0007216e68d8 (a
 java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
at
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
at
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
at
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
at
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
at
 org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
at
 org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
at
 org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
at

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Kevin Osborn

I am having this issue as well. I did apply this patch. Unfortunately, it
did not resolve the issue in my case.

On Wed, Sep 4, 2013 at 7:01 AM, Greg Walters
gwalt...@sherpaanalytics.comwrote:

Tim,

Take a look at
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.htmland
https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue
that you're reporting for a while then I applied the patch from SOLR-4816
to my clients and the problems went away. If you don't feel like applying
the patch it looks like it should be included in the release of version
4.5. Also note that the problem happens more frequently when the
replication factor is greater than 1.

Thanks,
Greg

-Original Message-
From: Tim Vaillancourt [mailto:t...@elementspace.com]
Sent: Tuesday, September 03, 2013 6:31 PM
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.x hangs under high update volume

Hey guys,

I am looking into an issue we've been having with SolrCloud since the
beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
yet). I've noticed other users with this same issue, so I'd really like to
get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
see stalled transactions that snowball to consume all Jetty threads in the
JVM. This eventually causes the JVM to hang with most threads waiting on
the condition/stack provided at the bottom of this message. At this point
SolrCloud instances then start to see their neighbors (who also have all
threads hung) as down w/Connection Refused, and the shards become down
in state. Sometimes a node or two survives and just returns 503s no
server hosting shard errors.

As a workaround/experiment, we have tuned the number of threads sending
updates to Solr, as well as the batch size (we batch updates from client -
solr), and the Soft/Hard autoCommits, all to no avail. Turning off
Client-to-Solr batching (1 update = 1 call to Solr), which also did not
help. Certain combinations of update threads and batch sizes seem to
mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
day.
- 5000 max jetty threads (well above what we use when we are healthy),
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java
version (I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the
following, which seems to be waiting on a lock that I would very much like
to understand further:

java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for 0x0007216e68d8 (a
java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at

java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at

java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
at

org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
at

org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
at

org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
at

org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
at

org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
at

org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
at

org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at

org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Mark Miller

There is an issue if I remember right, but I can't find it right now.

If anyone that has the problem could try this patch, that would be very
helpful: http://pastebin.com/raw.php?i=aaRWwSGP

- Mark


On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma markus.jel...@openindex.iowrote:

 Hi Mark,

 Got an issue to watch?

 Thanks,
 Markus

 -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Wednesday 4th September 2013 16:55
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  I'm going to try and fix the root cause for 4.5 - I've suspected what it
 is since early this year, but it's never personally been an issue, so it's
 rolled along for a long time.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
   Hey guys,
  
   I am looking into an issue we've been having with SolrCloud since the
   beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
 4.4.0
   yet). I've noticed other users with this same issue, so I'd really
 like to
   get to the bottom of it.
  
   Under a very, very high rate of updates (2000+/sec), after 1-12 hours
 we
   see stalled transactions that snowball to consume all Jetty threads in
 the
   JVM. This eventually causes the JVM to hang with most threads waiting
 on
   the condition/stack provided at the bottom of this message. At this
 point
   SolrCloud instances then start to see their neighbors (who also have
 all
   threads hung) as down w/Connection Refused, and the shards become
 down
   in state. Sometimes a node or two survives and just returns 503s no
 server
   hosting shard errors.
  
   As a workaround/experiment, we have tuned the number of threads sending
   updates to Solr, as well as the batch size (we batch updates from
 client -
   solr), and the Soft/Hard autoCommits, all to no avail. Turning off
   Client-to-Solr batching (1 update = 1 call to Solr), which also did not
   help. Certain combinations of update threads and batch sizes seem to
   mask/help the problem, but not resolve it entirely.
  
   Our current environment is the following:
   - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
   - 3 x Zookeeper instances, external Java 7 JVM.
   - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
 and
   a replica of 1 shard).
   - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
 good
   day.
   - 5000 max jetty threads (well above what we use when we are healthy),
   Linux-user threads ulimit is 6000.
   - Occurs under Jetty 8 or 9 (many versions).
   - Occurs under Java 1.6 or 1.7 (several minor versions).
   - Occurs under several JVM tunings.
   - Everything seems to point to Solr itself, and not a Jetty or Java
 version
   (I hope I'm wrong).
  
   The stack trace that is holding up all my Jetty QTP threads is the
   following, which seems to be waiting on a lock that I would very much
 like
   to understand further:
  
   java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for  0x0007216e68d8 (a
   java.util.concurrent.Semaphore$NonfairSync)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
  at
  
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
  at
  
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
  at
  
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
  at
  
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
  at
  
 org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
  at
  
 org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
  at
  
 org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
  at
  
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
  at
  
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Tim Vaillancourt

Thanks guys! :)

Mark: this patch is much appreciated, I will try to test this shortly,
hopefully today.

For my curiosity/understanding, could someone explain to me quickly what
locks SolrCloud takes on updates? Was I on to something that more shards
decrease the chance for locking?

Secondly, I was wondering if someone could summarize what this patch
'fixes'? I'm not too familiar with Java and the solr codebase (working on
that though :D).

Cheers,

Tim

On 4 September 2013 09:52, Mark Miller markrmil...@gmail.com wrote:

There is an issue if I remember right, but I can't find it right now.

If anyone that has the problem could try this patch, that would be very
helpful: http://pastebin.com/raw.php?i=aaRWwSGP

- Mark

On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

Hi Mark,

Got an issue to watch?

Thanks,
Markus

-Original message-
From:Mark Miller markrmil...@gmail.com
Sent: Wednesday 4th September 2013 16:55
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

I'm going to try and fix the root cause for 4.5 - I've suspected what
it
is since early this year, but it's never personally been an issue, so
it's
rolled along for a long time.

Mark

Sent from my iPhone

On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
wrote:

Hey guys,

I am looking into an issue we've been having with SolrCloud since the
beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
4.4.0
yet). I've noticed other users with this same issue, so I'd really
like to
get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours
we
see stalled transactions that snowball to consume all Jetty threads
in
the
JVM. This eventually causes the JVM to hang with most threads waiting
on
the condition/stack provided at the bottom of this message. At this
point
SolrCloud instances then start to see their neighbors (who also have
all
threads hung) as down w/Connection Refused, and the shards become
down
in state. Sometimes a node or two survives and just returns 503s no
server
hosting shard errors.

As a workaround/experiment, we have tuned the number of threads
sending
updates to Solr, as well as the batch size (we batch updates from
client -
solr), and the Soft/Hard autoCommits, all to no avail. Turning off
Client-to-Solr batching (1 update = 1 call to Solr), which also did
not
help. Certain combinations of update threads and batch sizes seem to
mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1
shard
and
a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
good
day.
- 5000 max jetty threads (well above what we use when we are
healthy),
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java
version
(I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the
following, which seems to be waiting on a lock that I would very much
like
to understand further: