Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
We are planning an upgrade to 4.4 but it's still weeks out. We offer a
high availability search service and there are a number of changes in 4.4
that are not backward compatible. (i.e. Clusterstate.json and no solr.xml)
So there must be lots of testing, additionally this upgrade cannot be
performed without downtime.

Regardless, I need to find a band-aid right now.  Does anyone know if it's
possible to set the timeout for distributed update request to/from leader.
 Currently we see it's set to 0.  Maybe via -D startup param, or something?

Jed

On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

Hi Jed,

This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
that is about to be released.  We did not have fun working with 4.0 in
SolrCloud mode a few months ago.  You will save time, hair, and money
if you convince your manager to let you use Solr 4.4. :)

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
 Hi Shawn,

 I have been trying to duplicate this problem without success for the
last 2 weeks which is one reason I'm getting flustered.   It seems
reasonable to be able to duplicate it but I can't.

  We do have a story to upgrade but that is still weeks if not months
before that gets rolled out to production.

 We have another cluster running the same version but with 8 shards and
8 replicas with each shard at 100gb and more load and more indexing
requests without this problem but we send docs in batches here and all
fields are stored.   Where as the trouble index has only 1 or 2 stored
fields and only send docs 1 at a time.

 Could that have anything to do with it?

 Jed


 Von Samsung Mobile gesendet



  Ursprüngliche Nachricht 
 Von: Shawn Heisey s...@elyograg.org
 Datum: 07.09.2013 18:33 (GMT+01:00)
 An: solr-user@lucene.apache.org
 Betreff: Re: Solr Hangs During Updates for over 10 minutes


 On 7/9/2013 9:50 AM, Jed Glazner wrote:
 I'll give you the high level before delving deep into setup etc. I
have been struggeling at work with a seemingly random problem when solr
will hang for 10-15 minutes during updates.  This outage always seems
to immediately be proceeded by an EOF exception on  the replica.  Then
10-15 minutes later we see an exception on the leader for a socket
timeout to the replica.  The leader will then tell the replica to
recover which in most cases it does and then the outage is over.

 Here are the setup details:

 We are currently using Solr 4.0.0 with an external ZK ensemble of 5
machines.

 After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
 and have since been fixed.  You're five releases and about nine months
 behind what's current.  My recommendation: Upgrade to 4.3.1, ensure your
 configuration is up to date with changes to the example config between
 4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
 testbed, duplicate your current problem, and upgrade the testbed to see
 if the problem goes away.  A testbed will also give you practice for a
 smooth upgrade of your production system.

 Thanks,
 Shawn




Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Daniel Collins
We had something similar in terms of update times suddenly spiking up for
no obvious reason.  We never got quite as bad as you in terms of the other
knock on effects, but we certainly saw updates jumping from 10ms up to
3ms, all our external queues backed up and we rejected some updates,
then after a while things quietened down.

We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
went away.

Now, I admit its not exactly the same as your case, we never had the
follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
reduced the spikes in our indexing times.

We run the following settings now (the usual caveats apply, it might not
work for you).

GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
application pauses, that's our goal, if we have to use more memory in the
short term then so be it, but we couldn't afford application pauses,
because we are using NRT (soft commits every 1s, hard commits every 60s)
and we get a lot of updates.

I know there have been other discussion on G1 and it has received mixed
results overall, but for us, it seems to be a winner.

Hope that helps,


On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in 4.4
 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if it's
 possible to set the timeout for distributed update request to/from leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.  Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure your
  configuration is up to date with changes to the example config between
  4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
  testbed, duplicate your current problem, and upgrade the testbed to see
  if the problem goes away.  A testbed will also give you practice for a
  smooth upgrade of your production system.
 
  Thanks,
  Shawn
 




Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
Hey Daniel,

Thanks for the response.  I think we'll give this a try to see if this
helps.

Jed.

On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote:

We had something similar in terms of update times suddenly spiking up for
no obvious reason.  We never got quite as bad as you in terms of the other
knock on effects, but we certainly saw updates jumping from 10ms up to
3ms, all our external queues backed up and we rejected some updates,
then after a while things quietened down.

We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
went away.

Now, I admit its not exactly the same as your case, we never had the
follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
reduced the spikes in our indexing times.

We run the following settings now (the usual caveats apply, it might not
work for you).

GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
application pauses, that's our goal, if we have to use more memory in the
short term then so be it, but we couldn't afford application pauses,
because we are using NRT (soft commits every 1s, hard commits every 60s)
and we get a lot of updates.

I know there have been other discussion on G1 and it has received mixed
results overall, but for us, it seems to be a winner.

Hope that helps,


On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in
4.4
 that are not backward compatible. (i.e. Clusterstate.json and no
solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if
it's
 possible to set the timeout for distributed update request to/from
leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or
something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards
and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when
solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.
Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine
months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure
your
  configuration is up to date with changes to the example config
between
  4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
  testbed, duplicate your current problem, and upgrade the testbed to
see
  if the problem goes away.  A testbed will also give you practice for
a
  smooth upgrade of your production system.
 
  Thanks,
  Shawn
 





Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Erick Erickson
Jed:

I'm not sure changing Java runtime is any less scary than upgrading Solr

Wait, I know! Ask your manager if you can do both at once evil smirk. I have
a  t-shirt that says I don't test, but when I do it's in production...

Erick

On Wed, Jul 10, 2013 at 8:08 AM, Jed Glazner jglaz...@adobe.com wrote:
 Hey Daniel,

 Thanks for the response.  I think we'll give this a try to see if this
 helps.

 Jed.

 On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote:

We had something similar in terms of update times suddenly spiking up for
no obvious reason.  We never got quite as bad as you in terms of the other
knock on effects, but we certainly saw updates jumping from 10ms up to
3ms, all our external queues backed up and we rejected some updates,
then after a while things quietened down.

We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
went away.

Now, I admit its not exactly the same as your case, we never had the
follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
reduced the spikes in our indexing times.

We run the following settings now (the usual caveats apply, it might not
work for you).

GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
application pauses, that's our goal, if we have to use more memory in the
short term then so be it, but we couldn't afford application pauses,
because we are using NRT (soft commits every 1s, hard commits every 60s)
and we get a lot of updates.

I know there have been other discussion on G1 and it has received mixed
results overall, but for us, it seems to be a winner.

Hope that helps,


On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in
4.4
 that are not backward compatible. (i.e. Clusterstate.json and no
solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if
it's
 possible to set the timeout for distributed update request to/from
leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or
something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards
and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when
solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.
Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine
months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure
your
  configuration is up to date with changes to the example config
between
  4.0.0 and 4.3.1, and reindex

Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
 as the trouble index has only 1 or 2
stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when
solr
 will hang for 10-15 minutes during updates.  This outage always
seems
 to immediately be proceeded by an EOF exception on  the replica.
Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of
5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud
surfaced
  and have since been fixed.  You're five releases and about nine
months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure
your
  configuration is up to date with changes to the example config
between
  4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
  testbed, duplicate your current problem, and upgrade the testbed to
see
  if the problem goes away.  A testbed will also give you practice
for
a
  smooth upgrade of your production system.
 
  Thanks,
  Shawn
 






Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Shawn Heisey

On 7/10/2013 6:57 AM, Jed Glazner wrote:

So we'll do what we can quickly to see if we can 'band-aid' the problem
until we can upgrade to solr 4.4  Speaking of band-aids - does anyone know
of a way to change the socket timeout/connection timeout for distributed
updates?


If you need to change HttpClient parameters for CloudSolrServer, here's 
how you can do it:


String zkHost = 
zk1.REDACTED.com:2181,zk2.REDACTED.com:2181,zk3.REDACTED.com:2181/chroot;

ModifiableSolrParams params = new ModifiableSolrParams();
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000);
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200);
params.set(HttpClientUtil.PROP_SO_TIMEOUT, 30);
params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 5000);
HttpClient client = HttpClientUtil.createClient(params);
ResponseParser parser = new BinaryResponseParser();
LBHttpSolrServer lbServer = new LBHttpSolrServer(client, parser);
CloudSolrServer server = new CloudSolrServer(zkHost, lbServer);

Thanks,
Shawn



AW: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
Hi Shawn this code is für the solrj lib which we already use.

I'm talking about solr s internal communication from leader to replica via the 
DistributedCmdUpdate class.  I want to force the leader to time out after a 
fixed period instead of waiting for 15 minutes für the server to figure out the 
other end of the socket was closed.I don't know of any flags or settings in 
the solrconfig.xml to do this or if it's even possible with out modifying 
source code.

Jed

Von Samsung Mobile gesendet



 Ursprüngliche Nachricht 
Von: Shawn Heisey s...@elyograg.org
Datum: 07.10.2013 17:35 (GMT+01:00)
An: solr-user@lucene.apache.org
Betreff: Re: Solr Hangs During Updates for over 10 minutes


On 7/10/2013 6:57 AM, Jed Glazner wrote:
 So we'll do what we can quickly to see if we can 'band-aid' the problem
 until we can upgrade to solr 4.4  Speaking of band-aids - does anyone know
 of a way to change the socket timeout/connection timeout for distributed
 updates?

If you need to change HttpClient parameters for CloudSolrServer, here's
how you can do it:

String zkHost =
zk1.REDACTED.com:2181,zk2.REDACTED.com:2181,zk3.REDACTED.com:2181/chroot;
ModifiableSolrParams params = new ModifiableSolrParams();
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000);
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200);
params.set(HttpClientUtil.PROP_SO_TIMEOUT, 30);
params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 5000);
HttpClient client = HttpClientUtil.createClient(params);
ResponseParser parser = new BinaryResponseParser();
LBHttpSolrServer lbServer = new LBHttpSolrServer(client, parser);
CloudSolrServer server = new CloudSolrServer(zkHost, lbServer);

Thanks,
Shawn



Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Shawn Heisey

On 7/10/2013 6:57 AM, Jed Glazner wrote:

So, while it's 'just as risky' as you say, it's 'less risky' than a new
version of java and is possible to implement without downtime.


I believe that if you update one node at a time, there should be no 
downtime.  I've not actually tried this, so it would be a very good idea 
for you to try on a testbed.



It is actually something of a pain point that the upgrade path to
solrcloud seems to frequently require downtime. (clusterstate.json changes
in 4.1, and then again this big change in 4.4 with no solr.xml).


Looking through CHANGES.txt, I cannot see any issues mentioning a format 
change in clusterstate.json except for SOLR-3815, which was fixed in 
4.0, not 4.1.  I do see some commits on that issue after 4.0 was 
released, but they would have gone into 4.2.1, not 4.1, and the 
description for one of those later commits says that it adds information 
to clusterstate.json, it doesn't say anything about changing the format. 
 What documentation or issues are you seeing regarding a format change 
in 4.1?


As far as I know, elimination of solr.xml has not happened yet, and will 
not happen in the 4.x timeframe.  There is a new solr.xml format for 
core discovery that will be used in the 4.4 example, but it is 
completely optional - you will be able to continue to use the existing 
format in all 4.x releases.  Things are likely to be different in 5.0, 
but nobody is working on actual release plans for 5.0 yet.


Thanks,
Shawn



Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Otis Gospodnetic
+1 for G1.  We just had a happy client this week switch to G1 after
seeing stw pauses with CMS.  I can't share their JVM metrics from SPM,
but I can share ours:
http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/
(HBase, not Solr, but we've seen the same effect with ElasticSearch
for example, so I'm optimistic about seeing the same effects with
Solr, too).

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 10, 2013 at 4:48 AM, Daniel Collins danwcoll...@gmail.com wrote:
 We had something similar in terms of update times suddenly spiking up for
 no obvious reason.  We never got quite as bad as you in terms of the other
 knock on effects, but we certainly saw updates jumping from 10ms up to
 3ms, all our external queues backed up and we rejected some updates,
 then after a while things quietened down.

 We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
 Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
 went away.

 Now, I admit its not exactly the same as your case, we never had the
 follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
 reduced the spikes in our indexing times.

 We run the following settings now (the usual caveats apply, it might not
 work for you).

 GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
 -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
 -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

 I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
 application pauses, that's our goal, if we have to use more memory in the
 short term then so be it, but we couldn't afford application pauses,
 because we are using NRT (soft commits every 1s, hard commits every 60s)
 and we get a lot of updates.

 I know there have been other discussion on G1 and it has received mixed
 results overall, but for us, it seems to be a winner.

 Hope that helps,


 On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in 4.4
 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if it's
 possible to set the timeout for distributed update request to/from leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.  Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure

Solr Hangs During Updates for over 10 minutes

2013-07-09 Thread Jed Glazner
I'll give you the high level before delving deep into setup etc. I have been 
struggeling at work with a seemingly random problem when solr will hang for 
10-15 minutes during updates.  This outage always seems to immediately be 
proceeded by an EOF exception on  the replica.  Then 10-15 minutes later we see 
an exception on the leader for a socket timeout to the replica.  The leader 
will then tell the replica to recover which in most cases it does and then the 
outage is over.

Here are the setup details:

We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. 
We have 2 active collections each with only 1 shard (we have in total about 15 
collections but most are empty or have less than 100 docs). The first index 
(collection1) is 6.5GB and has ~18M documents.  The 2nd index (collection2) is 
9GB and has about 13M documents. In all cases the leader resides on 1 server 
and the replica resides on the other.  Both servers are AWS XL High Mem 
instances. (8 CPUs @ 2.67Ghz, 70GB Ram) with the index residing on a 1TB raid 
10 using ephemeral storage disks.  We are starting solr using the embedded 
jetty with the following java memory and GC options:

-Xmx16382m -Xms4092m -XX:MaxPermSize=256m -Xss256k -XX:NewSize=1536m 
-XX:SurvivorRatio=16 -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC 
-XX:ParallelCMSThreads=2 -XX:+CMSClassUnloadingEnabled 
-XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80 
-XX:+CMSParallelRemarkEnabled

Both collections receive a constant stream of updates ~10k per hour (both 
adds/deletes).  Approximately once per day the following events transpire:


 1.  We see a log entry for a distributed update that takes just over 5 ms 
followed by an EOF write exception on the replica. In all cases this exception 
is triggered by an update to the 9GB collection.
 2.  Occasionally we'll see a 503 shard update error on the leader but usually 
not.
 3.  Approximately 15 minutes after this exception we see a timeout error for a 
this distributed update request on the leader.
 4.  The leader then creates a new connection and tells the replica to recover, 
which it does and everything is OK again.
 5.  During the 15 minute window from when the replica throws the EOF until the 
SocketTimeout by the leader no other updates are processed:

ERROR ON REPLICA:

Jul 8, 2013 6:38:16 PM org.apache.solr.core.SolrCore execute
INFO: [collection2_0] webapp=/solr path=/update 
params={distrib.from=http://Solr4-1-1.domain.com:8983/solr/collection2_0/update.distrib=FROMLEADERwt=javabinversion=2}
 status=0 QTime=50012

Jul 8, 2013 6:38:16 PM org.apache.solr.common.SolrException log
SEVERE: null:org.eclipse.jetty.io.EofException
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:154)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:101)
at 
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:203)
at 
org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:196)
at 
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:94)
at 
org.apache.solr.response.BinaryResponseWriter.write(BinaryResponseWriter.java:49)
at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:404)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at 

Re: Solr Hangs During Updates for over 10 minutes

2013-07-09 Thread Shawn Heisey

On 7/9/2013 9:50 AM, Jed Glazner wrote:

I'll give you the high level before delving deep into setup etc. I have been 
struggeling at work with a seemingly random problem when solr will hang for 
10-15 minutes during updates.  This outage always seems to immediately be 
proceeded by an EOF exception on  the replica.  Then 10-15 minutes later we see 
an exception on the leader for a socket timeout to the replica.  The leader 
will then tell the replica to recover which in most cases it does and then the 
outage is over.

Here are the setup details:

We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines.


After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced 
and have since been fixed.  You're five releases and about nine months 
behind what's current.  My recommendation: Upgrade to 4.3.1, ensure your 
configuration is up to date with changes to the example config between 
4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0 
testbed, duplicate your current problem, and upgrade the testbed to see 
if the problem goes away.  A testbed will also give you practice for a 
smooth upgrade of your production system.


Thanks,
Shawn



AW: Solr Hangs During Updates for over 10 minutes

2013-07-09 Thread Jed Glazner
Hi Shawn,

I have been trying to duplicate this problem without success for the last 2 
weeks which is one reason I'm getting flustered.   It seems reasonable to be 
able to duplicate it but I can't.

 We do have a story to upgrade but that is still weeks if not months before 
that gets rolled out to production.

We have another cluster running the same version but with 8 shards and 8 
replicas with each shard at 100gb and more load and more indexing requests 
without this problem but we send docs in batches here and all fields are 
stored.   Where as the trouble index has only 1 or 2 stored fields and only 
send docs 1 at a time.

Could that have anything to do with it?

Jed


Von Samsung Mobile gesendet



 Ursprüngliche Nachricht 
Von: Shawn Heisey s...@elyograg.org
Datum: 07.09.2013 18:33 (GMT+01:00)
An: solr-user@lucene.apache.org
Betreff: Re: Solr Hangs During Updates for over 10 minutes


On 7/9/2013 9:50 AM, Jed Glazner wrote:
 I'll give you the high level before delving deep into setup etc. I have been 
 struggeling at work with a seemingly random problem when solr will hang for 
 10-15 minutes during updates.  This outage always seems to immediately be 
 proceeded by an EOF exception on  the replica.  Then 10-15 minutes later we 
 see an exception on the leader for a socket timeout to the replica.  The 
 leader will then tell the replica to recover which in most cases it does and 
 then the outage is over.

 Here are the setup details:

 We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines.

After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
and have since been fixed.  You're five releases and about nine months
behind what's current.  My recommendation: Upgrade to 4.3.1, ensure your
configuration is up to date with changes to the example config between
4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
testbed, duplicate your current problem, and upgrade the testbed to see
if the problem goes away.  A testbed will also give you practice for a
smooth upgrade of your production system.

Thanks,
Shawn



Re: Solr Hangs During Updates for over 10 minutes

2013-07-09 Thread Otis Gospodnetic
Hi Jed,

This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
that is about to be released.  We did not have fun working with 4.0 in
SolrCloud mode a few months ago.  You will save time, hair, and money
if you convince your manager to let you use Solr 4.4. :)

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
 Hi Shawn,

 I have been trying to duplicate this problem without success for the last 2 
 weeks which is one reason I'm getting flustered.   It seems reasonable to be 
 able to duplicate it but I can't.

  We do have a story to upgrade but that is still weeks if not months before 
 that gets rolled out to production.

 We have another cluster running the same version but with 8 shards and 8 
 replicas with each shard at 100gb and more load and more indexing requests 
 without this problem but we send docs in batches here and all fields are 
 stored.   Where as the trouble index has only 1 or 2 stored fields and only 
 send docs 1 at a time.

 Could that have anything to do with it?

 Jed


 Von Samsung Mobile gesendet



  Ursprüngliche Nachricht 
 Von: Shawn Heisey s...@elyograg.org
 Datum: 07.09.2013 18:33 (GMT+01:00)
 An: solr-user@lucene.apache.org
 Betreff: Re: Solr Hangs During Updates for over 10 minutes


 On 7/9/2013 9:50 AM, Jed Glazner wrote:
 I'll give you the high level before delving deep into setup etc. I have been 
 struggeling at work with a seemingly random problem when solr will hang for 
 10-15 minutes during updates.  This outage always seems to immediately be 
 proceeded by an EOF exception on  the replica.  Then 10-15 minutes later we 
 see an exception on the leader for a socket timeout to the replica.  The 
 leader will then tell the replica to recover which in most cases it does and 
 then the outage is over.

 Here are the setup details:

 We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines.

 After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
 and have since been fixed.  You're five releases and about nine months
 behind what's current.  My recommendation: Upgrade to 4.3.1, ensure your
 configuration is up to date with changes to the example config between
 4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
 testbed, duplicate your current problem, and upgrade the testbed to see
 if the problem goes away.  A testbed will also give you practice for a
 smooth upgrade of your production system.

 Thanks,
 Shawn