Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-08 Thread Reverend Chip
On 12/8/2010 7:30 AM, Jonathan Ellis wrote: On Tue, Dec 7, 2010 at 4:00 PM, Reverend Chip rev.c...@gmail.com wrote: Full DEBUG level logs would be a space problem; I'm loading at least 1T per node (after 3x replication), and these events are rare. Can the DEBUG logs be limited to the specific

Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-07 Thread Reverend Chip
which is fixed in rc2. On Mon, Dec 6, 2010 at 6:58 PM, Reverend Chip rev.c...@gmail.com wrote: I'm running a big test -- ten nodes with 3T disk each. I'm using 0.7.0rc1. After some tuning help (thanks Tyler) lots of this is working as it should. However a serious event occurred as well

Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-07 Thread Reverend Chip
are rare. Can the DEBUG logs be limited to the specific modules helpful for this diagnosis of the gossip problem and, secondarily, the failure to report replication failure? On Tue, Dec 7, 2010 at 2:37 PM, Reverend Chip rev.c...@gmail.com wrote: No, I'm afraid that's

Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

2010-12-06 Thread Reverend Chip
I'm running a big test -- ten nodes with 3T disk each. I'm using 0.7.0rc1. After some tuning help (thanks Tyler) lots of this is working as it should. However a serious event occurred as well -- the server froze up -- and though mutations were dropped, no error was reported to the client.

Re: repair takes two days, and ends up stuck: stream at 1096% (yes, really)

2010-11-15 Thread Reverend Chip
Did I answer the question sufficiently? I need repair to work, and the cluster is sick. On 11/14/2010 2:17 PM, Jonathan Ellis wrote: What exception is causing it to fail/retry? On Sun, Nov 14, 2010 at 3:49 PM, Chip Salzenberg rev.c...@gmail.com wrote: My by-now infamous eight-node cluster

Re: Gossip yoyo under write load

2010-11-15 Thread Reverend Chip
On 11/15/2010 11:34 AM, Rob Coli wrote: On 11/13/10 11:59 AM, Reverend Chip wrote: Swapping could conceivably be a factor; the JVM is 32G out of 72G, but the machine is 2.5G into swap anyway. I'm going to disable swap and see if the gossip issues resolve. Are you using JNA/memlock

Re: repair takes two days, and ends up stuck: stream at 1096% (yes, really)

2010-11-15 Thread Reverend Chip
On 11/15/2010 12:09 PM, Jonathan Ellis wrote: On Mon, Nov 15, 2010 at 1:03 PM, Reverend Chip rev.c...@gmail.com wrote: I find X.21's data disk is full. nodetool ring says that X.21 has a load of only 326.2 GB, but the 1T partition is full. Load only tracks live data -- is the rest tmp files

Re: repair takes two days, and ends up stuck: stream at 1096% (yes, really)

2010-11-15 Thread Reverend Chip
On 11/15/2010 2:01 PM, Jonathan Ellis wrote: On Mon, Nov 15, 2010 at 3:05 PM, Reverend Chip rev.c...@gmail.com wrote: There are a lot of non-tmps that were not included in the load figure. Having stopped the server and deleted tmp files, the data are still using way more space than ring

Re: Gossip yoyo under write load

2010-11-13 Thread Reverend Chip
On 11/12/2010 6:46 PM, Jonathan Ellis wrote: On Fri, Nov 12, 2010 at 3:19 PM, Chip Salzenberg rev.c...@gmail.com wrote: After I rebooted my 0.7.0beta3+ cluster to increase threads (read=100 write=200 ... they're beefy machines), and putting them under load again, I find gossip reporting yoyo

Re: Cluster fragility

2010-11-13 Thread Reverend Chip
and target - DEBUG level logs - instructions for how to reproduce On Thu, Nov 11, 2010 at 7:46 PM, Reverend Chip rev.c...@gmail.com wrote: I've been running tests with a first four-node, then eight-node cluster. I started with 0.7.0 beta3, but have since updated to a more recent Hudson

Re: node won't leave

2010-11-07 Thread Reverend Chip
On 11/6/2010 8:26 PM, Jonathan Ellis wrote: On Sat, Nov 6, 2010 at 4:51 PM, Reverend Chip rev.c...@gmail.com wrote: Am I to understand that ring maintenance requests can just fail when partially complete, in the same manner as a regular insert might fail, perhaps due to inter-node RPC