The error starts as soon as the downed node #5 goes down and lasts
until I restart the downed node #5.

bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
and when it is up again)

Since I set the replication set to 3, I'm confused as to why (after
the first few seconds or so) there is an error just because one host
is down temporarily.

The way I have the test setup is that I have a script running on each
of the nodes that is running the get_key_range over and over to
"localhost".  Depending on which node I take down, the behavior
varies: if I take done one host, it is the only one giving errors (the
other 4 nodes still work).  For the other 4 situations, either 2 or 3
nodes continue to work (i.e. the downed node and either one or two
other nodes are the ones giving errors).  Note: the nodes that keep
working, never fail at all, not even for a few seconds.

I am running this on 4GB "cloud server" boxes in Rackspace, I can set
up just about any test needed to help debug this and capture output or
logs, and can give a Cassandra developer access if it would help.  Of
course I can include whatever config files or log files would be
helpful, I just don't want to spam the list unless it is relevant.

Thanks again,

Simon


On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<jbel...@gmail.com> wrote:
> getting temporary errors when a node goes down, until the other nodes'
> failure detectors realize it's down, is normal.  (this should only
> take a dozen seconds, or so.)
>
> but after that it should route requests to other nodes, and it should
> also realize when you restart #5 that it is alive again.  those are
> two separate issues.
>
> can you verify that "bin/nodeprobe cluster" shows that node 1
> eventually does/does not see #5 dead, and alive again?
>
> -Jonathan
>
> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<simongsm...@gmail.com> wrote:
>> I'm seeing an issue similar to:
>>
>> http://issues.apache.org/jira/browse/CASSANDRA-169
>>
>> Here is when I see it.  I'm running Cassandra on 5 nodes using the
>> OrderPreservingPartitioner, and have populated Cassandra with 78
>> records, and I can use get_key_range via Thrift just fine.  Then, if I
>> manually kill one of the nodes (if I kill off node #5), the node (node
>> #1) which I've been using to call get_key_range will timeout and the
>> error:
>>
>>  Thrift: Internal error processing get_key_range
>>
>> And the Cassandra output shows the same trace as in 169:
>>
>> ERROR - Encountered IOException on connection:
>> java.nio.channels.SocketChannel[closed]
>> java.net.ConnectException: Connection refused
>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>        at 
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>>        at 
>> org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>>        at 
>> org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>>        at 
>> org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
>> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
>> ERROR - Internal error processing get_key_range
>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>> Operation timed out.
>>        at 
>> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>>        at 
>> org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>>        at 
>> org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>>        at 
>> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>>        at 
>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>        at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>        at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>        at java.lang.Thread.run(Thread.java:675)
>> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>>        at 
>> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>>        ... 7 more
>>
>>
>>
>> If it was giving an error just one time, I could just rely on catching
>> the error and trying again.  But a get_key_range call to that node I
>> was already making get_key_range queries against (node #1) never works
>> again (it is still up and it responds fine to multiget Thrift calls),
>> sometimes not even after I restart the down node (node #5).  I end up
>> having to restart node #1 in addition to node #5.  The behavior for
>> the other 3 nodes varies - some of them  are also unable to respond to
>> get_key_range calls, but some of them do respond to get_key_range
>> calls.
>>
>> My question is, what path should I go down in terms of reproducing
>> this problem?  I'm using Aug 27 trunk code - should I update my
>> Cassandra install prior to gathering more information for this issue,
>> and if so, which version (0.4 or trunk).  If there is anyone who is
>> familiar with this issue, could you let me know what I might be doing
>> wrong, or what the next info-gathering step should be for me?
>>
>> Thank you,
>>
>> Simon Smith
>> Arcode Corporation
>>
>

Reply via email to