Re: All subsequent CAS requests time out after heavy use of new CAS feature

horschi Thu, 15 Dec 2016 06:15:46 -0800

Hi,

I would like to warm up this old thread. I did some debugging and found out
that the timeouts are coming from StorageProxy.proposePaxos()
- callback.isFullyRefused() returns false and therefore triggers a
WriteTimeout.


Looking at my ccm cluster logs, I can see that two replica nodes return
different results in their ProposeVerbHandler. In my opinion the
coordinator should not throw a Exception in such a case, but instead retry
the operation.

What do the CAS/Paxos experts on this list say to this? Feel free to
instruct me to do further tests/code changes. I'd be glad to help.

Log:

node1/logs/system.log:WARN  [SharedPool-Worker-5] 2016-12-15 14:48:36,896
PaxosState.java:124 - Rejecting proposal for
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node1/logs/system.log-    Row: id=@ | value=<tombstone>) because inProgress
is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4, [MDS.Lock]
key=locktest_ 1 columns=[[] | [value]]
--
node1/logs/system.log:ERROR [SharedPool-Worker-12] 2016-12-15 14:48:36,980
StorageProxy.java:506 - proposePaxos:
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node1/logs/system.log-    Row: id=@ | value=<tombstone>)//1//0
--
node2/logs/system.log:WARN  [SharedPool-Worker-7] 2016-12-15 14:48:36,969
PaxosState.java:117 - Accepting proposal:
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node2/logs/system.log-    Row: id=@ | value=<tombstone>)
--
node3/logs/system.log:WARN  [SharedPool-Worker-2] 2016-12-15 14:48:36,897
PaxosState.java:124 - Rejecting proposal for
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node3/logs/system.log-    Row: id=@ | value=<tombstone>) because inProgress
is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4, [MDS.Lock]
key=locktest_ 1 columns=[[] | [value]]


kind regards,
Christian


On Fri, Apr 15, 2016 at 8:27 PM, Denise Rogers <datag...@aol.com> wrote:

> My thinking was that due to the size of the data that there maybe I/O
> issues. But it sounds more like you're competing for locks and hit a
> deadlock issue.
>
> Regards,
> Denise
> Cell - (860)989-3431 <(860)%20989-3431>
>
> Sent from mi iPhone
>
> On Apr 15, 2016, at 9:00 AM, horschi <hors...@gmail.com> wrote:
>
> Hi Denise,
>
> in my case its a small blob I am writing (should be around 100 bytes):
>
>      CREATE TABLE "Lock" (
>          lockname varchar,
>          id varchar,
>          value blob,
>          PRIMARY KEY (lockname, id)
>      ) WITH COMPACT STORAGE
>          AND COMPRESSION = { 'sstable_compression' : 'SnappyCompressor',
> 'chunk_length_kb' : '8' };
>
> You ask because large values are known to cause issues? Anything special
> you have in mind?
>
> kind regards,
> Christian
>
>
>
>
> On Fri, Apr 15, 2016 at 2:42 PM, Denise Rogers <datag...@aol.com> wrote:
>
>> Also, what type of data were you reading/writing?
>>
>> Regards,
>> Denise
>>
>> Sent from mi iPad
>>
>> On Apr 15, 2016, at 8:29 AM, horschi <hors...@gmail.com> wrote:
>>
>> Hi Jan,
>>
>> were you able to resolve your Problem?
>>
>> We are trying the same and also see a lot of WriteTimeouts:
>> WriteTimeoutException: Cassandra timeout during write query at
>> consistency SERIAL (2 replica were required but only 1 acknowledged the
>> write)
>>
>> How many clients were competing for a lock in your case? In our case its
>> only two :-(
>>
>> cheers,
>> Christian
>>
>>
>> On Tue, Sep 24, 2013 at 12:18 AM, Robert Coli <rc...@eventbrite.com>
>> wrote:
>>
>>> On Mon, Sep 16, 2013 at 9:09 AM, Jan Algermissen <
>>> jan.algermis...@nordsc.com> wrote:
>>>
>>>> I am experimenting with C* 2.0 ( and today's java-driver 2.0 snapshot)
>>>> for implementing distributed locks.
>>>>
>>>
>>> [ and I'm experiencing the problem described in the subject ... ]
>>>
>>>
>>>> Any idea how to approach this problem?
>>>>
>>>
>>> 1) Upgrade to 2.0.1 release.
>>> 2) Try to reproduce symptoms.
>>> 3) If able to, file a JIRA at https://issues.apache.org/
>>> jira/secure/Dashboard.jspa including repro steps
>>> 4) Reply to this thread with the JIRA ticket URL
>>>
>>> =Rob
>>>
>>>
>>>
>>
>>
>

Re: All subsequent CAS requests time out after heavy use of new CAS feature

Reply via email to