Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread horschi
Hi Tom,

"The idea of join_ring=false is that other nodes are not aware of the new
node, and therefore never send requests to it. The new node can then be
repaired"
Nicely explained, but I still see the issue that this node would not
receive writes during that time. So after the repair the node would still
miss data.
Again, what is needed is either some joining-state or write-survey that
allows disabling reads, but still accepts writes.



"To set up a new DC, I was hoping that you could also rebuild (instead of a
repair) a new node while join_ring=false, but that seems not to work."
Correct. The node does not get any tokens with join_ring=false. And again,
your node won't receive any writes while you are rebuilding. Therefore you
will have outdated data at the point when you are done rebuilding.


kind regards,
Christian





On Tue, Sep 8, 2015 at 10:00 AM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> "one drawback: the node joins the cluster as soon as the bootstrapping
>> begins."
>> I am not sure I understand this correctly. It will get tokens, but not
>> load data if you combine it with autobootstrap=false.
>>
> Joining the cluster means that all other nodes become aware of the new
> node, and therefore it might receive reads. And yes, it will not have any
> data, because auto_bootstrap=false.
>
>
>
>> How I see it: You should be able to start all the new nodes in the new DC
>> with autobootstrap=false and survey-mode=true. Then you should have a new
>> DC with nodes that have tokens but no data. Then you can start rebuild on
>> all new nodes. During this process, the new nodes should get writes, but
>> not serve reads.
>>
> Maybe you're right.
>
>
>>
>> "It turns out that join_ring=false in this scenario does not solve this
>> problem"
>> I also don't see how joing_ring would help here. (Actually I have no clue
>> where you would ever need that option)
>>
> The idea of join_ring=false is that other nodes are not aware of the new
> node, and therefore never send requests to it. The new node can then be
> repaired (see https://issues.apache.org/jira/browse/CASSANDRA-6961). To
> set up a new DC, I was hoping that you could also rebuild (instead of a
> repair) a new node while join_ring=false, but that seems not to work.
>
>>
>>
>> "Currently I'm trying to auto_bootstrap my new DC. The good thing is that
>> it doesn't accept reads from other DCs."
>> The joining-state actually works perfectly. The joining state is a state
>> where node take writes, but not serve ready. It would be really cool if you
>> could boot a node into the joining state. Actually, write_survey should
>> basically be the same.
>>
> It would be great if you could select the DC from where it's bootstrapped,
> similar to nodetool rebuild. I'm currently bootstrapping a node in
> San-Jose. It decides to stream all data from another DC in Amsterdam, while
> we also have another DC in San-Jose, right next to it. Streaming data
> across the Atlantic takes a lot more time :(
>
>
>
>>
>> kind regards,
>> Christian
>>
>> PS: I would love to see the results, if you perform any tests on the
>> write-survey. Please share it here on the mailing list :-)
>>
>>
>>
>> On Mon, Sep 7, 2015 at 11:10 PM, Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>>> Hi Christian,
>>>
>>> No, I never tried survey mode. I didn't know it until now, but form the
>>> info I was able to find it looks like it is meant for a different purpose.
>>> Maybe it can be used to bootstrap a new DC, though.
>>>
>>> On the other hand, the auto_bootstrap=false + rebuild scenario seems to
>>> be designed to do exactly what I need, except that it has one drawback: the
>>> node joins the cluster as soon as the bootstrapping begins.
>>>
>>> It turns out that join_ring=false in this scenario does not solve this
>>> problem, since nodetool rebuild does not do anything if C* is started with
>>> this option.
>>>
>>> A workaround could be to ensure that only LOCAL_* CL is used by all
>>> clients, but even then I'm seeing failed queries, because they're
>>> mysteriously routed to the new DC every now and then.
>>>
>>> Currently I'm trying to auto_bootstrap my new DC. The good thing is that
>>> it doesn't accept reads from other DCs. The bad thing is that a) I can't
>>> choose where it streams its data from, and b) the two nodes I've been
>>> trying to bootstrap crashed when they were almost finished...
>>>
>>>
>>>
>>> On Mon, Sep 7, 2015 at 10:22 PM, horschi  wrote:
>>>
 Hi Tom,

 this sounds very much like my thread: "auto_bootstrap=false broken?"

 Did you try booting the new node with survey-mode? I wanted to try
 this, but I am waiting for 2.0.17 to come out (survey mode is broken in
 earlier versions). Imho survey mode is what you (and me too) want: start a
 node, accepting writes, but not serving reads. I have not tested it yet,
 but I think it should work.

 Also the manual join mentioned in 

Re: How to prevent queries being routed to new DC?

2015-09-08 Thread Tom van den Berge
Hi Anuj,

That could indeed explain reads on my new DC. However, what I'm seeing in
my client application is that every now and then, a read query does not
produce any result, while I'm sure that it should. If I understand the read
repair process correctly, it will never cause a read query fail to find a
replica, right?



On Tue, Sep 8, 2015 at 4:40 AM, Anuj Wadehra  wrote:

> Hi Tom,
>
> While reading data ( even at CL LOCAL_QUORUM), if data in different nodes
> required to meet CL in your local cluster doesnt match, data will be read
> from remote dc for read repair if read_repair_chance is not 0.
>
> Imp points:
> 1.If you are reading and writing at local_quorum you can set
> read_repair_chance to 0 to prevent cross dc read repair.
> 2. For enabling dc local read repairs you can use
> dclocal_read_repair_chance and set read_repair_chance to 0.
> 3. If you are experiencing frequent requests being routed due to digest
> mismatch you may need to investigate mutation drops in your cluster using
> tpstats.
>
> Refer to similar issue raised by us :
> https://issues.apache.org/jira/browse/CASSANDRA-8479
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> 
> --
> *From*:"Tom van den Berge" 
> *Date*:Tue, 8 Sep, 2015 at 1:31 am
> *Subject*:Re: How to prevent queries being routed to new DC?
>
> NetworkTopologyStrategy
>
> On Mon, Sep 7, 2015 at 4:39 PM, Ryan Svihla  wrote:
>
>> What's your keyspace replication strategy?
>>
>> On Thu, Sep 3, 2015 at 3:16 PM Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>>> Thanks for your help so far!
>>>
>>> I have some problems trying to understand the jira mentioned by Rob :(
>>>
>>> I'm currently trying to set up the first node in the new DC with
>>> auto_bootstrap = true. The node then becomes visible with status "joining",
>>> which (hopefully) prevents other DCs from sending queries to it.
>>>
>>> Do you think this will work?
>>>
>>>
>>>
>>> On Thu, Sep 3, 2015 at 9:46 PM, Robert Coli 
>>> wrote:
>>>
 On Thu, Sep 3, 2015 at 12:25 PM, Bryan Cheng 
 wrote:

> I'd recommend you enable tracing and do a few queries in a controlled
> environment to verify that queries are being routed to your new nodes.
> Provided you have followed the procedure outlined above (specifically, 
> have
> set auto_bootstrap to false in your new cluster), rebuild has not been 
> run,
> the application is not connecting to the new cluster, and all your queries
> are run at LOCAL_* quorum levels, I do not believe those queries should be
> routed to the new dc.
>

 Other than CASSANDRA-9753, this is true.

 https://issues.apache.org/jira/browse/CASSANDRA-9753 (Unresolved; ):
 "LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch"

 =Rob


>>> --
>> Regards,
>>
>> Ryan Svihla
>
>
>
>
> --
> Tom van den Berge
> Lead Software Engineer
>  [image: Drillster] Middenburcht 136
> 3452 MT Vleuten
> Netherlands+31 30 755 53 30
> www.drillster.com [image: Follow us on Facebook] Follow us on Facebook
> 
>


Re: Some love for multi-partition LWT?

2015-09-08 Thread Marek Lewandowski
Are you absolutely sure that lock is required? I could imagine that
multiple paxos rounds could be played for different partitions and these
rounds would be dependent on each other.

Performance aside, can you please elaborate where do you see such need for
lock?
On 8 Sep 2015 00:05, "DuyHai Doan"  wrote:

> Multi partitions LWT is not supported currently on purpose. To support it,
> we would have to emulate a distributed lock which is pretty bad for
> performance.
>
> On Mon, Sep 7, 2015 at 10:38 PM, Marek Lewandowski <
> marekmlewandow...@gmail.com> wrote:
>
>> Hello there,
>>
>> would you be interested in having multi-partition support for lightweight
>> transactions in order words to have ability to express something like:
>>
>> INSERT INTO … IF NOT EXISTS AND
>> UPDATE … IF EXISTS AND
>> UPDATE … IF colX = ‘xyz’
>>
>> where each statement refers to a row living potentially on different set
>> of nodes.
>> In yet another words: imagine batch with conditions, but which allows to
>> specify multiple statements with conditions for rows in different
>> partitions.
>>
>> Do you find it very useful, moderately useful or you don’t need that
>> because you have some superb business logic handling of such cases for
>> example?
>>
>> Let me know.
>> Regards,
>> Marek
>
>
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread Tom van den Berge
>
> "one drawback: the node joins the cluster as soon as the bootstrapping
> begins."
> I am not sure I understand this correctly. It will get tokens, but not
> load data if you combine it with autobootstrap=false.
>
Joining the cluster means that all other nodes become aware of the new
node, and therefore it might receive reads. And yes, it will not have any
data, because auto_bootstrap=false.



> How I see it: You should be able to start all the new nodes in the new DC
> with autobootstrap=false and survey-mode=true. Then you should have a new
> DC with nodes that have tokens but no data. Then you can start rebuild on
> all new nodes. During this process, the new nodes should get writes, but
> not serve reads.
>
Maybe you're right.


>
> "It turns out that join_ring=false in this scenario does not solve this
> problem"
> I also don't see how joing_ring would help here. (Actually I have no clue
> where you would ever need that option)
>
The idea of join_ring=false is that other nodes are not aware of the new
node, and therefore never send requests to it. The new node can then be
repaired (see https://issues.apache.org/jira/browse/CASSANDRA-6961). To set
up a new DC, I was hoping that you could also rebuild (instead of a repair)
a new node while join_ring=false, but that seems not to work.

>
>
> "Currently I'm trying to auto_bootstrap my new DC. The good thing is that
> it doesn't accept reads from other DCs."
> The joining-state actually works perfectly. The joining state is a state
> where node take writes, but not serve ready. It would be really cool if you
> could boot a node into the joining state. Actually, write_survey should
> basically be the same.
>
It would be great if you could select the DC from where it's bootstrapped,
similar to nodetool rebuild. I'm currently bootstrapping a node in
San-Jose. It decides to stream all data from another DC in Amsterdam, while
we also have another DC in San-Jose, right next to it. Streaming data
across the Atlantic takes a lot more time :(



>
> kind regards,
> Christian
>
> PS: I would love to see the results, if you perform any tests on the
> write-survey. Please share it here on the mailing list :-)
>
>
>
> On Mon, Sep 7, 2015 at 11:10 PM, Tom van den Berge <
> tom.vandenbe...@gmail.com> wrote:
>
>> Hi Christian,
>>
>> No, I never tried survey mode. I didn't know it until now, but form the
>> info I was able to find it looks like it is meant for a different purpose.
>> Maybe it can be used to bootstrap a new DC, though.
>>
>> On the other hand, the auto_bootstrap=false + rebuild scenario seems to
>> be designed to do exactly what I need, except that it has one drawback: the
>> node joins the cluster as soon as the bootstrapping begins.
>>
>> It turns out that join_ring=false in this scenario does not solve this
>> problem, since nodetool rebuild does not do anything if C* is started with
>> this option.
>>
>> A workaround could be to ensure that only LOCAL_* CL is used by all
>> clients, but even then I'm seeing failed queries, because they're
>> mysteriously routed to the new DC every now and then.
>>
>> Currently I'm trying to auto_bootstrap my new DC. The good thing is that
>> it doesn't accept reads from other DCs. The bad thing is that a) I can't
>> choose where it streams its data from, and b) the two nodes I've been
>> trying to bootstrap crashed when they were almost finished...
>>
>>
>>
>> On Mon, Sep 7, 2015 at 10:22 PM, horschi  wrote:
>>
>>> Hi Tom,
>>>
>>> this sounds very much like my thread: "auto_bootstrap=false broken?"
>>>
>>> Did you try booting the new node with survey-mode? I wanted to try this,
>>> but I am waiting for 2.0.17 to come out (survey mode is broken in earlier
>>> versions). Imho survey mode is what you (and me too) want: start a node,
>>> accepting writes, but not serving reads. I have not tested it yet, but I
>>> think it should work.
>>>
>>> Also the manual join mentioned in CASSANDRA-9667 sounds very interesting.
>>>
>>> kind regards,
>>> Christian
>>>
>>> On Mon, Sep 7, 2015 at 10:11 PM, Tom van den Berge 
>>> wrote:
>>>
 Running nodetool rebuild on a node that was started with
 join_ring=false does not work, unfortunately. The nodetool command returns
 immediately, after a message appears in the log that the streaming of data
 has started. After that, nothing happens.

 Tom


 On Fri, Sep 12, 2014 at 5:47 PM, Robert Coli 
 wrote:

> On Fri, Sep 12, 2014 at 6:57 AM, Tom van den Berge 
> wrote:
>
>> Wouldn't it be far more efficient if a node that is rebuilding itself
>> is responsible for not accepting reads until the rebuild is complete? 
>> E.g.
>> by marking it as "Joining", similar to a node that is being bootstrapped?
>>
>
> Yes, and Cassandra 2.0.7 and above contain this long desired
> functionality.
>
> 

Re: Some love for multi-partition LWT?

2015-09-08 Thread woolfel

There's quite a bit of literature on the topic. Look at what is in acmqueue and 
you'll see what others are saying is accurate.

To guarantee you need a distributed lock or a different design like datomic. 
Look at what rich hickey has done with datomic 



Sent from my iPhone

> On Sep 8, 2015, at 5:54 AM, DuyHai Doan  wrote:
> 
> "I could imagine that multiple paxos rounds could be played for different 
> partitions and these rounds would be dependent on each other" 
> 
> Example of cluster of 10 nodes (N1 ... N10) and RF=3.
> 
> Suppose a LWT with 2 partitions and 2 mutations M1 & M2, coordinator C1. It 
> will imply 2 Paxos rounds with the same ballot B1, one round affecting N1, 
> N2, N3 for mutation M1 and the other round affecting N4, N5 and N6 for 
> mutation M2.
> 
> Suppose that the prepare/promise, read/results phases are successful for both 
> Paxos rounds (see here for different LWT phases 
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0)
> 
> The coordinator C1 then sends an propose() message to N1 ... N6 with 
> respective mutations M1 and M2. N1, N2 and N3 will reply accept() but imagine 
> another coordinator C2 has just proposed a higher ballot B2 (B2 > B1) for 
> nodes N4, N5 & N6 only. Those node will reply NoAck() (or won't reply) to C1. 
> 
> Then the whole multi-partition LWT will need to be aborted because we cannot 
> proceed with mutation M2. But N1, N2 and N3 already accepted the mutation M1 
> and it could be committed by any subsequent Paxos round on N1, N2 and N3
> 
> We certainly don't want that. 
> 
> So how to abort safely mutation M1 on N1, N2 and N3 in this case ? Of course 
> the coordinator C1 could send itself another prepare() message with ballot B3 
> higher than B1 to abort the accepted value in N1, N2 and N3, but we do not 
> have ANY GUARANTEE that in the meantime, there is another Paxos round 
> impacting N1, N2 and N3 with ballot higher than B1 that will commit the 
> undesirable mutation M1...
> 
> This simple example shows how hard it is to implement multi-partition Paxos 
> rounds. The fact that you have multiple Paxos rounds that are dependent on 
> each other break the safety guarantees of the original Paxos paper. 
> 
>  The only way I can see that will ensure safety in this case is to forbid any 
> Paxos round on N1 ... N6 as long as the current rounds are not finished yet, 
> and this is exactly what a distributed lock does.
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On Tue, Sep 8, 2015 at 8:15 AM, Marek Lewandowski 
>>  wrote:
>> Are you absolutely sure that lock is required? I could imagine that multiple 
>> paxos rounds could be played for different partitions and these rounds would 
>> be dependent on each other.
>> 
>> Performance aside, can you please elaborate where do you see such need for 
>> lock?
>> 
>>> On 8 Sep 2015 00:05, "DuyHai Doan"  wrote:
>>> Multi partitions LWT is not supported currently on purpose. To support it, 
>>> we would have to emulate a distributed lock which is pretty bad for 
>>> performance.
>>> 
 On Mon, Sep 7, 2015 at 10:38 PM, Marek Lewandowski 
  wrote:
 Hello there,
 
 would you be interested in having multi-partition support for lightweight 
 transactions in order words to have ability to express something like:
 
 INSERT INTO … IF NOT EXISTS AND
 UPDATE … IF EXISTS AND
 UPDATE … IF colX = ‘xyz’
 
 where each statement refers to a row living potentially on different set 
 of nodes.
 In yet another words: imagine batch with conditions, but which allows to 
 specify multiple statements with conditions for rows in different 
 partitions.
 
 Do you find it very useful, moderately useful or you don’t need that 
 because you have some superb business logic handling of such cases for 
 example?
 
 Let me know.
 Regards,
 Marek
> 


Re: Some love for multi-partition LWT?

2015-09-08 Thread Marek Lewandowski
"This simple example shows how hard it is to implement multi-partition
Paxos rounds. The fact that you have multiple Paxos rounds that are
dependent on each other break the safety guarantees of the original Paxos
paper. "
What if this dependency is explicitly specified in proposal.
Assume that each proposal consists of all mutations in this case: {M1,M2}.
So N1,N2,N3 receive proposal {M1,M2} and nodes N4,N5,N6 receive proposal
{M1,M2}. (Yes, in this case I assume that nodes will get more data than
they need, but that's the cost).

If C2 wins with higher ballot at nodes N4, N5, N6  after N1,N2,N3  accepted
{M1,M2} from C1, then C2 sees in progress proposal {M1, M2} which it has to
complete, so it will try nodes N1,N2,N3 and get agreement as well as nodes
N4,N5,N6 and eventually commit unless another coordinator jumps in and
trumps paxos rounds.

Do you think it could work?

"To guarantee you need a distributed lock or a different design like
datomic. Look at what rich hickey has done with datomic "
I'll look into that, thanks.

2015-09-08 12:52 GMT+02:00 :

>
> There's quite a bit of literature on the topic. Look at what is in
> acmqueue and you'll see what others are saying is accurate.
>
> To guarantee you need a distributed lock or a different design like
> datomic. Look at what rich hickey has done with datomic
>
>
>
> Sent from my iPhone
>
> On Sep 8, 2015, at 5:54 AM, DuyHai Doan  wrote:
>
> "I could imagine that multiple paxos rounds could be played for different
> partitions and these rounds would be dependent on each other"
>
> Example of cluster of 10 nodes (N1 ... N10) and RF=3.
>
> Suppose a LWT with 2 partitions and 2 mutations M1 & M2, coordinator C1.
> It will imply 2 Paxos rounds with the same ballot B1, one round affecting
> N1, N2, N3 for mutation M1 and the other round affecting N4, N5 and N6 for
> mutation M2.
>
> Suppose that the prepare/promise, read/results phases are successful for
> both Paxos rounds (see here for different LWT phases
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
> )
>
> The coordinator C1 then sends an propose() message to N1 ... N6 with
> respective mutations M1 and M2. N1, N2 and N3 will reply accept() but
> imagine another coordinator C2 has just proposed a higher ballot B2 (B2 >
> B1) for nodes N4, N5 & N6 only. Those node will reply NoAck() (or won't
> reply) to C1.
>
> Then the whole multi-partition LWT will need to be aborted because we
> cannot proceed with mutation M2. But N1, N2 and N3 already accepted the
> mutation M1 and it could be committed by any subsequent Paxos round on N1,
> N2 and N3
>
> We certainly don't want that.
>
> So how to abort safely mutation M1 on N1, N2 and N3 in this case ? Of
> course the coordinator C1 could send itself another prepare() message with
> ballot B3 higher than B1 to abort the accepted value in N1, N2 and N3, but
> we do not have ANY GUARANTEE that in the meantime, there is another Paxos
> round impacting N1, N2 and N3 with ballot higher than B1 that will commit
> the undesirable mutation M1...
>
> This simple example shows how hard it is to implement multi-partition
> Paxos rounds. The fact that you have multiple Paxos rounds that are
> dependent on each other break the safety guarantees of the original Paxos
> paper.
>
>  The only way I can see that will ensure safety in this case is to forbid
> any Paxos round on N1 ... N6 as long as the current rounds are not finished
> yet, and this is exactly what a distributed lock does.
>
>
>
>
>
>
>
>
>
> On Tue, Sep 8, 2015 at 8:15 AM, Marek Lewandowski <
> marekmlewandow...@gmail.com> wrote:
>
>> Are you absolutely sure that lock is required? I could imagine that
>> multiple paxos rounds could be played for different partitions and these
>> rounds would be dependent on each other.
>>
>> Performance aside, can you please elaborate where do you see such need
>> for lock?
>> On 8 Sep 2015 00:05, "DuyHai Doan"  wrote:
>>
>>> Multi partitions LWT is not supported currently on purpose. To support
>>> it, we would have to emulate a distributed lock which is pretty bad for
>>> performance.
>>>
>>> On Mon, Sep 7, 2015 at 10:38 PM, Marek Lewandowski <
>>> marekmlewandow...@gmail.com> wrote:
>>>
 Hello there,

 would you be interested in having multi-partition support for
 lightweight transactions in order words to have ability to express
 something like:

 INSERT INTO … IF NOT EXISTS AND
 UPDATE … IF EXISTS AND
 UPDATE … IF colX = ‘xyz’

 where each statement refers to a row living potentially on different
 set of nodes.
 In yet another words: imagine batch with conditions, but which allows
 to specify multiple statements with conditions for rows in different
 partitions.

 Do you find it very useful, moderately useful or you don’t need that
 because you have some superb business logic handling of such cases for
 

Re: Some love for multi-partition LWT?

2015-09-08 Thread Peter Lin
I would caution using paxos for distributed transaction in an inappropriate
way. The model has to be logically and mathematically correct, otherwise
you end up with corrupt data. In the worst case, it could cause cascading
failure that brings down the cluster. I've seen distributed systems come to
a grinding halt due to disagreement between nodes.

As far as I know based on existing literature and first hand experience,
there's only 2 ways: a central transaction manager or distributed lock.

distributed locks are expensive and can result in cluster wide deadlock.
Look at the data grid space if you want to see where and when distributed
locks cause major headaches. Just ask anyone that has tried to use a
distributed cache as a distributed in-memory database and they'll tell you
how poorly that scales.

I recommend spending a couple of months reading up on the topic, there's a
wealth of literature on this. These are very old problems and lots of smart
people have spent time figuring out what works and doesn't work.

On Tue, Sep 8, 2015 at 7:30 AM, Marek Lewandowski <
marekmlewandow...@gmail.com> wrote:

> "This simple example shows how hard it is to implement multi-partition
> Paxos rounds. The fact that you have multiple Paxos rounds that are
> dependent on each other break the safety guarantees of the original Paxos
> paper. "
> What if this dependency is explicitly specified in proposal.
> Assume that each proposal consists of all mutations in this case: {M1,M2}.
> So N1,N2,N3 receive proposal {M1,M2} and nodes N4,N5,N6 receive proposal
> {M1,M2}. (Yes, in this case I assume that nodes will get more data than
> they need, but that's the cost).
>
> If C2 wins with higher ballot at nodes N4, N5, N6  after N1,N2,N3
>  accepted {M1,M2} from C1, then C2 sees in progress proposal {M1, M2} which
> it has to complete, so it will try nodes N1,N2,N3 and get agreement as well
> as nodes N4,N5,N6 and eventually commit unless another coordinator jumps in
> and trumps paxos rounds.
>
> Do you think it could work?
>
> "To guarantee you need a distributed lock or a different design like
> datomic. Look at what rich hickey has done with datomic "
> I'll look into that, thanks.
>
> 2015-09-08 12:52 GMT+02:00 :
>
>>
>> There's quite a bit of literature on the topic. Look at what is in
>> acmqueue and you'll see what others are saying is accurate.
>>
>> To guarantee you need a distributed lock or a different design like
>> datomic. Look at what rich hickey has done with datomic
>>
>>
>>
>> Sent from my iPhone
>>
>> On Sep 8, 2015, at 5:54 AM, DuyHai Doan  wrote:
>>
>> "I could imagine that multiple paxos rounds could be played for
>> different partitions and these rounds would be dependent on each other"
>>
>> Example of cluster of 10 nodes (N1 ... N10) and RF=3.
>>
>> Suppose a LWT with 2 partitions and 2 mutations M1 & M2, coordinator C1.
>> It will imply 2 Paxos rounds with the same ballot B1, one round affecting
>> N1, N2, N3 for mutation M1 and the other round affecting N4, N5 and N6 for
>> mutation M2.
>>
>> Suppose that the prepare/promise, read/results phases are successful for
>> both Paxos rounds (see here for different LWT phases
>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>> )
>>
>> The coordinator C1 then sends an propose() message to N1 ... N6 with
>> respective mutations M1 and M2. N1, N2 and N3 will reply accept() but
>> imagine another coordinator C2 has just proposed a higher ballot B2 (B2 >
>> B1) for nodes N4, N5 & N6 only. Those node will reply NoAck() (or won't
>> reply) to C1.
>>
>> Then the whole multi-partition LWT will need to be aborted because we
>> cannot proceed with mutation M2. But N1, N2 and N3 already accepted the
>> mutation M1 and it could be committed by any subsequent Paxos round on N1,
>> N2 and N3
>>
>> We certainly don't want that.
>>
>> So how to abort safely mutation M1 on N1, N2 and N3 in this case ? Of
>> course the coordinator C1 could send itself another prepare() message with
>> ballot B3 higher than B1 to abort the accepted value in N1, N2 and N3, but
>> we do not have ANY GUARANTEE that in the meantime, there is another Paxos
>> round impacting N1, N2 and N3 with ballot higher than B1 that will commit
>> the undesirable mutation M1...
>>
>> This simple example shows how hard it is to implement multi-partition
>> Paxos rounds. The fact that you have multiple Paxos rounds that are
>> dependent on each other break the safety guarantees of the original Paxos
>> paper.
>>
>>  The only way I can see that will ensure safety in this case is to forbid
>> any Paxos round on N1 ... N6 as long as the current rounds are not finished
>> yet, and this is exactly what a distributed lock does.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 8, 2015 at 8:15 AM, Marek Lewandowski <
>> marekmlewandow...@gmail.com> wrote:
>>
>>> Are you absolutely sure that lock is required? I could imagine that
>>> multiple paxos rounds could 

Re: Some love for multi-partition LWT?

2015-09-08 Thread DuyHai Doan
"Do you think it could work?"

At first glance, maybe, but it would involve a huge number of round trips
and a lot of contentions. You'll risk serious deadlocks.

Second, to prove that a solution works, you'll need to prove that it works
for ALL situations, not just a few. Proving something  wrong is easy,
you'll need just only one counter-example. Proving that something is true
in all cases is much much harder. The fact that I don't find any
counter-example right now to your proposal does not mean it's correct and
it will work. I may forget some corner cases.



On Tue, Sep 8, 2015 at 1:53 PM, Peter Lin  wrote:

>
> I would caution using paxos for distributed transaction in an
> inappropriate way. The model has to be logically and mathematically
> correct, otherwise you end up with corrupt data. In the worst case, it
> could cause cascading failure that brings down the cluster. I've seen
> distributed systems come to a grinding halt due to disagreement between
> nodes.
>
> As far as I know based on existing literature and first hand experience,
> there's only 2 ways: a central transaction manager or distributed lock.
>
> distributed locks are expensive and can result in cluster wide deadlock.
> Look at the data grid space if you want to see where and when distributed
> locks cause major headaches. Just ask anyone that has tried to use a
> distributed cache as a distributed in-memory database and they'll tell you
> how poorly that scales.
>
> I recommend spending a couple of months reading up on the topic, there's a
> wealth of literature on this. These are very old problems and lots of smart
> people have spent time figuring out what works and doesn't work.
>
> On Tue, Sep 8, 2015 at 7:30 AM, Marek Lewandowski <
> marekmlewandow...@gmail.com> wrote:
>
>> "This simple example shows how hard it is to implement multi-partition
>> Paxos rounds. The fact that you have multiple Paxos rounds that are
>> dependent on each other break the safety guarantees of the original Paxos
>> paper. "
>> What if this dependency is explicitly specified in proposal.
>> Assume that each proposal consists of all mutations in this case: {M1,M2}.
>> So N1,N2,N3 receive proposal {M1,M2} and nodes N4,N5,N6 receive proposal
>> {M1,M2}. (Yes, in this case I assume that nodes will get more data than
>> they need, but that's the cost).
>>
>> If C2 wins with higher ballot at nodes N4, N5, N6  after N1,N2,N3
>>  accepted {M1,M2} from C1, then C2 sees in progress proposal {M1, M2} which
>> it has to complete, so it will try nodes N1,N2,N3 and get agreement as well
>> as nodes N4,N5,N6 and eventually commit unless another coordinator jumps in
>> and trumps paxos rounds.
>>
>> Do you think it could work?
>>
>> "To guarantee you need a distributed lock or a different design like
>> datomic. Look at what rich hickey has done with datomic "
>> I'll look into that, thanks.
>>
>> 2015-09-08 12:52 GMT+02:00 :
>>
>>>
>>> There's quite a bit of literature on the topic. Look at what is in
>>> acmqueue and you'll see what others are saying is accurate.
>>>
>>> To guarantee you need a distributed lock or a different design like
>>> datomic. Look at what rich hickey has done with datomic
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Sep 8, 2015, at 5:54 AM, DuyHai Doan  wrote:
>>>
>>> "I could imagine that multiple paxos rounds could be played for
>>> different partitions and these rounds would be dependent on each other"
>>>
>>> Example of cluster of 10 nodes (N1 ... N10) and RF=3.
>>>
>>> Suppose a LWT with 2 partitions and 2 mutations M1 & M2, coordinator C1.
>>> It will imply 2 Paxos rounds with the same ballot B1, one round affecting
>>> N1, N2, N3 for mutation M1 and the other round affecting N4, N5 and N6 for
>>> mutation M2.
>>>
>>> Suppose that the prepare/promise, read/results phases are successful for
>>> both Paxos rounds (see here for different LWT phases
>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>>> )
>>>
>>> The coordinator C1 then sends an propose() message to N1 ... N6 with
>>> respective mutations M1 and M2. N1, N2 and N3 will reply accept() but
>>> imagine another coordinator C2 has just proposed a higher ballot B2 (B2 >
>>> B1) for nodes N4, N5 & N6 only. Those node will reply NoAck() (or won't
>>> reply) to C1.
>>>
>>> Then the whole multi-partition LWT will need to be aborted because we
>>> cannot proceed with mutation M2. But N1, N2 and N3 already accepted the
>>> mutation M1 and it could be committed by any subsequent Paxos round on N1,
>>> N2 and N3
>>>
>>> We certainly don't want that.
>>>
>>> So how to abort safely mutation M1 on N1, N2 and N3 in this case ? Of
>>> course the coordinator C1 could send itself another prepare() message with
>>> ballot B3 higher than B1 to abort the accepted value in N1, N2 and N3, but
>>> we do not have ANY GUARANTEE that in the meantime, there is another Paxos
>>> round impacting N1, N2 and N3 with 

Re: Some love for multi-partition LWT?

2015-09-08 Thread DuyHai Doan
"I could imagine that multiple paxos rounds could be played for different
partitions and these rounds would be dependent on each other"

Example of cluster of 10 nodes (N1 ... N10) and RF=3.

Suppose a LWT with 2 partitions and 2 mutations M1 & M2, coordinator C1. It
will imply 2 Paxos rounds with the same ballot B1, one round affecting N1,
N2, N3 for mutation M1 and the other round affecting N4, N5 and N6 for
mutation M2.

Suppose that the prepare/promise, read/results phases are successful for
both Paxos rounds (see here for different LWT phases
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0)

The coordinator C1 then sends an propose() message to N1 ... N6 with
respective mutations M1 and M2. N1, N2 and N3 will reply accept() but
imagine another coordinator C2 has just proposed a higher ballot B2 (B2 >
B1) for nodes N4, N5 & N6 only. Those node will reply NoAck() (or won't
reply) to C1.

Then the whole multi-partition LWT will need to be aborted because we
cannot proceed with mutation M2. But N1, N2 and N3 already accepted the
mutation M1 and it could be committed by any subsequent Paxos round on N1,
N2 and N3

We certainly don't want that.

So how to abort safely mutation M1 on N1, N2 and N3 in this case ? Of
course the coordinator C1 could send itself another prepare() message with
ballot B3 higher than B1 to abort the accepted value in N1, N2 and N3, but
we do not have ANY GUARANTEE that in the meantime, there is another Paxos
round impacting N1, N2 and N3 with ballot higher than B1 that will commit
the undesirable mutation M1...

This simple example shows how hard it is to implement multi-partition Paxos
rounds. The fact that you have multiple Paxos rounds that are dependent on
each other break the safety guarantees of the original Paxos paper.

 The only way I can see that will ensure safety in this case is to forbid
any Paxos round on N1 ... N6 as long as the current rounds are not finished
yet, and this is exactly what a distributed lock does.









On Tue, Sep 8, 2015 at 8:15 AM, Marek Lewandowski <
marekmlewandow...@gmail.com> wrote:

> Are you absolutely sure that lock is required? I could imagine that
> multiple paxos rounds could be played for different partitions and these
> rounds would be dependent on each other.
>
> Performance aside, can you please elaborate where do you see such need for
> lock?
> On 8 Sep 2015 00:05, "DuyHai Doan"  wrote:
>
>> Multi partitions LWT is not supported currently on purpose. To support
>> it, we would have to emulate a distributed lock which is pretty bad for
>> performance.
>>
>> On Mon, Sep 7, 2015 at 10:38 PM, Marek Lewandowski <
>> marekmlewandow...@gmail.com> wrote:
>>
>>> Hello there,
>>>
>>> would you be interested in having multi-partition support for
>>> lightweight transactions in order words to have ability to express
>>> something like:
>>>
>>> INSERT INTO … IF NOT EXISTS AND
>>> UPDATE … IF EXISTS AND
>>> UPDATE … IF colX = ‘xyz’
>>>
>>> where each statement refers to a row living potentially on different set
>>> of nodes.
>>> In yet another words: imagine batch with conditions, but which allows to
>>> specify multiple statements with conditions for rows in different
>>> partitions.
>>>
>>> Do you find it very useful, moderately useful or you don’t need that
>>> because you have some superb business logic handling of such cases for
>>> example?
>>>
>>> Let me know.
>>> Regards,
>>> Marek
>>
>>
>>


Re: Replacing dead node and cassandra.replace_address

2015-09-08 Thread Maciek Sakrejda
On Tue, Sep 8, 2015 at 11:14 AM, sai krishnam raju potturi <
pskraj...@gmail.com> wrote:

> Once the new node is bootstrapped, you could remove replacement_address
> from the env.sh file
>
Thanks, but how do I know when bootstrapping is completed?


Re: Old SSTables lying around

2015-09-08 Thread Vidur Malik
I did add a comment to that ticket, but the problem seems slightly
different; moreover the compaction strategies are different. I'm also not
seeing the error that abliss is reporting.

On Tue, Sep 8, 2015 at 2:47 PM, Robert Coli  wrote:

> On Tue, Sep 8, 2015 at 10:32 AM, Vidur Malik  wrote:
>
>> Bump on this. Anybody have any insight/need more info?
>>
>
> With your report, I have heard 3 similar issues on 2.2.0.
>
> https://issues.apache.org/jira/browse/CASSANDRA-10270
>
> Is the ticket abliss has opened on topic. If you think it's the same
> issue, please share your experience there. :)
>
> =Rob
>
>



-- 

Vidur Malik

[image: ShopKeep] 

800.820.9814
<8008209814> [image: ShopKeep]  [image:
ShopKeep]  [image: ShopKeep]



Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

2015-09-08 Thread Bryan Cheng
Tom, I don't believe so; it seems the symptom would be an indefinite (or
very long) hang.

To clarify, is this issue restricted to LOCAL_QUORUM? Can you issue a
LOCAL_ONE SELECT and retrieve the expected data back?

On Tue, Sep 8, 2015 at 12:02 PM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> Just to be sure: can this bug result in a 0-row result while it should be
> > 0 ?
> Op 8 sep. 2015 6:29 PM schreef "Tyler Hobbs" :
>
> See https://issues.apache.org/jira/browse/CASSANDRA-9753
>>
>> On Tue, Sep 8, 2015 at 10:22 AM, Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>>> I've been bugging you a few times, but now I've got trace data for a
>>> query with LOCAL_QUORUM that is being sent to a remove data center.
>>>
>>> The setup is as follows:
>>> NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
>>> Both DC1 and DC2 have 2 nodes.
>>> In DC2, one node is currently being rebuilt, and therefore does not
>>> contain all data (yet).
>>>
>>> The client app connects to a node in DC1, and sends a SELECT query with
>>> CL LOCAL_QUORUM, which in this case means ((1/2)+1=1.
>>> If all is ok, the query always produces a result, because the requested
>>> rows are guaranteed to be available in DC1.
>>>
>>> However, the query sometimes produces no result. I've been able to
>>> record the traces of these queries, and it turns out that the coordinator
>>> node in DC1 sometimes sends the query to DC2, to the node that is being
>>> rebuilt, and does not have the requested rows. I've included an example
>>> trace below.
>>>
>>> The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
>>> is in DC2.
>>> I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
>>> sent (I'm using the datastax java driver).
>>>
>>>  activity
>>>| source   | source_elapsed | thread
>>>
>>> ---+--++-
>>>Message received from /
>>> 10.55.156.67 |  10.88.4.194 | 48 |
>>> MessagingService-Incoming-/10.55.156.67
>>>  Executing single-partition query on
>>> aggregate |  10.88.4.194 |286 |
>>> SharedPool-Worker-2
>>>   Acquiring sstable
>>> references |  10.88.4.194 |306 |
>>> SharedPool-Worker-2
>>>Merging memtable
>>> tombstones |  10.88.4.194 |321 |
>>> SharedPool-Worker-2
>>> Partition index lookup allows skipping sstable
>>> 107 |  10.88.4.194 |458 |
>>> SharedPool-Worker-2
>>> Bloom filter allows skipping sstable
>>> 1 |  10.88.4.194 |489 | SharedPool-Worker-2
>>>  Skipped 0/2 non-slice-intersecting sstables, included 0 due to
>>> tombstones |  10.88.4.194 |496 |
>>> SharedPool-Worker-2
>>> Merging data from memtables and 0
>>> sstables |  10.88.4.194 |500 |
>>> SharedPool-Worker-2
>>>  Read 0 live and 0 tombstone
>>> cells |  10.88.4.194 |513 |
>>> SharedPool-Worker-2
>>>Enqueuing response to /
>>> 10.55.156.67 |  10.88.4.194 |613 |
>>> SharedPool-Worker-2
>>>   Sending message to /
>>> 10.55.156.67 |  10.88.4.194 |672 |
>>> MessagingService-Outgoing-/10.55.156.67
>>> Parsing SELECT * FROM Aggregate WHERE type=? AND
>>> typeId=?; | 10.55.156.67 | 10 |
>>> SharedPool-Worker-4
>>>Sending message to /
>>> 10.88.4.194 | 10.55.156.67 |   4335 |
>>>  MessagingService-Outgoing-/10.88.4.194
>>> Message received from /
>>> 10.88.4.194 | 10.55.156.67 |   6328 |
>>>  MessagingService-Incoming-/10.88.4.194
>>>Seeking to partition beginning in data
>>> file | 10.55.156.67 |  10417 |
>>> SharedPool-Worker-3
>>>  Key cache hit for sstable
>>> 389 | 10.55.156.67 |  10586 |
>>> SharedPool-Worker-3
>>>
>>> My question is: how is it possible that the query is sent to a node in
>>> DC2?
>>> Since DC1 has 2 nodes and RF 1, the query should always be sent to the
>>> other node in DC1 if the coordinator does not have a replica, right?
>>>
>>> Thanks,
>>> Tom
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>


Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

2015-09-08 Thread Tom van den Berge
Just to be sure: can this bug result in a 0-row result while it should be >
0 ?
Op 8 sep. 2015 6:29 PM schreef "Tyler Hobbs" :

> See https://issues.apache.org/jira/browse/CASSANDRA-9753
>
> On Tue, Sep 8, 2015 at 10:22 AM, Tom van den Berge <
> tom.vandenbe...@gmail.com> wrote:
>
>> I've been bugging you a few times, but now I've got trace data for a
>> query with LOCAL_QUORUM that is being sent to a remove data center.
>>
>> The setup is as follows:
>> NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
>> Both DC1 and DC2 have 2 nodes.
>> In DC2, one node is currently being rebuilt, and therefore does not
>> contain all data (yet).
>>
>> The client app connects to a node in DC1, and sends a SELECT query with
>> CL LOCAL_QUORUM, which in this case means ((1/2)+1=1.
>> If all is ok, the query always produces a result, because the requested
>> rows are guaranteed to be available in DC1.
>>
>> However, the query sometimes produces no result. I've been able to record
>> the traces of these queries, and it turns out that the coordinator node in
>> DC1 sometimes sends the query to DC2, to the node that is being rebuilt,
>> and does not have the requested rows. I've included an example trace below.
>>
>> The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
>> is in DC2.
>> I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
>> sent (I'm using the datastax java driver).
>>
>>  activity
>>| source   | source_elapsed | thread
>>
>> ---+--++-
>>Message received from /
>> 10.55.156.67 |  10.88.4.194 | 48 |
>> MessagingService-Incoming-/10.55.156.67
>>  Executing single-partition query on
>> aggregate |  10.88.4.194 |286 |
>> SharedPool-Worker-2
>>   Acquiring sstable
>> references |  10.88.4.194 |306 |
>> SharedPool-Worker-2
>>Merging memtable
>> tombstones |  10.88.4.194 |321 |
>> SharedPool-Worker-2
>> Partition index lookup allows skipping sstable
>> 107 |  10.88.4.194 |458 |
>> SharedPool-Worker-2
>> Bloom filter allows skipping sstable
>> 1 |  10.88.4.194 |489 | SharedPool-Worker-2
>>  Skipped 0/2 non-slice-intersecting sstables, included 0 due to
>> tombstones |  10.88.4.194 |496 |
>> SharedPool-Worker-2
>> Merging data from memtables and 0
>> sstables |  10.88.4.194 |500 |
>> SharedPool-Worker-2
>>  Read 0 live and 0 tombstone
>> cells |  10.88.4.194 |513 |
>> SharedPool-Worker-2
>>Enqueuing response to /
>> 10.55.156.67 |  10.88.4.194 |613 |
>> SharedPool-Worker-2
>>   Sending message to /
>> 10.55.156.67 |  10.88.4.194 |672 |
>> MessagingService-Outgoing-/10.55.156.67
>> Parsing SELECT * FROM Aggregate WHERE type=? AND
>> typeId=?; | 10.55.156.67 | 10 |
>> SharedPool-Worker-4
>>Sending message to /
>> 10.88.4.194 | 10.55.156.67 |   4335 |
>>  MessagingService-Outgoing-/10.88.4.194
>> Message received from /
>> 10.88.4.194 | 10.55.156.67 |   6328 |
>>  MessagingService-Incoming-/10.88.4.194
>>Seeking to partition beginning in data
>> file | 10.55.156.67 |  10417 |
>> SharedPool-Worker-3
>>  Key cache hit for sstable
>> 389 | 10.55.156.67 |  10586 |
>> SharedPool-Worker-3
>>
>> My question is: how is it possible that the query is sent to a node in
>> DC2?
>> Since DC1 has 2 nodes and RF 1, the query should always be sent to the
>> other node in DC1 if the coordinator does not have a replica, right?
>>
>> Thanks,
>> Tom
>>
>>
>>
>>
>>
>
>
> --
> Tyler Hobbs
> DataStax 
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread Tom van den Berge
> Running nodetool rebuild on a node that was started with join_ring=false
>> does not work, unfortunately. The nodetool command returns immediately,
>> after a message appears in the log that the streaming of data has started.
>> After that, nothing happens.
>
>
> Per driftx, the author of CASSANDRA-6961, this sounds like a bug. If you
> can repro, please file a JIRA and let the list know the URL.
>

I just filed https://issues.apache.org/jira/browse/CASSANDRA-10287.

(I wan't convinced that join_ring is supposed to work in conjunction with
nodetool rebuild, since CASSANDRA-6961 only speaks of repair.)


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread Robert Coli
On Tue, Sep 8, 2015 at 1:39 AM, horschi  wrote:

> "The idea of join_ring=false is that other nodes are not aware of the new
> node, and therefore never send requests to it. The new node can then be
> repaired"
> Nicely explained, but I still see the issue that this node would not
> receive writes during that time. So after the repair the node would still
> miss data.
> Again, what is needed is either some joining-state or write-survey that
> allows disabling reads, but still accepts writes.
>

https://issues.apache.org/jira/browse/CASSANDRA-6961
"
We can *almost* set join_ring to false, then repair, and then join the ring
to narrow the window (actually, you can do this and everything succeeds
because the node doesn't know it's a member yet, which is probably a bit of
a bug.) If instead we modified this to put the node in hibernate, like
replace_address does, it could work almost like replace, except you could
run a repair (manually) while in the hibernate state, and then flip to
normal when it's done.
"

Since 2.0.7, you should be able to use join_ring=false + repair to do the
operation this thread discusses.

Has anyone here tried and found it wanting? If so, in what way?

For the record, I find various statements in this thread confusing and
likely to be wrong :

" And again, your node won't receive any writes while you are rebuilding. "


If your RF has been increased in the new DC, sure you will, you'll get the
writes you're supposed to get because of your RF? The challenge with
rebuild is premature reads from the new DC, not losing writes?

Running nodetool rebuild on a node that was started with join_ring=false
> does not work, unfortunately. The nodetool command returns immediately,
> after a message appears in the log that the streaming of data has started.
> After that, nothing happens.


Per driftx, the author of CASSANDRA-6961, this sounds like a bug. If you
can repro, please file a JIRA and let the list know the URL.

=Rob


Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

2015-09-08 Thread Nate McCall
>
> Just to be sure: can this bug result in a 0-row result while it should be
> > 0 ?
>
Per Tyler's reference to CASSANDRA-9753
, you would see this
if the read was routed by speculative retry to the nodes that were not yet
finished being built.

Does this work as anticipated when you set speculative_retry to NONE?




-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

2015-09-08 Thread Tom van den Berge
Nate,

I've disabled it, and it's been running for about an hour now without
problems, while before, the problem occurred roughly every few minutes. I
guess it's safe to say that this proves that CASSANDRA-9753
 is the cause of the
problem.

I'm very happy to finally know the cause of this problem! Thanks for
pointing me in the right direction.
Tom

On Tue, Sep 8, 2015 at 9:13 PM, Nate McCall  wrote:

> Just to be sure: can this bug result in a 0-row result while it should be
>> > 0 ?
>>
> Per Tyler's reference to CASSANDRA-9753
> , you would see
> this if the read was routed by speculative retry to the nodes that were not
> yet finished being built.
>
> Does this work as anticipated when you set speculative_retry to NONE?
>
>
>
>
> --
> -
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread horschi
Hi Robert,

I tried to set up a new node with join_ring=false once. In my test that
node did not pick a token in the ring. I assume running repair or rebuild
would not do anything in that case: No tokens = no data. But I must admit:
I have not tried running rebuild.

Is a new node with join_ring=false supposed to pick tokens? From driftx
comment in CASSANDRA-6961 I take it should not.

Tom: What does "nodetool status" say after you started the new node
with join_ring=false?
In my test I got a node that was not in the ring at all.

kind regards,
Christian



On Tue, Sep 8, 2015 at 9:05 PM, Robert Coli  wrote:

>
>
> On Tue, Sep 8, 2015 at 1:39 AM, horschi  wrote:
>
>> "The idea of join_ring=false is that other nodes are not aware of the
>> new node, and therefore never send requests to it. The new node can then be
>> repaired"
>> Nicely explained, but I still see the issue that this node would not
>> receive writes during that time. So after the repair the node would still
>> miss data.
>> Again, what is needed is either some joining-state or write-survey that
>> allows disabling reads, but still accepts writes.
>>
>
> https://issues.apache.org/jira/browse/CASSANDRA-6961
> "
> We can *almost* set join_ring to false, then repair, and then join the
> ring to narrow the window (actually, you can do this and everything
> succeeds because the node doesn't know it's a member yet, which is probably
> a bit of a bug.) If instead we modified this to put the node in hibernate,
> like replace_address does, it could work almost like replace, except you
> could run a repair (manually) while in the hibernate state, and then flip
> to normal when it's done.
> "
>
> Since 2.0.7, you should be able to use join_ring=false + repair to do the
> operation this thread discusses.
>
> Has anyone here tried and found it wanting? If so, in what way?
>
> For the record, I find various statements in this thread confusing and
> likely to be wrong :
>
> " And again, your node won't receive any writes while you are rebuilding.
>>  "
>
>
> If your RF has been increased in the new DC, sure you will, you'll get the
> writes you're supposed to get because of your RF? The challenge with
> rebuild is premature reads from the new DC, not losing writes?
>
> Running nodetool rebuild on a node that was started with join_ring=false
>> does not work, unfortunately. The nodetool command returns immediately,
>> after a message appears in the log that the streaming of data has started.
>> After that, nothing happens.
>
>
> Per driftx, the author of CASSANDRA-6961, this sounds like a bug. If you
> can repro, please file a JIRA and let the list know the URL.
>
> =Rob
>
>
>


Re: who does generate timestamp during the write?

2015-09-08 Thread ibrahim El-sanosi
Yes, that you a lot

On Tue, Sep 8, 2015 at 5:25 PM, Tyler Hobbs  wrote:

>
> On Sat, Sep 5, 2015 at 8:32 AM, ibrahim El-sanosi <
> ibrahimsaba...@gmail.com> wrote:
>
>> So in this scenario, the latest data that wrote to the replicas is [K1,
>> V2] which should be the correct one, but it reads [K1,V1] because of divert
>> clock.
>>
>> Can such scenario occur?
>>
>
> Yes, it most certainly can.  There are a couple of pieces of advice for
> this.  First, run NTP on all of your servers.  Second, if clock drift of a
> second or so would cause problems for your data model (like your example),
> change your data model.  Usually this means creating separate rows for each
> version of the value (by adding a timuuid to the primary key, for example),
> but in some cases lightweight transactions may also be suitable.
>
>
> --
> Tyler Hobbs
> DataStax 
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread Robert Coli
On Tue, Sep 8, 2015 at 2:39 PM, horschi  wrote:

> I tried to set up a new node with join_ring=false once. In my test that
> node did not pick a token in the ring. I assume running repair or rebuild
> would not do anything in that case: No tokens = no data. But I must admit:
> I have not tried running rebuild.
>

I admit I haven't been following this thread closely, perhaps I have missed
what exactly it is you're trying to do.

It's possible you'd need to :

1) join the node with auto_bootstrap=false
2) immediately stop it
3) re-start it with join_ring=false

To actually use repair or rebuild in this way.

However, if your goal is to create a new data-center and rebuild a node
there without any risk of reading from that node while creating the new
data center, you can just :

1) create nodes in new data-center, with RF=0 for that DC
2) change RF in that DC
3) run rebuild on new data-center nodes
4) while doing so, don't talk to new data-center coordinators from your
client
5) and also use LOCAL_ONE/LOCAL_QUORUM to avoid cross-data-center reads
from your client
6) modulo the handful of current bugs which make 5) currently imperfect

What problem are you encountering with this procedure? If it's this ...

I've learned from experience that the node immediately joins the cluster,
> and starts accepting reads (from other DCs) for the range it owns.


This seems to be the incorrect assumption at the heart of the confusion.
You "should" be able to prevent this behavior entirely via correct use of
ConsistencyLevel and client configuration.

In an ideal world, I'd write a detailed blog post explaining this... :/ in
my copious spare time...

=Rob


Question about consistency

2015-09-08 Thread Eric Plowe
I'm using Cassandra as a storage mechanism for session state persistence
for an ASP.NET web application. I am seeing issues where the session state
is persisted on one page (setting a value: Session["key"] = "value" and
when it redirects to another (from a post back event) and check for the
existence of the value that was set, it doesn't exist.

It's a 12 node cluster with 2 data centers (6 and 6) running 2.1.9. The key
space that the column family lives has a RF of 3 for each data center. The
session state provider is using the the datastax csharp driver v2.1.6.
Writes and reads are at LOCAL_QUORUM.

The cluster and web servers have their time synced and we've ruled out
clock drift issues.

The issue doesn't happen all the time, maybe two to three times a day.

Any insight as to what to look at next? Thanks!

~Eric Plowe


Re: Question about consistency

2015-09-08 Thread Robert Coli
On Tue, Sep 8, 2015 at 4:40 PM, Eric Plowe  wrote:

> I'm using Cassandra as a storage mechanism for session state persistence
> for an ASP.NET web application. I am seeing issues where the session
> state is persisted on one page (setting a value: Session["key"] =
> "value" and when it redirects to another (from a post back event) and check
> for the existence of the value that was set, it doesn't exist.
>
> It's a 12 node cluster with 2 data centers (6 and 6) running 2.1.9. The
> key space that the column family lives has a RF of 3 for each data
> center. The session state provider is using the the datastax csharp driver
> v2.1.6. Writes and reads are at LOCAL_QUORUM.
>

1) Write to DC_A with LOCAL_QUORUM
2) Replication to DC_B takes longer than it takes to...
3) Read from DC_B with LOCAL_QUORUM, do not see the write from 1)

If you want to be able to read your writes from DC_A in DC_B, you're going
to need to use EACH_QUORUM.

=Rob


Re: Replacing dead node and cassandra.replace_address

2015-09-08 Thread Vasileios Vlachos
I think you should be able to see the streaming process by running nodetool
netstats. I also think system.log displays similar information about
stemming/when stemming is finished. Shouldn't the state of the node change
to UP when bootstrap is completed as well?

People, correct me if I'm wrong here...
On 8 Sep 2015 20:56, "Maciek Sakrejda"  wrote:

> On Tue, Sep 8, 2015 at 11:14 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
>> Once the new node is bootstrapped, you could remove replacement_address
>> from the env.sh file
>>
> Thanks, but how do I know when bootstrapping is completed?
>


Re: Question about consistency

2015-09-08 Thread Eric Plowe
Rob,

All writes/reads are happening from DC1. DC2 is a backup. The web app does
not handle live requests from DC2.

Regards,

Eric Plowe

On Tuesday, September 8, 2015, Robert Coli  wrote:

> On Tue, Sep 8, 2015 at 4:40 PM, Eric Plowe  > wrote:
>
>> I'm using Cassandra as a storage mechanism for session state persistence
>> for an ASP.NET web application. I am seeing issues where the session
>> state is persisted on one page (setting a value: Session["key"] =
>> "value" and when it redirects to another (from a post back event) and check
>> for the existence of the value that was set, it doesn't exist.
>>
>> It's a 12 node cluster with 2 data centers (6 and 6) running 2.1.9. The
>> key space that the column family lives has a RF of 3 for each data
>> center. The session state provider is using the the datastax csharp driver
>> v2.1.6. Writes and reads are at LOCAL_QUORUM.
>>
>
> 1) Write to DC_A with LOCAL_QUORUM
> 2) Replication to DC_B takes longer than it takes to...
> 3) Read from DC_B with LOCAL_QUORUM, do not see the write from 1)
>
> If you want to be able to read your writes from DC_A in DC_B, you're going
> to need to use EACH_QUORUM.
>
> =Rob
>
>


Re: Question about consistency

2015-09-08 Thread Eric Plowe
To further expand. We have two data centers, Miami and Dallas. Dallas is
our disaster recovery data center. The cluster has 12 nodes, 6 in Miami and
6 in Dallas. The servers in Miami only read/write to Miami using data
center aware load balancing policy of the driver. We have the problem when
writing and reading to the Miami cluster with LOCAL_QUORUM.

Regards,

Eric

On Tuesday, September 8, 2015, Eric Plowe  wrote:

> Rob,
>
> All writes/reads are happening from DC1. DC2 is a backup. The web app does
> not handle live requests from DC2.
>
> Regards,
>
> Eric Plowe
>
> On Tuesday, September 8, 2015, Robert Coli  > wrote:
>
>> On Tue, Sep 8, 2015 at 4:40 PM, Eric Plowe  wrote:
>>
>>> I'm using Cassandra as a storage mechanism for session state persistence
>>> for an ASP.NET web application. I am seeing issues where the session
>>> state is persisted on one page (setting a value: Session["key"] =
>>> "value" and when it redirects to another (from a post back event) and check
>>> for the existence of the value that was set, it doesn't exist.
>>>
>>> It's a 12 node cluster with 2 data centers (6 and 6) running 2.1.9. The
>>> key space that the column family lives has a RF of 3 for each data
>>> center. The session state provider is using the the datastax csharp driver
>>> v2.1.6. Writes and reads are at LOCAL_QUORUM.
>>>
>>
>> 1) Write to DC_A with LOCAL_QUORUM
>> 2) Replication to DC_B takes longer than it takes to...
>> 3) Read from DC_B with LOCAL_QUORUM, do not see the write from 1)
>>
>> If you want to be able to read your writes from DC_A in DC_B, you're
>> going to need to use EACH_QUORUM.
>>
>> =Rob
>>
>>


Re: who does generate timestamp during the write?

2015-09-08 Thread Tyler Hobbs
On Sat, Sep 5, 2015 at 8:32 AM, ibrahim El-sanosi 
wrote:

> So in this scenario, the latest data that wrote to the replicas is [K1,
> V2] which should be the correct one, but it reads [K1,V1] because of divert
> clock.
>
> Can such scenario occur?
>

Yes, it most certainly can.  There are a couple of pieces of advice for
this.  First, run NTP on all of your servers.  Second, if clock drift of a
second or so would cause problems for your data model (like your example),
change your data model.  Usually this means creating separate rows for each
version of the value (by adding a timuuid to the primary key, for example),
but in some cases lightweight transactions may also be suitable.


-- 
Tyler Hobbs
DataStax 


Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

2015-09-08 Thread Tyler Hobbs
See https://issues.apache.org/jira/browse/CASSANDRA-9753

On Tue, Sep 8, 2015 at 10:22 AM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> I've been bugging you a few times, but now I've got trace data for a query
> with LOCAL_QUORUM that is being sent to a remove data center.
>
> The setup is as follows:
> NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
> Both DC1 and DC2 have 2 nodes.
> In DC2, one node is currently being rebuilt, and therefore does not
> contain all data (yet).
>
> The client app connects to a node in DC1, and sends a SELECT query with CL
> LOCAL_QUORUM, which in this case means ((1/2)+1=1.
> If all is ok, the query always produces a result, because the requested
> rows are guaranteed to be available in DC1.
>
> However, the query sometimes produces no result. I've been able to record
> the traces of these queries, and it turns out that the coordinator node in
> DC1 sometimes sends the query to DC2, to the node that is being rebuilt,
> and does not have the requested rows. I've included an example trace below.
>
> The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
> is in DC2.
> I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
> sent (I'm using the datastax java driver).
>
>  activity
>  | source   | source_elapsed | thread
>
> ---+--++-
>Message received from /10.55.156.67
> |  10.88.4.194 | 48 | MessagingService-Incoming-/10.55.156.67
>  Executing single-partition query on aggregate
> |  10.88.4.194 |286 | SharedPool-Worker-2
>   Acquiring sstable references
> |  10.88.4.194 |306 | SharedPool-Worker-2
>Merging memtable tombstones
> |  10.88.4.194 |321 | SharedPool-Worker-2
> Partition index lookup allows skipping sstable 107
> |  10.88.4.194 |458 | SharedPool-Worker-2
> Bloom filter allows skipping sstable 1
> |  10.88.4.194 |489 | SharedPool-Worker-2
>  Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones
> |  10.88.4.194 |496 | SharedPool-Worker-2
> Merging data from memtables and 0 sstables
> |  10.88.4.194 |500 | SharedPool-Worker-2
>  Read 0 live and 0 tombstone cells
> |  10.88.4.194 |513 | SharedPool-Worker-2
>Enqueuing response to /10.55.156.67
> |  10.88.4.194 |613 | SharedPool-Worker-2
>   Sending message to /10.55.156.67
> |  10.88.4.194 |672 | MessagingService-Outgoing-/10.55.156.67
> Parsing SELECT * FROM Aggregate WHERE type=? AND typeId=?;
> | 10.55.156.67 | 10 | SharedPool-Worker-4
>Sending message to /10.88.4.194
> | 10.55.156.67 |   4335 |  MessagingService-Outgoing-/10.88.4.194
> Message received from /10.88.4.194
> | 10.55.156.67 |   6328 |  MessagingService-Incoming-/10.88.4.194
>Seeking to partition beginning in data file
> | 10.55.156.67 |  10417 | SharedPool-Worker-3
>  Key cache hit for sstable 389
> | 10.55.156.67 |  10586 | SharedPool-Worker-3
>
> My question is: how is it possible that the query is sent to a node in
> DC2?
> Since DC1 has 2 nodes and RF 1, the query should always be sent to the
> other node in DC1 if the coordinator does not have a replica, right?
>
> Thanks,
> Tom
>
>
>
>
>


-- 
Tyler Hobbs
DataStax 


Trace evidence for LOCAL_QUORUM ending up in remote DC

2015-09-08 Thread Tom van den Berge
I've been bugging you a few times, but now I've got trace data for a query
with LOCAL_QUORUM that is being sent to a remove data center.

The setup is as follows:
NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
Both DC1 and DC2 have 2 nodes.
In DC2, one node is currently being rebuilt, and therefore does not contain
all data (yet).

The client app connects to a node in DC1, and sends a SELECT query with CL
LOCAL_QUORUM, which in this case means ((1/2)+1=1.
If all is ok, the query always produces a result, because the requested
rows are guaranteed to be available in DC1.

However, the query sometimes produces no result. I've been able to record
the traces of these queries, and it turns out that the coordinator node in
DC1 sometimes sends the query to DC2, to the node that is being rebuilt,
and does not have the requested rows. I've included an example trace below.

The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
is in DC2.
I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
sent (I'm using the datastax java driver).

 activity
 | source   | source_elapsed | thread
---+--++-
   Message received from /10.55.156.67
|  10.88.4.194 | 48 | MessagingService-Incoming-/10.55.156.67
 Executing single-partition query on aggregate
|  10.88.4.194 |286 | SharedPool-Worker-2
  Acquiring sstable references
|  10.88.4.194 |306 | SharedPool-Worker-2
   Merging memtable tombstones
|  10.88.4.194 |321 | SharedPool-Worker-2
Partition index lookup allows skipping sstable 107
|  10.88.4.194 |458 | SharedPool-Worker-2
Bloom filter allows skipping sstable 1
|  10.88.4.194 |489 | SharedPool-Worker-2
 Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones
|  10.88.4.194 |496 | SharedPool-Worker-2
Merging data from memtables and 0 sstables
|  10.88.4.194 |500 | SharedPool-Worker-2
 Read 0 live and 0 tombstone cells
|  10.88.4.194 |513 | SharedPool-Worker-2
   Enqueuing response to /10.55.156.67
|  10.88.4.194 |613 | SharedPool-Worker-2
  Sending message to /10.55.156.67
|  10.88.4.194 |672 | MessagingService-Outgoing-/10.55.156.67
Parsing SELECT * FROM Aggregate WHERE type=? AND typeId=?;
| 10.55.156.67 | 10 | SharedPool-Worker-4
   Sending message to /10.88.4.194
| 10.55.156.67 |   4335 |  MessagingService-Outgoing-/10.88.4.194
Message received from /10.88.4.194
| 10.55.156.67 |   6328 |  MessagingService-Incoming-/10.88.4.194
   Seeking to partition beginning in data file
| 10.55.156.67 |  10417 | SharedPool-Worker-3
 Key cache hit for sstable 389
| 10.55.156.67 |  10586 | SharedPool-Worker-3

My question is: how is it possible that the query is sent to a node in DC2?
Since DC1 has 2 nodes and RF 1, the query should always be sent to the
other node in DC1 if the coordinator does not have a replica, right?

Thanks,
Tom


Re: Old SSTables lying around

2015-09-08 Thread Vidur Malik
Bump on this. Anybody have any insight/need more info?

On Fri, Sep 4, 2015 at 5:09 PM, Vidur Malik  wrote:

> Hey,
>
> We're running a Cassandra 2.2.0 cluster with 8 nodes. We are doing
> frequent updates to our data and we have very few reads, and we are using
> Leveled Compaction with a sstable_size_in_mb of 160MB. We don't have that
> much data currently since we're just testing the cluster.
> We are seeing the SSTable count linearly increase even though `nodetool
> compactionhistory` shows that compaction have definitely ran. When I ran
> nodetool cfstats, I get the following output:
>
> Table: tender_summaries
>
> SSTable count: 56
>
> SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
>
> Does it make sense that there is such a huge difference between the number
> of SStables in each level and the total count of SStables? It seems like
> old SSTables are lying around and never cleaned-up/compacted. Does this
> theory sound plausible? If so, what could be the problem?
>
> Thanks!
>



-- 

Vidur Malik

[image: ShopKeep] 

800.820.9814
<8008209814> [image: ShopKeep]  [image:
ShopKeep]  [image: ShopKeep]



Replacing dead node and cassandra.replace_address

2015-09-08 Thread Maciek Sakrejda
According to the docs [1], when replacing a Cassandra node, I should start
the replacement with cassandra.replace_address specified. Does that just
become part of the replacement node's startup configuration? Can I (or do I
have to) stop specifying it at some point? Does this affect subsequent node
restarts (whether intentional or due to a crash)?

I'm running Cassandra 2.1.

Thanks,
Maciek

[1]:
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsReplaceNode.html