Re: cassandra 1.0.10 : Bootstrapping 7 node cluster to 14 nodes

2012-11-02 Thread Manu Zhang
My guess is that 108 has become a new replica for the streamed data on 103,
104, 107, which is decided by your per-keyspace replica placement strategy.
When we bootstrap, we do not simply stream data from 102 to 108. Rather, we
calculate all the ranges that 108 is responsible for.
So looking from the perspective of data instead of node.


On Fri, Nov 2, 2012 at 12:41 AM, Brennan Saeta sa...@coursera.org wrote:

 The other nodes all have copies of the same data. To optimize performance,
 all of them stream different parts of the data, even though 102 has all the
 data that 108 needs. (I think. I'm not an expert.) -Brennan


 On Thu, Nov 1, 2012 at 9:31 AM, Ramesh Natarajan rames...@gmail.comwrote:

 I am trying to bootstrap cassandra 1.0.10 cluster of 7 nodes to 14 nodes.

 My seed nodes are 101, 102, 103 and 104.

 Here is my initial ring

 Address DC  RackStatus State   Load
  OwnsToken

  145835300108973627198589117470757804908
 192.168.1.101   datacenter1 rack1   Up Normal  8.16 GB
 14.29%  0
 192.168.1.102   datacenter1 rack1   Up Normal  8.68 GB
 14.29%  24305883351495604533098186245126300818
 192.168.1.103   datacenter1 rack1   Up Normal  8.45 GB
 14.29%  48611766702991209066196372490252601636
 192.168.1.104   datacenter1 rack1   Up Normal  8.16 GB
 14.29%  72917650054486813599294558735378902454
 192.168.1.105   datacenter1 rack1   Up Normal  8.33 GB
 14.29%  97223533405982418132392744980505203272
 192.168.1.106   datacenter1 rack1   Up Normal  8.71 GB
 14.29%  121529416757478022665490931225631504090
 192.168.1.107   datacenter1 rack1   Up Normal  8.41 GB
 14.29%  145835300108973627198589117470757804908

 I add a new node 108 with the initial_token between 101 and 102.  After I
 start bootstrapping, I see the node is placed in the ring in correct place

 Address DC  RackStatus State   Load
  OwnsToken

  145835300108973627198589117470757804908
 192.168.1.101   datacenter1 rack1   Up Normal  8.16 GB
 14.29%  0
 192.168.1.108   datacenter1 rack1   Up Joining 114.61 KB
 7.14%   12152941675747802266549093122563150409
 192.168.1.102   datacenter1 rack1   Up Normal  8.68 GB
 7.14%   24305883351495604533098186245126300818
 192.168.1.103   datacenter1 rack1   Up Normal  8.4 GB
  14.29%  48611766702991209066196372490252601636
 192.168.1.104   datacenter1 rack1   Up Normal  8.15 GB
 14.29%  72917650054486813599294558735378902454
 192.168.1.105   datacenter1 rack1   Up Normal  8.33 GB
 14.29%  97223533405982418132392744980505203272
 192.168.1.106   datacenter1 rack1   Up Normal  8.71 GB
 14.29%  121529416757478022665490931225631504090
 192.168.1.107   datacenter1 rack1   Up Normal  8.41 GB
 14.29%  145835300108973627198589117470757804908

 What puzzles me is when I look at the netstats I see nodes 107,104 and
 103 are streaming data to 108.   Can someone explain why this happens?  I
 was under the impression that only node 102 needs to split the tokens and
 send to 108. Am I missing something?


 Streaming from: /192.168.1.107
 Streaming from: /192.168.1.104
 Streaming from: /192.168.1.103


 Thanks
 Ramesh









Re: repair, compaction, and tombstone rows

2012-11-02 Thread horschi
Hi Sylvain,

might I ask why repair cannot simply ignore anything that is older than
gc-grace? (like Aaron proposed)  I agree that repair should not process any
tombstones or anything. But in my mind it sounds reasonable to make repair
ignore timed-out data. Because the timestamp is created on the client,
there is no reason to repair these, right?

We are using TTLs quite heavily and I was noticing that every repair
increases the load of all nodes by 1-2 GBs, where each node has about
20-30GB of data. I dont know if this increases with the data-volume. The
data is mostly time-series data.
I even noticed an increase when running two repairs directly after each
other. So even when data was just repaired, there is still data being
transferred. I assume this is due some columns timing out within that
timeframe and the entire row being repaired.

regards,
Christian

On Thu, Nov 1, 2012 at 9:43 AM, Sylvain Lebresne sylv...@datastax.comwrote:

  Is this a feature or a bug?

 Neither really. Repair doesn't do any gcable tombstone collection and
 it would be really hard to change that (besides, it's not his job). So
 if you when you run repair there is sstable with tombstone that could
 be collected but are not yet, then yes, they will be streamed. Now the
 theory is that compaction will run often enough that gcable tombstone
 will be collected in a reasonably timely fashion and so you will never
 have lots of such tombstones in general (making the fact that repair
 stream them largely irrelevant). That being said, in practice, I don't
 doubt that there is a few scenario like your own where this still can
 lead to doing too much useless work.

 I believe the main problem is that size tiered compaction has a
 tendency to not compact the largest sstables very often. Meaning that
 you could have large sstable with mostly gcable tombstone sitting
 around. In the upcoming Cassandra 1.2,
 https://issues.apache.org/jira/browse/CASSANDRA-3442 will fix that.
 Until then, if you are no afraid of a little bit of scripting, one
 option could be before running a repair to run a small script that
 would check the creation time of your sstable. If an sstable is old
 enough (for some value of that that depends on what is the TTL you use
 on all your columns), you may want to force a compaction (using the
 JMX call forceUserDefinedCompaction()) of that sstable. The goal being
 to get read of a maximum of outdated tombstones before running the
 repair (you could also alternatively run a major compaction prior to
 the repair, but major compactions have a lot of nasty effect so I
 wouldn't recommend that a priori).

 --
 Sylvain



Re: Multiple counters value after restart

2012-11-02 Thread Alain RODRIGUEZ
I ran the same cql query against my 3 nodes (after adding the third and
repairing each of them):

On the new node:

cqlsh:mykeyspace select '20121029#myevent' from 'mycf' where key =
'887#day';

 20121029#myevent
---
  4983

On the 2 others (old nodes):

cqlsh:mykeyspace select '20121029#myevent' from 'mycf' where key =
'887#day';
 20121029#myevent
---
  4254

And the read value atc CL.QUORUM is 4943, which is the good value.

How is it possible that QUORUM read 4943 with only 1 node out of 3
answering that count ?
How could a new node, get a value that none of other existing node has ?
Is there a way to fix the data (isn't repair supposed to do it) ?

Alain



2012/11/1 Alain RODRIGUEZ arodr...@gmail.com

 Can you try it thought, or run a repair ?

 Repairing didn't help

 My first thought is to use QUOURM

 This fix the problem. However, my data is probably still inconsistent,
 even if I read now always the same value. The point is that I can't handle
 a crash with CL.QUORUM, I can't even restart a node...

 I will add a third server.

   But isn't Cassandra suppose to handle a server crash ? When a server
 crashes I guess it don't drain before...

 I was asking to understand how you did the upgrade.

 Ok. On my side I am just concern about the possibility of using counters
 with CL.ONE and correctly handle a crash or restart without a drain.

 Alain



 2012/11/1 aaron morton aa...@thelastpickle.com

 What CL are you using ?

 I think this can be what causes the issue. I'm writing and reading at CL
 ONE. I didn't drain before stopping Cassandra and this may have produce a
 fail in the current counters (those which were being written when I stopped
 a server).

 My first thought is to use QUOURM. But with only two nodes it's hard to
 get strong consistency using  QUOURM.
 Can you try it thought, or run a repair ?

 But isn't Cassandra suppose to handle a server crash ? When a server
 crashes I guess it don't drain before...

 I was asking to understand how you did the upgrade.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 1/11/2012, at 11:39 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 What version of cassandra are you using ?

 1.1.2

 Can you explain this further?

 I had an unexplained amount of reads (up to 1800 r/s and 90 Mo/s) on one
 server the other was doing about 200 r/s and 5 Mo/s max. I fixed it by
 rebooting the server. This server is dedicated to cassandra. I can't tell
 you more about it 'cause I don't get it... But a simple Cassandra restart
 wasn't enough.

 Was something writing to the cluster ?

 Yes we are having some activity and perform about 600 w/s.

 Did you drain for the upgrade ?

 We upgrade a long time ago and to 1.1.2. This warning is about the
 version 1.1.6.

 What changes did you make ?

 In the cassandra.yaml I just change the compaction_throughput_mb_per_sec
 property to slow down my compaction a bit. I don't think the problem come
 from here.

 Are you saying that a particular counter column is giving different
 values for different reads ?

 Yes, this is exactly what I was saying. Sorry if something is wrong with
 my English, it's not my mother tongue.

 What CL are you using ?

 I think this can be what causes the issue. I'm writing and reading at CL
 ONE. I didn't drain before stopping Cassandra and this may have produce a
 fail in the current counters (those which were being written when I stopped
 a server).

 But isn't Cassandra suppose to handle a server crash ? When a server
 crashes I guess it don't drain before...

 Thank you for your time Aaron, once again.

 Alain



 2012/10/31 aaron morton aa...@thelastpickle.com

 What version of cassandra are you using ?

  I finally restart Cassandra. It didn't solve the problem so I stopped
 Cassandra again on that node and restart my ec2 server. This solved the
 issue (1800 r/s to 100 r/s).

 Can you explain this further?
 Was something writing to the cluster ?
 Did you drain for the upgrade ?
 https://github.com/apache/cassandra/blob/cassandra-1.1/NEWS.txt#L17

 Today I changed my cassandra.yml and restart this same server to apply
 my conf.

 What changes did you make ?

 I just noticed that my homepage (which uses a Cassandra counter and
 refreshes every sec) shows me 4 different values. 2 of them repeatedly
 (5000 and 4000) and the 2 other some rare times (5500 and 3800)

 Are you saying that a particular counter column is giving different
 values for different reads ?
 What CL are you using ?

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 31/10/2012, at 3:39 AM, Jason Wee peich...@gmail.com wrote:

 maybe enable the debug in log4j-server.properties and going through the
 log to see what actually happen?

 On Tue, Oct 30, 2012 at 7:31 PM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi,

 I have an issue with counters, 

Re: repair, compaction, and tombstone rows

2012-11-02 Thread Rob Coli
On Fri, Nov 2, 2012 at 2:46 AM, horschi hors...@gmail.com wrote:
 might I ask why repair cannot simply ignore anything that is older than
 gc-grace? (like Aaron proposed)  I agree that repair should not process any
 tombstones or anything. But in my mind it sounds reasonable to make repair
 ignore timed-out data. Because the timestamp is created on the client, there
 is no reason to repair these, right?

IIRC, tombstone timestamps are written by the server, at compaction
time. Therefore if you have RF=X, you have X different timestamps
relative to GCGraceSeconds. I believe there was another thread about
two weeks ago in which Sylvain detailed the problems with what you are
proposing, when someone else asked approximately the same question.

 I even noticed an increase when running two repairs directly after each
 other. So even when data was just repaired, there is still data being
 transferred. I assume this is due some columns timing out within that
 timeframe and the entire row being repaired.

Merkle trees are an optimization, what they trade for this
optimization is over-repair.

(FWIW, I agree that, if possible, this particular case of over-repair
would be nice to eliminate.)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: repair, compaction, and tombstone rows

2012-11-02 Thread horschi
 IIRC, tombstone timestamps are written by the server, at compaction
 time. Therefore if you have RF=X, you have X different timestamps
 relative to GCGraceSeconds. I believe there was another thread about
 two weeks ago in which Sylvain detailed the problems with what you are
 proposing, when someone else asked approximately the same question.

Oh yes, I forgot about the thread. I assume you are talking about:
http://grokbase.com/t/cassandra/user/12ab6pbs5n/unnecessary-tombstones-transmission-during-repair-process

I think these are multiple issues that correlate with each other:

1) Repair uses the local timestamp of DeletedColumns for Merkle tree
calculation. This is what the other thread was about.
Alexey claims that this was fixed by some other commit:
https://issues.apache.org/jira/secure/attachment/12544204/CASSANDRA-4561-CS.patch
But honestly, I dont see how this solves it. I understand how Alexeys patch
a few messages before would solve it (by overriding the updateDigest method
in DeletedColumn)

2) ExpiringColumns should not be used for merkle tree calculation if they
are timed out.
I checked LazilyCompactedRow and saw that it does not exclude any timed-out
columns. It loops over all columns and calls updateDigest on them. Without
any condition. Imho ExpiringColumn.updateDigest() should check for its own
isMarkedForDelete() first before doing any digest-changes (We cannot simply
call isMarkedDelete from LazilyCompactionRow because we dont want this for
DeletedColumns).

3) Cassandra should not create tombstones for expiring columns.
I am not a 100% sure, but it looks to me like cassandra creates tombstones
for expired ExpiringColumns. This makes me wonder if we could delete
expired columns directly. The digest for a ExpiringColumn and DeletedColumn
can never match, due to the different implementations. So there will be
always a repair if compactions are not synchronous on nodes.
Imho it should be valid to delete ExpiringColumns directly, because the TTL
is given by the client and should pass on all nodes at the same time.

All together should reduce over-repair.


Merkle trees are an optimization, what they trade for this
 optimization is over-repair.

 (FWIW, I agree that, if possible, this particular case of over-repair
 would be nice to eliminate.)

Of course, rather over-repair than corrupt something.


Re: distribution of token ranges with virtual nodes

2012-11-02 Thread Eric Evans
On Fri, Nov 2, 2012 at 12:38 AM, Manu Zhang owenzhang1...@gmail.com wrote:
 It splits into a contiguous range, because truly upgrading to vnode
 functionality is another step.

 That confuses me. As I understand it, there is no point in having 256 tokens
 on same node if I don't commit the shuffle

This isn't exactly true.  By-partition operations (think repair,
streaming, etc) will be more reliable in the sense that if they fail
and need to be restarted, there is less that is lost/needs redoing.
Also, if all you did was migrate from 1-token-per-node to 256
contiguous tokens per node, normal topology changes (bootstrapping new
nodes, decommissioning old ones), would gradually work to redistribute
the partitions.  And, from a topology perspective, splitting the one
partition into many contiguous partition is a no-op; it's safe to do
and there is no cost to speak of from a computational or IO
perspective.

On the other hand, shuffling requires moving tokens around the
cluster.  If you completely randomize placement, it follows that you
will need to relocate all of the clusters data, so it's quite costly.
It's also precedent setting, and not thoroughly tested yet.

--
Eric Evans
Acunu | http://www.acunu.com | @acunu


how to detect stream closed in twitter/cassandra api?

2012-11-02 Thread Yuhan Zhang
Hi all,

I'm using twitter/cassandra ruby client, trying to pool a connection in a
static variable.
@@client = Cassandra.new(keyspace, host, :retries = retries,
:connect_timeout = connect_timeout, :timeout = timeout,
:exception_classes = [])

but the connection returns stream closed error after a while. is there a
method in twitter/cassandra client to detect this state?


Thank you..

Yuhan

-- 
The information contained in this e-mail is for the exclusive use of the 
intended recipient(s) and may be confidential, proprietary, and/or legally 
privileged. Inadvertent disclosure of this message does not constitute a 
waiver of any privilege.  If you receive this message in error, please do 
not directly or indirectly print, copy, retransmit, disseminate, or 
otherwise use the information. In addition, please delete this e-mail and 
all copies and notify the sender.


Insert via CQL

2012-11-02 Thread Vivek Mishra
Hi,
any idea, how to insert into a column family for a column of type blob
via cql query?

-Vivek


Re: Insert via CQL

2012-11-02 Thread Eric Evans
On Fri, Nov 2, 2012 at 8:09 PM, Vivek Mishra mishra.v...@gmail.com wrote:
 any idea, how to insert into a column family for a column of type blob via
 cql query?

Yes, most of them involve binary data that is hex-encoded ascii. :)

-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu