from:"aaron morton"

Re: CPU hotspot at BloomFilterSerializer#deserialize

2013-02-04 Thread aaron morton

 Yes, it contains a big row that goes up to 2GB with more than a million of 
 columns.

I've run tests with 10 million small columns and reasonable performance. I've 
not looked at 1 million large columns.  

 - BloomFilterSerializer#deserialize does readLong iteratively at each page
 of size 4K for a given row, which means it could be 500,000 loops(calls
 readLong) for a 2G row(from 1.0.7 source).
There is only one Bloom filter per row in an SSTable, not one per column 
index/page. 

It could take a while if there are a lot of sstables in the read. 

nodetool cfhistorgrams will let you know, run it once to reset the counts , 
then do your test, then run it again. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/02/2013, at 4:13 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 It is interesting the press c* got about having 2 billion columns in a
 row. You *can* do it but it brings to light some realities of what
 that means.
 
 On Sun, Feb 3, 2013 at 8:09 AM, Takenori Sato ts...@cloudian.com wrote:
 Hi Aaron,
 
 Thanks for your answers. That helped me get a big picture.
 
 Yes, it contains a big row that goes up to 2GB with more than a million of
 columns.
 
 Let me confirm if I correctly understand.
 
 - The stack trace is from Slice By Names query. And the deserialization is
 at the step 3, Read the row level Bloom Filter, on your blog.
 
 - BloomFilterSerializer#deserialize does readLong iteratively at each page
 of size 4K for a given row, which means it could be 500,000 loops(calls
 readLong) for a 2G row(from 1.0.7 source).
 
 Correct?
 
 That makes sense Slice By Names queries against such a wide row could be CPU
 bottleneck. In fact, in our test environment, a
 BloomFilterSerializer#deserialize of such a case takes more than 10ms, up to
 100ms.
 
 Get a single named column.
 Get the first 10 columns using the natural column order.
 Get the last 10 columns using the reversed order.
 
 Interesting. A query pattern could make a difference?
 
 We thought the only solutions is to change the data structure(don't use such
 a wide row if it is retrieved by Slice By Names query).
 
 Anyway, will give it a try!
 
 Best,
 Takenori
 
 On Sat, Feb 2, 2013 at 2:55 AM, aaron morton aa...@thelastpickle.com
 wrote:
 
 5. the problematic Data file contains only 5 to 10 keys data but
 large(2.4G)
 
 So very large rows ?
 What does nodetool cfstats or cfhistograms say about the row sizes ?
 
 
 1. what is happening?
 
 I think this is partially large rows and partially the query pattern, this
 is only by roughly correct
 http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my talk here
 http://www.datastax.com/events/cassandrasummit2012/presentations
 
 3. any more info required to proceed?
 
 Do some tests with different query techniques…
 
 Get a single named column.
 Get the first 10 columns using the natural column order.
 Get the last 10 columns using the reversed order.
 
 Hope that helps.
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 31/01/2013, at 7:20 PM, Takenori Sato ts...@cloudian.com wrote:
 
 Hi all,
 
 We have a situation that CPU loads on some of our nodes in a cluster has
 spiked occasionally since the last November, which is triggered by requests
 for rows that reside on two specific sstables.
 
 We confirmed the followings(when spiked):
 
 version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8
 jdk: Oracle 1.6.0
 
 1. a profiling showed that BloomFilterSerializer#deserialize was the
 hotspot(70% of the total load by running threads)
 
 * the stack trace looked like this(simplified)
 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
 ...
 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData
 ...
 89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read
 ...
 79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter
 68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize
 66.7% - java.io.DataInputStream.readLong
 
 2. Usually, 1 should be so fast that a profiling by sampling can not
 detect
 
 3. no pressure on Cassandra's VM heap nor on machine in overal
 
 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat
 1 1000)
 
 5. the problematic Data file contains only 5 to 10 keys data but
 large(2.4G)
 
 6. the problematic Filter file size is only 256B(could be normal)
 
 
 So now, I am trying to read the Filter file in the same way
 BloomFilterSerializer#deserialize does as possible as I can, in order to see
 if the file is something wrong.
 
 Could you give me some advise on:
 
 1. what is happening?
 2. the best way to simulate the BloomFilterSerializer#deserialize
 3. any more info required to proceed?
 
 Thanks,
 Takenori

Re: cassandra cqlsh error

2013-02-04 Thread aaron morton

Grab 1.2.1, it's fixed there http://cassandra.apache.org/download/

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 5/02/2013, at 4:37 AM, Kumar, Anjani anjani.ku...@infogroup.com wrote:

  
 I am facing problem while trying to run cqlsh. Here is what I did:
  
 1.   I brought the tar ball files for both 1.1.7 and 1.2.0 version.
 2.   Unzipped and untarred it
 3.   Started Cassandra
 4.   And then tried starting cqlsh but I am getting the following error 
 in both the versions:
 Connection error: Invalid method name: ‘set_cql_version’
  
 Before installing Datastax 1.1.7 and 1.2.0 cassandra, I had installed 
 Cassandra through “sudo apt-get install Cassandra” on my ubuntu. Since it 
 doesn’t have CQL support(at least I cant find it) so I thought of installing 
 Datastax version of Cassandra but still no luck starting cqlsh so far. Any 
 suggestion?
  
 Thanks,
 Anjani

Re: Pycassa vs YCSB results.

2013-02-05 Thread aaron morton

The first thing I noticed is your script uses python threading library, which 
is hampered by the Global Interpreter Lock 
http://docs.python.org/2/library/threading.html

You don't really have multiple threads running in parallel, try using the 
multiprocessor library. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 5/02/2013, at 7:15 AM, Pradeep Kumar Mantha pradeep...@gmail.com wrote:

 Hi,
 
 Could some one please let me know any hints, why the pycassa client(attached) 
 is much slower than the YCSB?
 is it something to attribute to performance difference between python and 
 Java? or the pycassa api has some performance limitations?
 
 I don't see any client statements affecting the pycassa performance. Please 
 have a look at the simple python script attached and let me know
 your suggestions.
 
 thanks
 pradeep
 
 On Thu, Jan 31, 2013 at 4:53 PM, Pradeep Kumar Mantha pradeep...@gmail.com 
 wrote:
 
 
 On Thu, Jan 31, 2013 at 4:49 PM, Pradeep Kumar Mantha pradeep...@gmail.com 
 wrote:
 Thanks.. Please find the script as attachment.
 
 Just re-iterating.
 Its just a simple python script which submit 4 threads. 
 This script has been scheduled on 8 cores using taskset unix command , thus 
 running 32 threads/node. 
 and then scaling to 16 nodes
 
 thanks
 pradeep
 
 
 On Thu, Jan 31, 2013 at 4:38 PM, Tyler Hobbs ty...@datastax.com wrote:
 Can you provide the python script that you're using?
 
 (I'm moving this thread to the pycassa mailing list 
 (pycassa-disc...@googlegroups.com), which is a better place for this 
 discussion.)
 
 
 On Thu, Jan 31, 2013 at 6:25 PM, Pradeep Kumar Mantha pradeep...@gmail.com 
 wrote:
 Hi,
 
 I am trying to benchmark cassandra on a 12 Data Node cluster using 16 clients 
 ( each client uses 32 threads) using custom pycassa client and YCSB.
 
 I found the maximum number of operations/seconds achieved using pycassa 
 client is nearly 70k+ reads/second.
 Whereas with YCSB it is ~ 120k reads/second.
 
 Any thoughts, why I see this huge difference in performance?
 
 
 Here is the description of setup.
 
 Pycassa client (a simple python script).
 1. Each pycassa client starts 4 threads - where each thread queries 76896 
 queries.
 2. a shell script is used to submit 4threads/each core using taskset unix 
 command on a 8 core single node. ( 8 * 4 * 76896 queries)
 3. Another shell script is used to scale the single node shell script to 16 
 nodes  ( total queries now - 16 * 8 * 4 * 76896 queries )
 
 I tried to keep YCSB configuration as much as similar to my custom pycassa 
 benchmarking setup.
 
 YCSB -
 
 Launched 16 YCSB clients on 16 nodes where each client uses 32 threads for 
 execution and need to query ( 32 * 76896 keys ), i.e 100% reads
 
 The dataset is different in each case, but has
 
 1. same number of total records.
 2. same number of fields.
 3. field length is almost same.
 
 Could you please let me know, why I see this huge performance difference and 
 is there any way I can improve the operations/second using pycassa client.
 
 thanks
 pradeep
  
 
 
 
 -- 
 Tyler Hobbs
 DataStax
 
 
 
 pycassa_client.py

Re: Pycassa vs YCSB results.

2013-02-05 Thread aaron morton

The simple thing to do would be use the multiprocessing package and eliminate 
all shared state.  

On a multicore box python threads can run on different cores and battle over 
obtaining the GIL.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 5/02/2013, at 11:34 PM, Tim Wintle timwin...@gmail.com wrote:

 On Tue, 2013-02-05 at 21:38 +1300, aaron morton wrote:
 The first thing I noticed is your script uses python threading library, 
 which is hampered by the Global Interpreter Lock 
 http://docs.python.org/2/library/threading.html
 
 You don't really have multiple threads running in parallel, try using the 
 multiprocessor library. 
 
 Python _should_ release the GIL around IO-bound work, so this is a
 situation where the GIL shouldn't be an issue (It's actually a very good
 use for python's threads as there's no serialization overhead for
 message passing between processes as there would be in most
 multi-process examples)
 
 
 A constant factor 2 slowdown really doesn't seem that significant for
 two different implementations, and I would not worry about this unless
 you're talking about thousands of machines..
 
 If you are talking about enough machines that this is real $$$, then I
 do think the python code can be optimised a lot.
 
 I'm talking about language/VM specific optimisations - so I'm assuming
 cpython (the standard /usr/bin/python as in the shebang).
 
 I don't know how much of a difference this will make, but I'd be
 interested in hearing your results:
 
 
 I would start by trying rewriting this:
 
  def start_cassandra_client(Threadname):
f=open(Threadname,w)
for key in lines:
  key=key.strip()
  st=time.time()
  f.write(str(cf.get(key))+\n)
  et=time.time()
  f.write(Time taken for a single query is  +
 str(round(1000*(et-st),2))+ milli secs\n)
  f.close()
 
 As something like this:
 
  def start_cassandra_client(Threadname):
# Avoid variable names outside this scope
time_fn = time.time
colfam = cf
f=open(Threadname,w)
for key in lines:
  key=key.strip()
  st=time_fn()
  f.write(str(colfam.get(key))+\n)
  et=time_fn()
  f.write(Time taken for a single query is  +
 str(round(1000*(et-st),2))+ milli secs\n)
  f.close()
 
 
 If you don't consider it cheating compared to the java version, I would
 also move the key.strip() call to the module initiation instead of
 doing it once per thread, as there's a lot of function dispatch overhead
 in python.
 
 
 I'd also closely compare the IO going on in both versions (the .write
 calls). For example this may be significantly faster:
 
  et=time_fn()
  f.write(str(colfam.get(key))+\nTime taken for a single query is 
 + str(round(1000*(et-st),2))+ milli secs\n)
 
 
 .. I haven't read your java code and I don't know Java IO semantics well
 enough to compare the behaviour of both.
 
 Tim
 
 
 
 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 5/02/2013, at 7:15 AM, Pradeep Kumar Mantha pradeep...@gmail.com wrote:
 
 Hi,
 
 Could some one please let me know any hints, why the pycassa 
 client(attached) is much slower than the YCSB?
 is it something to attribute to performance difference between python and 
 Java? or the pycassa api has some performance limitations?
 
 I don't see any client statements affecting the pycassa performance. Please 
 have a look at the simple python script attached and let me know
 your suggestions.
 
 thanks
 pradeep
 
 On Thu, Jan 31, 2013 at 4:53 PM, Pradeep Kumar Mantha 
 pradeep...@gmail.com wrote:
 
 
 On Thu, Jan 31, 2013 at 4:49 PM, Pradeep Kumar Mantha 
 pradeep...@gmail.com wrote:
 Thanks.. Please find the script as attachment.
 
 Just re-iterating.
 Its just a simple python script which submit 4 threads. 
 This script has been scheduled on 8 cores using taskset unix command , thus 
 running 32 threads/node. 
 and then scaling to 16 nodes
 
 thanks
 pradeep
 
 
 On Thu, Jan 31, 2013 at 4:38 PM, Tyler Hobbs ty...@datastax.com wrote:
 Can you provide the python script that you're using?
 
 (I'm moving this thread to the pycassa mailing list 
 (pycassa-disc...@googlegroups.com), which is a better place for this 
 discussion.)
 
 
 On Thu, Jan 31, 2013 at 6:25 PM, Pradeep Kumar Mantha 
 pradeep...@gmail.com wrote:
 Hi,
 
 I am trying to benchmark cassandra on a 12 Data Node cluster using 16 
 clients ( each client uses 32 threads) using custom pycassa client and YCSB.
 
 I found the maximum number of operations/seconds achieved using pycassa 
 client is nearly 70k+ reads/second.
 Whereas with YCSB it is ~ 120k reads/second.
 
 Any thoughts, why I see this huge difference in performance?
 
 
 Here is the description of setup.
 
 Pycassa client (a simple python script).
 1. Each pycassa client starts 4 threads - where each thread queries 76896 
 queries.
 2. a shell

Re: Operation Consideration with Counter Column Families

2013-02-05 Thread aaron morton

 Are there any specific operational considerations one should make when using 
 counter columns families?
Performance, as they incur a read and a write. 
There were some issues with overcounts in log replay (see the changes.txt). 
 
  How are counter column families stored on disk? 
Same as regular CF's. 

 How do they effect compaction?
None.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/02/2013, at 7:47 AM, Drew Kutcharian d...@venarc.com wrote:

 Hey Guys,
 
 Are there any specific operational considerations one should make when using 
 counter columns families? How are counter column families stored on disk? How 
 do they effect compaction?
 
 -- Drew

Re: unbalanced ring

2013-02-05 Thread aaron morton

Use nodetool status with vnodes 
http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes

The different load can be caused by rack affinity, are all the nodes in the 
same rack ? Another simple check is have you created some very big rows?
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/02/2013, at 8:40 AM, stephen.m.thomp...@wellsfargo.com wrote:

 So I have three nodes in a ring in one data center.  My configuration has 
 num_tokens: 256 set andinitial_token commented out.  When I look at the ring, 
 it shows me all of the token ranges of course, and basically identical data 
 for each range on each node.  Here is the Cliff’s Notes version of what I see:
  
 [root@Config3482VM2 apache-cassandra-1.2.0]# bin/nodetool ring
  
 Datacenter: 28
 ==
 Replicas: 1
  
 Address RackStatus State   LoadOwns   
  Token
   
  9187343239835811839
 10.28.205.125   205 Up Normal  2.85 GB 33.69% 
  -3026347817059713363
 10.28.205.125   205 Up Normal  2.85 GB 33.69% 
  -3026276684526453414
 10.28.205.125   205 Up Normal  2.85 GB 33.69% 
  -3026205551993193465
   (etc)
 10.28.205.126   205 Up Normal  1.15 GB 100.00%
  -9187343239835811840
 10.28.205.126   205 Up Normal  1.15 GB 100.00%
  -9151314442816847872
 10.28.205.126   205 Up Normal  1.15 GB 100.00%
  -9115285645797883904
   (etc)
 10.28.205.127   205 Up Normal  69.13 KB66.30% 
  -9223372036854775808
 10.28.205.127   205 Up Normal  69.13 KB66.30% 
  36028797018963967
 10.28.205.127   205 Up Normal  69.13 KB66.30% 
  72057594037927935
   (etc)
  
 So at this point I have a number of questions.   The biggest question is of 
 Load.  Why does the .125 node have 2.85 GB, .126 has 1.15 GB, and .127 has 
 only 0.69 GB?  These boxes are all comparable and all configured 
 identically.
  
 partitioner: org.apache.cassandra.dht.Murmur3Partitioner
  
 I’m sorry to ask so many questions – I’m having a hard time finding 
 documentation that explains this stuff.
  
 Stephen

Re: Clarification on num_tokens setting

2013-02-05 Thread aaron morton

  With N nodes, the ring is divided into N*num_tokens. Correct? 
There is always num_tokens tokens in the ring.
Each node has (num_tokens / N) * RF ranges on it. 

 so the ranges of keys are not uniform, although with enough nodes in the 
 cluster there probably won't be any really large ranges. Correct?
Even without vnodes there is no guarantee that nodes had contiguous key ranges. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/02/2013, at 5:43 AM, Baron Schwartz ba...@xaprb.com wrote:

 As I understand the num_tokens setting, it makes Cassandra do the following 
 pseudocode when a new node is added:
 
 for 1...num_tokens do
my_token = rand(0, 2^128-1)
next_token = min(tokens in cluster where token  my_token)
my_range = (my_token, next_token - 1)
 done
 
 Now the new node owns num_tokens chunks of keys that previously belonged to 
 other nodes.
 
 My point is, with 1 node in the cluster, the ring is divided into num_tokens 
 ranges. With N nodes, the ring is divided into N*num_tokens. Correct? The 
 docs do not make this clear for me.
 
 And another point: the tokens are randomly chosen, so the ranges of keys are 
 not uniform, although with enough nodes in the cluster there probably won't 
 be any really large ranges. Correct?

Re: Clarification on num_tokens setting

2013-02-05 Thread aaron morton

 There is always num_tokens tokens in the ring.


I got this wrong. 
Each node *does* have num_tokens tokens. 

  With N nodes, the ring is divided into N*num_tokens. Correct? 
Yes

 In other words it is cluster wide parameter. Correct?
Yes.

Cheers
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/02/2013, at 10:36 AM, Andrey Ilinykh ailin...@gmail.com wrote:

 
 
 
 On Tue, Feb 5, 2013 at 12:42 PM, aaron morton aa...@thelastpickle.com wrote:
  With N nodes, the ring is divided into N*num_tokens. Correct? 
 
 There is always num_tokens tokens in the ring.
 Each node has (num_tokens / N) * RF ranges on it. 
 
 That means every node should have the same num_token parameter? In other 
 words it is cluster wide parameter. Correct?
 
 Thank you,
   Andrey

Re: Operation Consideration with Counter Column Families

2013-02-06 Thread aaron morton

 Thanks Aaron, so will there only be one value for each counter column per 
 sstable just like regular columns?
Yes. 

  For some reason I was under the impression that Cassandra keeps a log of all 
 the increments not the actual value.
Not as far as I understand. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/02/2013, at 11:15 AM, Drew Kutcharian d...@venarc.com wrote:

 Thanks Aaron, so will there only be one value for each counter column per 
 sstable just like regular columns? For some reason I was under the impression 
 that Cassandra keeps a log of all the increments not the actual value.
 
 
 On Feb 5, 2013, at 12:36 PM, aaron morton aa...@thelastpickle.com wrote:
 
 Are there any specific operational considerations one should make when 
 using counter columns families?
 Performance, as they incur a read and a write. 
 There were some issues with overcounts in log replay (see the changes.txt). 
  
  How are counter column families stored on disk? 
 Same as regular CF's. 
 
 How do they effect compaction?
 None.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 6/02/2013, at 7:47 AM, Drew Kutcharian d...@venarc.com wrote:
 
 Hey Guys,
 
 Are there any specific operational considerations one should make when 
 using counter columns families? How are counter column families stored on 
 disk? How do they effect compaction?
 
 -- Drew

Re: DataModel Question

2013-02-06 Thread aaron morton

 2)  DynamicComposites : I read somewhere that they are not recommended ?
You probably wont need them. 

Your current model will not sort message by the time they arrive in a day. The 
sort order will be based on Message type and the message ID. 

I'm assuming you want to order messages, so put the time uuid at the start of 
the composite columns. If you often want to get the most recent messages use a 
reverse comparator. 

You could probably also have wider rows if you want to, not sure how many 
messages kids send a day but you may get by with weekly partitions. 

The CLI model could be:
row_key: phone_number : day
column: time_uuid : message_id : message_type 

You could also pack extra data used JSON, ProtoBuffers etc and store more that 
just the message in the column value. 

If you use using CQL 3 consider this:

create table messages (
phone_numbertext, 
day timestamp, 
message_sequencetimeuuid, # your timestamp
message_id  integer, 
message_typetext, 
message_bodytext
) with PRIMARY KEY ( (phone_number, day), message_sequence, message_id)

(phone_number, day) is the partition key, same the thrift row key. 

 message_sequence, message_id is the grouping columns, all instances will be 
grouped / ordered by these columns. 

Hope that helps. 



-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/02/2013, at 1:47 AM, Kanwar Sangha kan...@mavenir.com wrote:

 1)  Version is 1.2
 2)  DynamicComposites : I read somewhere that they are not recommended ?
 3)  Good point. I need to think about that one.
  
  
  
 From: Tamar Fraenkel [mailto:ta...@tok-media.com] 
 Sent: 06 February 2013 00:50
 To: user@cassandra.apache.org
 Subject: Re: DataModel Question
  
 Hi!
 I have couple of questions regarding your model:
 
  1. What Cassandra version are you using? I am still working with 1.0 and 
 this seems to make sense, but 1.2 gives you much more power I think.
  2. Maybe I don't understand your model, but I think you need  
 DynamicComposite columns, as user columns are different in number of 
 components and maybe type.
  3. How do you associate between the SMS or MMS and the user you are chating 
 with. Is it done by a separate CF?
 
 Thanks,
 Tamar
  
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 image001.png
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
  
  
  
 
 On Wed, Feb 6, 2013 at 8:23 AM, Vivek Mishra mishra.v...@gmail.com wrote:
 Avoid super columns. If you need Sorted, wide rows then go for Composite 
 columns. 
 
 -Vivek
  
 
 On Wed, Feb 6, 2013 at 7:09 AM, Kanwar Sangha kan...@mavenir.com wrote:
 Hi –  We are designing a Cassandra based storage for the following use cases-
  
 ·Store SMS messages
 
 ·Store MMS messages
 
 ·Store Chat history
 
  
 What would be the ideal was to design the data model for this kind of 
 application ? I am thinking on these lines ..
  
 Row-Key :  Composite key [ PhoneNum : Day]
  
 ·Example:   19876543456:05022013
 
  
 Dynamic Column Families
  
 ·Composite column key for SMS [SMS:MessageId:TimeUUID]
 
 ·Composite column key for MMS [MMS:MessageId:TimeUUID]
 
 ·Composite column key for user I am chatting with 
 [UserId:198765432345] – This can have multiple values since each chat conv 
 can have many messages. Should this be a super column ?
 
  
  
 198:05022013
 SMS::ttt
 SMS:xxx12:ttt
 MMS::ttt
 :19
 198:05022013
  
  
  
  
 1987888:05022013
  
  
  
  
  
  
 Thanks,
 Kanwar

Re: Cassandra 1.1.8 timeouts on clients

2013-02-07 Thread aaron morton

First check your node for IO errors. You have some bad data there. 

When you restart cassandra it may identify which sstables are corrupt. You can 
then stop the node and remove them. 

You will then need to run repair to replace the missing data. 

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/02/2013, at 1:21 PM, Terry Cumaranatunge cumar...@gmail.com wrote:

 I may have found a trigger that is causing these problems. Anyone seen these 
 compaction problems in 1.1? I did run scrub on all my 1.0 data to convert it 
 to 1.1 and fix level-manifest problems before I started running 1.1.
 
 1 node:
 ERROR [CompactionExecutor:281] 2013-02-06 23:56:16,183 
 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[Comp
 actionExecutor:281,1,main]
 java.io.IOError: 
 org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid 
 column name length 0
 at 
 org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116)
 at 
 org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99)
 at 
 org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176)
 at 
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83)
 at 
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68)
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 at 
 com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$2.runMayThrow(CompactionManager.java:164)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 Caused by: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: 
 invalid column name length 0
 at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:98)
 at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37)
 at 
 org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:144)
 at 
 org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:234
 )
 at 
 org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:112)
 ... 21 more
 
 2nd node:
 ERROR [CompactionExecutor:266] 2013-02-06 23:51:35,181 
 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[Comp
 actionExecutor:266,1,main]
 java.io.IOError: java.io.EOFException
 at 
 org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116)
 at 
 org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99)
 at 
 org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176)
 at 
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83)
 at 
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68)
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 at 
 com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140

Re: Directory structure after upgrading 1.0.8 to 1.2.1

2013-02-07 Thread aaron morton

the -old.json is an artefact of Levelled Compaction. 

You should see a non -old file in the current CF folder. 

I'm not sure what would have created the -old CF dir. Does the timestamp 
indicate it was created the time the server first started as a 1.2 node?

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/02/2013, at 10:39 PM, Desimpel, Ignace ignace.desim...@nuance.com 
wrote:

 After upgrading from 1.0.8 I see that now the directory structure has changed 
 and has a structure like keyspacecolumnfamily (part of the 1.1.x 
 migration).
 But I also see that directories appear like keyspacecolumnfamily-old, and 
 the content of that ‘old’ directory is only one file columnfamily-old.json.
  
 Questions :
 Should this xxx-old.json file be in the other directory?
 Should the extra directory xxx-old not be created?
 Or was that intentionally done and is it allowed to remove these directories 
 ( manually … )?
  
 Thanks

Re: Can't remove contents of table with truncate or drop

2013-02-07 Thread aaron morton

Double check the truncate worked, all nodes must be available for it execute. 

If you can provide the output from the cqlsh from truncating and selecting that 
would be helpful.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 8/02/2013, at 2:55 AM, Jabbar aja...@gmail.com wrote:

 Hello,
 
 I'm having problems truncating or deleting the contents of a table. If I 
 truncate the table and then do a select count(*) I get a value above zero.
  If I drop the table, recreate the table the select count(*) still returns a 
 non zero value.
 
 The truncate or delete operation does not return any errors.
 
 I am using cassandra 1.2.1 with java 1.6.0 u 39 64 bit in centos 6.3
 
 My keyspace definition is
 
 CREATE KEYSPACE studata WITH replication = {
   'class': 'SimpleStrategy',
   'replication_factor': '3'
 };
 
 
 My table definition is 
 
 CREATE TABLE datapoints (
   siteid bigint,
   channel int,
   time timestamp,
   data float,
   PRIMARY KEY ((siteid, channel), time)
 ) WITH
   bloom_filter_fp_chance=0.01 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   read_repair_chance=0.10 AND
   replicate_on_write='true' AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'SnappyCompressor'};
 
 
 It has 3,504,000,000  rows, consisting of 100,000 partition keys.
 
 
 Is there anything that I'm doing wrong?
 
 
 
 -- 
 Thanks
 
  A Jabbar Azam

Re: DataModel Question

2013-02-07 Thread aaron morton

 Go day / phone instead of phone / day this way you won't have a rk growing 
 forever .
Not sure I understand. 

+1 for month partition.

 When I go offline and come online again, I need to retrieve all pending 
 messages from all my conversations.
You need to have some sort of token that includes the last time stamp seen by 
the client. Then make as many queries as necessary to get the missing data. 

  I guess this makes the data model span across many CFs ?
Yes. 
Sorry I have not considered conversations. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 8/02/2013, at 3:04 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 Go day / phone instead of phone / day this way you won't have a rk growing 
 forever .
 
 A comprise would be month / phone as the row key and then use the date time 
 as the first part of a composite column. 
 
 On Thursday, February 7, 2013, Kanwar Sangha kan...@mavenir.com wrote:
  Thanks Aaron !
 
   
 
  My use case is modeled like “skype” which stores IM + SMS + MMS in one 
  conversation.
 
   
 
  I need to have the following functionality –
 
   
 
  ·When I go offline and come online again, I need to retrieve all 
  pending messages from all my conversations.
 
  ·I should be able to select a contact and view the ‘history’ of the 
  messages (last 7 days, last 14 days, last 21 days…)
 
  ·If I log in to a different device, I should be able to synch at 
  least a “few days” of messages.
 
  ·One conversation can have multiple participants.
 
  ·Support full synch or delta synch based on number of 
  messages/history.
 
   
 
  I guess this makes the data model span across many CFs ?
 
   
 
   
 
   
 
   
 
  From: aaron morton [mailto:aa...@thelastpickle.com]
  Sent: 06 February 2013 22:20
  To: user@cassandra.apache.org
  Subject: Re: DataModel Question
 
   
 
  2)  DynamicComposites : I read somewhere that they are not recommended ?
 
  You probably wont need them. 
 
   
 
  Your current model will not sort message by the time they arrive in a day. 
  The sort order will be based on Message type and the message ID. 
 
   
 
  I'm assuming you want to order messages, so put the time uuid at the start 
  of the composite columns. If you often want to get the most recent messages 
  use a reverse comparator. 
 
   
 
  You could probably also have wider rows if you want to, not sure how many 
  messages kids send a day but you may get by with weekly partitions. 
 
   
 
  The CLI model could be:
 
  row_key: phone_number : day
 
  column: time_uuid : message_id : message_type 
 
   
 
  You could also pack extra data used JSON, ProtoBuffers etc and store more 
  that just the message in the column value. 
 
   
 
  If you use using CQL 3 consider this:
 
   
 
  create table messages (
 
  phone_numbertext, 
 
  day  
  timestamp, 
 
  message_sequence timeuuid, # your timestamp
 
  message_id integer, 
 
  message_type text, 
 
  message_bodytext
 
  ) with PRIMARY KEY ( (phone_number, day), message_sequence, message_id)
 
   
 
  (phone_number, day) is the partition key, same the thrift row key. 
 
   
 
   message_sequence, message_id is the grouping columns, all instances will 
  be grouped / ordered by these columns. 
 
   
 
  Hope that helps. 
 
   
 
   
 
   
 
  -
 
  Aaron Morton
 
  Freelance Cassandra Developer
 
  New Zealand
 
   
 
  @aaronmorton
 
  http://www.thelastpickle.com

Re: Netflix/Astynax Client for Cassandra

2013-02-07 Thread aaron morton

I'm going to guess Netflix are running Astynax in production with Cassandra 
1.1. 

cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 8/02/2013, at 6:50 AM, Cassa L lcas...@gmail.com wrote:

 Thank you all for the responses to this thread. I am  planning to use 
 Cassandra 1.1.9 with Astynax. Does anyone has Cassandra 1.x version running 
 in production with astynax? Did you come across any show-stopper issues?
 
 Thanks
 LCassa
 
 
 On Thu, Feb 7, 2013 at 8:50 AM, Bartłomiej Romański b...@sentia.pl wrote:
 Hi,
 
 Does anyone know how about virtual nodes support in Astynax? Are they
 handled correctly? Especially with ConnectionPoolType.TOKEN_AWARE?
 
 Thanks,
 BR

Re: are CFs consistent after a repair

2013-02-07 Thread aaron morton

  'nodetool -pr repair' 
Assuming nodetool repair -pr

If there is no write activity all reads (at any CL level) will return the same 
value after a successful repair. 

If there is write activity there is always a possibility of inconsistencies, 
and so only access where R + W N (e.g. QUORUM + QUROUM ) will be consistent. 

Can you drill down into the consistency problem?

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 8/02/2013, at 7:01 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote:

 I'm confused about consistency. I have a 6-node group (RF=3) and I have a 
 table that
 was known to be inconsistent across replicas (a Hadoop app was sensitive to 
 this).
 
 So a did a 'nodetool -pr repair' on every node in the cluster. After the 
 repairs were
 complete, the Hadoop app still indicated inconsistencies. Is this to be 
 expected?
 
 Brian

Re: High CPU usage during repair

2013-02-10 Thread aaron morton

 During repair I see high CPU consumption, 
Repair reads the data and computes a hash, this is a CPU intensive operation.
Is the CPU over loaded or is just under load?

 I run Cassandra  version 1.0.11, on 3 node setup on EC2 instances.
What machine size?

 there are compactions waiting.
That's normally ok. How many are waiting?

 I thought of adding a call to my repair script, before repair starts to do:
 nodetool setcompactionthroughput 0
 and then when repair finishes call
 nodetool setcompactionthroughput 16
That will remove throttling on compaction and the validation compaction used 
for the repair. Which may in turn add additional IO load, CPU load and GC 
pressure. You probably do not want to do this. 

Try reducing the compaction throughput to say 12 normally and see the effect.

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/02/2013, at 1:01 AM, Tamar Fraenkel ta...@tok-media.com wrote:

 Hi!
 I run repair weekly, using a scheduled cron job.
 During repair I see high CPU consumption, and messages in the log file
 INFO [ScheduledTasks:1] 2013-02-10 11:48:06,396 GCInspector.java (line 122) 
 GC for ParNew: 208 ms for 1 collections, 1704786200 used; max is 3894411264
 From time to time, there are also messages of the form
 INFO [ScheduledTasks:1] 2012-12-04 13:34:52,406 MessagingService.java (line 
 607) 1 READ messages dropped in last 5000ms
 
 Using opscenter, jmx and nodetool compactionstats I can see that during the 
 time the CPU consumption is high, there are compactions waiting.
 
 I run Cassandra  version 1.0.11, on 3 node setup on EC2 instances.
 I have the default settings:
 compaction_throughput_mb_per_sec: 16
 in_memory_compaction_limit_in_mb: 64
 multithreaded_compaction: false
 compaction_preheat_key_cache: true
 
 I am thinking on the following solution, and wanted to ask if I am on the 
 right track:
 I thought of adding a call to my repair script, before repair starts to do:
 nodetool setcompactionthroughput 0
 and then when repair finishes call
 nodetool setcompactionthroughput 16
 
 Is this a right solution?
 Thanks,
 Tamar
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956

Re: Read-repair working, repair not working?

2013-02-10 Thread aaron morton

 I’d request data, nothing would be returned, I would then re-request the data 
 and it would correctly be returned:
 
What CL are you using for reads and writes?

 I see a number of dropped ‘MUTATION’ operations : just under 5% of the total 
 ‘MutationStage’ count.
 
Dropped mutations in a multi DC setup may be a sign of network congestion or 
overloaded nodes. 


 -  Could anybody suggest anything specific to look at to see why the 
 repair operations aren’t having the desired effect? 
 
I would first build a test case to ensure correct operation when using strong 
consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I 
assume you are not using LOCAL_QUOURM because that is 2 and you would not have 
any redundancy in the DC. 

 
 
 -  Would increasing logging level to ‘DEBUG’ show read-repair 
 activity (to confirm that this is happening, when  for what proportion of 
 total requests)?
It would, but the INFO logging for the AES is pretty good. I would hold off for 
now. 

 
 -  Is there something obvious that I could be missing here?
When a new AES session starts it logs this

logger.info(String.format([repair #%s] new session: will sync %s 
on range %s for %s.%s, getName(), repairedNodes(), range, tablename, 
Arrays.toString(cfnames)));

When it completes it logs this

logger.info(String.format([repair #%s] session completed successfully, 
getName()));

Or this on failure 

logger.error(String.format([repair #%s] session completed with the following 
error, getName()), exception);


Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 9:56 PM, Brian Fleming bigbrianflem...@gmail.com wrote:

 
  
 
 Hi,
 
  
 
 I have a 20 node cluster running v1.0.7 split between 5 data centres, each 
 with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. 
 
  
 
 I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I 
 brought online late last year with data consistency  availability: I’d 
 request data, nothing would be returned, I would then re-request the data and 
 it would correctly be returned: i.e. read-repair appeared to be occurring.  
 However running repairs on the nodes didn’t resolve this (I tried general 
 ‘repair’ commands as well as targeted keyspace commands) – this didn’t alter 
 the behaviour.
 
  
 
 After a lot of fruitless investigation, I decided to wipe  
 re-install/re-populate the nodes.  The re-install  repair operations are now 
 complete: I see the expected amount of data on the nodes, however I am still 
 seeing the same behaviour, i.e. I only get data after one failed attempt.
 
  
 
 When I run repair commands, I don’t see any errors in the logs. 
 
 I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during 
 repair sessions.
 
 I see a number of dropped ‘MUTATION’ operations : just under 5% of the total 
 ‘MutationStage’ count.
 
  
 
 Questions :
 
 -  Could anybody suggest anything specific to look at to see why the 
 repair operations aren’t having the desired effect? 
 
 -  Would increasing logging level to ‘DEBUG’ show read-repair 
 activity (to confirm that this is happening, when  for what proportion of 
 total requests)?
 
 -  Is there something obvious that I could be missing here?
 
  
 
 Many thanks,
 
 Brian

Re: Issues with writing data to Cassandra column family using a Hive script

2013-02-10 Thread aaron morton

Don't use the variable length Cassandra integer, use the Int32Type. It also 
sounds like you want to use a DoubleType rather than FloatType. 
http://www.datastax.com/docs/datastax_enterprise2.2/solutions/about_hive#hive-to-cassandra-table-mapping
 
Cheers

 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 4:15 PM, Dinusha Dilrukshi sdddilruk...@gmail.com wrote:

 Hi All,
 
 Data was originally stored in column family called test_cf. Definition of 
 column family is as follows:
 
 CREATE COLUMN FAMILY test_cf 
 WITH COMPARATOR = 'IntegerType' 
  AND key_validation_class = UTF8Type 
  AND default_validation_class = FloatType;  
 
 And, following is the sample data set that contains in test_cf.
 
 cqlsh:temp_ks select * from test_cf;
  key| column1| value
 --++---
  localhost:8282 | 1350468600 |76
  localhost:8282 | 1350468601 |76
 
 
 Hive script (shown in the end of mail) is use to take the data from above 
 column family test_cf and insert into a new column family called 
 cpu_avg_5min_new7. Column family description of cpu_avg_5min_new7 is also 
 same as the test_cf. Issue is, data written in to cpu_avg_5min_new7 column 
 family after executing the hive script is as follows. It's not in the format  
 of data present in the original column family test_cf. Any explanations 
 would highly appreciate..
 
 
 cqlsh:temp_ks select * from cpu_avg_5min_new7;
  key| column1  | value
 --+--+--
  localhost:8282 | 232340574229062170849328 | 1.09e-05
  localhost:8282 | 232340574229062170849329 | 1.09e-05
 
 
 Hive script:
 
 drop table cpu_avg_5min_new7_hive;
 CREATE EXTERNAL TABLE IF NOT EXISTS cpu_avg_5min_new7_hive (src_id STRING, 
 start_time INT, cpu_avg FLOAT) STORED BY 
 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH 
 SERDEPROPERTIES (
  cassandra.host = 127.0.0.1 , cassandra.port = 9160 , 
 cassandra.ks.name = temp_ks , 
  cassandra.ks.username = xxx , cassandra.ks.password = xxx , 
  cassandra.columns.mapping = :key,:column,:value , cassandra.cf.name = 
 cpu_avg_5min_new7 ); 
 
 drop table xxx;
 CREATE EXTERNAL TABLE IF NOT EXISTS xxx (src_id STRING, start_time INT, 
 cpu_avg FLOAT) STORED BY
  'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH 
 SERDEPROPERTIES ( 
  cassandra.host = 127.0.0.1 , cassandra.port = 9160 , 
 cassandra.ks.name = temp_ks ,
   cassandra.ks.username = xxx , cassandra.ks.password = xxx ,
cassandra.columns.mapping = :key,:column,:value , cassandra.cf.name 
 = test_cf );
 
 insert overwrite table cpu_avg_5min_new7_hive select 
 src_id,start_time,cpu_avg from xxx;
 
 Regards,
 Dinusha.

Re: Cassandra 1.1.2 - 1.1.8 upgrade

2013-02-10 Thread aaron morton

I would do #1.

You can play with nodetool setcompactionthroughput to speed things up, but 
beware nothing comes for free.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 6:40 AM, Mike mthero...@yahoo.com wrote:

 Thank you,
 
 Another question on this topic.
 
 Upgrading from 1.1.2-1.1.9 requires running upgradesstables, which will take 
 many hours on our dataset (about 12).  For this upgrade, is it recommended 
 that I:
 
 1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and run a 
 staggered upgrade of the sstables over a number of days.
 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2-1.1.9 
 configuration for a number of days.
 
 I would prefer #1, as with #2, streaming will not work until all the nodes 
 are upgraded.
 
 I appreciate your thoughts,
 -Mike
 
 On 1/16/2013 11:08 AM, Jason Wee wrote:
 always check NEWS.txt for instance for cassandra 1.1.3 you need to run 
 nodetool upgradesstables if your cf has counter.
 
 
 On Wed, Jan 16, 2013 at 11:58 PM, Mike mthero...@yahoo.com wrote:
 Hello,
 
 We are looking to upgrade our Cassandra cluster from 1.1.2 - 1.1.8 (or 
 possibly 1.1.9 depending on timing).  It is my understanding that rolling 
 upgrades of Cassandra is supported, so as we upgrade our cluster, we can do 
 so one node at a time without experiencing downtime.
 
 Has anyone had any gotchas recently that I should be aware of before 
 performing this upgrade?
 
 In order to upgrade, is the only thing that needs to change are the JAR 
 files?  Can everything remain as-is?
 
 Thanks,
 -Mike

Re: Cassandra flush spin?

2013-02-10 Thread aaron morton

Sounds like flushing due to memory consumption. 

The flush log messages include the number of ops, so you can see if this node 
was processing more mutations that the others. Try to see if there was more 
(serialised) data being written or more operations being processed. 

Also just for fun check the JVM and yaml settings are as expected. 

Cheers 


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 6:29 AM, Mike mthero...@yahoo.com wrote:

 Hello,
 
 We just hit a very odd issue in our Cassandra cluster.  We are running 
 Cassandra 1.1.2 in a 6 node cluster.  We use a replication factor of 3, and 
 all operations utilize LOCAL_QUORUM consistency.
 
 We noticed a large performance hit in our application's maintenance 
 activities and I've been investigating.  I discovered a node in the cluster 
 that was flushing a memtable like crazy.  It was flushing every 2-3 minutes, 
 and has been apparently doing this for days. Typically, during this time of 
 day, a flush would happen every 30 minutes or so.
 
 alldb.sh cat /var/log/cassandra/system.log | grep \flushing high-traffic 
 column family CFS(Keyspace='open', ColumnFamily='msgs')\ | grep 02-08 | wc 
 -l
 [1] 18:41:04 [SUCCESS] db-1c-1
 59
 [2] 18:41:05 [SUCCESS] db-1c-2
 48
 [3] 18:41:05 [SUCCESS] db-1a-1
 1206
 [4] 18:41:05 [SUCCESS] db-1d-2
 54
 [5] 18:41:05 [SUCCESS] db-1a-2
 56
 [6] 18:41:05 [SUCCESS] db-1d-1
 52
 
 
 I restarted the database node, and, at least for now, the problem appears to 
 have stopped.
 
 There are a number of things that don't make sense here.  We use a 
 replication factor of 3, so if this was being caused by our application, I 
 would have expected 3 nodes in the cluster to have issues.  Also, I would 
 have expected the issue to continue once the node restarted.
 
 Another information point of interest, and I'm wondering if its exposed a 
 bug, was this node was recently converted to use ephemeral storage on EC2, 
 and was restored from a snapshot.  After the restore, a nodetool repair was 
 run.  However, repair was going to run into some heavy activity for our 
 application, and we canceled that validation compaction (2 of the 3 
 anti-entropy sessions had completed).  The spin appears to have started at 
 the start of the second session.
 
 Any hints?
 
 -Mike

Re: persisted ring state

2013-02-10 Thread aaron morton

  Is that the right way to do?
No. 
If you want to change the token for a node use nodetool move. 

Changing it like this will not make the node change it's token. Because after 
startup the token is stored in the System.LocationInfo CF. 

 or -Dcassandra.load_ring_state=false|true is only limited to changes to 
 seed/listen_address ?
it's used when a node somehow as a bad view of the ring, and you want it to 
forget things. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 3:35 AM, S C as...@outlook.com wrote:

 In one of the scenarios that I encountered, I needed to change the token on 
 the node. I added new token and started the node with
 -Dcassandra.load_ring_state=false in anticipation that the node will not pick 
 from the locally persisted data. Is that the right way to do? or 
 -Dcassandra.load_ring_state=false|true is only limited to changes to 
 seed/listen_address ?
 
 
 Thanks,
 SC

Re: High CPU usage during repair

2013-02-11 Thread aaron morton

 What machine size?
 m1.large 
If you are seeing high CPU move to an m1.xlarge, that's the sweet spot. 

 That's normally ok. How many are waiting?
 
 I have seen 4 this morning 
That's not really abnormal. 
The pending task count goes when when a file *may* be eligible for compaction, 
not when there is a compaction task waiting. 

If you suddenly create a number of new SSTables for a CF the pending count will 
rise, however one of the tasks may compact all the sstables waiting for 
compaction. So the count will suddenly drop as well. 

 Just to make sure I understand you correctly, you suggest that I change 
 throughput to 12 regardless of whether repair is ongoing or not. I will do it 
 using nodetool and change the yaml file in case a restart will occur in the 
 future? 
Yes. 
If you are seeing performance degrade during compaction or repair try reducing 
the throughput. 

I would attribute most of the problems you have described to using m1.large. 

Cheers
 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/02/2013, at 9:16 AM, Tamar Fraenkel ta...@tok-media.com wrote:

 Hi!
 Thanks for the response.
 See my answers and questions below.
 Thanks!
 Tamar
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 On Sun, Feb 10, 2013 at 10:04 PM, aaron morton aa...@thelastpickle.com 
 wrote:
 During repair I see high CPU consumption, 
 Repair reads the data and computes a hash, this is a CPU intensive operation.
 Is the CPU over loaded or is just under load?
  Usually just load, but in the past two weeks I have seen CPU of over 90%!
 I run Cassandra  version 1.0.11, on 3 node setup on EC2 instances.
 
 What machine size?
 m1.large 
 
 there are compactions waiting.
 That's normally ok. How many are waiting?
 
 I have seen 4 this morning 
 I thought of adding a call to my repair script, before repair starts to do:
 nodetool setcompactionthroughput 0
 and then when repair finishes call
 nodetool setcompactionthroughput 16
 That will remove throttling on compaction and the validation compaction used 
 for the repair. Which may in turn add additional IO load, CPU load and GC 
 pressure. You probably do not want to do this. 
 
 Try reducing the compaction throughput to say 12 normally and see the effect.
 
 Just to make sure I understand you correctly, you suggest that I change 
 throughput to 12 regardless of whether repair is ongoing or not. I will do it 
 using nodetool and change the yaml file in case a restart will occur in the 
 future? 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11/02/2013, at 1:01 AM, Tamar Fraenkel ta...@tok-media.com wrote:
 
 Hi!
 I run repair weekly, using a scheduled cron job.
 During repair I see high CPU consumption, and messages in the log file
 INFO [ScheduledTasks:1] 2013-02-10 11:48:06,396 GCInspector.java (line 122) 
 GC for ParNew: 208 ms for 1 collections, 1704786200 used; max is 3894411264
 From time to time, there are also messages of the form
 INFO [ScheduledTasks:1] 2012-12-04 13:34:52,406 MessagingService.java (line 
 607) 1 READ messages dropped in last 5000ms
 
 Using opscenter, jmx and nodetool compactionstats I can see that during the 
 time the CPU consumption is high, there are compactions waiting.
 
 I run Cassandra  version 1.0.11, on 3 node setup on EC2 instances.
 I have the default settings:
 compaction_throughput_mb_per_sec: 16
 in_memory_compaction_limit_in_mb: 64
 multithreaded_compaction: false
 compaction_preheat_key_cache: true
 
 I am thinking on the following solution, and wanted to ask if I am on the 
 right track:
 I thought of adding a call to my repair script, before repair starts to do:
 nodetool setcompactionthroughput 0
 and then when repair finishes call
 nodetool setcompactionthroughput 16
 
 Is this a right solution?
 Thanks,
 Tamar
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956

Re: CQL 3 compound row key error

2013-02-11 Thread aaron morton

That sounds like a bug, or something that is still under work. Sylvain has his 
finger on all things CQL. 

Can you raise a ticket on https://issues.apache.org/jira/browse/CASSANDRA

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/02/2013, at 4:01 PM, Shahryar Sedghi shsed...@gmail.com wrote:

 I am moving my application from 1.1 to 1.2.1  to utilize secondary index and 
 simplify the data model. In 1.1 I was concentrating some fields into one 
 separated by : for the row key and it was a big string. In V1.2 I use 
 compound rows key showed in the following test case (interval and seq):
 
 
 CREATE TABLE  test(
 interval text,
 seq int, 
 id int,
 severity int,
 PRIMARY KEY ((interval, seq), id)) 
 WITH CLUSTERING ORDER BY (id DESC);
 --
 CREATE INDEX ON test(severity);
 
 
  select * from test where severity = 3 and  interval = 't' and seq =1;
 
 results:
 
 Bad Request: Start key sorts after end key. This is not allowed; you probably 
 should not specify end key at all under random partitioner
 
 If I define the table as this:
 
 CREATE TABLE  test(
 interval text,
 id int,
 severity int,
 PRIMARY KEY (interval, id))
 WITH CLUSTERING ORDER BY (id DESC);
 
  select * from test where severity = 3 and  interval = 't1';
 
 Works fine. Is it a bug?
 
 Thanks in Advance
 
 Shahryar
 
 
 
 -- 
 Life is what happens while you are making other plans. ~ John Lennon

Re: Read-repair working, repair not working?

2013-02-11 Thread aaron morton

 CL.ONE : this is primarily for performance reasons …
This makes reasoning about correct behaviour a little harder. 

If there is anyway you can run some tests with R + W  N strong consistency I 
would encourage you to do so. You will then have a baseline of what works.

  (say I make 100 requests : all 100 initially fail and subsequently all 100 
 succeed), so not sure it'll help?
The high number of inconsistencies seems to match with the massive number of 
dropped Mutation messages. Even if Anti Entropy is running, if the node in HK 
is dropping so many messages there will be inconsistencies. 
  

It looks like the HK node is overloaded. I would check the logs for GC 
messages, check for VM steal in a virtualised env, check for sufficient CPU + 
memory resources, check for IO stress. 

 20 node cluster running v1.0.7 split between 5 data centres,
 I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I
Do all DC's have the same number of nodes ?

Cheers
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/02/2013, at 9:13 PM, Brian Fleming bigbrianflem...@gmail.com wrote:

 
 Hi Aaron,
  
 Many thanks for your reply - answers below.
  
 Cheers,
  
 Brian
  
  
  What CL are you using for reads and writes?
  I would first build a test case to ensure correct operation when using 
  strong consistency. i.e. QUOURM write and read. Because you are using RF 2 
  per DC I assume you are not using LOCAL_QUOURM because that is 2 and you 
  would not have any redundancy in the DC.
 CL.ONE : this is primarily for performance reasons but also because there are 
 only three local nodes as you suggest and we need at least some resiliency.  
 In the context of this issue, I considered increasing this to CL.LOCAL_QUORUM 
 but the behaviour suggests than none of the 3 local nodes have the data (say 
 I make 100 requests : all 100 initially fail and subsequently all 100 
 succeed), so not sure it'll help?
  
  
  Dropped mutations in a multi DC setup may be a sign of network congestion 
  or overloaded nodes.
 This DC is remote in terms of network topology - it's in Asia (Hong Kong) 
 while the rest of the cluster is in Europe/North America, so network latency 
 rather than congestion could be a cause?  However I see some pretty 
 aggressive data transfer speeds during the initial repairs  the data 
 footprint approximately matches the nodes elsewhere in the ring, so something 
 doesn't add up?
  
 Here are the tpstats for one of these nodes :
 Pool NameActive   Pending  Completed   Blocked  All 
 time blocked
 ReadStage 0 0   4919185 0 
 0
 RequestResponseStage  0 0  16869994 0 
 0
 MutationStage 0 0  16764910 0 
 0
 ReadRepairStage   0 0   3703 0
  0
 ReplicateOnWriteStage 0 0  0 0
  0
 GossipStage   0 0 845225 0
  0
 AntiEntropyStage  0 0  52441 0
  0
 MigrationStage0 0   4362 0
  0
 MemtablePostFlusher   0 0952 0
  0
 StreamStage   0 0 24 0
  0
 FlushWriter   0 0960 0
  5
 MiscStage 0 0   3592 0
  0
 AntiEntropySessions   4 4121 0
  0
 InternalResponseStage 0 0  0 0
  0
 HintedHandoff 1 2 55 0
  0
  
 Message type   Dropped
 RANGE_SLICE  0
 READ_REPAIR 150597
 BINARY   0
 READ781490
 MUTATION853846
 REQUEST_RESPONSE 0
 The numbers of READ_REPAIR, READ  MUTATION operations  are non-negligable.  
 The nodes in Europe/North America have effectively zero dropped messages.  
 This suggests network latency is probably a significant factor? 
 [the network ping from Europe to a HK node is ~250ms, so I wouldn’t have 
 expected it to be such a problem?]
  
 
  It would, but the INFO logging for the AES is pretty good. I would hold off 
  for now.
 Ok.
  
  [AES session logging]
 Yes, I see the expected start/end logs, so that's another thing off the list.
  
  
 
 
 On 10 Feb 2013, at 20:12, aaron morton aa...@thelastpickle.com wrote:
 
 I’d request data, nothing would be returned, I would then re-request the 
 data and it would correctly be returned

Re: Cassandra 1.1.2 - 1.1.8 upgrade

2013-02-11 Thread aaron morton


You can always run them. But in some situations repair cannot be used, and in 
this case new nodes cannot be added. The news.txt file is your friend there. 

As a general rule when upgrading a cluster I move one node to the new version 
and let it soak in for an hour or so. Just to catch any crazy. I then upgrade 
all the nodes and run through the upgrade table. You can stagger upgrade table 
to be every RF'th node in the cluster to reduce the impact.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/02/2013, at 8:05 PM, Michal Michalski mich...@opera.com wrote:

 
 2) Upgrade one node at a time, running the clustered in a mixed
 1.1.2-1.1.9 configuration for a number of days.
 
 I'm about to upgrade my 1.1.0 cluster and
 http://www.datastax.com/docs/1.1/install/upgrading#info says:
 
 If you are upgrading to Cassandra 1.1.9 from a version earlier than 1.1.7, 
 all nodes must be upgraded before any streaming can take place. Until you 
 upgrade all nodes, you cannot add version 1.1.7 nodes or later to a 1.1.7 or 
 earlier cluster.
 
 Which one is correct then? Can I run mixed 1.1.2 (in my case 1.1.0)  1.1.9 
 cluster or not?
 
 M.

Re: Cassandra jmx stats ReadCount

2013-02-11 Thread aaron morton

Are you using counters? They require a read before write.

Also secondary index CF's require a read before write. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 8/02/2013, at 1:26 PM, Daning Wang dan...@netseer.com wrote:

 We have 8 nodes cluster in Casandra 1.1.0, with replication factor is 3. We 
 found that when you just insert data, not only WriteCount increases, the 
 ReadCount also increases.
 
 How could this happen? I am under the impression that readCount only counts 
 the reads from client.
 
 Thanks,
 
 Daning

Re: Directory structure after upgrading 1.0.8 to 1.2.1

2013-02-11 Thread aaron morton

I think it's a little more subtle that that 
https://issues.apache.org/jira/browse/CASSANDRA-5242

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 8/02/2013, at 10:21 PM, Desimpel, Ignace ignace.desim...@nuance.com 
wrote:

 Yes it are new directories. I did some debugging …
 The Cassandra code is org.apache.cassandra.db.Directories::migrateFile.
 It is detecting that it is a manifest (based on the .json extension).
 But then it does not take in account that something like 
 MyColumnFamily-old.json can exist. Then it is using MyColumnFamily-old as 
 a directory name in a call to a function destDir = getOrCreate(ksDir, 
 dirname, additionalPath), while it should be  MyColumnFamily.
  
 So I guess that the cfname computation should be adapted to include the 
 “-old.json” manifest files.
  
 Ignace
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: vrijdag 8 februari 2013 03:09
 To: user@cassandra.apache.org
 Subject: Re: Directory structure after upgrading 1.0.8 to 1.2.1
  
 the -old.json is an artefact of Levelled Compaction. 
  
 You should see a non -old file in the current CF folder. 
  
 I'm not sure what would have created the -old CF dir. Does the timestamp 
 indicate it was created the time the server first started as a 1.2 node?
  
 Cheers
  
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 7/02/2013, at 10:39 PM, Desimpel, Ignace ignace.desim...@nuance.com 
 wrote:
 
 
 After upgrading from 1.0.8 I see that now the directory structure has changed 
 and has a structure like keyspacecolumnfamily (part of the 1.1.x 
 migration).
 But I also see that directories appear like keyspacecolumnfamily-old, and 
 the content of that ‘old’ directory is only one file columnfamily-old.json.
  
 Questions :
 Should this xxx-old.json file be in the other directory?
 Should the extra directory xxx-old not be created?
 Or was that intentionally done and is it allowed to remove these directories 
 ( manually … )?
  
 Thanks

Re: Healthy JVM GC

2013-02-12 Thread aaron morton

 -Xms8049M 
 -Xmx8049M 
 -Xmn800M 
That's a healthy amount of memory for the JVM. 

If you are using Row Caches, reduce their size and/or ensure you are using 
Serializing (off heap) caches.
Also consider changing the yaml conf flush_largest_memtables_at from 0.75 to  
0.80 so it is different to the CMS occupancy setting. 
If you have a lot of rows, 100's of millions, consider reducing the bloom 
filter false positive ratio. 

Or just upgrade to 1.2 which uses less JVM memory. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 9/02/2013, at 7:46 AM, André Cruz andre.c...@co.sapo.pt wrote:

 Hello.
 
 I've noticed I get the frequent JVM warning in the logs about the heap being 
 full:
 
 WARN [ScheduledTasks:1] 2013-02-08 18:14:20,410 GCInspector.java (line 145) 
 Heap is 0.731554347747841 full.  You may need to reduce memtable and/or cache 
 sizes.  Cassandra will now flush up to the two largest memtables to free up 
 memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
 don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-02-08 18:14:20,418 StorageService.java (line 
 2855) Flushing CFS(Keyspace='Disco', ColumnFamily='FilesPerBlock') to relieve 
 memory pressure
 INFO [ScheduledTasks:1] 2013-02-08 18:14:20,418 ColumnFamilyStore.java (line 
 659) Enqueuing flush of Memtable-FilesPerBlock@1804403938(6275300/63189158 
 serialized/live bytes, 52227 ops)
 INFO [FlushWriter:4500] 2013-02-08 18:14:20,419 Memtable.java (line 264) 
 Writing Memtable-FilesPerBlock@1804403938(6275300/63189158 serialized/live 
 bytes, 52227 ops)
 INFO [FlushWriter:4500] 2013-02-08 18:14:21,059 Memtable.java (line 305) 
 Completed flushing 
 /servers/storage/cassandra-data/Disco/FilesPerBlock/Disco-FilesPerBlock-he-6154-Data.db
  (6332375 bytes) for commitlog position 
 ReplayPosition(segmentId=1357730625412, position=10756636)
 WARN [ScheduledTasks:1] 2013-02-08 18:23:31,970 GCInspector.java (line 145) 
 Heap is 0.6835904101057064 full.  You may need to reduce memtable and/or 
 cache sizes.  Cassandra will now flush up to the two largest memtables to 
 free up memory.  Adjust flush_largest_memtables_at threshold in 
 cassandra.yaml if you don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-02-08 18:23:31,971 StorageService.java (line 
 2855) Flushing CFS(Keyspace='Disco', ColumnFamily='BlocksKnownPerUser') to 
 relieve memory pressure
 INFO [ScheduledTasks:1] 2013-02-08 18:23:31,972 ColumnFamilyStore.java (line 
 659) Enqueuing flush of 
 Memtable-BlocksKnownPerUser@2072550435(1834642/60143054 serialized/live 
 bytes, 67010 ops)
 INFO [FlushWriter:4501] 2013-02-08 18:23:31,972 Memtable.java (line 264) 
 Writing Memtable-BlocksKnownPerUser@2072550435(1834642/60143054 
 serialized/live bytes, 67010 ops)
 INFO [FlushWriter:4501] 2013-02-08 18:23:32,827 Memtable.java (line 305) 
 Completed flushing 
 /servers/storage/cassandra-data/Disco/BlocksKnownPerUser/Disco-BlocksKnownPerUser-he-484930-Data.db
  (7404407 bytes) for commitlog position 
 ReplayPosition(segmentId=1357730625413, position=6093472)
 WARN [ScheduledTasks:1] 2013-02-08 18:29:46,198 GCInspector.java (line 145) 
 Heap is 0.6871977390878024 full.  You may need to reduce memtable and/or 
 cache sizes.  Cassandra will now flush up to the two largest memtables to 
 free up memory.  Adjust flush_largest_memtables_at threshold in 
 cassandra.yaml if you don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-02-08 18:29:46,199 StorageService.java (line 
 2855) Flushing CFS(Keyspace='Disco', ColumnFamily='FileRevision') to relieve 
 memory pressure
 INFO [ScheduledTasks:1] 2013-02-08 18:29:46,200 ColumnFamilyStore.java (line 
 659) Enqueuing flush of Memtable-FileRevision@1526026442(7245147/63711465 
 serialized/live bytes, 23779 ops)
 INFO [FlushWriter:4502] 2013-02-08 18:29:46,201 Memtable.java (line 264) 
 Writing Memtable-FileRevision@1526026442(7245147/63711465 serialized/live 
 bytes, 23779 ops)
 INFO [FlushWriter:4502] 2013-02-08 18:29:46,769 Memtable.java (line 305) 
 Completed flushing 
 /servers/storage/cassandra-data/Disco/FileRevision/Disco-FileRevision-he-5438-Data.db
  (5480642 bytes) for commitlog position 
 ReplayPosition(segmentId=1357730625413, position=29816878)
 INFO [ScheduledTasks:1] 2013-02-08 18:34:13,442 GCInspector.java (line 122) 
 GC for ConcurrentMarkSweep: 352 ms for 1 collections, 5902597760 used; max is 
 8357150720
 WARN [ScheduledTasks:1] 2013-02-08 18:34:13,442 GCInspector.java (line 145) 
 Heap is 0.7062930845406603 full.  You may need to reduce memtable and/or 
 cache sizes.  Cassandra will now flush up to the two largest memtables to 
 free up memory.  Adjust flush_largest_memtables_at threshold in 
 cassandra.yaml if you don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-02-08 18:34:13,443 StorageService.java (line 
 2855) Flushing CFS(Keyspace

Re: Bootstrapping a new node to a virtual node cluster

2013-02-12 Thread aaron morton

Just checking if this sorted it's self out? 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 1:15 AM, Jouni Hartikainen jouni.hartikai...@reaktor.fi 
wrote:

 Hello all,
 
 I have a cluster of three nodes running 1.2.1 and I'd like to increase the 
 capacity by adding a new node. I'm using virtual nodes with 256 tokens and 
 planning to use the same configuration for the new node as well.
 
 My cluster looks like this before adding the new node:
 
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  192.168.154.111.49 GB256 100.0%
 234b82a4-3812-4261-adab-deb805942d63  rack1
 UN  192.168.154.121.6 GB 256 100.0%
 577db21e-81ef-45fd-a67b-cfd39455c0f6  rack1
 UN  192.168.154.131.64 GB256 100.0%
 6187cc5d-d44c-45cb-b738-1b87f5ae3dff  rack1
 
 
 And corresponding gossipinfo:
 
 /192.168.154.12
  RPC_ADDRESS:192.168.154.12
  DC:datacenter1
  STATUS:NORMAL,-1072164398478041156
  LOAD:1.719425018E9
  SCHEMA:ef2c294e-1a74-32c1-b169-3a6465b2053d
  NET_VERSION:6
  HOST_ID:577db21e-81ef-45fd-a67b-cfd39455c0f6
  SEVERITY:0.0
  RELEASE_VERSION:1.2.1
  RACK:rack1
 /192.168.154.11
  RPC_ADDRESS:192.168.154.11
  DC:datacenter1
  STATUS:NORMAL,-1158837144480089281
  LOAD:1.514343678E9
  SCHEMA:ef2c294e-1a74-32c1-b169-3a6465b2053d
  NET_VERSION:6
  HOST_ID:234b82a4-3812-4261-adab-deb805942d63
  SEVERITY:0.0
  RELEASE_VERSION:1.2.1
  RACK:rack1
 /192.168.154.13
  RPC_ADDRESS:192.168.154.13
  DC:datacenter1
  STATUS:NORMAL,-1135137292201587328
  LOAD:1.765093695E9
  SCHEMA:ef2c294e-1a74-32c1-b169-3a6465b2053d
  NET_VERSION:6
  HOST_ID:6187cc5d-d44c-45cb-b738-1b87f5ae3dff
  SEVERITY:0.0
  RELEASE_VERSION:1.2.1
  RACK:rack1
 
 
 I have now set the correct net addresses  seeds in the cassandra.yaml of the 
 new node (.14) and then started it with num_tokens set to 256 and 
 initial_token commented out. Everything seems to go OK as I get the following 
 prints on the log:
 
 On node 192.168.154.11:
 
 INFO [GossipStage:1] 2013-02-09 12:30:28,126 Gossiper.java (line 784) Node 
 /192.168.154.14 is now part of the cluster
 INFO [GossipStage:1] 2013-02-09 12:30:28,128 Gossiper.java (line 750) 
 InetAddress /192.168.154.14 is now UP
 INFO [MiscStage:1] 2013-02-09 12:30:59,255 StreamOut.java (line 114) 
 Beginning transfer to /192.168.154.14
 
 And on node 192.168.154.14 (the new node):
 
 INFO 12:30:26,843 Loading persisted ring state
 INFO 12:30:26,846 Starting up server gossip
 WARN 12:30:26,853 No host ID found, created 
 a4a0b918-a1c8-4acc-a050-672a96a5f110 (Note: This should happen exactly once 
 per node).
 INFO 12:30:26,979 Starting Messaging Service on port 7000
 INFO 12:30:27,014 JOINING: waiting for ring information
 INFO 12:30:28,602 Node /192.168.154.11 is now part of the cluster
 INFO 12:30:28,603 InetAddress /192.168.154.11 is now UP
 INFO 12:30:28,675 Node /192.168.154.12 is now part of the cluster
 INFO 12:30:28,678 InetAddress /192.168.154.12 is now UP
 INFO 12:30:28,751 Node /192.168.154.13 is now part of the cluster
 INFO 12:30:28,751 InetAddress /192.168.154.13 is now UP
 INFO 12:30:29,015 JOINING: schema complete, ready to bootstrap
 INFO 12:30:29,015 JOINING: getting bootstrap token
 INFO 12:30:29,157 JOINING: sleeping 3 ms for pending range setup
 INFO 12:30:59,159 JOINING: Starting to bootstrap...
 
 However, the new node does not show up in nodetool status (even if queried 
 from the new node itself):
 
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  192.168.154.111.49 GB256 100.0%
 234b82a4-3812-4261-adab-deb805942d63  rack1
 UN  192.168.154.121.6 GB 256 100.0%
 577db21e-81ef-45fd-a67b-cfd39455c0f6  rack1
 UN  192.168.154.131.64 GB256 100.0%
 6187cc5d-d44c-45cb-b738-1b87f5ae3dff  rack1
 
 It shows up in the gossip still:
 
 /192.168.154.12
  RPC_ADDRESS:192.168.154.12
  DC:datacenter1
  STATUS:NORMAL,-1072164398478041156
  LOAD:1.719430632E9
  SCHEMA:19657c82-a7eb-37a8-b436-0ea712c57db2
  NET_VERSION:6
  HOST_ID:577db21e-81ef-45fd-a67b-cfd39455c0f6
  SEVERITY:0.0
  RELEASE_VERSION:1.2.1-SNAPSHOT
  RACK:rack1
 /192.168.154.14
  RPC_ADDRESS:192.168.154.14
  DC:datacenter1
  STATUS:BOOT,8077752099299332137
  LOAD:105101.0
  SCHEMA:19657c82-a7eb-37a8-b436-0ea712c57db2
  NET_VERSION:6
  HOST_ID:a4a0b918-a1c8-4acc-a050-672a96a5f110
  RELEASE_VERSION:1.2.1-SNAPSHOT
  RACK:rack1
 /192.168.154.11
  RPC_ADDRESS:192.168.154.11
  DC:datacenter1
  STATUS:NORMAL,-1158837144480089281
  LOAD:1.596505929E9
  SCHEMA:19657c82-a7eb-37a8-b436-0ea712c57db2
  NET_VERSION:6
  HOST_ID:234b82a4-3812-4261-adab-deb805942d63
  SEVERITY:0.0
  RELEASE_VERSION:1.2.1-SNAPSHOT

Re: Deleting old items

2013-02-12 Thread aaron morton

 So is it possible to delete all the data inserted in some CF between 2 dates 
 or data older than 1 month ?
No. 

You need to issue row level deletes. If you don't know the row key you'll need 
to do range scans to locate them. 

If you are deleting parts of wide rows consider reducing the 
min_compaction_level_threshold on the CF to 2

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi,
 
 I would like to know if there is a way to delete old/unused data easily ?
 
 I know about TTL but there are 2 limitations of TTL:
 
 - AFAIK, there is no TTL on counter columns
 - TTL need to be defined at write time, so it's too late for data already 
 inserted.
 
 I also could use a standard delete but it seems inappropriate for such a 
 massive.
 
 In some cases, I don't know the row key and would like to delete all the rows 
 starting by, let's say, 1050#... 
 
 Even better, I understood that columns are always inserted in C* with (name, 
 value, timestamp). So is it possible to delete all the data inserted in some 
 CF between 2 dates or data older than 1 month ?
 
 Alain

Re: RuntimeException during leveled compaction

2013-02-12 Thread aaron morton

snapshot all nodes so you have a backup: nodetool snapshot -t corrupt

run nodetool scrub on the errant CF. 

Look for messages such as:

Out of order row detected…
%d out of order rows found while scrubbing %s; Those have been written (in 
order) to a new sstable (%s)

In the logs. 

Cheers
  
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/02/2013, at 6:13 AM, Andre Sprenger andre.spren...@getanet.de wrote:

 Hi,
 
 I'm running a 6 node Cassandra 1.1.5 cluster on EC2. We have switched to 
 leveled compaction a couple of weeks ago, 
 this has been successful. Some days ago 3 of the nodes start to log the 
 following exception during compaction of 
 a particular column family:
 
 ERROR [CompactionExecutor:726] 2013-02-11 13:02:26,582 
 AbstractCassandraDaemon.java (line 135) Exception in thread 
 Thread[CompactionExecutor:726,1,main]
 java.lang.RuntimeException: Last written key 
 DecoratedKey(84590743047470232854915142878708713938, 
 3133353533383530323237303130313030303232313537303030303132393832) 
 = current key DecoratedKey(28357704665244162161305918843747894551, 
 31333430313336313830333831303130313030303230313632303030303036363338) 
 writing into 
 /var/cassandra/data/AdServer/EventHistory/Adserver-EventHistory-tmp-he-68638-Data.db
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 
 Compaction does not happen any more for the column family and read 
 performance gets worse because of the growing 
 number of data files accessed during reads. Looks like one or more of the 
 data files are corrupt and have keys
 that are stored out of order.
 
 Any help to resolve this situation would be greatly appreciated.
 
 Thanks
 Andre

Re: Cassandra becnhmark

2013-02-12 Thread aaron morton

  I see the same keys in both nodes. Replication is not enabled. 
Why do you say that ? 

Check the schema for Keyspace1 using the cassandra-cli. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/02/2013, at 9:31 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – I am trying to do benchmark using the Cassandra-stress tool. They have 
 given an example to insert data across 2 nodes –
  
 /tools/stress/bin/stress -d 192.168.1.101,192.168.1.102 -n 1000
 
 But when I run this across my 2 node cluster, I see the same keys in both 
 nodes. Replication is not enabled. Should it not have unique keys in both 
 nodes ?
  
 Thanks,
 Kanwar

Re: what addresses to use in EC2 cluster (whenever an instance restarts it gets a new private ip)?

2013-02-12 Thread aaron morton

Cassandra handles nodes changing IP. The import thing to Cassandra is the 
token, not the IP. 

In your case did the replacement node have the same token as the failed one?

You can normally work around these issues using commands like nodetool 
removetoken. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/02/2013, at 10:04 AM, Andrey Ilinykh ailin...@gmail.com wrote:

 You have to use private IPs, but if an instance dies you have to bootstrap it 
 with replace token flag. If you use EC2 I'd recommend Netflix's Priam tool. 
 It manages all that stuff, plus you have S3 backup.
 
 
 Andrey
 
 
 On Mon, Feb 11, 2013 at 11:35 AM, Brian Tarbox tar...@cabotresearch.com 
 wrote:
 How do I configure my cluster to run in EC2?  In my cassandra.yaml I have IP 
 addresses under seed_provider, listen_address and rpc_address.
 
 I tried setting up my cluster using just the EC2 private addresses but when 
 one of my instances failed and I restarted it there was a new private 
 address.  Suddenly my cluster thought it have five nodes rather than four.
 
 Then I tried using Elastic IP addresses (permanent addresses) but it turns 
 out you get charged for network traffic between elastic addresses even if 
 they are within the cluster.
 
 So...how do you configure the cluster when the IP addresses can change out 
 from under you?
 
 Thanks.
 
 Brian Tarbox

Re: Cassandra 1.1.2 - 1.1.8 upgrade

2013-02-12 Thread aaron morton

You have linked to the 1.2 news file, which branched from 1.1 at some point. 

Look at the news file in the distribution you are installing or here 
https://github.com/apache/cassandra/blob/cassandra-1.1/NEWS.txt

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/02/2013, at 11:14 PM, Michal Michalski mich...@opera.com wrote:

 OK, thanks Aaron. I ask because NEWS.txt is not a big help in case of  1.1.5 
 versions because there's no info on them in it (especially on 1.1.7 which 
 seems to be the most important one in this case, according to the DataStax' 
 upgrade instructions) ;-)
 
 https://github.com/apache/cassandra/blob/trunk/NEWS.txt
 
 M.
 
 
 W dniu 11.02.2013 11:05, aaron morton pisze:
 You can always run them. But in some situations repair cannot be used, and 
 in this case new nodes cannot be added. The news.txt file is your friend 
 there.
 
 As a general rule when upgrading a cluster I move one node to the new 
 version and let it soak in for an hour or so. Just to catch any crazy. I 
 then upgrade all the nodes and run through the upgrade table. You can 
 stagger upgrade table to be every RF'th node in the cluster to reduce the 
 impact.

Re: Cassandra 1.1.2 - 1.1.8 upgrade

2013-02-12 Thread aaron morton

 Can anyone know the impact of not running upgrade sstables? Or possible not 
 running it for several days?
nodetool repair will not work. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/02/2013, at 11:54 AM, Mike mthero...@yahoo.com wrote:

 So the upgrade sstables is recommended as part of the upgrade to 1.1.3 if you 
 are using counter columns
 
 Also, there was a general recommendation (in another response to my question) 
 to run upgrade sstables because of:
 
 upgradesstables always needs to be done between majors. While 1.1.2 - 1.1.8 
 is not a major, due to an unforeseen bug in the conversion to microseconds 
 you'll need to run upgradesstables.
 
 Is this referring to: https://issues.apache.org/jira/browse/CASSANDRA-4432
 
 Can anyone know the impact of not running upgrade sstables? Or possible not 
 running it for several days?
 
 Thanks,
 -Mike
 
 On 2/10/2013 3:27 PM, aaron morton wrote:
 I would do #1.
 
 You can play with nodetool setcompactionthroughput to speed things up, but 
 beware nothing comes for free.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 10/02/2013, at 6:40 AM, Mike mthero...@yahoo.com wrote:
 
 Thank you,
 
 Another question on this topic.
 
 Upgrading from 1.1.2-1.1.9 requires running upgradesstables, which will 
 take many hours on our dataset (about 12).  For this upgrade, is it 
 recommended that I:
 
 1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and run 
 a staggered upgrade of the sstables over a number of days.
 2) Upgrade one node at a time, running the clustered in a mixed 
 1.1.2-1.1.9 configuration for a number of days.
 
 I would prefer #1, as with #2, streaming will not work until all the nodes 
 are upgraded.
 
 I appreciate your thoughts,
 -Mike
 
 On 1/16/2013 11:08 AM, Jason Wee wrote:
 always check NEWS.txt for instance for cassandra 1.1.3 you need to run 
 nodetool upgradesstables if your cf has counter.
 
 
 On Wed, Jan 16, 2013 at 11:58 PM, Mike mthero...@yahoo.com wrote:
 Hello,
 
 We are looking to upgrade our Cassandra cluster from 1.1.2 - 1.1.8 (or 
 possibly 1.1.9 depending on timing).  It is my understanding that rolling 
 upgrades of Cassandra is supported, so as we upgrade our cluster, we can 
 do so one node at a time without experiencing downtime.
 
 Has anyone had any gotchas recently that I should be aware of before 
 performing this upgrade?
 
 In order to upgrade, is the only thing that needs to change are the JAR 
 files?  Can everything remain as-is?
 
 Thanks,
 -Mike

Re: Upgrade to Cassandra 1.2

2013-02-12 Thread aaron morton

Were you upgrading to 1.2 AND running the shuffle or just upgrading to 1.2? 

If you have not run shuffle I would suggest reverting the changes to num_tokens 
and inital_token. This is a guess because num_tokens is only used at bootstrap. 

Just get upgraded to 1.2 first, then do the shuffle when things are stable. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/02/2013, at 2:55 PM, Daning Wang dan...@netseer.com wrote:

 Thanks Aaron.
 
 I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed.
 
 - I followed http://www.datastax.com/docs/1.2/install/upgrading, have merged 
 cassandra.yaml, with follow parameter
 
 num_tokens: 256
 #initial_token: 0
 
 the initial_token is commented out, current token should be obtained from 
 system schema
 
 - I did rolling upgrade, during the upgrade, I got Borken Pipe error from 
 the nodes with old version, is that normal?
 
 - After I upgraded 3 nodes(still have 5 to go), I found it is total wrong, 
 the first node upgraded owns 99.2 of ring
 
 [cassy@d5:/usr/local/cassy conf]$  ~/bin/nodetool -h localhost status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID   
 Rack
 DN  10.210.101.11745.01 GB   254 99.2%  
 f4b6afe3-7e2e-4c61-96e8-12a529a31373  rack1
 UN  10.210.101.12045.43 GB   256 0.4%   
 0fd912fb-3187-462b-8c8a-7d223751b649  rack1
 UN  10.210.101.11127.08 GB   256 0.4%   
 bd4c37bc-07dd-488b-bfab-e74e32c26f6e  rack1
 
 
 What was wrong? please help. I could provide more information if you need.
 
 Thanks,
 
 Daning
 
 
 
 On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.com wrote:
 There is a command line utility in 1.2 to shuffle the tokens…
 
 http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes
 
 $ ./cassandra-shuffle --help
 Missing sub-command argument.
 Usage: shuffle [options] sub-command
 
 Sub-commands:
  create   Initialize a new shuffle operation
  ls   List pending relocations
  clearClear pending relocations
  en[able] Enable shuffling
  dis[able]Disable shuffling
 
 Options:
  -dc,  --only-dc   Apply only to named DC (create only)
  -tp,  --thrift-port   Thrift port number (Default: 9160)
  -p,   --port  JMX port number (Default: 7199)
  -tf,  --thrift-framed Enable framed transport for Thrift (Default: false)
  -en,  --and-enableImmediately enable shuffling (create only)
  -H,   --help  Print help information
  -h,   --host  JMX hostname or IP address (Default: localhost)
  -th,  --thrift-host   Thrift hostname or IP address (Default: JMX host)
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote:
 
 On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote:
 I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is
 that it can have multiple tokens in one node. but there is only one
 token in 1.1.6.
 
 how can I upgrade to 1.2.1 then breaking the token to take advantage
 of this feature? I went through this doc but it does not say how to
 change the num_token
 
 http://www.datastax.com/docs/1.2/install/upgrading
 
 Is there other doc about this upgrade path?
 
 Thanks,
 
 Daning
 
 I think for each node you need to change the num_token option in 
 conf/cassandra.yaml (this only split the current range into num_token parts) 
 and run the bin/cassandra-shuffle command (this spread it all over the ring).

Re: Cassandra 1.2.1 key cache error

2013-02-12 Thread aaron morton

This looks like a bug in 1.2 beta 
https://issues.apache.org/jira/browse/CASSANDRA-4553

Can you confirm you are running 1.2.1 and if you can re-create this with a 
clean install please create a ticket on 
https://issues.apache.org/jira/browse/CASSANDRA

Thanks


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/02/2013, at 1:22 AM, Ahmed Guecioueur ahme...@gmail.com wrote:

 Hi
 
 I am currently evaluating Cassandra on a single node. Running the node seems 
 fine, it responds to Thrift (via Hector) and CQL3 requests to create  delete 
 keyspaces. I have not yet tested any data operations.
 
 However, I get the following each time the node is started. This is using the 
 latest production jars (v 1.2.1) downloaded from the Apache website:
 
 
  INFO [main] 2013-02-07 19:48:55,610 AutoSavingCache.java (line 139) reading 
 saved cache C:\Cassandra\saved_caches\system-local-KeyCache-b.db
  WARN [main] 2013-02-07 19:48:55,614 AutoSavingCache.java (line 160) error 
 reading saved cache C:\Cassandra\saved_caches\system-local-KeyCache-b.db
 java.io.EOFException
   at java.io.DataInputStream.readInt(Unknown Source)
   at 
 org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:349)
   at 
 org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:378)
   at 
 org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:144)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.init(ColumnFamilyStore.java:277)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:392)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:364)
   at org.apache.cassandra.db.Table.initCf(Table.java:337)
   at org.apache.cassandra.db.Table.init(Table.java:280)
   at org.apache.cassandra.db.Table.open(Table.java:110)
   at org.apache.cassandra.db.Table.open(Table.java:88)
   at org.apache.cassandra.db.SystemTable.checkHealth(SystemTable.java:421)
   at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:177)
   at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:370)
   at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:413)
  INFO [SSTableBatchOpen:1] 2013-02-07 19:48:56,212 SSTableReader.java (line 
 164) Opening C:\Cassandra\data\system_auth\users\system_auth-users-ib-1 (72 
 bytes)
  INFO [main] 2013-02-07 19:48:56,242 CassandraDaemon.java (line 224) 
 completed pre-loading (3 keys) key cache.
 
 
 That binary file exists, though ofc the content is unreadable. Deleting the 
 file and letting it be recreated doesn't help either.
 
 Can anyone suggest any other solutions?
 
 Cheers
 Ahmed

Re: Upgrade to Cassandra 1.2

2013-02-12 Thread aaron morton

Restore the settings for num_tokens and intial_token to what they were before 
you upgraded. 
They should not be changed just because you are upgrading to 1.2, they are used 
to enable virtual nodes. Which are not necessary to run 1.2. 

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/02/2013, at 8:02 AM, Daning Wang dan...@netseer.com wrote:

 No, I did not run shuffle since the upgrade was not successful. 
 
 what do you mean reverting the changes to num_tokens and inital_token? set 
 num_tokens=1? initial_token should be ignored since it is not bootstrap. 
 right?
 
 Thanks,
 
 Daning
 
 On Tue, Feb 12, 2013 at 10:52 AM, aaron morton aa...@thelastpickle.com 
 wrote:
 Were you upgrading to 1.2 AND running the shuffle or just upgrading to 1.2? 
 
 If you have not run shuffle I would suggest reverting the changes to 
 num_tokens and inital_token. This is a guess because num_tokens is only used 
 at bootstrap. 
 
 Just get upgraded to 1.2 first, then do the shuffle when things are stable. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 12/02/2013, at 2:55 PM, Daning Wang dan...@netseer.com wrote:
 
 Thanks Aaron.
 
 I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed.
 
 - I followed http://www.datastax.com/docs/1.2/install/upgrading, have merged 
 cassandra.yaml, with follow parameter
 
 num_tokens: 256
 #initial_token: 0
 
 the initial_token is commented out, current token should be obtained from 
 system schema
 
 - I did rolling upgrade, during the upgrade, I got Borken Pipe error from 
 the nodes with old version, is that normal?
 
 - After I upgraded 3 nodes(still have 5 to go), I found it is total wrong, 
 the first node upgraded owns 99.2 of ring
 
 [cassy@d5:/usr/local/cassy conf]$  ~/bin/nodetool -h localhost status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID  
  Rack
 DN  10.210.101.11745.01 GB   254 99.2%  
 f4b6afe3-7e2e-4c61-96e8-12a529a31373  rack1
 UN  10.210.101.12045.43 GB   256 0.4%   
 0fd912fb-3187-462b-8c8a-7d223751b649  rack1
 UN  10.210.101.11127.08 GB   256 0.4%   
 bd4c37bc-07dd-488b-bfab-e74e32c26f6e  rack1
 
 
 What was wrong? please help. I could provide more information if you need.
 
 Thanks,
 
 Daning
 
 
 
 On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.com wrote:
 There is a command line utility in 1.2 to shuffle the tokens…
 
 http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes
 
 $ ./cassandra-shuffle --help
 Missing sub-command argument.
 Usage: shuffle [options] sub-command
 
 Sub-commands:
  create   Initialize a new shuffle operation
  ls   List pending relocations
  clearClear pending relocations
  en[able] Enable shuffling
  dis[able]Disable shuffling
 
 Options:
  -dc,  --only-dc   Apply only to named DC (create only)
  -tp,  --thrift-port   Thrift port number (Default: 9160)
  -p,   --port  JMX port number (Default: 7199)
  -tf,  --thrift-framed Enable framed transport for Thrift (Default: 
 false)
  -en,  --and-enableImmediately enable shuffling (create only)
  -H,   --help  Print help information
  -h,   --host  JMX hostname or IP address (Default: localhost)
  -th,  --thrift-host   Thrift hostname or IP address (Default: JMX host)
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote:
 
 On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote:
 I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is
 that it can have multiple tokens in one node. but there is only one
 token in 1.1.6.
 
 how can I upgrade to 1.2.1 then breaking the token to take advantage
 of this feature? I went through this doc but it does not say how to
 change the num_token
 
 http://www.datastax.com/docs/1.2/install/upgrading
 
 Is there other doc about this upgrade path?
 
 Thanks,
 
 Daning
 
 I think for each node you need to change the num_token option in 
 conf/cassandra.yaml (this only split the current range into num_token 
 parts) and run the bin/cassandra-shuffle command (this spread it all over 
 the ring).

Re: RuntimeException during leveled compaction

2013-02-16 Thread aaron morton

That sounds like something wrong with the way the rows are merged during 
compaction then. 

Can you run the compaction with DEBUG logging and raise a ticket? You may want 
to do this with the node not in the ring. Five minutes after it starts it will 
run pending compactions, so if there if compactions are not running they should 
start again. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/02/2013, at 8:11 PM, Andre Sprenger andre.spren...@getanet.de wrote:

 
 Aaron,
 
 thanks for your help. 
 
 I ran 'nodetool scrub' and it finished after a couple of hours. But there are 
 no infos about 
 out of order rows in the logs and the compaction on the column family still 
 raises the same
 exception. 
 
 With the row key I could identify some of the errant SSTables and removed 
 them during
 a node restart. On some nodes compaction is working for the moment but there 
 are likely
 more corrupt datafiles and than I would be in the same situation as before.
 
 So I still need some help to resolve this issue!
 
 Cheers
 Andre
 
 
 2013/2/12 aaron morton aa...@thelastpickle.com
 snapshot all nodes so you have a backup: nodetool snapshot -t corrupt
 
 run nodetool scrub on the errant CF. 
 
 Look for messages such as:
 
 Out of order row detected…
 %d out of order rows found while scrubbing %s; Those have been written (in 
 order) to a new sstable (%s)
 
 In the logs. 
 
 Cheers
   
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 12/02/2013, at 6:13 AM, Andre Sprenger andre.spren...@getanet.de wrote:
 
 Hi,
 
 I'm running a 6 node Cassandra 1.1.5 cluster on EC2. We have switched to 
 leveled compaction a couple of weeks ago, 
 this has been successful. Some days ago 3 of the nodes start to log the 
 following exception during compaction of 
 a particular column family:
 
 ERROR [CompactionExecutor:726] 2013-02-11 13:02:26,582 
 AbstractCassandraDaemon.java (line 135) Exception in thread 
 Thread[CompactionExecutor:726,1,main]
 java.lang.RuntimeException: Last written key 
 DecoratedKey(84590743047470232854915142878708713938, 
 3133353533383530323237303130313030303232313537303030303132393832) 
 = current key DecoratedKey(28357704665244162161305918843747894551, 
 31333430313336313830333831303130313030303230313632303030303036363338) 
 writing into 
 /var/cassandra/data/AdServer/EventHistory/Adserver-EventHistory-tmp-he-68638-Data.db
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 
 Compaction does not happen any more for the column family and read 
 performance gets worse because of the growing 
 number of data files accessed during reads. Looks like one or more of the 
 data files are corrupt and have keys
 that are stored out of order.
 
 Any help to resolve this situation would be greatly appreciated.
 
 Thanks
 Andre

Re: Deleting old items

2013-02-16 Thread aaron morton

  Is that a feature that could possibly be developed one day ?
No. 
Timestamps are essentially internal implementation used to resolve different 
values for the same column. 

 With min_compaction_level_threshold did you mean min_compaction_threshold 
  ? If so, why should I do that, what are the advantage/inconvenient of 
 reducing this value ?
Yes, min_compaction_threshold, my bad. 
If you have a wide row and delete a lot of values you will end up with a lot of 
tombstones. These may dramatically reduce the read performance until they are 
purged. Reducing the compaction threshold makes compaction happen more 
frequently. 

 Looking at the doc I saw that: max_compaction_threshold: Ignored in 
 Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount 
 of SSTables then ?
AFAIK it's not. 
There may be some confusion about the location of the settings in CLI vs CQL. 
Can you point to the docs. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/02/2013, at 10:14 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi Aaron, once again thanks for this answer.
 So is it possible to delete all the data inserted in some CF between 2 
 dates or data older than 1 month ?
 No. 
 
 Why is there no way of deleting or getting data using the internal timestamp 
 stored alongside of any inserted column (as described here: 
 http://www.datastax.com/docs/1.1/ddl/column_family#standard-columns) ? Is 
 that a feature that could possibly be developed one day ? It could be useful 
 to perform delete of old data or to bring to a dev cluster just the last week 
 of data for example.
 
 With min_compaction_level_threshold did you mean min_compaction_threshold 
  ? If so, why should I do that, what are the advantage/inconvenient of 
 reducing this value ?
 
 Looking at the doc I saw that: max_compaction_threshold: Ignored in 
 Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount 
 of SSTables then ? Why is this deprecated ?
 
 Alain
 
 
 2013/2/12 aaron morton aa...@thelastpickle.com
 So is it possible to delete all the data inserted in some CF between 2 dates 
 or data older than 1 month ?
 No. 
 
 You need to issue row level deletes. If you don't know the row key you'll 
 need to do range scans to locate them. 
 
 If you are deleting parts of wide rows consider reducing the 
 min_compaction_level_threshold on the CF to 2
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 
 Hi,
 
 I would like to know if there is a way to delete old/unused data easily ?
 
 I know about TTL but there are 2 limitations of TTL:
 
 - AFAIK, there is no TTL on counter columns
 - TTL need to be defined at write time, so it's too late for data already 
 inserted.
 
 I also could use a standard delete but it seems inappropriate for such a 
 massive.
 
 In some cases, I don't know the row key and would like to delete all the 
 rows starting by, let's say, 1050#... 
 
 Even better, I understood that columns are always inserted in C* with (name, 
 value, timestamp). So is it possible to delete all the data inserted in some 
 CF between 2 dates or data older than 1 month ?
 
 Alain

Re: Deleting old items during compaction (WAS: Deleting old items)

2013-02-17 Thread aaron morton

That's what the TTL does. 

Manually delete all the older data now, then start using TTL. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/02/2013, at 11:08 PM, Ilya Grebnov i...@metricshub.com wrote:

 Hi,
  
 We looking for solution for same problem. We have a wide column family with 
 counters and we want to delete old data like 1 months old. One of potential 
 ideas was to implement hook in compaction code and drop column which we don’t 
 need. Is this a viable option?
  
 Thanks,
 Ilya
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Tuesday, February 12, 2013 9:01 AM
 To: user@cassandra.apache.org
 Subject: Re: Deleting old items
  
 So is it possible to delete all the data inserted in some CF between 2 dates 
 or data older than 1 month ?
 No. 
  
 You need to issue row level deletes. If you don't know the row key you'll 
 need to do range scans to locate them. 
  
 If you are deleting parts of wide rows consider reducing the 
 min_compaction_level_threshold on the CF to 2
  
 Cheers
  
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 
 
 Hi,
  
 I would like to know if there is a way to delete old/unused data easily ?
  
 I know about TTL but there are 2 limitations of TTL:
  
 - AFAIK, there is no TTL on counter columns
 - TTL need to be defined at write time, so it's too late for data already 
 inserted.
  
 I also could use a standard delete but it seems inappropriate for such a 
 massive.
  
 In some cases, I don't know the row key and would like to delete all the rows 
 starting by, let's say, 1050#... 
  
 Even better, I understood that columns are always inserted in C* with (name, 
 value, timestamp). So is it possible to delete all the data inserted in some 
 CF between 2 dates or data older than 1 month ?
  
 Alain

Re: Mutation dropped

2013-02-17 Thread aaron morton

You are hitting the maximum throughput on the cluster. 

The messages are dropped because the node fails to start processing them before 
rpc_timeout. 

However the request is still a success because the client requested CL was 
achieved. 

Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
Both nodes replicate each row, and writes are sent to each replica, so the only 
thing the client is waiting on is the local node to write to it's commit log. 

Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
scenario. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some timeout associated with replicated data 
 persistence ?
  
 Thanks,
 Kanwar
  
  
  
  
  
  
  
 From: Kanwar Sangha [mailto:kan...@mavenir.com] 
 Sent: 14 February 2013 09:08
 To: user@cassandra.apache.org
 Subject: Mutation dropped
  
 Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing 
 a lot of mutation dropped messages.  I understand that this is due to the 
 replica not being written to the
 other node ? RF = 2, CL =1.
  
 From the wiki -
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair or 
 Anti Entropy Repair
  
 Thanks,
 Kanwar

Re: [nodetool] repair with vNodes

2013-02-17 Thread aaron morton

I'm a bit late, but for reference. 

Repair runs in two stages, first differences are detected. You an monitor the 
validation compaction with nodetool compactionstats. 

Then the differences are streamed between the nodes, you can monitor that with 
nodetool netstats. 

 Nodetool repair command has been running for almost 24hours and I can’t see 
 any activity from the logs or JMX.
Grep for session completed

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/02/2013, at 11:38 PM, Haithem Jarraya haithem.jarr...@struq.com wrote:

 Hi,
  
 I am new to Cassandra and I would like to hear your thoughts on this.
 We are running our tests with Cassandra 1.2.1, in relatively small dataset 
 ~60GB.
 Nodetool repair command has been running for almost 24hours and I can’t see 
 any activity from the logs or JMX.
 What am I missing? Or there is a problem with node tool repair?
 What other commands that I can run to do a sanity check on the cluster?
 Can I run nodetool repair on different node in the same time?
  
  
 Here is the current test deployment of Cassandra
 $ nodetool status
 Datacenter: ams01 (Replication Factor 2)
 =
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID   
 Rack
 UN  10.70.48.23   38.38 GB   256 19.0%  
 7c5fdfad-63c6-4f37-bb9f-a66271aa3423  RAC1
 UN  10.70.6.7858.13 GB   256 18.3%  
 94e7f48f-d902-4d4a-9b87-81ccd6aa9e65  RAC1
 UN  10.70.47.126  53.89 GB   256 19.4%  
 f36f1f8c-1956-4850-8040-b58273277d83  RAC1
 Datacenter: wdc01 (Replication Factor 1)
 =
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID   
 Rack
 UN  10.24.116.66  65.81 GB   256 22.1%  
 f9dba004-8c3d-4670-94a0-d301a9b775a8  RAC1
 Datacenter: sjc01 (Replication Factor 1)
 =
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID   
 Rack
 UN  10.55.104.90  63.31 GB   256 21.2%  
 4746f1bd-85e1-4071-ae5e-9c5baac79469  RAC1
  
  
 Many Thanks,
  
 Haithem

Re: Question on Cassandra Snapshot

2013-02-17 Thread aaron morton

 With incremental_backup turned OFF in Cassandra.yaml - Are all SSTables are 
 under /data/TestKeySpace/ColumnFamily at all times?
No. 
They are deleted when they are compacted and no internal operations are 
referencing them. 

 With incremental_backup turned ON in cassandra.yaml - Are current SSTables 
 under /data/TestKeySpace/ColumnFamily/ with a hardlink to 
 /data/TestKeySpace/ColumnFamily/backups? 
Yes, sort of. 
*All* SSTables ever created are in the backups directory. 
Not just the ones currently live.

 Lets say I have taken snapshot and moved the 
 /data/TestKeySpace/ColumnFamily/snapshots/snapshot-name/*.db to tape, at 
 what point should I be backing up *.db files from 
 /data/TestKeySpace/ColumnFamily/backups directory. Also, should I be deleting 
 the *.db files whose inode matches with the files in the snapshot? Is that a 
 correct approach? 
Backup all files in the snapshots. There may be non .db extensions files if you 
use levelled compactions
When you are finished with the snapshot delete it. If the inode is not longer 
referenced from the live data dir it will be deleted. 

 I noticed /data/TestKeySpace/ColumnFamily/snapshots/timestamp-ColumnFamily/ 
 what are these timestamp directories?
Probably automatic snapshot from dropping KS or CF's

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 16/02/2013, at 4:41 AM, S C as...@outlook.com wrote:

 I appreciate any advise or pointers on this.
 
 Thanks in advance.
 
 From: as...@outlook.com
 To: user@cassandra.apache.org
 Subject: Question on Cassandra Snapshot
 Date: Thu, 14 Feb 2013 20:47:14 -0600
 
 I have been looking at incremental backups and snapshots. I have done some 
 experimentation but could not come to a conclusion. Can somebody please help 
 me understanding it right?
 
 /data is my data partition
 
 With incremental_backup turned OFF in Cassandra.yaml - Are all SSTables are 
 under /data/TestKeySpace/ColumnFamily at all times?
 With incremental_backup turned ON in cassandra.yaml - Are current SSTables 
 under /data/TestKeySpace/ColumnFamily/ with a hardlink to 
 /data/TestKeySpace/ColumnFamily/backups? 
 Lets say I have taken snapshot and moved the 
 /data/TestKeySpace/ColumnFamily/snapshots/snapshot-name/*.db to tape, at 
 what point should I be backing up *.db files from 
 /data/TestKeySpace/ColumnFamily/backups directory. Also, should I be deleting 
 the *.db files whose inode matches with the files in the snapshot? Is that a 
 correct approach? 
 I noticed /data/TestKeySpace/ColumnFamily/snapshots/timestamp-ColumnFamily/ 
 what are these timestamp directories?
 
 Thanks in advance. 
 SC

Re: odd production issue today 1.1.4

2013-02-17 Thread aaron morton

There is always this old chestnut 
http://wiki.apache.org/cassandra/FAQ#ubuntu_hangs

A
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 16/02/2013, at 8:22 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 With hyper threading a core can show up as two or maybe even four
 physical system processors, this is something the kernel does.
 
 On Fri, Feb 15, 2013 at 11:41 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 We ran into an issue today where website became around 10 times slower.  We 
 found out node 5 out of our 6 nodes was hitting 2100% cpu (cat /proc/cpuinfo 
 reveals a 16 processor machine).  I am really not sure how to hit 2100% 
 unless we had 21 processors.  It bounces between 300% and 2100% so I tried 
 to a do a thread dump and had to use –F which then hotspot hit a nullpointer 
 :(.
 
 I copied off all my logs after restarting(should have done it before 
 restarting it).  Any ideas what I could even look for as to what went wrong 
 with this node?
 
 Also, we know our astyanax for some reason is not setup properly yet so we 
 probably would not have seen an issue had we had all nodes in the seed 
 list(which we changed today) as astyanax is supposed to be measuring time 
 per request and changing which nodes it hits but we know it only hits nodes 
 in our seedlist right now as we have not fixed that yet.  Our astyanax was 
 hitting 3,4,5,6 and did not have 1 and 2 in the seed list (we rollout a new 
 version next wed. with the new seedlist including the last two delaying the 
 dynamic discovery config we need to look at).
 
 Thanks,
 Dean
 
 Commands I ran with jstack that didn't work out too well….
 
 [cassandra@a5 ~]$ jstack -l 20907  threads.txt
 20907: Unable to open socket file: target process not responding or HotSpot 
 VM not loaded
 The -F option can be used when the target process is not responding
 [cassandra@a5 ~]$ jstack -l -F  20907  threads.txt
 Attaching to process ID 20907, please wait...
 Debugger attached successfully.
 Server compiler detected.
 JVM version is 20.7-b02
 java.lang.NullPointerException
 at 
 sun.jvm.hotspot.oops.InstanceKlass.computeSubtypeOf(InstanceKlass.java:426)
 at sun.jvm.hotspot.oops.Klass.isSubtypeOf(Klass.java:137)
 at sun.jvm.hotspot.oops.Oop.isA(Oop.java:100)
 at sun.jvm.hotspot.runtime.DeadlockDetector.print(DeadlockDetector.java:93)
 at sun.jvm.hotspot.runtime.DeadlockDetector.print(DeadlockDetector.java:39)
 at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:52)
 at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:45)
 at sun.jvm.hotspot.tools.JStack.run(JStack.java:60)
 at sun.jvm.hotspot.tools.Tool.start(Tool.java:221)
 at sun.jvm.hotspot.tools.JStack.main(JStack.java:86)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at sun.tools.jstack.JStack.runJStackTool(JStack.java:118)
 at sun.tools.jstack.JStack.main(JStack.java:84)
 [cassandra@a5 ~]$ java -version
 java version 1.6.0_32

Re: cassandra vs. mongodb quick question

2013-02-17 Thread aaron morton

If you have spinning disk and 1G networking and no virtual nodes, I would still 
say 300G to 500G is a soft limit. 

If you are using virtual nodes, SSD, JBOD disk configuration or faster 
networking you may go higher. 

The limiting factors are the time it take to repair, the time it takes to 
replace a node, the memory considerations for 100's of millions of rows. If you 
the performance of those operations is acceptable to you, then go crazy. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 16/02/2013, at 9:05 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 So I found out mongodb varies their node size from 1T to 42T per node 
 depending on the profile.  So if I was going to be writing a lot but rarely 
 changing rows, could I also use cassandra with a per node size of +20T or is 
 that not advisable?
 
 Thanks,
 Dean

Re: can we pull rows out compressed from cassandra(lots of rows)?

2013-02-17 Thread aaron morton

No. 
The rows are uncompressed deep down in the IO stack. 

There is compression in the binary protocol 
http://www.datastax.com/dev/blog/binary-protocol 
https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=doc/native_protocol.spec;hb=refs/heads/cassandra-1.2

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 16/02/2013, at 9:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 
 Thanks,
 Dean

Re: Deleting old items

2013-02-17 Thread aaron morton

I'll email the docs people.

I believe they are saying use compaction throttling rather than this not
this does nothing

Although I used this in the last month on a machine with very little ram to
limit compaction memory use.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/02/2013, at 7:05 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

Can you point to the docs.

http://www.datastax.com/docs/1.1/configuration/storage_configuration#max-compaction-threshold

And thanks about the rest of your answers, once again ;-).

Alain

2013/2/16 aaron morton aa...@thelastpickle.com
Is that a feature that could possibly be developed one day ?
No.
Timestamps are essentially internal implementation used to resolve different
values for the same column.

With min_compaction_level_threshold did you mean
min_compaction_threshold ? If so, why should I do that, what are the
advantage/inconvenient of reducing this value ?

Yes, min_compaction_threshold, my bad.
If you have a wide row and delete a lot of values you will end up with a lot
of tombstones. These may dramatically reduce the read performance until they
are purged. Reducing the compaction threshold makes compaction happen more
frequently.

Looking at the doc I saw that: max_compaction_threshold: Ignored in
Cassandra 1.1 and later.. How to ensure that I'll always keep a small
amount of SSTables then ?
AFAIK it's not.
There may be some confusion about the location of the settings in CLI vs CQL.
Can you point to the docs.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/02/2013, at 10:14 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:

Hi Aaron, once again thanks for this answer.
So is it possible to delete all the data inserted in some CF between 2
dates or data older than 1 month ?
No.

Why is there no way of deleting or getting data using the internal timestamp
stored alongside of any inserted column (as described here:
http://www.datastax.com/docs/1.1/ddl/column_family#standard-columns) ? Is
that a feature that could possibly be developed one day ? It could be useful
to perform delete of old data or to bring to a dev cluster just the last
week of data for example.

With min_compaction_level_threshold did you mean
min_compaction_threshold ? If so, why should I do that, what are the
advantage/inconvenient of reducing this value ?

Looking at the doc I saw that: max_compaction_threshold: Ignored in
Cassandra 1.1 and later.. How to ensure that I'll always keep a small
amount of SSTables then ? Why is this deprecated ?

Alain

2013/2/12 aaron morton aa...@thelastpickle.com
So is it possible to delete all the data inserted in some CF between 2
dates or data older than 1 month ?
No.

You need to issue row level deletes. If you don't know the row key you'll
need to do range scans to locate them.

If you are deleting parts of wide rows consider reducing the
min_compaction_level_threshold on the CF to 2

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

Hi,

I would like to know if there is a way to delete old/unused data easily ?

I know about TTL but there are 2 limitations of TTL:

- AFAIK, there is no TTL on counter columns
- TTL need to be defined at write time, so it's too late for data already
inserted.

I also could use a standard delete but it seems inappropriate for such a
massive.

In some cases, I don't know the row key and would like to delete all the
rows starting by, let's say, 1050#...

Even better, I understood that columns are always inserted in C* with
(name, value, timestamp). So is it possible to delete all the data inserted
in some CF between 2 dates or data older than 1 month ?

Alain

Re: Is there any consolidated literature about Read/Write and Data Consistency in Cassandra ?

2013-02-17 Thread aaron morton

If you want the underlying ideas try the Dynamo paper, the Big Table paper and 
the original Cassandra paper from facebook. 

Start here http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/02/2013, at 7:40 AM, mateus mat...@tripleoxygen.net wrote:

 Like articles with tests and conclusions about it, and such, and not like the 
 documentation in DataStax, or the Cassandra Books.
 
 Thank you.

Re: nodetool repair with vnodes

2013-02-17 Thread aaron morton

 …so it seems to me that it is running on all vnodes ranges.
Yes.

 Also, whatever the node which I launch the command on is, only one node log 
 is moving and is always the same node. 
Not sure what you mean here. 

 So, to me, it's like the nodetool repair command is running always on the 
 same single node and repairing everything.
If you use nodetool repair without the -pr flag in your setup (3 nodes and I 
assume RF 3) it will repair all token ranges in the cluster. 

 Is there anything I'm missing ?
Look for messages with session completed in the log from the 
AntiEntropyService

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/02/2013, at 12:51 AM, Marco Matarazzo marco.matara...@hexkeep.com wrote:

 Greetings. 
 
 I'm trying to run nodetool repair on a Cassandra 1.2.1 cluster of 3 nodes 
 with 256 vnodes each.
 
 On a pre-1.2 cluster I used to launch a nodetool repair on every node every 
 24hrs. Now I'm getting a differenf behavior, and I'm sure I'm missing 
 something.
 
 What I see on the command line is: 
 
 [2013-02-17 10:20:15,186] Starting repair command #1, repairing 768 ranges 
 for keyspace goh_master
 [2013-02-17 10:48:13,401] Repair session 3d140e10-78e3-11e2-af53-d344dbdd69f5 
 for range (6556914650761469337,6580337080281832001] finished
 (…repeat the last line 767 times)
 
 …so it seems to me that it is running on all vnodes ranges.
 
 Also, whatever the node which I launch the command on is, only one node log 
 is moving and is always the same node. 
 
 So, to me, it's like the nodetool repair command is running always on the 
 same single node and repairing everything.
 
 I'm sure I'm making some mistakes, and I just can't find any clue of what's 
 wrong with my nodetool usage on the documentation (if anything is wrong, 
 btw). Is there anything I'm missing ?
 
 --
 Marco Matarazzo

Re: Deleting old items during compaction (WAS: Deleting old items)

2013-02-18 Thread aaron morton

Sorry, missed the Counters part.

You are probably interested in this one 
https://issues.apache.org/jira/browse/CASSANDRA-5228

Add your need to ticket to help it along. IMHO if you have write once, read 
many time series data the SSTables are effectively doing horizontal 
partitioning for you. So been able to drop a partition would make life 
easier. 

If you can delete entire row then the deletes have less impact than per column. 
However the old rows will not be purged from disk unless all fragments of the 
row are involved in a compaction process. So it may take some time to purge 
from disk, depending on the workload. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/02/2013, at 10:43 AM, Ilya Grebnov i...@metricshub.com wrote:

 According to https://issues.apache.org/jira/browse/CASSANDRA-2103 There is no 
 support for time to live (TTL) on counter columns. Did I miss something?
  
 Thanks,
 Ilya
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Sunday, February 17, 2013 9:16 AM
 To: user@cassandra.apache.org
 Subject: Re: Deleting old items during compaction (WAS: Deleting old items)
  
 That's what the TTL does. 
  
 Manually delete all the older data now, then start using TTL. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 13/02/2013, at 11:08 PM, Ilya Grebnov i...@metricshub.com wrote:
 
 
 Hi,
  
 We looking for solution for same problem. We have a wide column family with 
 counters and we want to delete old data like 1 months old. One of potential 
 ideas was to implement hook in compaction code and drop column which we don’t 
 need. Is this a viable option?
  
 Thanks,
 Ilya
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Tuesday, February 12, 2013 9:01 AM
 To: user@cassandra.apache.org
 Subject: Re: Deleting old items
  
 So is it possible to delete all the data inserted in some CF between 2 dates 
 or data older than 1 month ?
 No. 
  
 You need to issue row level deletes. If you don't know the row key you'll 
 need to do range scans to locate them. 
  
 If you are deleting parts of wide rows consider reducing the 
 min_compaction_level_threshold on the CF to 2
  
 Cheers
  
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 
 
 
 Hi,
  
 I would like to know if there is a way to delete old/unused data easily ?
  
 I know about TTL but there are 2 limitations of TTL:
  
 - AFAIK, there is no TTL on counter columns
 - TTL need to be defined at write time, so it's too late for data already 
 inserted.
  
 I also could use a standard delete but it seems inappropriate for such a 
 massive.
  
 In some cases, I don't know the row key and would like to delete all the rows 
 starting by, let's say, 1050#... 
  
 Even better, I understood that columns are always inserted in C* with (name, 
 value, timestamp). So is it possible to delete all the data inserted in some 
 CF between 2 dates or data older than 1 month ?
  
 Alain

Re: nodetool repair with vnodes

2013-02-18 Thread aaron morton

 So, running it periodically on just one node is enough for cluster 
 maintenance ? 
In the special case where you have RF == Number of nodes. 

The recommended approach is to use -pr and run it on each node periodically. 

 Also: running it with -pr does output:
That does not look right. There should be messages about requesting and 
receiving merkle tree's from other nodes, and that certain CF's are in sync. 
These are all logged from the AntiEntropyService.

 Is there a way to run it only for all vnodes on a single physical node ?
it should be doing that. 

Look for messages like this in the log 
logger.info(String.format([repair #%s] new session: will sync %s 
on range %s for %s.%s, getName(), repairedNodes(), range, tablename, 
Arrays.toString(cfnames)));

They say how much is going to be synced, and with what. Try running repair with 
-pr on one of nodes not already repaired. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/02/2013, at 11:12 AM, Marco Matarazzo marco.matara...@hexkeep.com wrote:

 So, to me, it's like the nodetool repair command is running always on the 
 same single node and repairing everything.
 If you use nodetool repair without the -pr flag in your setup (3 nodes and I 
 assume RF 3) it will repair all token ranges in the cluster. 
 
 That's correct, 3 nodes and RF 3. Sorry for not specifying it in the 
 beginning.
 
 
 So, running it periodically on just one node is enough for cluster 
 maintenance ? Does this depends on the fact that every vnode data is related 
 with the previous and next vnode, and this particular setup makes this enough 
 as it cover every physical node?
 
 
 Also: running it with -pr does output:
 
 [2013-02-17 12:29:25,293] Nothing to repair for keyspace 'system'
 [2013-02-17 12:29:25,301] Starting repair command #2, repairing 1 ranges for 
 keyspace keyspace_test
 [2013-02-17 12:29:28,028] Repair session 487d0650-78f5-11e2-a73a-2f5b109ee83c 
 for range (-9177680845984855691,-9171525326632276709] finished
 [2013-02-17 12:29:28,028] Repair command #2 finished
 
 … that, as far as I can understand, works on the first vnode on the specified 
 node, or so it seems from the output range. Am I right? Is there a way to run 
 it only for all vnodes on a single physical node ?
 
 Thank you!
 
 --
 Marco Matarazzo

Re: Cassandra on Red Hat 6.3

2013-02-18 Thread aaron morton

Nothing jumps out. 

Check /var/log/cassandra/output.log , that's where stdout and std err are 
directed. 

Check file permissions. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/02/2013, at 9:08 PM, amulya rattan talk2amu...@gmail.com wrote:

 I followed step-by-step instructions for installing Cassandra on Red Hat 
 Linux Server 6.3 from the datastax site, without much success. Apparently it 
 installs fine but starting cassandra service does nothing(no ports are bound 
 so opscenter/cli doesnt work). When I check service's status, it shows 
 Cassandra dead but pid file exists. When I try launching Cassandra from 
 /usr/sbin, it throws Error opening zip file or JAR manifest missing : 
 /lib/jamm-0.2.5.jar and stop, so clearly that's why service isn't running. 
 
 While I investigate it further, I thought it'd be worthwhile to put this on 
 the list and see if anybody else saw similar issue. I must point out that 
 this is fresh machine with fresh Cassandra installation so no conflicts with 
 any previous installations are possible. So anybody else came across 
 something similar?
 
 ~Amulya

Re: NPE in running ClientOnlyExample

2013-02-18 Thread aaron morton

An you can never go wrong relying on the documentation for the python pycassa
library, it has some handy tutorials for getting started.

http://pycassa.github.com/pycassa/

cheers
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/02/2013, at 9:51 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote:

I hope you have already gone through this link
https://github.com/zznate/hector-examples. If not will suggest you to go
through, and you can also refer
http://hector-client.github.com/hector/build/html/documentation.html.

Best Regards,

On Mon, Feb 18, 2013 at 12:15 AM, Jain Rahul ja...@ivycomptech.com wrote:
Thanks Edward,

My Bad. I was confused as It does seems to create keyspace also, As I
understand (although i'm not sure)

ListCfDef cfDefList = new ArrayListCfDef();
CfDef columnFamily = new CfDef(KEYSPACE, COLUMN_FAMILY);
cfDefList.add(columnFamily);
try
{
client.system_add_keyspace(new KsDef(KEYSPACE,
org.apache.cassandra.locator.SimpleStrategy, 1, cfDefList));
int magnitude = client.describe_ring(KEYSPACE).size();

Can I request you to please point me to some examples with I can start. I try
to see some example from hector but it does seems to be in-line with
Cassandra's 1.1 version.

Regards,
Rahul

-Original Message-
From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
Sent: 17 February 2013 21:49
To: user@cassandra.apache.org
Subject: Re: NPE in running ClientOnlyExample

This is a bad example to follow. This is the internal client the Cassandra
nodes use to talk to each other (fat client) usually you do not use this
unless you want to write some embedded code on the Cassandra server.

Typically clients use thrift/native transport. But you are likely getting the
error you are seeing because the keyspace or column family is not created yet.

On Sat, Feb 16, 2013 at 11:41 PM, Jain Rahul ja...@ivycomptech.com wrote:
Hi All,

I am newbie to Cassandra and trying to run an example program
ClientOnlyExample taken from
https://raw.github.com/apache/cassandra/cassandra-1.2/examples/client_only/src/ClientOnlyExample.java.
But while executing the program it gives me a null pointer exception.
Can you guys please help me out what I am missing.

I am using Cassandra 1.2.1 version. I have pasted the logs at
http://pastebin.com/pmADWCYe

Exception in thread main java.lang.NullPointerException

at
org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:71)

at
org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:66)

at
org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:61)

at
org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:56)

at org.apache.cassandra.db.RowMutation.add(RowMutation.java:183)

at org.apache.cassandra.db.RowMutation.add(RowMutation.java:204)

at ClientOnlyExample.testWriting(ClientOnlyExample.java:78)

at ClientOnlyExample.main(ClientOnlyExample.java:135)

Regards,

Rahul

This email and any attachments are confidential, and may be legally
privileged and protected by copyright. If you are not the intended
recipient dissemination or copying of this email is prohibited. If you
have received this in error, please notify the sender by replying by
email and then delete the email completely from your system. Any views
or opinions are solely those of the sender. This communication is not
intended to form a binding contract unless expressly indicated to the
contrary and properly authorised.
Any actions taken on the basis of this email are at the recipient's
own risk.
This email and any attachments are confidential, and may be legally
privileged and protected by copyright. If you are not the intended recipient
dissemination or copying of this email is prohibited. If you have received
this in error, please notify the sender by replying by email and then delete
the email completely from your system. Any views or opinions are solely those
of the sender. This communication is not intended to form a binding contract
unless expressly indicated to the contrary and properly authorised. Any
actions taken on the basis of this email are at the recipient's own risk.

--
Abhijit Chanda
+91-974395

Re: cassandra vs. mongodb quick question

2013-02-18 Thread aaron morton

My experience is repair of 300GB compressed data takes longer than 300GB of 
uncompressed, but I cannot point to an exact number. Calculating the 
differences is mostly CPU bound and works on the non compressed data. 

Streaming uses compression (after uncompressing the on disk data).

So if you have 300GB of compressed data, take a look at how long repair takes 
and see if you are comfortable with that. You may also want to test replacing a 
node so you can get the procedure documented and understand how long it takes.  

The idea of the soft 300GB to 500GB limit cam about because of a number of 
cases where people had 1 TB on a single node and they were surprised it took 
days to repair or replace. If you know how long things may take, and that fits 
in your operations then go with it. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/02/2013, at 10:08 PM, Vegard Berget p...@fantasista.no wrote:

  
 Just out of curiosity :
 
 When using compression, does this affect this one way or another?  Is 300G 
 (compressed) SSTable size, or total size of data?   
 
 .vegard,
 
 
 - Original Message -
 From:
 user@cassandra.apache.org
 
 To:
 user@cassandra.apache.org
 Cc:
 
 Sent:
 Mon, 18 Feb 2013 08:41:25 +1300
 Subject:
 Re: cassandra vs. mongodb quick question
 
 
 If you have spinning disk and 1G networking and no virtual nodes, I would 
 still say 300G to 500G is a soft limit. 
 
 If you are using virtual nodes, SSD, JBOD disk configuration or faster 
 networking you may go higher. 
 
 The limiting factors are the time it take to repair, the time it takes to 
 replace a node, the memory considerations for 100's of millions of rows. If 
 you the performance of those operations is acceptable to you, then go crazy. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 16/02/2013, at 9:05 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 
 So I found out mongodb varies their node size from 1T to 42T per node 
 depending on the profile.  So if I was going to be writing a lot but rarely 
 changing rows, could I also use cassandra with a per node size of +20T or is 
 that not advisable?
 
 Thanks,
 Dean

Re: Mutation dropped

2013-02-19 Thread aaron morton

 Does the rpc_timeout not control the client timeout ?
No it is how long a node will wait for a response from other nodes before 
raising a TimedOutException if less than CL nodes have responded. 
Set the client side socket timeout using your preferred client. 

 Is there any param which is configurable to control the replication timeout 
 between nodes ?
There is no such thing.
rpc_timeout is roughly like that, but it's not right to think about it that 
way. 
i.e. if a message to a replica times out and CL nodes have already responded 
then we are happy to call the request complete. 

Cheers

 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's commit 
 log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some timeout associated with replicated data 
 persistence ?
  
 Thanks,
 Kanwar
  
  
  
  
  
  
  
 From: Kanwar Sangha [mailto:kan...@mavenir.com] 
 Sent: 14 February 2013 09:08
 To: user@cassandra.apache.org
 Subject: Mutation dropped
  
 Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing 
 a lot of mutation dropped messages.  I understand that this is due to the 
 replica not being written to the
 other node ? RF = 2, CL =1.
  
 From the wiki -
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair or 
 Anti Entropy Repair
  
 Thanks,
 Kanwar

Re: Testing compaction strategies on a single production server?

2013-02-20 Thread aaron morton

I *think* it will work. The steps in the blog post to change the compaction 
strategy before RING_DELAY expires is to ensure no sstables are created before 
the strategy is changed. 

But I think you will be venturing into unchartered territory where their might 
be dragons. And not the fun Disney kind.

While it may be more work I personally would use one node in write survey to 
test LCS 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 6:28 AM, Henrik Schröder skro...@gmail.com wrote:

 Well, that answer didn't really help. I know how to make a survey node, and I 
 know how to simulate reads to it, it's just that that's a lot of work, and I 
 wouldn't be sure that the simulated load is the same as the production load.
 
 We gather a lot of metrics from our production servers, so we know exactly 
 how they perform over long periods of time. Changing a single server to run a 
 different compaction strategy would allow us to know in detail how a 
 different strategy would impact the cluster.
 
 So, is it possible to modify org.apache.cassandra.db.[keyspace].[column 
 family].CompactionStrategyClass through jmx on a production server without 
 any ill effects? Or is this only possible to do on a survey node while it is 
 in a specific state?
 
 
 /Henrik
 
 
 On Tue, Feb 19, 2013 at 3:09 PM, Viktor Jevdokimov 
 viktor.jevdoki...@adform.com wrote:
 Just turn off dynamic snitch on survey node and make read requests from it 
 directly with CL.ONE, watch histograms, compare.
 
  
 
 Regarding switching compaction strategy there’re a lot of info already.
 
  
 
  
 
 Best regards / Pagarbiai
 Viktor Jevdokimov
 Senior Developer
 
 Email: viktor.jevdoki...@adform.com
 Phone: +370 5 212 3063, Fax +370 5 261 0453
 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
 Follow us on Twitter: @adforminsider
 Take a ride with Adform's Rich Media Suite
 signature-logo18be.png
 signature-best-employer-logo6784.png 
 
 Disclaimer: The information contained in this message and attachments is 
 intended solely for the attention and use of the named addressee and may be 
 confidential. If you are not the intended recipient, you are reminded that 
 the information remains the property of the sender. You must not use, 
 disclose, distribute, copy, print or rely on this e-mail. If you have 
 received this message in error, please contact the sender immediately and 
 irrevocably delete this message and any copies.
 
 From: Henrik Schröder [mailto:skro...@gmail.com] 
 Sent: Tuesday, February 19, 2013 15:57
 To: user
 Subject: Testing compaction strategies on a single production server?
 
  
 
 Hey,
 
 
 Version 1.1 of Cassandra introduced live traffic sampling, which allows you 
 to measure the performance of a node without it really joining the cluster: 
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-live-traffic-sampling
 
 That page mentions that you can change the compaction strategy through jmx if 
 you want to test out a different strategy on your survey node.
 
 That's great, but it doesn't give you a complete view of how your performance 
 would change, since you're not doing reads from the survey node. But what 
 would happen if you used jmx to change the compaction strategy of a column 
 family on a single *production* node? Would that be a safe way to test it out 
 or are there side-effects of doing that live?
 
 And if you do that, would running a major compaction transform the entire 
 column family to the new format?
 
 Finally, if the test was a success, how do you proceed from there? Just 
 change the schema?
 
 
 /Henrik

Re: Cassandra network latency tuning

2013-02-21 Thread aaron morton

  I would like to understand how we can capture network latencies between a 
 1GbE and 10GbE for ex.
Cassandra reports two latencies.

The CF latencies reported by nodetool cfstats, nodetool cfhistograms and the CF 
MBeans cover the local time it takes to read or write the data. This does not 
include any local wait times, network latency or coordinator overhead. 

The Storage Proxy latency from nodetool proxyhistograms and the StorageProxy 
MBean is the total latency for a request on a coordinator.

Under load, with a consistent workload,  the CF latency should not vary too 
much. While the request latency can increase as wait time becomes more of an 
factor. 

Additionally streaming is throttled which you may want to increase, see the the 
yaml file. 
   
 We will soon be adding SSD's and was wondering how Cassandra can utilize the 
 10GbE and the SSD's and if there are specific tuning that is required.
You may want to increase both the concurrent_writes and reads in the yaml file 
to take advantage of the extra IO. 
Same for the compaction settings, comments in the yaml file will help. 

With SSD and 10GbE you can easily hold more data on each node. Typically we 
advise 300GB to 500GB per node with HDD and 1GbE, because of the time repair 
and node replacement takes. With SSD and 10GbE it will take less, and even less 
if you are using SSD. 

If you feel like being thorough add repair and node replacement (all under 
load) to your test lineup. 

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 1:44 PM, Brandon Walsh brandon_9021...@yahoo.com wrote:

 I have a 5 node cluster and currently running ver 1.2. Prior to full scale 
 deployment, I'm running some benchmarks  using YCSB. From a hadoop cluster 
 deployment we saw an excellent improvement using higher speed networks. 
 However Cassandra does not include network latencies and I would like to 
 understand how we can capture network latencies between a 1GbE and 10GbE for 
 ex. As of now all the graphs look the same. We will soon be adding SSD's and 
 was wondering how Cassandra can utilize the 10GbE and the SSD's and if there 
 are specific tuning that is required.

Re: How to limit query results like from row 50 to 100

2013-02-21 Thread aaron morton

CQL does not support offset but does have limit. 

See 
http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT#specifying-rows-returned-using-limit

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 1:47 PM, Mateus Ferreira e Freitas 
mateus.ffrei...@hotmail.com wrote:

 With CQL or an API.

Re: Heap is N.N full. Immediately on startup

2013-02-21 Thread aaron morton

My first guess would be the bloom filter and index sampling from lots-o-rows 

Check the row count in cfstats
Check the bloom filter size in cfstats. 

Background on memory requirements 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 11:27 PM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.com wrote:

 Hey list,
 
 Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM 
 heap at startup in Cassandra 1.1.6 besides
 row cache : not persisted and is at 0 keys when this warning is produced
 Memtables : no write traffic at startup, my app's column families are 
 durable_writes:false
 Pending tasks : no pending tasks, except for 928 compactions ( not sure where 
 those are coming from )
 I drew these conclusions from the StatusLogger output below: 
 
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) 
 GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max 
 is 8375238656
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) 
 Pool NameActive   Pending   Blocked
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) 
 ReadStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 RequestResponseStage  0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 ReadRepairStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 MutationStage 0-1 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 ReplicateOnWriteStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 GossipStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 AntiEntropyStage  0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 MigrationStage0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 StreamStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 MemtablePostFlusher   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 FlushWriter   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 MiscStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 commitlog_archiver0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) 
 InternalResponseStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) 
 CompactionManager 0   928
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) 
 MessagingServicen/a   0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) 
 Cache Type Size Capacity   
 KeysToSave Provider
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) 
 KeyCache 25   25  
 all 
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) 
 RowCache  00  
 all  org.apache.cassandra.cache.SerializingCacheProvider
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) 
 ColumnFamilyMemtable ops,data
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 MYAPP_1.CF0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 MYAPP_2.CF 0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 HiveMetaStore.MetaStore   0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 system.NodeIdInfo 0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 system.IndexInfo  0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 system.LocationInfo   0,0
  INFO

Re: SSTable Num

2013-02-21 Thread aaron morton

 Hi – I have around 6TB of data on 1 node
Unless you have SSD and 10GbE you probably have too much data on there. 
Remember you need to run repair and that can take a long time with a lot of 
data. Also you may need to replace a node one day and moving 6TB will take a 
while.

  Or will the sstable compaction continue and eventually we will have 1 file ?
No. 
The default size tiered strategy compacts files what are roughly the same size, 
and only when there are more than 4 (default) of them.

Cheers
  
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 3:47 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – I have around 6TB of data on 1 node and the cfstats show 32 sstables. 
 There is no compaction job running in the background. Is there a limit on the 
 size per sstable ? Or will the sstable compaction continue and eventually we 
 will have 1 file ?
  
 Thanks,
 Kanwar

Re: how to debug slowdowns from these log snippets-more info 2

2013-02-21 Thread aaron morton

Some things to consider: 

Check for contention around the switch lock. This can happen if you get a lot 
of tables flushing at the same time, or if you have a lot of secondary indexes. 
It shows up as a pattern in the logs. As soon a the writer starts flushing a 
memtable another will be queued. Probably not happening here but can be a pain 
when a lot of memtables are flushed. 

I would turn on GC logging in cassandra-env.sh and watch that. After a full CMS 
flush how full / empty is the tenured heap ? If it is still got a lot in it 
then you are running with too much cache / bloom filter / index sampling. 

You can also experiment with the Max Tenuring Threshold, try turning it up to 4 
to start with. The GC logs will show you how much data is at each tenuring 
level. You can then see how much data is being tenuring, and if premature 
tenuring was an issue. I've seen premature tenuring cause issues with wide rows 
/ long reads. 

Hope that helps. 


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 4:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Oh, and my startup command that cassandra logged was
 
 a2.bigde.nrel.gov: xss =  -ea -javaagent:/opt/cassandra/lib/jamm-0.2.5.jar
 -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8021M -Xmx8021M
 -Xmn1600M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
 
 And I remember from docs you don't want to go above 8G or java GC doesn't
 work out so well.  I am not sure why this is not working out though.
 
 Dean
 
 On 2/20/13 7:16 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 
 Here is the printout before that log which is probably important as
 wellŠ..
 
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 GCInspector.java (line
 122) GC for ConcurrentMarkSweep: 3618 ms for 2 collections, 7038159096
 used; max is 8243904512
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line
 57) Pool NameActive   Pending   Blocked
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line
 72) ReadStage11   264 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) RequestResponseStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) ReadRepairStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) MutationStage1288 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) ReplicateOnWriteStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) GossipStage   1 7 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) AntiEntropyStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) MigrationStage0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) StreamStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) MemtablePostFlusher   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) FlushWriter   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) MiscStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) commitlog_archiver0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 72) InternalResponseStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 72) HintedHandoff 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 77) CompactionManager 4 5
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 89) MessagingServicen/a10,127
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 99) Cache Type Size Capacity
   KeysToSave
 Provider
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 100) KeyCache1310719  1310719
   all
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 106) RowCache  00
   all   
 org.apache.cassandra.cache.SerializingCacheProvider
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 113) ColumnFamilyMemtable ops,data
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,379 StatusLogger.java

Re: Mutation dropped

2013-02-21 Thread aaron morton

 What does rpc_timeout control? Only the reads/writes? 
Yes. 

 like data stream,
streaming_socket_timeout_in_ms in the yaml

 merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

 What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:

 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is the 
 reasonable value for roc_timeout? The default value of 10 seconds are way too 
 long. What is the side effect if it's set to a really small number, say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's 
 commit log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID   
 Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some timeout associated with replicated data 
 persistence ?
  
 Thanks,
 Kanwar
  
  
  
  
  
  
  
 From: Kanwar Sangha [mailto:kan...@mavenir.com] 
 Sent: 14 February 2013 09:08
 To: user@cassandra.apache.org
 Subject: Mutation dropped
  
 Hi – I am doing a load test using YCSB across 2 nodes in a cluster and 
 seeing a lot of mutation dropped messages.  I understand that this is due to 
 the replica not being written to the
 other node ? RF = 2, CL =1.
  
 From the wiki -
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair 
 or Anti Entropy Repair
  
 Thanks,
 Kanwar

Re: very confused by jmap dump of cassandra

2013-02-21 Thread aaron morton

Cannot comment too much on the jmap but I can add my general compaction is 
hurting strategy. 

Try any or all of the following to get to a stable setup, then increase until 
things go bang. 

Set concurrent compactors to 2. 
Reduce compaction throughput by half. 
Reduce in_memory_compaction_limit. 
If you see compactions using a lot of sstables in the logs, reduce 
max_compaction_threshold. 
 
  I can easily go higher than 8G on these systems as I have 32gig each node, 
 but there was docs that said 8G is better for GC. 
More JVM memory is not the answer. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 7:49 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 I took this jmap dump of cassandra(in production).  Before I restarted the 
 whole production cluster, I had some nodes running compaction and it looked 
 like all memory had been consumed(kind of like cassandra is not clearing out 
 the caches or memtables fast enough).  I am trying to still debug compaction 
 causes slowness on the cluster since all cassandra.yaml files are pretty much 
 the defaults with size tiered compaction.
 
 The weird thing is I dump and get a 5.4G heap.bin file and load that into 
 ecipse who tells me total is 142.8MB….what So low, top was showing 
 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is 
 eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in 
 use?)
 
 Tasks: 398 total,   1 running, 397 sleeping,   0 stopped,   0 zombie
 Cpu(s):  2.8%us,  0.5%sy,  0.0%ni, 96.5%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
 Mem:  32854680k total, 31910708k used,   943972k free,89776k buffers
 Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached
 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 20909 cassandr  20   0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java
 22455 cassandr  20   0 15288 1340  824 R  3.9  0.0   0:00.02 top
 
 It almost seems like cassandra is not being good about memory management here 
 as we slowly get into a situation where compaction is run which takes out our 
 memory(configured for 8G).  I can easily go higher than 8G on these systems 
 as I have 32gig each node, but there was docs that said 8G is better for GC.  
 Has anyone else taken a jmap dump of cassandra?
 
 Thanks,
 Dean

Re: cassandra vs. mongodb quick question(good additional info)

2013-02-21 Thread aaron morton

If you are lazy like me wolfram alpha can help 

http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes--

10 hours 15 minutes 43.59 seconds

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 11:31 AM, Wojciech Meler wojciech.me...@gmail.com wrote:

 you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link
 
 19 lut 2013 02:01, Hiller, Dean dean.hil...@nrel.gov napisał(a):
 I thought about this more, and even with a 10Gbit network, it would take 40 
 days to bring up a replacement node if mongodb did truly have a 42T / node 
 like I had heard.  I wrote the below email to the person I heard this from 
 going back to basics which really puts some perspective on it….(and a lot of 
 people don't even have a 10Gbit network like we do)
 
 Nodes are hooked up by a 10G network at most right now where that is 
 10gigabit.  We are talking about 10Terabytes on disk per node recently.
 
 Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second  (yes I could 
 have divided by 8 in my head but eh…course when I saw the number, I went duh)
 
 So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we are 
 bringing online to replace a dead node would take approximately 5 days???
 
 This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1 
 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more 
 likely 11 days if we only use 50% of the network.
 
 So bringing a new node up to speed is more like 11 days once it is crashed.  
 I think this is the main reason the 1Terabyte exists to begin with, right?
 
 From an ops perspective, this could sound like a nightmare scenario of 
 waiting 10 days…..maybe it is livable though.  Either way, I thought it would 
 be good to share the numbers.  ALSO, that is assuming the bus with it's 10 
 disk can keep up with 10G  Can it?  What is the limit of throughput on a 
 bus / second on the computers we have as on wikipedia there is a huge 
 variance?
 
 What is the rate of the disks too (multiplied by 10 of course)?  Will they 
 keep up with a 10G rate for bringing a new node online?
 
 This all comes into play even more so when you want to double the size of 
 your cluster of course as all nodes have to transfer half of what they have 
 to all the new nodes that come online(cassandra actually has a very data 
 center/rack aware topology to transfer data correctly to not use up all 
 bandwidth unecessarily…I am not sure mongodb has that).  Anyways, just food 
 for thought.
 
 From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Monday, February 18, 2013 1:39 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget 
 p...@fantasista.nomailto:p...@fantasista.no
 Subject: Re: cassandra vs. mongodb quick question
 
 My experience is repair of 300GB compressed data takes longer than 300GB of 
 uncompressed, but I cannot point to an exact number. Calculating the 
 differences is mostly CPU bound and works on the non compressed data.
 
 Streaming uses compression (after uncompressing the on disk data).
 
 So if you have 300GB of compressed data, take a look at how long repair takes 
 and see if you are comfortable with that. You may also want to test replacing 
 a node so you can get the procedure documented and understand how long it 
 takes.
 
 The idea of the soft 300GB to 500GB limit cam about because of a number of 
 cases where people had 1 TB on a single node and they were surprised it took 
 days to repair or replace. If you know how long things may take, and that 
 fits in your operations then go with it.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/02/2013, at 10:08 PM, Vegard Berget 
 p...@fantasista.nomailto:p...@fantasista.no wrote:
 
 
 
 Just out of curiosity :
 
 When using compression, does this affect this one way or another?  Is 300G 
 (compressed) SSTable size, or total size of data?
 
 .vegard,
 
 - Original Message -
 From:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
 To:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Cc:
 
 Sent:
 Mon, 18 Feb 2013 08:41:25 +1300
 Subject:
 Re: cassandra vs. mongodb quick question
 
 
 If you have spinning disk and 1G networking and no virtual nodes, I would 
 still say 300G to 500G is a soft limit.
 
 If you are using virtual nodes, SSD, JBOD disk configuration or faster 
 networking you may go higher.
 
 The limiting factors are the time it take to repair, the time it takes to 
 replace a node, the memory considerations for 100's of millions of rows. If 
 you

Re: key cache size

2013-02-21 Thread aaron morton

This is the key cache entry 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cache/KeyCacheKey.java

Note that the Descriptor is re-used. 

If you want to see key cache metrics, including bytes used,  use nodetool info. 

Cheers
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 3:45 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – What is the approximate overhead of the key cache ? Say each key is 50 
 bytes. What would be the overhead for this key in the key cache ?
  
 Thanks,
 Kanwar

Re: Read IO

2013-02-22 Thread aaron morton

AFAIk this is still roughly correct 
http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

It includes information on the page size read from disk. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 5:45 AM, Jouni Hartikainen jouni.hartikai...@reaktor.fi 
wrote:

 
 Hi,
 
 On Feb 21, 2013, at 7:52 , Kanwar Sangha kan...@mavenir.com wrote:
 Hi – Can someone explain the worst case IOPS for a read ? No key cache, No 
 row cache, sampling rate say 512.
 
 1)  Bloom filter will be checked to see existence of key (In RAM)
 2)  Index filer sample (IN RAM) will be checked to find approx. location 
 in index file on disk
 3)  1 IOPS to read the actual index file on disk (DISK)
 4)  1 IOPS to get the data from the location in the sstable (DISK)
 
 Is this correct ?
 
 As you were asking for the worst case, I would still add one step that would 
 be a seek inside an SSTable from the row start to the queried columns using 
 column index.
 
 However, this applies only if you are querying a subset of columns in the row 
 (not all) and the total row size exceeds column_index_size_in_kb (defaults to 
 64kB).
 
 So, as far as I have understood, the worst case steps (without any caches) 
 are:
 
 1. Check the SSTable bloom filters (in memory)
 2. Use index samples to find approx. correct place in the key index file (in 
 memory)
 3. Read the key index file until correct key is found (1st disk seek  read)
 5. Seek to the start of the row in SSTable file and read row headers 
 (possibly including column index) (2nd seek  read)
 6. Using column index seek to the correct place inside the SSTable file to 
 actually read the columns (3rd seek  read)
 
 If the row is very wide and you are asking for a random bunch of columns from 
 here and there, the step 6 might even be needed multiple times. Also, if your 
 row has spread over many SSTables, each of them needs to be accessed (at 
 least steps 1. - 5.) to get the complete results for the query.
 
 All this in mind, if your node has any reasonable amount of reads, I'd say 
 that in practice key index files will be page cached by the OS very quickly 
 and thus normal read would end up being either one seek (for small rows 
 without the column index) or two (for wider rows). Of course, as Peter 
 already pointed out, the more columns you ask for, the more disk needs to 
 read. For a continuous set of columns the read should be linear, however.
 
 -Jouni

Re: SSTable Num

2013-02-22 Thread aaron morton

 Ok. So for 10 TB, I could have at least 4 SStables files each of 2.5 TB ?
You will have many sstables, in your case 32. 
Each bucket of files (files that are within 50% of the average size of files in 
a bucket) will contain 3 or less files. 

This article provides com back ground, but it's working correctly as you have 
described it 
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 6:39 AM, Kanwar Sangha kan...@mavenir.com wrote:

 No. 
 The default size tiered strategy compacts files what are roughly the same 
 size, and only when there are more than 4 (default) of them.
  
 Ok. So for 10 TB, I could have at least 4 SStables files each of 2.5 TB ?
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 21 February 2013 11:01
 To: user@cassandra.apache.org
 Subject: Re: SSTable Num
  
 Hi – I have around 6TB of data on 1 node
 Unless you have SSD and 10GbE you probably have too much data on there. 
 Remember you need to run repair and that can take a long time with a lot of 
 data. Also you may need to replace a node one day and moving 6TB will take a 
 while.
  
  Or will the sstable compaction continue and eventually we will have 1 file ?
 No. 
 The default size tiered strategy compacts files what are roughly the same 
 size, and only when there are more than 4 (default) of them.
  
 Cheers
   
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 21/02/2013, at 3:47 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – I have around 6TB of data on 1 node and the cfstats show 32 sstables. 
 There is no compaction job running in the background. Is there a limit on the 
 size per sstable ? Or will the sstable compaction continue and eventually we 
 will have 1 file ?
  
 Thanks,
 Kanwar

Re: Heap is N.N full. Immediately on startup

2013-02-22 Thread aaron morton

To get a good idea of how GC is performing turn on the GC logging in 
cassandra-env.sh. 

After a full cms GC event, see how big the tenured heap is. If it's not 
reducing enough then GC will never get far enough ahead. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 8:37 AM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.com wrote:

 Thank you- indeed my index interval is 64 with a CF of 300M rows + bloom 
 filter false positive chance was default.
 Raising the index interval to 512 didn't fix this  alone, so I guess I'll 
 have to set the bloom filter to some reasonable value and scrub.
 
 From: aaron morton aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday 21 February 2013 17:58
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Heap is N.N full. Immediately on startup
 
 My first guess would be the bloom filter and index sampling from lots-o-rows 
 
 Check the row count in cfstats
 Check the bloom filter size in cfstats. 
 
 Background on memory requirements 
 http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 20/02/2013, at 11:27 PM, Andras Szerdahelyi 
 andras.szerdahe...@ignitionone.com wrote:
 
 Hey list,
 
 Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM 
 heap at startup in Cassandra 1.1.6 besides
 row cache : not persisted and is at 0 keys when this warning is produced
 Memtables : no write traffic at startup, my app's column families are 
 durable_writes:false
 Pending tasks : no pending tasks, except for 928 compactions ( not sure 
 where those are coming from )
 I drew these conclusions from the StatusLogger output below: 
 
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) 
 GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max 
 is 8375238656
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) 
 Pool NameActive   Pending   Blocked
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) 
 ReadStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 RequestResponseStage  0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 ReadRepairStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 MutationStage 0-1 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 ReplicateOnWriteStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 GossipStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 AntiEntropyStage  0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 MigrationStage0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 StreamStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 MemtablePostFlusher   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 FlushWriter   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 MiscStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 commitlog_archiver0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) 
 InternalResponseStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) 
 CompactionManager 0   928
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) 
 MessagingServicen/a   0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) 
 Cache Type Size Capacity   
 KeysToSave Provider
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 
 100) KeyCache 25   25
   all
  
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 
 106) RowCache  00

Re: Mutation dropped

2013-02-22 Thread aaron morton

If you are running repair, using QUORUM, and there are not dropped writes you 
should not be getting DigestMismatch during reads. 

If everything else looks good, but the request latency is higher than the CF 
latency I would check that client load is evenly distributed. Then start 
looking to see if the request throughput is at it's maximum for the cluster. 

Cheers
  
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 8:15 PM, Wei Zhu wz1...@yahoo.com wrote:

 Thanks Aaron for the great information as always. I just checked cfhistograms 
 and only a handful of read latency are bigger than 100ms, but for 
 proxyhistograms there are 10 times more are greater than 100ms. We are using 
 QUORUM  for reading with RF=3, and I understand coordinator needs to get the 
 digest from other nodes and read repair on the miss match etc. But is it 
 normal to see the latency from proxyhistograms to go beyond 100ms? Is there 
 anyway to improve that? 
 We are tracking the metrics from Client side and we see the 95th percentile 
 response time averages at 40ms which is a bit high. Our 50th percentile was 
 great under 3ms. 
 
 Any suggestion is very much appreciated.
 
 Thanks.
 -Wei
 
 - Original Message -
 From: aaron morton aa...@thelastpickle.com
 To: Cassandra User user@cassandra.apache.org
 Sent: Thursday, February 21, 2013 9:20:49 AM
 Subject: Re: Mutation dropped
 
 What does rpc_timeout control? Only the reads/writes? 
 Yes. 
 
 like data stream,
 streaming_socket_timeout_in_ms in the yaml
 
 merkle tree request? 
 Either no time out or a number of days, cannot remember which right now. 
 
 What is the side effect if it's set to a really small number, say 20ms?
 You will probably get a lot more requests that fail with a TimedOutException. 
 
 rpc_timeout needs to be longer than the time it takes a node to process the 
 message, and the time it takes the coordinator to do it's thing. You can look 
 at cfhistograms and proxyhistograms to get a better idea of how long a 
 request takes in your system.  
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:
 
 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is 
 the reasonable value for roc_timeout? The default value of 10 seconds are 
 way too long. What is the side effect if it's set to a really small number, 
 say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
 
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
 
 
 
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
 
 You are hitting the maximum throughput on the cluster. 
 
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
 
 However the request is still a success because the client requested CL was 
 achieved. 
 
 Testing with RF 2 and CL 1 really just tests the disks on one local 
 machine. Both nodes replicate each row, and writes are sent to each 
 replica, so the only thing the client is waiting on is the local node to 
 write to it's commit log. 
 
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
 
 Node A and B with RF=2, CL =1. Load balanced between the two

Re: Adding new nodes in a cluster with virtual nodes

2013-02-22 Thread aaron morton

 So, it looks that the repair is required if we want to add new nodes in our 
 platform, but I don't understand why.
Bootstrapping should take care of it. But new seed nodes do not bootstrap. 
Check the logs on the nodes you added to see what messages have bootstrap in 
them. 

Anytime you are worried about things like this throw in a nodetool repair. If 
you are using QUOURM for read and writes you will still be getting consistent 
data, so long as you have only added one node. Or one node every RF'th nodes. 

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 9:55 PM, Jean-Armel Luce jaluc...@gmail.com wrote:

 Hi Aaron,
 
 Thanks for your answer.
 
 
 I apologize, I did a mistake in my 1st mail. The cluster was only 12 nodes 
 instead of 16 (it is a test cluster).
 There are 2 datacenters b1 and s1.
 
 Here is the result of nodetool status after adding a new node in the 1st 
 datacenter (dc s1):
 root@node007:~# nodetool status
 Datacenter: b1
 ==
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.234.72.135 10.71 GB   256 44.6% 
 2fc583b2-822f-4347-9fab-5e9d10d548c9  c01
 UN  10.234.72.134 16.74 GB   256 63.7% 
 f209a8c5-7e1b-45b5-aa80-ed679bbbdbd1  e01
 UN  10.234.72.139 17.09 GB   256 62.0% 
 95661392-ccd8-4592-a76f-1c99f7cdf23a  e07
 UN  10.234.72.138 10.96 GB   256 42.9% 
 0d6725f0-1357-423d-85c1-153fb94257d5  e03
 UN  10.234.72.137 11.09 GB   256 45.7% 
 492190d7-3055-4167-8699-9c6560e28164  e03
 UN  10.234.72.136 11.91 GB   256 41.1% 
 3872f26c-5f2d-4fb3-9f5c-08b4c7762466  c01
 Datacenter: s1
 ==
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.98.255.139 16.94 GB   256 43.8% 
 3523e80c-8468-4502-b334-79eabc3357f0  g10
 UN  10.98.255.138 12.62 GB   256 42.4% 
 a2bcddf1-393e-453b-9d4f-9f7111c01d7f  i02
 UN  10.98.255.137 10.59 GB   256 38.4% 
 f851b6ee-f1e4-431b-8beb-e7b173a77342  i02
 UN  10.98.255.136 11.89 GB   256 42.9% 
 36fe902f-3fb1-4b6d-9e2c-71e601fa0f2e  a09
 UN  10.98.255.135 10.29 GB   256 40.4% 
 e2d020a5-97a9-48d4-870c-d10b59858763  a09
 UN  10.98.255.134 16.19 GB   256 52.3% 
 73e3376a-5a9f-4b8a-a119-c87ae1fafdcb  h06
 UN  10.98.255.140 127.84 KB  256 39.9% 
 3d5c33e6-35d0-40a0-b60d-2696fd5cbf72  g10
 
 We can see that the new node (10.98.255.140) contains only 127,84KB.
 We saw also that there was no network traffic between the nodes.
 
 Then we added a new node in the 2nd datacenter (dc b1)
 
 
 
 root@node007:~# nodetool status
 Datacenter: b1
 ==
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.234.72.135 12.95 GB   256 42.0% 
 2fc583b2-822f-4347-9fab-5e9d10d548c9  c01
 UN  10.234.72.134 20.11 GB   256 53.1% 
 f209a8c5-7e1b-45b5-aa80-ed679bbbdbd1  e01
 UN  10.234.72.140 122.25 KB  256 41.9% 
 501ea498-8fed-4cc8-a23a-c99492bc4f26  e07
 UN  10.234.72.139 20.46 GB   256 40.2% 
 95661392-ccd8-4592-a76f-1c99f7cdf23a  e07
 UN  10.234.72.138 13.21 GB   256 40.9% 
 0d6725f0-1357-423d-85c1-153fb94257d5  e03
 UN  10.234.72.137 13.34 GB   256 42.9% 
 492190d7-3055-4167-8699-9c6560e28164  e03
 UN  10.234.72.136 14.16 GB   256 39.0% 
 3872f26c-5f2d-4fb3-9f5c-08b4c7762466  c01
 Datacenter: s1
 ==
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.98.255.139 19.19 GB   256 43.8% 
 3523e80c-8468-4502-b334-79eabc3357f0  g10
 UN  10.98.255.138 14.9 GB256 42.4% 
 a2bcddf1-393e-453b-9d4f-9f7111c01d7f  i02
 UN  10.98.255.137 12.49 GB   256 38.4% 
 f851b6ee-f1e4-431b-8beb-e7b173a77342  i02
 UN  10.98.255.136 14.13 GB   256 42.9% 
 36fe902f-3fb1-4b6d-9e2c-71e601fa0f2e  a09
 UN  10.98.255.135 12.16 GB   256 40.4% 
 e2d020a5-97a9-48d4-870c-d10b59858763  a09
 UN  10.98.255.134 18.85 GB   256 52.3% 
 73e3376a-5a9f-4b8a-a119-c87ae1fafdcb  h06
 UN  10.98.255.140 2.24 GB256 39.9% 
 3d5c33e6-35d0-40a0-b60d-2696fd5cbf72  g10
 
 
 We can see that the 2nd new node (10.234.72.140) contains only 122,25KB.
 The new node in the 1st datacenter contains now 2,24 GB because we

Re: operations progress on DBA operations?

2013-02-22 Thread aaron morton

nodetool compactionstats 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/02/2013, at 3:44 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 I am used to systems running a first phase calculating how much files it will 
 need to go through and then logging out the percent done or X files out of 
 total files done.  I ran this command and it is logging nothing
 
 nodetool upgradesstables databus5 nreldata;
 
 I have 130Gigs of data on my node and not all of it in that one column family 
 above.  How can I tell how far it is in it's process?  It has been running 
 for about 10 minutes already.  I don't see anything in the log files either.
 
 Thanks,
 Dean

Re: ReverseIndexExample

2013-02-22 Thread aaron morton

We are trying to answer client library specific questions on the client-dev 
list, see the link at the bottom here http://cassandra.apache.org/

If you can ask a more specific question I'll answer it there. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/02/2013, at 3:44 AM, Everton Lima peitin.inu...@gmail.com wrote:

 Hello, 
 
 Anyone have already used ReverseIndexQuery from Astyanay. I was tring to 
 understand it, but I execute the example of Astyanax Site and can not 
 understood.
 Ssomeone can help me please?
 
 Thanks;
 
 -- 
 Everton Lima Aleixo
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA

Re: disabling bloomfilter not working? or did I do this wrong?

2013-02-22 Thread aaron morton

 Bloom Filter Space Used: 2318392048
Just to be sane do a quick check of the -Filter.db files on disk for this CF. 
If they are very small try a restart on the node. 

 Number of Keys (estimate): 1249133696
Hey a billion rows on a node, what an age we live in :)

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/02/2013, at 4:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 So in the cli, I ran
 
 update column family nreldata with bloom_filter_fp_chance=1.0;
 
 Then I ran
 
 nodetool upgradesstables databus5 nreldata;
 
 But my bloom filter size is still around 2gig(and I want to free up this 
 heap) According to nodetool cfstats command…
 
 Column Family: nreldata
 SSTable count: 10
 Space used (live): 96841497731
 Space used (total): 96841497731
 Number of Keys (estimate): 1249133696
 Memtable Columns Count: 7066
 Memtable Data Size: 4286174
 Memtable Switch Count: 924
 Read Count: 19087150
 Read Latency: 0.595 ms.
 Write Count: 21281994
 Write Latency: 0.013 ms.
 Pending Tasks: 0
 Bloom Filter False Postives: 974393
 Bloom Filter False Ratio: 0.8
 Bloom Filter Space Used: 2318392048
 Compacted row minimum size: 73
 Compacted row maximum size: 446
 Compacted row mean size: 143

Re: How wide rows are structured in CQL3

2013-02-22 Thread aaron morton

 Does this effectively create the same storage structure?
Yes. 

 SELECT Value FROM X WHERE RowKey = 'RowKey1' AND TimeStamp BETWEEN 100 AND 
 1000;
select value from X where RoWKey  = 'foo' and timestamp = 100 and timestamp = 
1000;
 
 I also don't understand some of the things like WITH COMPACT STORAGE and 
 CLUSTERING.
Some info here, does not cover compact storage 
http://thelastpickle.com/2013/01/11/primary-keys-in-cql/

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/02/2013, at 4:36 AM, Boris Solovyov boris.solov...@gmail.com wrote:

 Hi,
 
 My impression from reading docs is that in old versions of Cassandra, you 
 could create very wide rows, say with timestamps as column names for time 
 series data, and read an ordered slice of the row.  So,
 
 RowKeyColumns
 ===  ==
 RowKey1  1:val1 2:val2 3:val3  N:valN
 
 With this data I think you could say get RowKey1, cols 100 to 1000 and get 
 a slice of values. (I have no experience with this, just from reading about 
 it.)
 
 In CQL3 it looks like this is kind of normalized so I would have
 
 CREATE TABLE X (
 RowKey text,
 TimeStamp int,
 Value text,
 PRIMARY KEY(RowKey, TimeStamp)
 );
 
 Does this effectively create the same storage structure?
 
 Now, in CQL3, it looks like I should access it like this,
 
 SELECT Value FROM X WHERE RowKey = 'RowKey1' AND TimeStamp BETWEEN 100 AND 
 1000;
 
 Does this do the same thing?
 
 I also don't understand some of the things like WITH COMPACT STORAGE and 
 CLUSTERING. I'm having a hard time figuring out how this maps to the 
 underlying storage. It is a little more abstract. I feel like the new CQL 
 stuff isn't really explained clearly to me -- is it just a query language 
 that accesses the same underlying structures, or is Cassandra's storage and 
 access model fundamentally different now?

Re: Q on schema migratins

2013-02-22 Thread aaron morton

  dropped this secondary index after while.
I assume you use UPDATE COLUMN FAMILY in the CLI. 

 How can I avoid this secondary index building on node join?
Check the schema using show schema in the cli.

Check that all nodes in the cluster have the same schema, using describe 
cluster in the cli.
If they are in disagreement see this 
http://wiki.apache.org/cassandra/FAQ#schema_disagreement

Cheers

 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/02/2013, at 5:17 AM, Igor i...@4friends.od.ua wrote:

 Hello
 
 Cassandra 1.0.7
 
 Some time ago we used secondary index on one of  CF. Due to performance 
 reasons we dropped this secondary index after while. But now, each time I add 
 and bootstrap new node I see how cassandra again build this secondary index 
 on this node (which takes huge time), and when  index is built it is not used 
 anymore, so I can safely delete files from disk.
 
 How can I avoid this secondary index building on node join?
 
 Thanks for your answers!

Re: is there a way to drain node(and prevent reads) and upgrade sstables offline?

2013-02-22 Thread aaron morton

To stop all writes and reads disable thrift and gossip via nodetool. 
This will not stop any in progress repair sessions nor disconnect fat clients 
if you have them.

There are also cmd line args cassandra.start_rpc and cassandra.join_ring whihc 
do the same thing. 

You can also change the compaction throughput using nodetool 

 multithreaded_compaction = true temporarily
Unless you have SSD leave this guy alone. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/02/2013, at 6:04 AM, Michael Kjellman mkjell...@barracuda.com wrote:

 Couldn't you just disable thrift and leave gossip active?
 
 On 2/22/13 9:01 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 
 We would like to take a node out of the ring and upgradesstables while it
 is not doing any writes nor reads with the ring.  Is this possible?
 
 I am thinking from the documentation
 
 1.  nodetool drain
 2.  ANYTHING to stop reads here
 3.  Modify cassandra.yaml with compaction_throughput_mb_per_sec = 0 and
 multithreaded_compaction = true temporarily
 4.  Restart cassandra and run nodetool upgradesstables keyspace CF
 5.  Modify cassandra.yaml to revert changes
 6.  Restart cassandra to join the cluster again.
 
 Is this how it should be done?
 
 Thanks,
 Dean
 
 
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com.

Re: Size Tiered - Leveled Compaction

2013-02-24 Thread aaron morton

If you did not use LCS until after the upgrade to 1.1.9 I think you are ok.

If in doubt the steps here look like they helped
https://issues.apache.org/jira/browse/CASSANDRA-4644?focusedCommentId=13456137page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13456137

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 23/02/2013, at 6:56 AM, Mike mthero...@yahoo.com wrote:

Hello,

Still doing research before we potentially move one of our column families
from Size Tiered-Leveled compaction this weekend. I was doing some research
around some of the bugs that were filed against leveled compaction in
Cassandra and I found this:

https://issues.apache.org/jira/browse/CASSANDRA-4644

The bug mentions:

You need to run the offline scrub (bin/sstablescrub) to fix the sstable
overlapping problem from early 1.1 releases. (Running with -m to just check
for overlaps between sstables should be fine, since you already scrubbed
online which will catch out-of-order within an sstable.)

We recently upgraded from 1.1.2 to 1.1.9.

Does anyone know if an offline scrub is recommended to be performed when
switching from STCS-LCS after upgrading from 1.1.2?

Any insight would be appreciated,
Thanks,
-Mike

On 2/17/2013 8:57 PM, Wei Zhu wrote:
We doubled the SStable size to 10M. It still generates a lot of SSTable and
we don't see much difference of the read latency. We are able to finish the
compactions after repair within serveral hours. We will increase the SSTable
size again if we feel the number of SSTable hurts the performance.

- Original Message -
From: Mike mthero...@yahoo.com
To: user@cassandra.apache.org
Sent: Sunday, February 17, 2013 4:50:40 AM
Subject: Re: Size Tiered - Leveled Compaction

Hello Wei,

First thanks for this response.

Out of curiosity, what SSTable size did you choose for your usecase, and
what made you decide on that number?

Thanks,
-Mike

On 2/14/2013 3:51 PM, Wei Zhu wrote:

I haven't tried to switch compaction strategy. We started with LCS.

For us, after massive data imports (5000 w/seconds for 6 days), the first
repair is painful since there is quite some data inconsistency. For 150G
nodes, repair brought in about 30 G and created thousands of pending
compactions. It took almost a day to clear those. Just be prepared LCS is
really slow in 1.1.X. System performance degrades during that time since
reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We
tried everything we can and couldn't speed it up. I think it's single
threaded and it's not recommended to turn on multithread compaction. We
even tried that, it didn't help )There is parallel LCS in 1.2 which is
supposed to alleviate the pain. Haven't upgraded yet, hope it works:)

http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2

Since our cluster is not write intensive, only 100 w/seconds. I don't see
any pending compactions during regular operation.

One thing worth mentioning is the size of the SSTable, default is 5M which
is kind of small for 200G (all in one CF) data set, and we are on SSD. It
more than 150K files in one directory. (200G/5M = 40K SSTable and each
SSTable creates 4 files on disk) You might want to watch that and decide the
SSTable size.

By the way, there is no concept of Major compaction for LCS. Just for fun,
you can look at a file called $CFName.json in your data directory and it
tells you the SSTable distribution among different levels.

-Wei

From: Charles Brophy cbro...@zulily.com
To: user@cassandra.apache.org
Sent: Thursday, February 14, 2013 8:29 AM
Subject: Re: Size Tiered - Leveled Compaction

I second these questions: we've been looking into changing some of our CFs
to use leveled compaction as well. If anybody here has the wisdom to answer
them it would be of wonderful help.

Thanks
Charles

On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote:

Hello,

I'm investigating the transition of some of our column families from Size
Tiered - Leveled Compaction. I believe we have some high-read-load column
families that would benefit tremendously.

I've stood up a test DB Node to investigate the transition. I successfully
alter the column family, and I immediately noticed a large number (1000+)
pending compaction tasks become available, but no compaction get executed.

I tried running nodetool sstableupgrade on the column family, and the
compaction tasks don't move.

I also notice no changes to the size and distribution of the existing
SSTables.

I then run a major compaction on the column family. All pending compaction
tasks get run, and the SSTables have a distribution that I would expect from
LeveledCompaction (lots and lots of 10MB files).

Couple

Re: Bulk Loading-Unable to select from CQL3 tables with NO COMPACT STORAGE option after Bulk Loading - Cassandra version 1.2.1

2013-02-26 Thread aaron morton

CQL 3 tables that do not use compact storage store use Composite Types , which 
other code may not be expecting. 

Take a look at the CQL 3 table definitions through cassandra-cli and you may 
see the changes you need to make when creating the SSTables. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/02/2013, at 3:44 AM, praveen.akun...@wipro.com wrote:

 Hi All, 
 
 I am using the bulk loader program provided in Datastax website. 
 http://www.datastax.com/dev/blog/bulk-loading
 
 I am able to load data into tables created with COMPACT STORAGE option and 
 also into tables created with out this option. However, I am unable to read 
 data from the table created without COMPACT STORAGE option. 
 
 I created 2 tables as below: 
 
 CREATE TABLE TABLE1(
 field1 text PRIMARY KEY,
 field2 text,
 field3 text,
 field4 text
  ) WITH COMPACT STORAGE;
 
 CREATE TABLE TABLE2(
 field1 text PRIMARY KEY,
 field2 text,
 field3 text,
 field4 text
  );
  
 Now, I loaded these 2 tables using the Java bulk loader program(Create 
 SSTables and load them using SSTableloader utility). 
 
 I can read the data from TABLE1, but, when I try to read data from TABLE2, I 
 am getting timeout from both cqlsh  cli. 
 
 Screen Shot 2013-02-25 at 8.10.58 PM.png
 
 
 
 Screen Shot 2013-02-25 at 8.10.38 PM.png
 
 Is this expected behavior, or am I doing something wrong? Can anyone please 
 help. 
 
 Thanks  Best Regards, 
 Praveen
 Wipro Limited (Company Regn No in UK - FC 019088) 
 Address: Level 2, West wing, 3 Sheldon Square, London W2 6PS, United Kingdom. 
 Tel +44 20 7432 8500 Fax: +44 20 7286 5703
 
 VAT Number: 563 1964 27
 
 (Branch of Wipro Limited (Incorporated in India at Bangalore with limited 
 liability vide Reg no L9KA1945PLC02800 with Registrar of Companies at 
 Bangalore, India. Authorized share capital: Rs 5550 mn))
 
 Please do not print this email unless it is absolutely necessary.
 
 The information contained in this electronic message and any attachments to 
 this message are intended for the exclusive use of the addressee(s) and may 
 contain proprietary, confidential or privileged information. If you are not 
 the intended recipient, you should not disseminate, distribute or copy this 
 e-mail. Please notify the sender immediately and destroy all copies of this 
 message and any attachments.
 
 WARNING: Computer viruses can be transmitted via email. The recipient should 
 check this email and any attachments for the presence of viruses. The company 
 accepts no liability for any damage caused by any virus transmitted by this 
 email.
 
 www.wipro.com

Re: disabling bloomfilter not working? memory numbers don't add up?

2013-02-28 Thread aaron morton

1. Can I stop the node, delete the *Filter.db files and restart the node(is
this safe)???
No.

2. Why do I have 5 gig being eaten up by cassandra? nodetool info memory
5.2Gig, key cache:11 meg and row cache 0 bytes. All bloomfilters are also
small 1meg.
If this is the Heap memory reported by the JVM that all you can say is since
the server was started it has allocated at least 5.2 GB of memory it's not
there is 5.2GB of live memory in use

Cheers
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/02/2013, at 9:32 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

H, my upgrade completed and then I added node back in and ran my repair.
What is weird is that my nreldata column family still shows 156Meg of memory
still(down from 2 gig though!!) in use and a false positive ratio of .99576
when I have the filter completely disabled(ie. Set to 1.0). I see the
*Filter.db files on disk(and size approximately matches the in-memory size).
I tried restarting the node as well.

1. Can I stop the node, delete the *Filter.db files and restart the node(is
this safe)???
2. Why do I have 5 gig being eaten up by cassandra? nodetool info memory
5.2Gig, key cache:11 meg and row cache 0 bytes. All bloomfilters are also
small 1meg.

Exception to #2 is I have nreldata still using 156MB for some reason but
still no where close to 5.2 gig that nodetool shows in use.

Thanks,
Dean

Bloom Filter Space Used: 2318392048tel:2318392048
Just to be sane do a quick check of the -Filter.db files on disk for this CF.
If they are very small try a restart on the node.

Number of Keys (estimate): 1249133696
Hey a billion rows on a node, what an age we live in :)

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.comhttp://www.thelastpickle.com/

On 23/02/2013, at 4:35 AM, Hiller, Dean
dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
wrote:

So in the cli, I ran

update column family nreldata with bloom_filter_fp_chance=1.0;

Then I ran

nodetool upgradesstables databus5 nreldata;

But my bloom filter size is still around 2gig(and I want to free up this
heap) According to nodetool cfstats command…

Column Family: nreldata
SSTable count: 10
Space used (live): 96841497731
Space used (total): 96841497731
Number of Keys (estimate): 1249133696
Memtable Columns Count: 7066
Memtable Data Size: 4286174
Memtable Switch Count: 924
Read Count: 19087150
Read Latency: 0.595 ms.
Write Count: 21281994
Write Latency: 0.013 ms.
Pending Tasks: 0
Bloom Filter False Postives: 974393
Bloom Filter False Ratio: 0.8
Bloom Filter Space Used: 2318392048
Compacted row minimum size: 73
Compacted row maximum size: 446
Compacted row mean size: 143

Re: Retrieving local data

2013-02-28 Thread aaron morton

Take a look at the token function with the select statement 
http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/02/2013, at 10:06 AM, Everton Lima peitin.inu...@gmail.com wrote:

 Hi people,
 
 I was needing to retrieve some data in a local machine that was running 
 cassandra.
 
 I start the Cassandra's daemon with my java process, so now I need to execute 
 a CQL, but just in data that was storaged in that machine, is it possible? 
 How?
 
 Thanks
 
 -- 
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA

Re: cluster with cross data center and local

2013-02-28 Thread aaron morton

  I assume my only options are to create another cluster or to create another 
 keyspace using LocalStrategy strategy?
You do need another key space, but you can still use the 
NetworkTopologyStrategy. 
Just set the strategy options to be dc1: 2 and dc2: 0. (check the docs for CLI 
and CQL for exact strategy options). 

 What's the difference between LocalStrategy and SimpleStrategy?
LocalStrategy is used by System keyspaces and secondary indexes to store data 
on a local node only. You do not want that. 

IMHO better to use NetworkTopologyStrategy as above than simple. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/02/2013, at 10:41 AM, Keith Wright kwri...@nanigans.com wrote:

 Hi all,
 
I have a cluster with 2 data centers with an RF 2 keyspace using network 
 topology on 1.1.10.  I would like to configure it such that some of the data 
 is not cross data center replicated but is replicated between the nodes of 
 the local data center.  I assume my only options are to create another 
 cluster or to create another keyspace using LocalStrategy strategy?  What's 
 the difference between LocalStrategy and SimpleStrategy?
 
 Thanks!

Re: please explain read path when key not in database

2013-02-28 Thread aaron morton

 This is my understanding from using cassandra for probably around 2 years
Sounds about right. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/02/2013, at 7:43 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 This is my understanding from using cassandra for probably around 2 
 years….(though I still make mistakes sometimes)….
 
 For CL.ONE read
 
 Depending on the client, the client may go through one of it's known 
 nodes(co-ordinating node) which goes to real node(clients like 
 astyanax/hector read in the ring information and usually go direct so for 
 CL_ONE, no co-ordination really needed).  The node it finally gets to may not 
 have the data yet and will return no row while the other 2 node might have 
 data.
 
 For CL.QUOROM read and RF=3
 Client goes to the node with data(again depending on client) and that node 
 sends off a request to one of the other 2.  Let's say A does not have row 
 yet, but B has row, comparison results and latest wins and a repair for that 
 row is kicked off to get all nodes in sync of that row.
 
 If local node responsible for key replied that it have no data for this key - 
 will coordinator send digest commands?
 
 IT looks like CL_ONE does trigger a read repair according to this doc (found 
 googling CL_ONE read repair cassandra)
 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CL-ONE-reads-RR-badness-threshold-interaction-td6247418.html
 
 http://wiki.apache.org/cassandra/ReadRepair
 
 Later,
 Dean
 
 Explain please, how this work when I request for key which is not in database
 
 
 *   The closest node (as determined by proximity sorting as described above) 
 will be sent a command to perform an actual data read (i.e., return data to 
 the co-ordinating node).
 *   As required by consistency level, additional nodes may be sent digest 
 commands, asking them to perform the read locally but send back the digest 
 only.
*   For example, at replication factor 3 a read at consistency level 
 QUORUM would require one digest read in additional to the data read sent to 
 the closest node. (See 
 ReadCallbackhttp://wiki.apache.org/cassandra/ReadCallback, instantiated by 
 StorageProxyhttp://wiki.apache.org/cassandra/StorageProxy)
 
 I have multi-DC with NetworkTopologyStrategy and RF:1 per datacenter, and 
 reads are at consitency level ONE. If local node responsible for key replied 
 that it have no data for this key - will coordinator send digest commands?
 
 Thanks!

Re: no backwards compatibility for thrift in 1.2.2? (we get utter failure)

2013-03-03 Thread aaron morton

Dean, 
Is this an issue with tables created using CQL 3 ?

OR…

An issue with tables created in 1.1.4 using the CLI not been readable after an 
in place upgrade to 1.2.2 ?

I did a quick test and it worked. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 3/03/2013, at 8:18 PM, Edward Capriolo edlinuxg...@gmail.com wrote:

 Your other option is to create tables 'WITH COMPACT STORAGE'. Basically if 
 you use COMPACT STORAGE and create tables as you did before.
 
 https://issues.apache.org/jira/browse/CASSANDRA-2995
 
 From an application standpoint, if you can't do sparse, wide rows, you break 
 compatibility with 90% of Cassandra applications. So that rules out almost 
 everything; if you can't provide the same data model, you're creating 
 fragmentation, not pluggability.
 
 I now call Cassandra compact storage 'c*' storage, and I call CQL3 storage 
 'c*++' storage. See debates on c vs C++ to understand why :).
 
 
 On Sun, Mar 3, 2013 at 9:39 PM, Michael Kjellman mkjell...@barracuda.com 
 wrote:
 Dean,
 
 I think if you look back through previous mailing list items you'll find
 answers to this already but to summarize:
 
 Tables created prior to 1.2 will continue to work after upgrade. New
 tables created are not exposed by the Thrift API. It is up to client
 developers to upgrade the client to pull the required metadata for
 serialization and deserialization of the data from the System column
 family instead.
 
 I don't know Netflix's time table for an update to Astyanax but I'm sure
 they are working on it. Alternatively, you can also  use the Datastax java
 driver in your QA environment for now.
 
 If you only need to access existing column families this shouldn't be an
 issue
 
 On 3/3/13 6:31 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 
 I remember huge discussions on backwards compatibility and we have a ton
 of code using thrift(as do many people out there).  We happen to have a
 startup bean for development that populates data in cassandra for us.  We
 cleared out our QA completely(no data) and ran thisŠ.it turns out there
 seems to be no backwards compatibility as it utterly fails.
 
 From astyanax point of view, we simply get this (when going back to
 1.1.4, everything works fine.  I can go down the path of finding out
 where backwards compatibility breaks but does this mean essentially
 everyone has to rewrite their applications?  OR is there a list of
 breaking changes that we can't do anymore?  Has anyone tried the latest
 astyanax client with 1.2.2 version?
 
 An unexpected error occured caused by exception RuntimeException:
 com.netflix.astyanax.connectionpool.exceptions.NoAvailableHostsException:
 NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0),
 attempts=0]No hosts to borrow from
 
 Thanks,
 Dean
 
 
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 
 things. Start today: www.copy.com.

Re: Select X amount of column families in a super column family in Cassandra using PHP?

2013-03-03 Thread aaron morton

You'll probably have better luck asking the author directly. 

Check the tutorial 
http://cassandra-php-client-library.com/tutorial/fetching-data and tell them 
what you have tried. 

For future reference we are trying to direct client specific queries to the 
client-dev list. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 2/03/2013, at 2:10 PM, Crocker Jordan jcrocker.115...@students.smu.ac.uk 
wrote:

 I'm using Kallaspriit's Cassandra/PHP library ( 
 https://github.com/kallaspriit/Cassandra-PHP-Client-Library).
 I'm trying to select the first x amount of column families within the super 
 column family, however, I'm having absolutely no luck, and google searches 
 don't seem to bring up much.
 
 I'm using Random Partitioning, and don't particularly wish to change to OPP 
 as I have read there is a lot more work involved.
 
 Any help would be much appreciated.

Re: Column Slice Query performance after deletions

2013-03-03 Thread aaron morton

 I need something to keep the deleted columns away from my query fetch. Not 
 only the tombstones.
 It looks like the min compaction might help on this. But I'm not sure yet on 
 what would be a reasonable value for its threeshold.
Your tombstones will not be purged in a compaction until after gc_grace and 
only if all fragments of the row are in the compaction. You right that you 
would probably want to run repair during the day if you are going to 
dramatically reduce gc_grace to avoid deleted data coming back to life. 

If you are using a single cassandra row as a queue, you are going to have 
trouble. Levelled compaction may help a little. 

If you are reading the most recent entries in the row, assuming the columns 
are sorted by some time stamp. Use the Reverse Comparator and issue slice 
commands to get the first X cols. That will remove tombstones from the problem. 
(Am guessing this is not something you do, just mentioning it). 

You next option is to change the data model so you don't use the same row all 
day. 

After that, consider a message queue. 

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 2/03/2013, at 12:03 PM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com 
wrote:

 Tombstones stay around until gc grace so you could lower that to see of that 
 fixes the performance issues.
 
 If the tombstones get collected,the column will live again, causing data 
 inconsistency since I cant run a repair during the regular operations. Not 
 sure if I got your thoughts on this.
 
 
 Size tiered or leveled comparison?
 
 I'm actuallly running on Size Tiered Compaction, but I've been looking into 
 changing it for Leveled. It seems to be the case.  Although even if I achieve 
 some performance, I would still have the same problem with the deleted 
 columns.
 
 
 I need something to keep the deleted columns away from my query fetch. Not 
 only the tombstones.
 It looks like the min compaction might help on this. But I'm not sure yet on 
 what would be a reasonable value for its threeshold.
 
 
 On Sat, Mar 2, 2013 at 4:22 PM, Michael Kjellman mkjell...@barracuda.com 
 wrote:
 Tombstones stay around until gc grace so you could lower that to see of that 
 fixes the performance issues.
 
 Size tiered or leveled comparison?
 
 On Mar 2, 2013, at 11:15 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:
 
 What is your gc_grace set to? Sounds like as the number of tombstones 
 records increase your performance decreases. (Which I would expect)
 
 gr_grace is default.
 
 
 Casandra's data files are write once. Deletes are another write. Until 
 compaction they all live on disk.Making really big rows has these problem.
 Oh, so it looks like I should lower the min_compaction_threshold for this 
 column family. Right?
 What does realy mean this threeshold value?
 
 
 Guys, thanks for the help so far.
 
 On Sat, Mar 2, 2013 at 3:42 PM, Michael Kjellman mkjell...@barracuda.com 
 wrote:
 What is your gc_grace set to? Sounds like as the number of tombstones 
 records increase your performance decreases. (Which I would expect)
 
 On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:
 
 I have a daily maintenance of my cluster where I truncate this column 
 family. Because its data doesnt need to be kept more than a day. 
 Since all the regular operations on it finishes around 4 hours before 
 finishing the day. I regurlarly run a truncate on it followed by a repair 
 at the end of the day.
 
 And every day, when the operations are started(when are only few deleted 
 columns), the performance looks pretty well.
 Unfortunately it is degraded along the day.
 
 
 On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman mkjell...@barracuda.com 
 wrote:
 When is the last time you did a cleanup on the cf?
 
 On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:
 
  Hello guys.
  I'm investigating the reasons of performance degradation for my case 
  scenario which follows:
 
  - I do have a column family which is filled of thousands of columns 
  inside a unique row(varies between 10k ~ 200k). And I do have also 
  thousands of rows, not much more than 15k.
  - This rows are constantly updated. But the write-load is not that 
  intensive. I estimate it as 100w/sec in the column family.
  - Each column represents a message which is read and processed by another 
  process. After reading it, the column is marked for deletion in order to 
  keep it out from the next query on this row.
 
  Ok, so, I've been figured out that after many insertions plus deletion 
  updates, my queries( column slice query ) are taking more time to be 
  performed. Even if there are only few columns, lower than 100.
 
  So it looks like that the longer is the number of columns being deleted, 
  the longer is the time spent for a query.
  - Internally at C*, does column slice query ranges among deleted

Re: reading the updated values

2013-03-03 Thread aaron morton

 my question is how do i get the updated data in cassandra for last 1 hour or 
 so to be indexed in elasticsearch.
You cannot. 

The best approach is to update elastic search at the same time you update 
cassandra. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 1/03/2013, at 11:57 PM, subhankar biswas neo20iit...@gmail.com wrote:

 hi,
i m trying to use cassandra as main data-store and elasticsearch for 
 realtime quries. my question is how do i get the updated data in cassandra 
 for last 1 hour or so to be indexed in elasticsearch.
once i get the updated data from cassandra i can index that to ES. 
is there any specific data model i have to follow to get the recent 
 updates of any CF.
 
thanks subhankar

Re: no backwards compatibility for thrift in 1.2.2? (we get utter failure)

2013-03-04 Thread aaron morton

ok, we are talking about all thrift / cli / hector / no CQL tables not been 
read after an upgrade. 

If you can get some repo steps that would be handy.

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/03/2013, at 5:01 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 For us, this was an issue creating tables in 1.1.4 using thrift, then 
 upgrading to 1.2.2.  We did not use cli to create anything.  I will try the 
 complete test again today and hopefully get more detail(I didn't know I could 
 not run the same thrift code in 1.2.2 for keyspace creation/table creation)
 
 Thanks,
 Dean
 
 From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, March 3, 2013 11:09 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: no backwards compatibility for thrift in 1.2.2? (we get utter 
 failure)
 
 Dean,
 Is this an issue with tables created using CQL 3 ?
 
 OR…
 
 An issue with tables created in 1.1.4 using the CLI not been readable after 
 an in place upgrade to 1.2.2 ?
 
 I did a quick test and it worked.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 3/03/2013, at 8:18 PM, Edward Capriolo 
 edlinuxg...@gmail.commailto:edlinuxg...@gmail.com wrote:
 
 Your other option is to create tables 'WITH COMPACT STORAGE'. Basically if 
 you use COMPACT STORAGE and create tables as you did before.
 
 https://issues.apache.org/jira/browse/CASSANDRA-2995
 
 From an application standpoint, if you can't do sparse, wide rows, you break 
 compatibility with 90% of Cassandra applications. So that rules out almost 
 everything; if you can't provide the same data model, you're creating 
 fragmentation, not pluggability.
 
 I now call Cassandra compact storage 'c*' storage, and I call CQL3 storage 
 'c*++' storage. See debates on c vs C++ to understand why :).
 
 
 On Sun, Mar 3, 2013 at 9:39 PM, Michael Kjellman 
 mkjell...@barracuda.commailto:mkjell...@barracuda.com wrote:
 Dean,
 
 I think if you look back through previous mailing list items you'll find
 answers to this already but to summarize:
 
 Tables created prior to 1.2 will continue to work after upgrade. New
 tables created are not exposed by the Thrift API. It is up to client
 developers to upgrade the client to pull the required metadata for
 serialization and deserialization of the data from the System column
 family instead.
 
 I don't know Netflix's time table for an update to Astyanax but I'm sure
 they are working on it. Alternatively, you can also  use the Datastax java
 driver in your QA environment for now.
 
 If you only need to access existing column families this shouldn't be an
 issue
 
 On 3/3/13 6:31 PM, Hiller, Dean 
 dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote:
 
 I remember huge discussions on backwards compatibility and we have a ton
 of code using thrift(as do many people out there).  We happen to have a
 startup bean for development that populates data in cassandra for us.  We
 cleared out our QA completely(no data) and ran thisŠ.ithttp://thisŠ.it 
 turns out there
 seems to be no backwards compatibility as it utterly fails.
 
 From astyanax point of view, we simply get this (when going back to
 1.1.4, everything works fine.  I can go down the path of finding out
 where backwards compatibility breaks but does this mean essentially
 everyone has to rewrite their applications?  OR is there a list of
 breaking changes that we can't do anymore?  Has anyone tried the latest
 astyanax client with 1.2.2 version?
 
 An unexpected error occured caused by exception RuntimeException:
 com.netflix.astyanax.connectionpool.exceptions.NoAvailableHostsException:
 NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0),
 attempts=0]No hosts to borrow from
 
 Thanks,
 Dean
 
 
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 
 things. Start today: www.copy.comhttp://www.copy.com/.

Re: Unable to instantiate cache provider org.apache.cassandra.cache.SerializingCacheProvider

2013-03-04 Thread aaron morton

What version are you using ? 

As of 1.1 off heap caches no longer require JNA 
https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L327

Also the row and key caches are now set globally not per CF 
https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L324

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 1/03/2013, at 1:33 AM, Jason Wee peich...@gmail.com wrote:

 This happened sometime ago, but for the sake of helping others if they 
 encounter, 
 
 each column family has a row cache provider, you can read into the schema, 
 for example : 
 
 ...
 and row_cache_provider = 'SerializingCacheProvider'
 ...
 
 it cannot start the cache provider for a reason and as a result, default to 
 the ConcurrentLinkedHashCacheProvider.
 
 the serializing cache provider require jna lib, and if you place the library 
 into cassandra lib directory, then this warning should not happen again.

Re: backing up and restoring from only 1 replica?

2013-03-04 Thread aaron morton

That would be OK only if you never had node go down (e.g. a restart) or drop 
messages. 

It's not something I would consider trying.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 28/02/2013, at 3:21 PM, Mike Koh defmike...@gmail.com wrote:

 It has been suggested to me that we could save a fair amount of time and 
 money by taking a snapshot of only 1 replica (so every third node for most 
 column families).  Assuming that we are okay with not having the absolute 
 latest data, does this have any possibility of working?  I feel like it 
 shouldn't but don't really know the argument for why it wouldn't.

Re: Retrieving local data

2013-03-04 Thread aaron morton

Yes. 
You can get the token ranges via astynax and only ask for rows that are within 
the token ranges. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 28/02/2013, at 2:25 PM, Everton Lima peitin.inu...@gmail.com wrote:

 Ok aaron. But the problem is that I am running Cassandra 1.1.8. I am using it 
 for the compatibility with Astyanax 1.56. So, it is possible in Cassandra 
 1.1.8, too?
 
 2013/2/28 aaron morton aa...@thelastpickle.com
 Take a look at the token function with the select statement 
 http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/02/2013, at 10:06 AM, Everton Lima peitin.inu...@gmail.com wrote:
 
 Hi people,
 
 I was needing to retrieve some data in a local machine that was running 
 cassandra.
 
 I start the Cassandra's daemon with my java process, so now I need to 
 execute a CQL, but just in data that was storaged in that machine, is it 
 possible? How?
 
 Thanks
 
 -- 
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA
 
 
 
 
 
 -- 
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA

Re: Unable to instantiate cache provider org.apache.cassandra.cache.SerializingCacheProvider

2013-03-05 Thread aaron morton

Details are here https://issues.apache.org/jira/browse/CASSANDRA-3271

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/03/2013, at 8:04 AM, Jason Wee peich...@gmail.com wrote:

 version 1.0.8
 
 Just curious, what is the mechanism for off heap in 1.1?
 
 Thank you.
 
 /Jason
 
 
 On Mon, Mar 4, 2013 at 11:49 PM, aaron morton aa...@thelastpickle.com wrote:
 What version are you using ? 
 
 As of 1.1 off heap caches no longer require JNA 
 https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L327
 
 Also the row and key caches are now set globally not per CF 
 https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L324
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 1/03/2013, at 1:33 AM, Jason Wee peich...@gmail.com wrote:
 
 This happened sometime ago, but for the sake of helping others if they 
 encounter, 
 
 each column family has a row cache provider, you can read into the schema, 
 for example : 
 
 ...
 and row_cache_provider = 'SerializingCacheProvider'
 ...
 
 it cannot start the cache provider for a reason and as a result, default to 
 the ConcurrentLinkedHashCacheProvider.
 
 the serializing cache provider require jna lib, and if you place the library 
 into cassandra lib directory, then this warning should not happen again.

Re: backing up and restoring from only 1 replica?

2013-03-05 Thread aaron morton

Hinted Handoff works well. But it's an optimisation that has certain safety 
valves, configuration and throttling that means it is still not considered the 
way to ensure on disk consistency. 

In general, if a node restarts or drops mutations HH should get the message 
there eventually. In specific cases it may not. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/03/2013, at 10:40 AM, Mike Koh defmike...@gmail.com wrote:

 Thanks for the response.  Could you elaborate more on the bad things that 
 happen during a restart or message drops that would cause a 1 replica restore 
 to fail?  I'm completely on board with not using a restore process that 
 nobody else uses, but I need to convince somebody else who thinks that it 
 will work that it is not a good idea.
 
 
 On 3/4/2013 7:54 AM, aaron morton wrote:
 That would be OK only if you never had node go down (e.g. a restart) or drop 
 messages.
 
 It's not something I would consider trying.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 28/02/2013, at 3:21 PM, Mike Koh defmike...@gmail.com wrote:
 
 It has been suggested to me that we could save a fair amount of time and 
 money by taking a snapshot of only 1 replica (so every third node for most 
 column families).  Assuming that we are okay with not having the absolute 
 latest data, does this have any possibility of working?  I feel like it 
 shouldn't but don't really know the argument for why it wouldn't.

Re: anyone see this user-cassandra thread get answered...

2013-03-05 Thread aaron morton

Was probably this https://issues.apache.org/jira/browse/CASSANDRA-4597

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/03/2013, at 2:05 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 I was reading
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201208.mbox/%3CCAGZm5drRh3VXNpHefR9UjH8H=dhad2y18s0xmam5cs4yfl5...@mail.gmail.com%3E
 As we are having the same issue in 1.2.2.  We modify to LCS and cassandra-cli 
 shows us at LCS on any node we run cassandra cli on, but then looking at 
 cqlsh, it is showing us at SizeTieredCompactionStrategy :(.
 
 Thanks,
 Dean

Re: Consistent problem when solve Digest mismatch

2013-03-05 Thread aaron morton

Otherwise, it means the version conflict solving strong depends on global
sequence id (timestamp) which need provide by client ?
Yes.
If you have an area of your data model that has a high degree of concurrency
C* may not be the right match.

In 1.1 we have atomic updates so clients see either the entire write or none of
it. And sometimes you can design a data model that does mutate shared values,
but writes ledger entries instead. See Matt Denis talk here
http://www.datastax.com/events/cassandrasummit2012/presentations or this post
http://thelastpickle.com/2012/08/18/Sorting-Lists-For-Humans/

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/03/2013, at 4:30 PM, Jason Tang ares.t...@gmail.com wrote:

The timestamp provided by my client is unix timestamp (with ntp), and as I
said, due to the ntp drift, the local unix timestamp is not accurately
synchronized (compare to my case).

So for short, client can not provide global sequence number to indicate the
event order.

But I wonder, I configured Cassandra consistency level as write QUORUM. So
for one record, I suppose Cassandra has the ability to decide the final
update results.

Otherwise, it means the version conflict solving strong depends on global
sequence id (timestamp) which need provide by client ?

//Tang

2013/3/4 Sylvain Lebresne sylv...@datastax.com
The problem is, what is the sequence number you are talking about is exactly?

Or let me put it another way: if you do have a sequence number that provides
a total ordering of your operation, then that is exactly what you should use
as your timestamp. What Cassandra calls the timestamp, is exactly what you
call seqID, it's the number Cassandra uses to decide the order of operation.

Except that in real life, provided you have more than one client talking to
Cassandra, then providing a total ordering of operation is hard, and in fact
not doable efficiently. So in practice, people use unix timestamp (with ntp)
which provide a very good while cheap approximation of the real life order of
operations.

But again, if you do know how to assign a more precise timestamp, Cassandra
let you use that: you can provid your own timestamp (using unix timestamp is
just the default). The point being, unix timestamp is the better
approximation we have in practice.

--
Sylvain

On Mon, Mar 4, 2013 at 9:26 AM, Jason Tang ares.t...@gmail.com wrote:
Hi

Previous I met a consistency problem, you can refer the link below for the
whole story.
http://mail-archives.apache.org/mod_mbox/cassandra-user/201206.mbox/%3CCAFb+LUxna0jiY0V=AvXKzUdxSjApYm4zWk=ka9ljm-txc04...@mail.gmail.com%3E

And after check the code, seems I found some clue of the problem. Maybe
some one can check this.

For short, I have Cassandra cluster (1.0.3), The consistency level is
read/write quorum, replication_factor is 3.

Here is event sequence:

seqID NodeA NodeB NodeC
1. New New New
2. Update Update Update
3. Delete Delete

When try to read from NodeB and NodeC, Digest mismatch exception triggered,
so Cassandra try to resolve this version conflict.
But the result is value Update.

Here is the suspect root cause, the version conflict resolved based on time
stamp.

Node C local time is a bit earlier then node A.

Update requests sent from node C with time stamp 00:00:00.050, Delete
sent from node A with time stamp 00:00:00.020, which is not same as the event
sequence.

So the version conflict resolved incorrectly.

It is true?

If Yes, then it means, consistency level can secure the conflict been found,
but to solve it correctly, dependence one time synchronization's accuracy,
e.g. NTP ?

Re: hinted handoff disabling trade-offs

2013-03-05 Thread aaron morton

The advantage of HH is that it reduces the probability of a DigestMismatch when 
using a CL  ONE. A DigestMismatch means the read has to run a second time 
before returning to the client. 

  - No risk of hinted-handoffs building up
  - No risk of hinted-handoffs flooding a node that just came up
See the yaml config settings for the max hint window and the throttling. 

  Can anyone suggest any other factors that I'm missing here. Specifically
  reasons
  not to do this.
If you are doing this for performance first make sure your data model is 
efficient, that you are doing the most efficient reads (see my presentation 
here http://www.datastax.com/events/cassandrasummit2012/presentations), and 
your caching is bang on. Then consider if you can tune the CL, and if your 
client is token aware so it directs traffic to a node that has it. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/03/2013, at 9:19 PM, Michael Kjellman mkjell...@barracuda.com wrote:

 Also, if you have enough hints being created that its significantly impacting 
 your heap I have a feeling things are going to get out of sync very quickly.
 
 On Mar 4, 2013, at 9:17 PM, Wz1975 wz1...@yahoo.com wrote:
 
 Why do you think disabling hinted handoff will improve memory usage? 
 
 
 Thanks.
 -Wei
 
 Sent from my Samsung smartphone on ATT 
 
 
  Original message 
 Subject: Re: hinted handoff disabling trade-offs 
 From: Michael Kjellman mkjell...@barracuda.com 
 To: user@cassandra.apache.org user@cassandra.apache.org 
 CC: 
 
 
 Repair is slow.
 
 On Mar 4, 2013, at 8:07 PM, Matt Kap matvey1...@gmail.com wrote:
 
  I am looking to get a second opinion about disabling hinted-handoffs. I
  have an application that can tolerate a fair amount of inconsistency
  (advertising domain), and so I'm weighting the pros and cons of hinted
  handoffs. I'm running Cassandra 1.0, looking to upgrade to 1.1 soon.
  
  Pros of disabling hinted handoffs:
  - Reduces heap
  - Improves GC performance
  - No risk of hinted-handoffs building up
  - No risk of hinted-handoffs flooding a node that just came up
  
  Cons
  - Some writes can be lost, at least until repair runs
  
  Can anyone suggest any other factors that I'm missing here. Specifically
  reasons
  not to do this.
  
  Cheers!
  -Matt
 
 Copy, by Barracuda, helps you store, protect, and share all your amazing 
 things. Start today: www.copy.com.
 
 -- 
 Copy, by Barracuda, helps you store, protect, and share all your amazing 
 things. Start today: www.copy.com.

Re: Replacing dead node when num_tokens is used

2013-03-05 Thread aaron morton

AFAIK you just fire up the new one and let nature take it's course :) 
http://www.datastax.com/docs/1.2/operations/add_replace_nodes#replace-node

i.e. you do not need to use -Dcassandra.replace_token. 

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 5/03/2013, at 1:06 AM, Jan Kesten j.kes...@enercast.de wrote:

 Hello,
 
 while trying out cassandra I read about the steps necessary to replace a dead 
 node. In my test cluster I used a setup using num_tokens instead of 
 initial_tokens. How do I replace a dead node in this scenario?
 
 Thanks,
 Jan

Re: old data / tombstones are not deleted after ttl

2013-03-05 Thread aaron morton

If you have a data model with long lived and frequently updated rows, you can
get around the all fragments problem by running a user defined compaction.

Look for the CompactionManagerMbean on the JMX API
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManagerMBean.java#L67

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 5/03/2013, at 1:52 AM, Michal Michalski mich...@opera.com wrote:

I have read in the documentation, that after a major compaction,
minor compactions are no longer automatically trigger.
Does this mean, that I have to do the nodetool compact regulary? Or
is there a way to get back to the automatically minor compactions?

I think it's one of the most confusing parts of C* docs.

There's nothing like a switch for minor compactions that gets magically
turned off when you trigger major compaction. Minor compactions won't get
trigerred automatically for _some_ time, because you'll only have one
gargantuan SSTable and unless you get enough new (smaller) SSTables to get
them compacted together (4 by default), no compactions will kick in.

Of course you'll still have one huge SSTable and it will take a lot of time
to get another 3 of similar size to get them compacted. I think that it will
be a problem for your TTL-based data model, as you'll have tons of Tombstones
in the newer/smaller SSTables that you won't be able to compact together with
the huge SSTable containing data.

BTW: As far as I remember, there was an external tool (I don't remember the
name) allowing to split SSTables - I didn't use it, so I can't suggest you
using it, but you may want to give it a try.

W dniu 05.03.2013 09:46, Matthias Zeilinger pisze:
Short question afterwards:

I have read in the documentation, that after a major compaction, minor
compactions are no longer automatically trigger.
Does this mean, that I have to do the nodetool compact regulary? Or is there
a way to get back to the automatically minor compactions?

Thx,

Br,
Matthias Zeilinger
Production Operation – Shared Services

P: +43 (0) 50 858-31185
M: +43 (0) 664 85-34459
E: matthias.zeilin...@bwinparty.com

bwin.party services (Austria) GmbH
Marxergasse 1B
A-1030 Vienna

www.bwinparty.com

-Original Message-
From: Matthias Zeilinger [mailto:matthias.zeilin...@bwinparty.com]
Sent: Dienstag, 05. März 2013 08:03
To: user@cassandra.apache.org
Subject: RE: old data / tombstones are not deleted after ttl

Yes it was a major compaction.
I know it´s not a great solution, but I needed something to get rid of the
old data, because I went out of diskspace.

Br,
Matthias Zeilinger
Production Operation – Shared Services

P: +43 (0) 50 858-31185
M: +43 (0) 664 85-34459
E: matthias.zeilin...@bwinparty.com

bwin.party services (Austria) GmbH
Marxergasse 1B
A-1030 Vienna

www.bwinparty.com

-Original Message-
From: Michal Michalski [mailto:mich...@opera.com]
Sent: Dienstag, 05. März 2013 07:47
To: user@cassandra.apache.org
Subject: Re: old data / tombstones are not deleted after ttl

Was it a major compaction? I ask because it's definitely a solution that had
to work, but it's also a solution that - in general - probably no-one here
would suggest you to use.

W dniu 05.03.2013 07:08, Matthias Zeilinger pisze:
Hi,

I have done a manually compaction over the nodetool and this worked.
But thx for the explanation, why it wasn´t compacted

Br,
Matthias Zeilinger
Production Operation – Shared Services

P: +43 (0) 50 858-31185
M: +43 (0) 664 85-34459
E: matthias.zeilin...@bwinparty.com

bwin.party services (Austria) GmbH
Marxergasse 1B
A-1030 Vienna

www.bwinparty.com

From: Bryan Talbot [mailto:btal...@aeriagames.com]
Sent: Montag, 04. März 2013 23:36
To: user@cassandra.apache.org
Subject: Re: old data / tombstones are not deleted after ttl

Those older files won't be included in a compaction until there are
min_compaction_threshold (4) files of that size. When you get another SS
table -Data.db file that is about 12-18GB then you'll have 4 and they will
be compacted together into one new file. At that time, if there are any
rows with only tombstones that are all older than gc_grace the row will be
removed (assuming the row exists exclusively in the 4 input SS tables).
Columns with data that is more than TTL seconds old will be written with a
tombstone. If the row does have column values in SS tables that are not
being compacted, the row will not be removed.

-Bryan

On Sun, Mar 3, 2013 at 11:07 PM, Matthias Zeilinger
matthias.zeilin...@bwinparty.commailto:matthias.zeilin...@bwinparty.com
wrote:
Hi,

I´m running Cassandra 1.1.5 and have following issue.

I´m using a 10 days TTL on my CF. I can see a lot of tombstones in there,
but they aren´t deleted

Re: what size file for LCS is best for 300-500G per node?

2013-03-05 Thread aaron morton

Don't forget you can test things 
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-live-traffic-sampling

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 5/03/2013, at 7:37 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Thanks!
 Dean
 
 On 3/4/13 7:12 PM, Wei Zhu wz1...@yahoo.com wrote:
 
 We have 200G and ended going with 10M. The compaction after repair takes
 a day to finish. Try to run a repair and see how it goes.
 
 -Wei
 
 - Original Message -
 From: Dean Hiller dean.hil...@nrel.gov
 To: user@cassandra.apache.org
 Sent: Monday, March 4, 2013 10:52:27 AM
 Subject: what size file for LCS is best for 300-500G per node?
 
 Should we really be going with 5MB when it compresses to 3MB?  That seems
 to be on the small side, right?  We have ulimit cranked up so many files
 shouldn't be an issue but maybe we should go to 10MB or 100MB or
 something in between?  Does anyone have any experience with changing the
 LCS sizes?
 
 I do read somewhere startup times of opening 100,000 files could be
 slow? Which implies a larger size so less files might be better?
 
 Thanks,
 Dean

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 2979 matches

Mail list logo