Re: CPU hotspot at BloomFilterSerializer#deserialize
Yes, it contains a big row that goes up to 2GB with more than a million of columns. I've run tests with 10 million small columns and reasonable performance. I've not looked at 1 million large columns. - BloomFilterSerializer#deserialize does readLong iteratively at each page of size 4K for a given row, which means it could be 500,000 loops(calls readLong) for a 2G row(from 1.0.7 source). There is only one Bloom filter per row in an SSTable, not one per column index/page. It could take a while if there are a lot of sstables in the read. nodetool cfhistorgrams will let you know, run it once to reset the counts , then do your test, then run it again. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/02/2013, at 4:13 AM, Edward Capriolo edlinuxg...@gmail.com wrote: It is interesting the press c* got about having 2 billion columns in a row. You *can* do it but it brings to light some realities of what that means. On Sun, Feb 3, 2013 at 8:09 AM, Takenori Sato ts...@cloudian.com wrote: Hi Aaron, Thanks for your answers. That helped me get a big picture. Yes, it contains a big row that goes up to 2GB with more than a million of columns. Let me confirm if I correctly understand. - The stack trace is from Slice By Names query. And the deserialization is at the step 3, Read the row level Bloom Filter, on your blog. - BloomFilterSerializer#deserialize does readLong iteratively at each page of size 4K for a given row, which means it could be 500,000 loops(calls readLong) for a 2G row(from 1.0.7 source). Correct? That makes sense Slice By Names queries against such a wide row could be CPU bottleneck. In fact, in our test environment, a BloomFilterSerializer#deserialize of such a case takes more than 10ms, up to 100ms. Get a single named column. Get the first 10 columns using the natural column order. Get the last 10 columns using the reversed order. Interesting. A query pattern could make a difference? We thought the only solutions is to change the data structure(don't use such a wide row if it is retrieved by Slice By Names query). Anyway, will give it a try! Best, Takenori On Sat, Feb 2, 2013 at 2:55 AM, aaron morton aa...@thelastpickle.com wrote: 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G) So very large rows ? What does nodetool cfstats or cfhistograms say about the row sizes ? 1. what is happening? I think this is partially large rows and partially the query pattern, this is only by roughly correct http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my talk here http://www.datastax.com/events/cassandrasummit2012/presentations 3. any more info required to proceed? Do some tests with different query techniques… Get a single named column. Get the first 10 columns using the natural column order. Get the last 10 columns using the reversed order. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 31/01/2013, at 7:20 PM, Takenori Sato ts...@cloudian.com wrote: Hi all, We have a situation that CPU loads on some of our nodes in a cluster has spiked occasionally since the last November, which is triggered by requests for rows that reside on two specific sstables. We confirmed the followings(when spiked): version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8 jdk: Oracle 1.6.0 1. a profiling showed that BloomFilterSerializer#deserialize was the hotspot(70% of the total load by running threads) * the stack trace looked like this(simplified) 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow ... 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData ... 89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read ... 79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter 68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize 66.7% - java.io.DataInputStream.readLong 2. Usually, 1 should be so fast that a profiling by sampling can not detect 3. no pressure on Cassandra's VM heap nor on machine in overal 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat 1 1000) 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G) 6. the problematic Filter file size is only 256B(could be normal) So now, I am trying to read the Filter file in the same way BloomFilterSerializer#deserialize does as possible as I can, in order to see if the file is something wrong. Could you give me some advise on: 1. what is happening? 2. the best way to simulate the BloomFilterSerializer#deserialize 3. any more info required to proceed? Thanks, Takenori
Re: cassandra cqlsh error
Grab 1.2.1, it's fixed there http://cassandra.apache.org/download/ Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/02/2013, at 4:37 AM, Kumar, Anjani anjani.ku...@infogroup.com wrote: I am facing problem while trying to run cqlsh. Here is what I did: 1. I brought the tar ball files for both 1.1.7 and 1.2.0 version. 2. Unzipped and untarred it 3. Started Cassandra 4. And then tried starting cqlsh but I am getting the following error in both the versions: Connection error: Invalid method name: ‘set_cql_version’ Before installing Datastax 1.1.7 and 1.2.0 cassandra, I had installed Cassandra through “sudo apt-get install Cassandra” on my ubuntu. Since it doesn’t have CQL support(at least I cant find it) so I thought of installing Datastax version of Cassandra but still no luck starting cqlsh so far. Any suggestion? Thanks, Anjani
Re: Pycassa vs YCSB results.
The first thing I noticed is your script uses python threading library, which is hampered by the Global Interpreter Lock http://docs.python.org/2/library/threading.html You don't really have multiple threads running in parallel, try using the multiprocessor library. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/02/2013, at 7:15 AM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Hi, Could some one please let me know any hints, why the pycassa client(attached) is much slower than the YCSB? is it something to attribute to performance difference between python and Java? or the pycassa api has some performance limitations? I don't see any client statements affecting the pycassa performance. Please have a look at the simple python script attached and let me know your suggestions. thanks pradeep On Thu, Jan 31, 2013 at 4:53 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: On Thu, Jan 31, 2013 at 4:49 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Thanks.. Please find the script as attachment. Just re-iterating. Its just a simple python script which submit 4 threads. This script has been scheduled on 8 cores using taskset unix command , thus running 32 threads/node. and then scaling to 16 nodes thanks pradeep On Thu, Jan 31, 2013 at 4:38 PM, Tyler Hobbs ty...@datastax.com wrote: Can you provide the python script that you're using? (I'm moving this thread to the pycassa mailing list (pycassa-disc...@googlegroups.com), which is a better place for this discussion.) On Thu, Jan 31, 2013 at 6:25 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Hi, I am trying to benchmark cassandra on a 12 Data Node cluster using 16 clients ( each client uses 32 threads) using custom pycassa client and YCSB. I found the maximum number of operations/seconds achieved using pycassa client is nearly 70k+ reads/second. Whereas with YCSB it is ~ 120k reads/second. Any thoughts, why I see this huge difference in performance? Here is the description of setup. Pycassa client (a simple python script). 1. Each pycassa client starts 4 threads - where each thread queries 76896 queries. 2. a shell script is used to submit 4threads/each core using taskset unix command on a 8 core single node. ( 8 * 4 * 76896 queries) 3. Another shell script is used to scale the single node shell script to 16 nodes ( total queries now - 16 * 8 * 4 * 76896 queries ) I tried to keep YCSB configuration as much as similar to my custom pycassa benchmarking setup. YCSB - Launched 16 YCSB clients on 16 nodes where each client uses 32 threads for execution and need to query ( 32 * 76896 keys ), i.e 100% reads The dataset is different in each case, but has 1. same number of total records. 2. same number of fields. 3. field length is almost same. Could you please let me know, why I see this huge performance difference and is there any way I can improve the operations/second using pycassa client. thanks pradeep -- Tyler Hobbs DataStax pycassa_client.py
Re: Pycassa vs YCSB results.
The simple thing to do would be use the multiprocessing package and eliminate all shared state. On a multicore box python threads can run on different cores and battle over obtaining the GIL. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/02/2013, at 11:34 PM, Tim Wintle timwin...@gmail.com wrote: On Tue, 2013-02-05 at 21:38 +1300, aaron morton wrote: The first thing I noticed is your script uses python threading library, which is hampered by the Global Interpreter Lock http://docs.python.org/2/library/threading.html You don't really have multiple threads running in parallel, try using the multiprocessor library. Python _should_ release the GIL around IO-bound work, so this is a situation where the GIL shouldn't be an issue (It's actually a very good use for python's threads as there's no serialization overhead for message passing between processes as there would be in most multi-process examples) A constant factor 2 slowdown really doesn't seem that significant for two different implementations, and I would not worry about this unless you're talking about thousands of machines.. If you are talking about enough machines that this is real $$$, then I do think the python code can be optimised a lot. I'm talking about language/VM specific optimisations - so I'm assuming cpython (the standard /usr/bin/python as in the shebang). I don't know how much of a difference this will make, but I'd be interested in hearing your results: I would start by trying rewriting this: def start_cassandra_client(Threadname): f=open(Threadname,w) for key in lines: key=key.strip() st=time.time() f.write(str(cf.get(key))+\n) et=time.time() f.write(Time taken for a single query is + str(round(1000*(et-st),2))+ milli secs\n) f.close() As something like this: def start_cassandra_client(Threadname): # Avoid variable names outside this scope time_fn = time.time colfam = cf f=open(Threadname,w) for key in lines: key=key.strip() st=time_fn() f.write(str(colfam.get(key))+\n) et=time_fn() f.write(Time taken for a single query is + str(round(1000*(et-st),2))+ milli secs\n) f.close() If you don't consider it cheating compared to the java version, I would also move the key.strip() call to the module initiation instead of doing it once per thread, as there's a lot of function dispatch overhead in python. I'd also closely compare the IO going on in both versions (the .write calls). For example this may be significantly faster: et=time_fn() f.write(str(colfam.get(key))+\nTime taken for a single query is + str(round(1000*(et-st),2))+ milli secs\n) .. I haven't read your java code and I don't know Java IO semantics well enough to compare the behaviour of both. Tim Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/02/2013, at 7:15 AM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Hi, Could some one please let me know any hints, why the pycassa client(attached) is much slower than the YCSB? is it something to attribute to performance difference between python and Java? or the pycassa api has some performance limitations? I don't see any client statements affecting the pycassa performance. Please have a look at the simple python script attached and let me know your suggestions. thanks pradeep On Thu, Jan 31, 2013 at 4:53 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: On Thu, Jan 31, 2013 at 4:49 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Thanks.. Please find the script as attachment. Just re-iterating. Its just a simple python script which submit 4 threads. This script has been scheduled on 8 cores using taskset unix command , thus running 32 threads/node. and then scaling to 16 nodes thanks pradeep On Thu, Jan 31, 2013 at 4:38 PM, Tyler Hobbs ty...@datastax.com wrote: Can you provide the python script that you're using? (I'm moving this thread to the pycassa mailing list (pycassa-disc...@googlegroups.com), which is a better place for this discussion.) On Thu, Jan 31, 2013 at 6:25 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Hi, I am trying to benchmark cassandra on a 12 Data Node cluster using 16 clients ( each client uses 32 threads) using custom pycassa client and YCSB. I found the maximum number of operations/seconds achieved using pycassa client is nearly 70k+ reads/second. Whereas with YCSB it is ~ 120k reads/second. Any thoughts, why I see this huge difference in performance? Here is the description of setup. Pycassa client (a simple python script). 1. Each pycassa client starts 4 threads - where each thread queries 76896 queries. 2. a shell
Re: Operation Consideration with Counter Column Families
Are there any specific operational considerations one should make when using counter columns families? Performance, as they incur a read and a write. There were some issues with overcounts in log replay (see the changes.txt). How are counter column families stored on disk? Same as regular CF's. How do they effect compaction? None. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/02/2013, at 7:47 AM, Drew Kutcharian d...@venarc.com wrote: Hey Guys, Are there any specific operational considerations one should make when using counter columns families? How are counter column families stored on disk? How do they effect compaction? -- Drew
Re: unbalanced ring
Use nodetool status with vnodes http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes The different load can be caused by rack affinity, are all the nodes in the same rack ? Another simple check is have you created some very big rows? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/02/2013, at 8:40 AM, stephen.m.thomp...@wellsfargo.com wrote: So I have three nodes in a ring in one data center. My configuration has num_tokens: 256 set andinitial_token commented out. When I look at the ring, it shows me all of the token ranges of course, and basically identical data for each range on each node. Here is the Cliff’s Notes version of what I see: [root@Config3482VM2 apache-cassandra-1.2.0]# bin/nodetool ring Datacenter: 28 == Replicas: 1 Address RackStatus State LoadOwns Token 9187343239835811839 10.28.205.125 205 Up Normal 2.85 GB 33.69% -3026347817059713363 10.28.205.125 205 Up Normal 2.85 GB 33.69% -3026276684526453414 10.28.205.125 205 Up Normal 2.85 GB 33.69% -3026205551993193465 (etc) 10.28.205.126 205 Up Normal 1.15 GB 100.00% -9187343239835811840 10.28.205.126 205 Up Normal 1.15 GB 100.00% -9151314442816847872 10.28.205.126 205 Up Normal 1.15 GB 100.00% -9115285645797883904 (etc) 10.28.205.127 205 Up Normal 69.13 KB66.30% -9223372036854775808 10.28.205.127 205 Up Normal 69.13 KB66.30% 36028797018963967 10.28.205.127 205 Up Normal 69.13 KB66.30% 72057594037927935 (etc) So at this point I have a number of questions. The biggest question is of Load. Why does the .125 node have 2.85 GB, .126 has 1.15 GB, and .127 has only 0.69 GB? These boxes are all comparable and all configured identically. partitioner: org.apache.cassandra.dht.Murmur3Partitioner I’m sorry to ask so many questions – I’m having a hard time finding documentation that explains this stuff. Stephen
Re: Clarification on num_tokens setting
With N nodes, the ring is divided into N*num_tokens. Correct? There is always num_tokens tokens in the ring. Each node has (num_tokens / N) * RF ranges on it. so the ranges of keys are not uniform, although with enough nodes in the cluster there probably won't be any really large ranges. Correct? Even without vnodes there is no guarantee that nodes had contiguous key ranges. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/02/2013, at 5:43 AM, Baron Schwartz ba...@xaprb.com wrote: As I understand the num_tokens setting, it makes Cassandra do the following pseudocode when a new node is added: for 1...num_tokens do my_token = rand(0, 2^128-1) next_token = min(tokens in cluster where token my_token) my_range = (my_token, next_token - 1) done Now the new node owns num_tokens chunks of keys that previously belonged to other nodes. My point is, with 1 node in the cluster, the ring is divided into num_tokens ranges. With N nodes, the ring is divided into N*num_tokens. Correct? The docs do not make this clear for me. And another point: the tokens are randomly chosen, so the ranges of keys are not uniform, although with enough nodes in the cluster there probably won't be any really large ranges. Correct?
Re: Clarification on num_tokens setting
There is always num_tokens tokens in the ring. I got this wrong. Each node *does* have num_tokens tokens. With N nodes, the ring is divided into N*num_tokens. Correct? Yes In other words it is cluster wide parameter. Correct? Yes. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/02/2013, at 10:36 AM, Andrey Ilinykh ailin...@gmail.com wrote: On Tue, Feb 5, 2013 at 12:42 PM, aaron morton aa...@thelastpickle.com wrote: With N nodes, the ring is divided into N*num_tokens. Correct? There is always num_tokens tokens in the ring. Each node has (num_tokens / N) * RF ranges on it. That means every node should have the same num_token parameter? In other words it is cluster wide parameter. Correct? Thank you, Andrey
Re: Operation Consideration with Counter Column Families
Thanks Aaron, so will there only be one value for each counter column per sstable just like regular columns? Yes. For some reason I was under the impression that Cassandra keeps a log of all the increments not the actual value. Not as far as I understand. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/02/2013, at 11:15 AM, Drew Kutcharian d...@venarc.com wrote: Thanks Aaron, so will there only be one value for each counter column per sstable just like regular columns? For some reason I was under the impression that Cassandra keeps a log of all the increments not the actual value. On Feb 5, 2013, at 12:36 PM, aaron morton aa...@thelastpickle.com wrote: Are there any specific operational considerations one should make when using counter columns families? Performance, as they incur a read and a write. There were some issues with overcounts in log replay (see the changes.txt). How are counter column families stored on disk? Same as regular CF's. How do they effect compaction? None. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/02/2013, at 7:47 AM, Drew Kutcharian d...@venarc.com wrote: Hey Guys, Are there any specific operational considerations one should make when using counter columns families? How are counter column families stored on disk? How do they effect compaction? -- Drew
Re: DataModel Question
2) DynamicComposites : I read somewhere that they are not recommended ? You probably wont need them. Your current model will not sort message by the time they arrive in a day. The sort order will be based on Message type and the message ID. I'm assuming you want to order messages, so put the time uuid at the start of the composite columns. If you often want to get the most recent messages use a reverse comparator. You could probably also have wider rows if you want to, not sure how many messages kids send a day but you may get by with weekly partitions. The CLI model could be: row_key: phone_number : day column: time_uuid : message_id : message_type You could also pack extra data used JSON, ProtoBuffers etc and store more that just the message in the column value. If you use using CQL 3 consider this: create table messages ( phone_numbertext, day timestamp, message_sequencetimeuuid, # your timestamp message_id integer, message_typetext, message_bodytext ) with PRIMARY KEY ( (phone_number, day), message_sequence, message_id) (phone_number, day) is the partition key, same the thrift row key. message_sequence, message_id is the grouping columns, all instances will be grouped / ordered by these columns. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 7/02/2013, at 1:47 AM, Kanwar Sangha kan...@mavenir.com wrote: 1) Version is 1.2 2) DynamicComposites : I read somewhere that they are not recommended ? 3) Good point. I need to think about that one. From: Tamar Fraenkel [mailto:ta...@tok-media.com] Sent: 06 February 2013 00:50 To: user@cassandra.apache.org Subject: Re: DataModel Question Hi! I have couple of questions regarding your model: 1. What Cassandra version are you using? I am still working with 1.0 and this seems to make sense, but 1.2 gives you much more power I think. 2. Maybe I don't understand your model, but I think you need DynamicComposite columns, as user columns are different in number of components and maybe type. 3. How do you associate between the SMS or MMS and the user you are chating with. Is it done by a separate CF? Thanks, Tamar Tamar Fraenkel Senior Software Engineer, TOK Media image001.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Wed, Feb 6, 2013 at 8:23 AM, Vivek Mishra mishra.v...@gmail.com wrote: Avoid super columns. If you need Sorted, wide rows then go for Composite columns. -Vivek On Wed, Feb 6, 2013 at 7:09 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – We are designing a Cassandra based storage for the following use cases- ·Store SMS messages ·Store MMS messages ·Store Chat history What would be the ideal was to design the data model for this kind of application ? I am thinking on these lines .. Row-Key : Composite key [ PhoneNum : Day] ·Example: 19876543456:05022013 Dynamic Column Families ·Composite column key for SMS [SMS:MessageId:TimeUUID] ·Composite column key for MMS [MMS:MessageId:TimeUUID] ·Composite column key for user I am chatting with [UserId:198765432345] – This can have multiple values since each chat conv can have many messages. Should this be a super column ? 198:05022013 SMS::ttt SMS:xxx12:ttt MMS::ttt :19 198:05022013 1987888:05022013 Thanks, Kanwar
Re: Cassandra 1.1.8 timeouts on clients
First check your node for IO errors. You have some bad data there. When you restart cassandra it may identify which sstables are corrupt. You can then stop the node and remove them. You will then need to run repair to replace the missing data. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 7/02/2013, at 1:21 PM, Terry Cumaranatunge cumar...@gmail.com wrote: I may have found a trigger that is causing these problems. Anyone seen these compaction problems in 1.1? I did run scrub on all my 1.0 data to convert it to 1.1 and fix level-manifest problems before I started running 1.1. 1 node: ERROR [CompactionExecutor:281] 2013-02-06 23:56:16,183 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[Comp actionExecutor:281,1,main] java.io.IOError: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116) at org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99) at org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176) at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83) at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68) at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at com.google.common.collect.Iterators$7.computeNext(Iterators.java:614) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$2.runMayThrow(CompactionManager.java:164) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:98) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:144) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:234 ) at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:112) ... 21 more 2nd node: ERROR [CompactionExecutor:266] 2013-02-06 23:51:35,181 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[Comp actionExecutor:266,1,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116) at org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99) at org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176) at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83) at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68) at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at com.google.common.collect.Iterators$7.computeNext(Iterators.java:614) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140
Re: Directory structure after upgrading 1.0.8 to 1.2.1
the -old.json is an artefact of Levelled Compaction. You should see a non -old file in the current CF folder. I'm not sure what would have created the -old CF dir. Does the timestamp indicate it was created the time the server first started as a 1.2 node? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 7/02/2013, at 10:39 PM, Desimpel, Ignace ignace.desim...@nuance.com wrote: After upgrading from 1.0.8 I see that now the directory structure has changed and has a structure like keyspacecolumnfamily (part of the 1.1.x migration). But I also see that directories appear like keyspacecolumnfamily-old, and the content of that ‘old’ directory is only one file columnfamily-old.json. Questions : Should this xxx-old.json file be in the other directory? Should the extra directory xxx-old not be created? Or was that intentionally done and is it allowed to remove these directories ( manually … )? Thanks
Re: Can't remove contents of table with truncate or drop
Double check the truncate worked, all nodes must be available for it execute. If you can provide the output from the cqlsh from truncating and selecting that would be helpful. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 2:55 AM, Jabbar aja...@gmail.com wrote: Hello, I'm having problems truncating or deleting the contents of a table. If I truncate the table and then do a select count(*) I get a value above zero. If I drop the table, recreate the table the select count(*) still returns a non zero value. The truncate or delete operation does not return any errors. I am using cassandra 1.2.1 with java 1.6.0 u 39 64 bit in centos 6.3 My keyspace definition is CREATE KEYSPACE studata WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '3' }; My table definition is CREATE TABLE datapoints ( siteid bigint, channel int, time timestamp, data float, PRIMARY KEY ((siteid, channel), time) ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND read_repair_chance=0.10 AND replicate_on_write='true' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; It has 3,504,000,000 rows, consisting of 100,000 partition keys. Is there anything that I'm doing wrong? -- Thanks A Jabbar Azam
Re: DataModel Question
Go day / phone instead of phone / day this way you won't have a rk growing forever . Not sure I understand. +1 for month partition. When I go offline and come online again, I need to retrieve all pending messages from all my conversations. You need to have some sort of token that includes the last time stamp seen by the client. Then make as many queries as necessary to get the missing data. I guess this makes the data model span across many CFs ? Yes. Sorry I have not considered conversations. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 3:04 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Go day / phone instead of phone / day this way you won't have a rk growing forever . A comprise would be month / phone as the row key and then use the date time as the first part of a composite column. On Thursday, February 7, 2013, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron ! My use case is modeled like “skype” which stores IM + SMS + MMS in one conversation. I need to have the following functionality – ·When I go offline and come online again, I need to retrieve all pending messages from all my conversations. ·I should be able to select a contact and view the ‘history’ of the messages (last 7 days, last 14 days, last 21 days…) ·If I log in to a different device, I should be able to synch at least a “few days” of messages. ·One conversation can have multiple participants. ·Support full synch or delta synch based on number of messages/history. I guess this makes the data model span across many CFs ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 06 February 2013 22:20 To: user@cassandra.apache.org Subject: Re: DataModel Question 2) DynamicComposites : I read somewhere that they are not recommended ? You probably wont need them. Your current model will not sort message by the time they arrive in a day. The sort order will be based on Message type and the message ID. I'm assuming you want to order messages, so put the time uuid at the start of the composite columns. If you often want to get the most recent messages use a reverse comparator. You could probably also have wider rows if you want to, not sure how many messages kids send a day but you may get by with weekly partitions. The CLI model could be: row_key: phone_number : day column: time_uuid : message_id : message_type You could also pack extra data used JSON, ProtoBuffers etc and store more that just the message in the column value. If you use using CQL 3 consider this: create table messages ( phone_numbertext, day timestamp, message_sequence timeuuid, # your timestamp message_id integer, message_type text, message_bodytext ) with PRIMARY KEY ( (phone_number, day), message_sequence, message_id) (phone_number, day) is the partition key, same the thrift row key. message_sequence, message_id is the grouping columns, all instances will be grouped / ordered by these columns. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com
Re: Netflix/Astynax Client for Cassandra
I'm going to guess Netflix are running Astynax in production with Cassandra 1.1. cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 6:50 AM, Cassa L lcas...@gmail.com wrote: Thank you all for the responses to this thread. I am planning to use Cassandra 1.1.9 with Astynax. Does anyone has Cassandra 1.x version running in production with astynax? Did you come across any show-stopper issues? Thanks LCassa On Thu, Feb 7, 2013 at 8:50 AM, Bartłomiej Romański b...@sentia.pl wrote: Hi, Does anyone know how about virtual nodes support in Astynax? Are they handled correctly? Especially with ConnectionPoolType.TOKEN_AWARE? Thanks, BR
Re: are CFs consistent after a repair
'nodetool -pr repair' Assuming nodetool repair -pr If there is no write activity all reads (at any CL level) will return the same value after a successful repair. If there is write activity there is always a possibility of inconsistencies, and so only access where R + W N (e.g. QUORUM + QUROUM ) will be consistent. Can you drill down into the consistency problem? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 7:01 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: I'm confused about consistency. I have a 6-node group (RF=3) and I have a table that was known to be inconsistent across replicas (a Hadoop app was sensitive to this). So a did a 'nodetool -pr repair' on every node in the cluster. After the repairs were complete, the Hadoop app still indicated inconsistencies. Is this to be expected? Brian
Re: High CPU usage during repair
During repair I see high CPU consumption, Repair reads the data and computes a hash, this is a CPU intensive operation. Is the CPU over loaded or is just under load? I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. What machine size? there are compactions waiting. That's normally ok. How many are waiting? I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 That will remove throttling on compaction and the validation compaction used for the repair. Which may in turn add additional IO load, CPU load and GC pressure. You probably do not want to do this. Try reducing the compaction throughput to say 12 normally and see the effect. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 1:01 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I run repair weekly, using a scheduled cron job. During repair I see high CPU consumption, and messages in the log file INFO [ScheduledTasks:1] 2013-02-10 11:48:06,396 GCInspector.java (line 122) GC for ParNew: 208 ms for 1 collections, 1704786200 used; max is 3894411264 From time to time, there are also messages of the form INFO [ScheduledTasks:1] 2012-12-04 13:34:52,406 MessagingService.java (line 607) 1 READ messages dropped in last 5000ms Using opscenter, jmx and nodetool compactionstats I can see that during the time the CPU consumption is high, there are compactions waiting. I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. I have the default settings: compaction_throughput_mb_per_sec: 16 in_memory_compaction_limit_in_mb: 64 multithreaded_compaction: false compaction_preheat_key_cache: true I am thinking on the following solution, and wanted to ask if I am on the right track: I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 Is this a right solution? Thanks, Tamar Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956
Re: Read-repair working, repair not working?
I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned: What CL are you using for reads and writes? I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count. Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. - Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. - Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when for what proportion of total requests)? It would, but the INFO logging for the AES is pretty good. I would hold off for now. - Is there something obvious that I could be missing here? When a new AES session starts it logs this logger.info(String.format([repair #%s] new session: will sync %s on range %s for %s.%s, getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames))); When it completes it logs this logger.info(String.format([repair #%s] session completed successfully, getName())); Or this on failure logger.error(String.format([repair #%s] session completed with the following error, getName()), exception); Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 9:56 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi, I have a 20 node cluster running v1.0.7 split between 5 data centres, each with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I brought online late last year with data consistency availability: I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned: i.e. read-repair appeared to be occurring. However running repairs on the nodes didn’t resolve this (I tried general ‘repair’ commands as well as targeted keyspace commands) – this didn’t alter the behaviour. After a lot of fruitless investigation, I decided to wipe re-install/re-populate the nodes. The re-install repair operations are now complete: I see the expected amount of data on the nodes, however I am still seeing the same behaviour, i.e. I only get data after one failed attempt. When I run repair commands, I don’t see any errors in the logs. I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during repair sessions. I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count. Questions : - Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? - Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when for what proportion of total requests)? - Is there something obvious that I could be missing here? Many thanks, Brian
Re: Issues with writing data to Cassandra column family using a Hive script
Don't use the variable length Cassandra integer, use the Int32Type. It also sounds like you want to use a DoubleType rather than FloatType. http://www.datastax.com/docs/datastax_enterprise2.2/solutions/about_hive#hive-to-cassandra-table-mapping Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 4:15 PM, Dinusha Dilrukshi sdddilruk...@gmail.com wrote: Hi All, Data was originally stored in column family called test_cf. Definition of column family is as follows: CREATE COLUMN FAMILY test_cf WITH COMPARATOR = 'IntegerType' AND key_validation_class = UTF8Type AND default_validation_class = FloatType; And, following is the sample data set that contains in test_cf. cqlsh:temp_ks select * from test_cf; key| column1| value --++--- localhost:8282 | 1350468600 |76 localhost:8282 | 1350468601 |76 Hive script (shown in the end of mail) is use to take the data from above column family test_cf and insert into a new column family called cpu_avg_5min_new7. Column family description of cpu_avg_5min_new7 is also same as the test_cf. Issue is, data written in to cpu_avg_5min_new7 column family after executing the hive script is as follows. It's not in the format of data present in the original column family test_cf. Any explanations would highly appreciate.. cqlsh:temp_ks select * from cpu_avg_5min_new7; key| column1 | value --+--+-- localhost:8282 | 232340574229062170849328 | 1.09e-05 localhost:8282 | 232340574229062170849329 | 1.09e-05 Hive script: drop table cpu_avg_5min_new7_hive; CREATE EXTERNAL TABLE IF NOT EXISTS cpu_avg_5min_new7_hive (src_id STRING, start_time INT, cpu_avg FLOAT) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( cassandra.host = 127.0.0.1 , cassandra.port = 9160 , cassandra.ks.name = temp_ks , cassandra.ks.username = xxx , cassandra.ks.password = xxx , cassandra.columns.mapping = :key,:column,:value , cassandra.cf.name = cpu_avg_5min_new7 ); drop table xxx; CREATE EXTERNAL TABLE IF NOT EXISTS xxx (src_id STRING, start_time INT, cpu_avg FLOAT) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( cassandra.host = 127.0.0.1 , cassandra.port = 9160 , cassandra.ks.name = temp_ks , cassandra.ks.username = xxx , cassandra.ks.password = xxx , cassandra.columns.mapping = :key,:column,:value , cassandra.cf.name = test_cf ); insert overwrite table cpu_avg_5min_new7_hive select src_id,start_time,cpu_avg from xxx; Regards, Dinusha.
Re: Cassandra 1.1.2 - 1.1.8 upgrade
I would do #1. You can play with nodetool setcompactionthroughput to speed things up, but beware nothing comes for free. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 6:40 AM, Mike mthero...@yahoo.com wrote: Thank you, Another question on this topic. Upgrading from 1.1.2-1.1.9 requires running upgradesstables, which will take many hours on our dataset (about 12). For this upgrade, is it recommended that I: 1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and run a staggered upgrade of the sstables over a number of days. 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2-1.1.9 configuration for a number of days. I would prefer #1, as with #2, streaming will not work until all the nodes are upgraded. I appreciate your thoughts, -Mike On 1/16/2013 11:08 AM, Jason Wee wrote: always check NEWS.txt for instance for cassandra 1.1.3 you need to run nodetool upgradesstables if your cf has counter. On Wed, Jan 16, 2013 at 11:58 PM, Mike mthero...@yahoo.com wrote: Hello, We are looking to upgrade our Cassandra cluster from 1.1.2 - 1.1.8 (or possibly 1.1.9 depending on timing). It is my understanding that rolling upgrades of Cassandra is supported, so as we upgrade our cluster, we can do so one node at a time without experiencing downtime. Has anyone had any gotchas recently that I should be aware of before performing this upgrade? In order to upgrade, is the only thing that needs to change are the JAR files? Can everything remain as-is? Thanks, -Mike
Re: Cassandra flush spin?
Sounds like flushing due to memory consumption. The flush log messages include the number of ops, so you can see if this node was processing more mutations that the others. Try to see if there was more (serialised) data being written or more operations being processed. Also just for fun check the JVM and yaml settings are as expected. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 6:29 AM, Mike mthero...@yahoo.com wrote: Hello, We just hit a very odd issue in our Cassandra cluster. We are running Cassandra 1.1.2 in a 6 node cluster. We use a replication factor of 3, and all operations utilize LOCAL_QUORUM consistency. We noticed a large performance hit in our application's maintenance activities and I've been investigating. I discovered a node in the cluster that was flushing a memtable like crazy. It was flushing every 2-3 minutes, and has been apparently doing this for days. Typically, during this time of day, a flush would happen every 30 minutes or so. alldb.sh cat /var/log/cassandra/system.log | grep \flushing high-traffic column family CFS(Keyspace='open', ColumnFamily='msgs')\ | grep 02-08 | wc -l [1] 18:41:04 [SUCCESS] db-1c-1 59 [2] 18:41:05 [SUCCESS] db-1c-2 48 [3] 18:41:05 [SUCCESS] db-1a-1 1206 [4] 18:41:05 [SUCCESS] db-1d-2 54 [5] 18:41:05 [SUCCESS] db-1a-2 56 [6] 18:41:05 [SUCCESS] db-1d-1 52 I restarted the database node, and, at least for now, the problem appears to have stopped. There are a number of things that don't make sense here. We use a replication factor of 3, so if this was being caused by our application, I would have expected 3 nodes in the cluster to have issues. Also, I would have expected the issue to continue once the node restarted. Another information point of interest, and I'm wondering if its exposed a bug, was this node was recently converted to use ephemeral storage on EC2, and was restored from a snapshot. After the restore, a nodetool repair was run. However, repair was going to run into some heavy activity for our application, and we canceled that validation compaction (2 of the 3 anti-entropy sessions had completed). The spin appears to have started at the start of the second session. Any hints? -Mike
Re: persisted ring state
Is that the right way to do? No. If you want to change the token for a node use nodetool move. Changing it like this will not make the node change it's token. Because after startup the token is stored in the System.LocationInfo CF. or -Dcassandra.load_ring_state=false|true is only limited to changes to seed/listen_address ? it's used when a node somehow as a bad view of the ring, and you want it to forget things. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 3:35 AM, S C as...@outlook.com wrote: In one of the scenarios that I encountered, I needed to change the token on the node. I added new token and started the node with -Dcassandra.load_ring_state=false in anticipation that the node will not pick from the locally persisted data. Is that the right way to do? or -Dcassandra.load_ring_state=false|true is only limited to changes to seed/listen_address ? Thanks, SC
Re: High CPU usage during repair
What machine size? m1.large If you are seeing high CPU move to an m1.xlarge, that's the sweet spot. That's normally ok. How many are waiting? I have seen 4 this morning That's not really abnormal. The pending task count goes when when a file *may* be eligible for compaction, not when there is a compaction task waiting. If you suddenly create a number of new SSTables for a CF the pending count will rise, however one of the tasks may compact all the sstables waiting for compaction. So the count will suddenly drop as well. Just to make sure I understand you correctly, you suggest that I change throughput to 12 regardless of whether repair is ongoing or not. I will do it using nodetool and change the yaml file in case a restart will occur in the future? Yes. If you are seeing performance degrade during compaction or repair try reducing the throughput. I would attribute most of the problems you have described to using m1.large. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 9:16 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! Thanks for the response. See my answers and questions below. Thanks! Tamar Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Sun, Feb 10, 2013 at 10:04 PM, aaron morton aa...@thelastpickle.com wrote: During repair I see high CPU consumption, Repair reads the data and computes a hash, this is a CPU intensive operation. Is the CPU over loaded or is just under load? Usually just load, but in the past two weeks I have seen CPU of over 90%! I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. What machine size? m1.large there are compactions waiting. That's normally ok. How many are waiting? I have seen 4 this morning I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 That will remove throttling on compaction and the validation compaction used for the repair. Which may in turn add additional IO load, CPU load and GC pressure. You probably do not want to do this. Try reducing the compaction throughput to say 12 normally and see the effect. Just to make sure I understand you correctly, you suggest that I change throughput to 12 regardless of whether repair is ongoing or not. I will do it using nodetool and change the yaml file in case a restart will occur in the future? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 1:01 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I run repair weekly, using a scheduled cron job. During repair I see high CPU consumption, and messages in the log file INFO [ScheduledTasks:1] 2013-02-10 11:48:06,396 GCInspector.java (line 122) GC for ParNew: 208 ms for 1 collections, 1704786200 used; max is 3894411264 From time to time, there are also messages of the form INFO [ScheduledTasks:1] 2012-12-04 13:34:52,406 MessagingService.java (line 607) 1 READ messages dropped in last 5000ms Using opscenter, jmx and nodetool compactionstats I can see that during the time the CPU consumption is high, there are compactions waiting. I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. I have the default settings: compaction_throughput_mb_per_sec: 16 in_memory_compaction_limit_in_mb: 64 multithreaded_compaction: false compaction_preheat_key_cache: true I am thinking on the following solution, and wanted to ask if I am on the right track: I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 Is this a right solution? Thanks, Tamar Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956
Re: CQL 3 compound row key error
That sounds like a bug, or something that is still under work. Sylvain has his finger on all things CQL. Can you raise a ticket on https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 4:01 PM, Shahryar Sedghi shsed...@gmail.com wrote: I am moving my application from 1.1 to 1.2.1 to utilize secondary index and simplify the data model. In 1.1 I was concentrating some fields into one separated by : for the row key and it was a big string. In V1.2 I use compound rows key showed in the following test case (interval and seq): CREATE TABLE test( interval text, seq int, id int, severity int, PRIMARY KEY ((interval, seq), id)) WITH CLUSTERING ORDER BY (id DESC); -- CREATE INDEX ON test(severity); select * from test where severity = 3 and interval = 't' and seq =1; results: Bad Request: Start key sorts after end key. This is not allowed; you probably should not specify end key at all under random partitioner If I define the table as this: CREATE TABLE test( interval text, id int, severity int, PRIMARY KEY (interval, id)) WITH CLUSTERING ORDER BY (id DESC); select * from test where severity = 3 and interval = 't1'; Works fine. Is it a bug? Thanks in Advance Shahryar -- Life is what happens while you are making other plans. ~ John Lennon
Re: Read-repair working, repair not working?
CL.ONE : this is primarily for performance reasons … This makes reasoning about correct behaviour a little harder. If there is anyway you can run some tests with R + W N strong consistency I would encourage you to do so. You will then have a baseline of what works. (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help? The high number of inconsistencies seems to match with the massive number of dropped Mutation messages. Even if Anti Entropy is running, if the node in HK is dropping so many messages there will be inconsistencies. It looks like the HK node is overloaded. I would check the logs for GC messages, check for VM steal in a virtualised env, check for sufficient CPU + memory resources, check for IO stress. 20 node cluster running v1.0.7 split between 5 data centres, I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I Do all DC's have the same number of nodes ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 9:13 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi Aaron, Many thanks for your reply - answers below. Cheers, Brian What CL are you using for reads and writes? I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. CL.ONE : this is primarily for performance reasons but also because there are only three local nodes as you suggest and we need at least some resiliency. In the context of this issue, I considered increasing this to CL.LOCAL_QUORUM but the behaviour suggests than none of the 3 local nodes have the data (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help? Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. This DC is remote in terms of network topology - it's in Asia (Hong Kong) while the rest of the cluster is in Europe/North America, so network latency rather than congestion could be a cause? However I see some pretty aggressive data transfer speeds during the initial repairs the data footprint approximately matches the nodes elsewhere in the ring, so something doesn't add up? Here are the tpstats for one of these nodes : Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 0 4919185 0 0 RequestResponseStage 0 0 16869994 0 0 MutationStage 0 0 16764910 0 0 ReadRepairStage 0 0 3703 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 845225 0 0 AntiEntropyStage 0 0 52441 0 0 MigrationStage0 0 4362 0 0 MemtablePostFlusher 0 0952 0 0 StreamStage 0 0 24 0 0 FlushWriter 0 0960 0 5 MiscStage 0 0 3592 0 0 AntiEntropySessions 4 4121 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 1 2 55 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 150597 BINARY 0 READ781490 MUTATION853846 REQUEST_RESPONSE 0 The numbers of READ_REPAIR, READ MUTATION operations are non-negligable. The nodes in Europe/North America have effectively zero dropped messages. This suggests network latency is probably a significant factor? [the network ping from Europe to a HK node is ~250ms, so I wouldn’t have expected it to be such a problem?] It would, but the INFO logging for the AES is pretty good. I would hold off for now. Ok. [AES session logging] Yes, I see the expected start/end logs, so that's another thing off the list. On 10 Feb 2013, at 20:12, aaron morton aa...@thelastpickle.com wrote: I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned
Re: Cassandra 1.1.2 - 1.1.8 upgrade
You can always run them. But in some situations repair cannot be used, and in this case new nodes cannot be added. The news.txt file is your friend there. As a general rule when upgrading a cluster I move one node to the new version and let it soak in for an hour or so. Just to catch any crazy. I then upgrade all the nodes and run through the upgrade table. You can stagger upgrade table to be every RF'th node in the cluster to reduce the impact. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 8:05 PM, Michal Michalski mich...@opera.com wrote: 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2-1.1.9 configuration for a number of days. I'm about to upgrade my 1.1.0 cluster and http://www.datastax.com/docs/1.1/install/upgrading#info says: If you are upgrading to Cassandra 1.1.9 from a version earlier than 1.1.7, all nodes must be upgraded before any streaming can take place. Until you upgrade all nodes, you cannot add version 1.1.7 nodes or later to a 1.1.7 or earlier cluster. Which one is correct then? Can I run mixed 1.1.2 (in my case 1.1.0) 1.1.9 cluster or not? M.
Re: Cassandra jmx stats ReadCount
Are you using counters? They require a read before write. Also secondary index CF's require a read before write. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 1:26 PM, Daning Wang dan...@netseer.com wrote: We have 8 nodes cluster in Casandra 1.1.0, with replication factor is 3. We found that when you just insert data, not only WriteCount increases, the ReadCount also increases. How could this happen? I am under the impression that readCount only counts the reads from client. Thanks, Daning
Re: Directory structure after upgrading 1.0.8 to 1.2.1
I think it's a little more subtle that that https://issues.apache.org/jira/browse/CASSANDRA-5242 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 10:21 PM, Desimpel, Ignace ignace.desim...@nuance.com wrote: Yes it are new directories. I did some debugging … The Cassandra code is org.apache.cassandra.db.Directories::migrateFile. It is detecting that it is a manifest (based on the .json extension). But then it does not take in account that something like MyColumnFamily-old.json can exist. Then it is using MyColumnFamily-old as a directory name in a call to a function destDir = getOrCreate(ksDir, dirname, additionalPath), while it should be MyColumnFamily. So I guess that the cfname computation should be adapted to include the “-old.json” manifest files. Ignace From: aaron morton [mailto:aa...@thelastpickle.com] Sent: vrijdag 8 februari 2013 03:09 To: user@cassandra.apache.org Subject: Re: Directory structure after upgrading 1.0.8 to 1.2.1 the -old.json is an artefact of Levelled Compaction. You should see a non -old file in the current CF folder. I'm not sure what would have created the -old CF dir. Does the timestamp indicate it was created the time the server first started as a 1.2 node? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 7/02/2013, at 10:39 PM, Desimpel, Ignace ignace.desim...@nuance.com wrote: After upgrading from 1.0.8 I see that now the directory structure has changed and has a structure like keyspacecolumnfamily (part of the 1.1.x migration). But I also see that directories appear like keyspacecolumnfamily-old, and the content of that ‘old’ directory is only one file columnfamily-old.json. Questions : Should this xxx-old.json file be in the other directory? Should the extra directory xxx-old not be created? Or was that intentionally done and is it allowed to remove these directories ( manually … )? Thanks
Re: Healthy JVM GC
-Xms8049M -Xmx8049M -Xmn800M That's a healthy amount of memory for the JVM. If you are using Row Caches, reduce their size and/or ensure you are using Serializing (off heap) caches. Also consider changing the yaml conf flush_largest_memtables_at from 0.75 to 0.80 so it is different to the CMS occupancy setting. If you have a lot of rows, 100's of millions, consider reducing the bloom filter false positive ratio. Or just upgrade to 1.2 which uses less JVM memory. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 9/02/2013, at 7:46 AM, André Cruz andre.c...@co.sapo.pt wrote: Hello. I've noticed I get the frequent JVM warning in the logs about the heap being full: WARN [ScheduledTasks:1] 2013-02-08 18:14:20,410 GCInspector.java (line 145) Heap is 0.731554347747841 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2013-02-08 18:14:20,418 StorageService.java (line 2855) Flushing CFS(Keyspace='Disco', ColumnFamily='FilesPerBlock') to relieve memory pressure INFO [ScheduledTasks:1] 2013-02-08 18:14:20,418 ColumnFamilyStore.java (line 659) Enqueuing flush of Memtable-FilesPerBlock@1804403938(6275300/63189158 serialized/live bytes, 52227 ops) INFO [FlushWriter:4500] 2013-02-08 18:14:20,419 Memtable.java (line 264) Writing Memtable-FilesPerBlock@1804403938(6275300/63189158 serialized/live bytes, 52227 ops) INFO [FlushWriter:4500] 2013-02-08 18:14:21,059 Memtable.java (line 305) Completed flushing /servers/storage/cassandra-data/Disco/FilesPerBlock/Disco-FilesPerBlock-he-6154-Data.db (6332375 bytes) for commitlog position ReplayPosition(segmentId=1357730625412, position=10756636) WARN [ScheduledTasks:1] 2013-02-08 18:23:31,970 GCInspector.java (line 145) Heap is 0.6835904101057064 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2013-02-08 18:23:31,971 StorageService.java (line 2855) Flushing CFS(Keyspace='Disco', ColumnFamily='BlocksKnownPerUser') to relieve memory pressure INFO [ScheduledTasks:1] 2013-02-08 18:23:31,972 ColumnFamilyStore.java (line 659) Enqueuing flush of Memtable-BlocksKnownPerUser@2072550435(1834642/60143054 serialized/live bytes, 67010 ops) INFO [FlushWriter:4501] 2013-02-08 18:23:31,972 Memtable.java (line 264) Writing Memtable-BlocksKnownPerUser@2072550435(1834642/60143054 serialized/live bytes, 67010 ops) INFO [FlushWriter:4501] 2013-02-08 18:23:32,827 Memtable.java (line 305) Completed flushing /servers/storage/cassandra-data/Disco/BlocksKnownPerUser/Disco-BlocksKnownPerUser-he-484930-Data.db (7404407 bytes) for commitlog position ReplayPosition(segmentId=1357730625413, position=6093472) WARN [ScheduledTasks:1] 2013-02-08 18:29:46,198 GCInspector.java (line 145) Heap is 0.6871977390878024 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2013-02-08 18:29:46,199 StorageService.java (line 2855) Flushing CFS(Keyspace='Disco', ColumnFamily='FileRevision') to relieve memory pressure INFO [ScheduledTasks:1] 2013-02-08 18:29:46,200 ColumnFamilyStore.java (line 659) Enqueuing flush of Memtable-FileRevision@1526026442(7245147/63711465 serialized/live bytes, 23779 ops) INFO [FlushWriter:4502] 2013-02-08 18:29:46,201 Memtable.java (line 264) Writing Memtable-FileRevision@1526026442(7245147/63711465 serialized/live bytes, 23779 ops) INFO [FlushWriter:4502] 2013-02-08 18:29:46,769 Memtable.java (line 305) Completed flushing /servers/storage/cassandra-data/Disco/FileRevision/Disco-FileRevision-he-5438-Data.db (5480642 bytes) for commitlog position ReplayPosition(segmentId=1357730625413, position=29816878) INFO [ScheduledTasks:1] 2013-02-08 18:34:13,442 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 352 ms for 1 collections, 5902597760 used; max is 8357150720 WARN [ScheduledTasks:1] 2013-02-08 18:34:13,442 GCInspector.java (line 145) Heap is 0.7062930845406603 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2013-02-08 18:34:13,443 StorageService.java (line 2855) Flushing CFS(Keyspace
Re: Bootstrapping a new node to a virtual node cluster
Just checking if this sorted it's self out? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 1:15 AM, Jouni Hartikainen jouni.hartikai...@reaktor.fi wrote: Hello all, I have a cluster of three nodes running 1.2.1 and I'd like to increase the capacity by adding a new node. I'm using virtual nodes with 256 tokens and planning to use the same configuration for the new node as well. My cluster looks like this before adding the new node: Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.154.111.49 GB256 100.0% 234b82a4-3812-4261-adab-deb805942d63 rack1 UN 192.168.154.121.6 GB 256 100.0% 577db21e-81ef-45fd-a67b-cfd39455c0f6 rack1 UN 192.168.154.131.64 GB256 100.0% 6187cc5d-d44c-45cb-b738-1b87f5ae3dff rack1 And corresponding gossipinfo: /192.168.154.12 RPC_ADDRESS:192.168.154.12 DC:datacenter1 STATUS:NORMAL,-1072164398478041156 LOAD:1.719425018E9 SCHEMA:ef2c294e-1a74-32c1-b169-3a6465b2053d NET_VERSION:6 HOST_ID:577db21e-81ef-45fd-a67b-cfd39455c0f6 SEVERITY:0.0 RELEASE_VERSION:1.2.1 RACK:rack1 /192.168.154.11 RPC_ADDRESS:192.168.154.11 DC:datacenter1 STATUS:NORMAL,-1158837144480089281 LOAD:1.514343678E9 SCHEMA:ef2c294e-1a74-32c1-b169-3a6465b2053d NET_VERSION:6 HOST_ID:234b82a4-3812-4261-adab-deb805942d63 SEVERITY:0.0 RELEASE_VERSION:1.2.1 RACK:rack1 /192.168.154.13 RPC_ADDRESS:192.168.154.13 DC:datacenter1 STATUS:NORMAL,-1135137292201587328 LOAD:1.765093695E9 SCHEMA:ef2c294e-1a74-32c1-b169-3a6465b2053d NET_VERSION:6 HOST_ID:6187cc5d-d44c-45cb-b738-1b87f5ae3dff SEVERITY:0.0 RELEASE_VERSION:1.2.1 RACK:rack1 I have now set the correct net addresses seeds in the cassandra.yaml of the new node (.14) and then started it with num_tokens set to 256 and initial_token commented out. Everything seems to go OK as I get the following prints on the log: On node 192.168.154.11: INFO [GossipStage:1] 2013-02-09 12:30:28,126 Gossiper.java (line 784) Node /192.168.154.14 is now part of the cluster INFO [GossipStage:1] 2013-02-09 12:30:28,128 Gossiper.java (line 750) InetAddress /192.168.154.14 is now UP INFO [MiscStage:1] 2013-02-09 12:30:59,255 StreamOut.java (line 114) Beginning transfer to /192.168.154.14 And on node 192.168.154.14 (the new node): INFO 12:30:26,843 Loading persisted ring state INFO 12:30:26,846 Starting up server gossip WARN 12:30:26,853 No host ID found, created a4a0b918-a1c8-4acc-a050-672a96a5f110 (Note: This should happen exactly once per node). INFO 12:30:26,979 Starting Messaging Service on port 7000 INFO 12:30:27,014 JOINING: waiting for ring information INFO 12:30:28,602 Node /192.168.154.11 is now part of the cluster INFO 12:30:28,603 InetAddress /192.168.154.11 is now UP INFO 12:30:28,675 Node /192.168.154.12 is now part of the cluster INFO 12:30:28,678 InetAddress /192.168.154.12 is now UP INFO 12:30:28,751 Node /192.168.154.13 is now part of the cluster INFO 12:30:28,751 InetAddress /192.168.154.13 is now UP INFO 12:30:29,015 JOINING: schema complete, ready to bootstrap INFO 12:30:29,015 JOINING: getting bootstrap token INFO 12:30:29,157 JOINING: sleeping 3 ms for pending range setup INFO 12:30:59,159 JOINING: Starting to bootstrap... However, the new node does not show up in nodetool status (even if queried from the new node itself): Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.154.111.49 GB256 100.0% 234b82a4-3812-4261-adab-deb805942d63 rack1 UN 192.168.154.121.6 GB 256 100.0% 577db21e-81ef-45fd-a67b-cfd39455c0f6 rack1 UN 192.168.154.131.64 GB256 100.0% 6187cc5d-d44c-45cb-b738-1b87f5ae3dff rack1 It shows up in the gossip still: /192.168.154.12 RPC_ADDRESS:192.168.154.12 DC:datacenter1 STATUS:NORMAL,-1072164398478041156 LOAD:1.719430632E9 SCHEMA:19657c82-a7eb-37a8-b436-0ea712c57db2 NET_VERSION:6 HOST_ID:577db21e-81ef-45fd-a67b-cfd39455c0f6 SEVERITY:0.0 RELEASE_VERSION:1.2.1-SNAPSHOT RACK:rack1 /192.168.154.14 RPC_ADDRESS:192.168.154.14 DC:datacenter1 STATUS:BOOT,8077752099299332137 LOAD:105101.0 SCHEMA:19657c82-a7eb-37a8-b436-0ea712c57db2 NET_VERSION:6 HOST_ID:a4a0b918-a1c8-4acc-a050-672a96a5f110 RELEASE_VERSION:1.2.1-SNAPSHOT RACK:rack1 /192.168.154.11 RPC_ADDRESS:192.168.154.11 DC:datacenter1 STATUS:NORMAL,-1158837144480089281 LOAD:1.596505929E9 SCHEMA:19657c82-a7eb-37a8-b436-0ea712c57db2 NET_VERSION:6 HOST_ID:234b82a4-3812-4261-adab-deb805942d63 SEVERITY:0.0 RELEASE_VERSION:1.2.1-SNAPSHOT
Re: Deleting old items
So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. You need to issue row level deletes. If you don't know the row key you'll need to do range scans to locate them. If you are deleting parts of wide rows consider reducing the min_compaction_level_threshold on the CF to 2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I would like to know if there is a way to delete old/unused data easily ? I know about TTL but there are 2 limitations of TTL: - AFAIK, there is no TTL on counter columns - TTL need to be defined at write time, so it's too late for data already inserted. I also could use a standard delete but it seems inappropriate for such a massive. In some cases, I don't know the row key and would like to delete all the rows starting by, let's say, 1050#... Even better, I understood that columns are always inserted in C* with (name, value, timestamp). So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? Alain
Re: RuntimeException during leveled compaction
snapshot all nodes so you have a backup: nodetool snapshot -t corrupt run nodetool scrub on the errant CF. Look for messages such as: Out of order row detected… %d out of order rows found while scrubbing %s; Those have been written (in order) to a new sstable (%s) In the logs. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 6:13 AM, Andre Sprenger andre.spren...@getanet.de wrote: Hi, I'm running a 6 node Cassandra 1.1.5 cluster on EC2. We have switched to leveled compaction a couple of weeks ago, this has been successful. Some days ago 3 of the nodes start to log the following exception during compaction of a particular column family: ERROR [CompactionExecutor:726] 2013-02-11 13:02:26,582 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[CompactionExecutor:726,1,main] java.lang.RuntimeException: Last written key DecoratedKey(84590743047470232854915142878708713938, 3133353533383530323237303130313030303232313537303030303132393832) = current key DecoratedKey(28357704665244162161305918843747894551, 31333430313336313830333831303130313030303230313632303030303036363338) writing into /var/cassandra/data/AdServer/EventHistory/Adserver-EventHistory-tmp-he-68638-Data.db at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Compaction does not happen any more for the column family and read performance gets worse because of the growing number of data files accessed during reads. Looks like one or more of the data files are corrupt and have keys that are stored out of order. Any help to resolve this situation would be greatly appreciated. Thanks Andre
Re: Cassandra becnhmark
I see the same keys in both nodes. Replication is not enabled. Why do you say that ? Check the schema for Keyspace1 using the cassandra-cli. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 9:31 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – I am trying to do benchmark using the Cassandra-stress tool. They have given an example to insert data across 2 nodes – /tools/stress/bin/stress -d 192.168.1.101,192.168.1.102 -n 1000 But when I run this across my 2 node cluster, I see the same keys in both nodes. Replication is not enabled. Should it not have unique keys in both nodes ? Thanks, Kanwar
Re: what addresses to use in EC2 cluster (whenever an instance restarts it gets a new private ip)?
Cassandra handles nodes changing IP. The import thing to Cassandra is the token, not the IP. In your case did the replacement node have the same token as the failed one? You can normally work around these issues using commands like nodetool removetoken. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 10:04 AM, Andrey Ilinykh ailin...@gmail.com wrote: You have to use private IPs, but if an instance dies you have to bootstrap it with replace token flag. If you use EC2 I'd recommend Netflix's Priam tool. It manages all that stuff, plus you have S3 backup. Andrey On Mon, Feb 11, 2013 at 11:35 AM, Brian Tarbox tar...@cabotresearch.com wrote: How do I configure my cluster to run in EC2? In my cassandra.yaml I have IP addresses under seed_provider, listen_address and rpc_address. I tried setting up my cluster using just the EC2 private addresses but when one of my instances failed and I restarted it there was a new private address. Suddenly my cluster thought it have five nodes rather than four. Then I tried using Elastic IP addresses (permanent addresses) but it turns out you get charged for network traffic between elastic addresses even if they are within the cluster. So...how do you configure the cluster when the IP addresses can change out from under you? Thanks. Brian Tarbox
Re: Cassandra 1.1.2 - 1.1.8 upgrade
You have linked to the 1.2 news file, which branched from 1.1 at some point. Look at the news file in the distribution you are installing or here https://github.com/apache/cassandra/blob/cassandra-1.1/NEWS.txt Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 11:14 PM, Michal Michalski mich...@opera.com wrote: OK, thanks Aaron. I ask because NEWS.txt is not a big help in case of 1.1.5 versions because there's no info on them in it (especially on 1.1.7 which seems to be the most important one in this case, according to the DataStax' upgrade instructions) ;-) https://github.com/apache/cassandra/blob/trunk/NEWS.txt M. W dniu 11.02.2013 11:05, aaron morton pisze: You can always run them. But in some situations repair cannot be used, and in this case new nodes cannot be added. The news.txt file is your friend there. As a general rule when upgrading a cluster I move one node to the new version and let it soak in for an hour or so. Just to catch any crazy. I then upgrade all the nodes and run through the upgrade table. You can stagger upgrade table to be every RF'th node in the cluster to reduce the impact.
Re: Cassandra 1.1.2 - 1.1.8 upgrade
Can anyone know the impact of not running upgrade sstables? Or possible not running it for several days? nodetool repair will not work. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 11:54 AM, Mike mthero...@yahoo.com wrote: So the upgrade sstables is recommended as part of the upgrade to 1.1.3 if you are using counter columns Also, there was a general recommendation (in another response to my question) to run upgrade sstables because of: upgradesstables always needs to be done between majors. While 1.1.2 - 1.1.8 is not a major, due to an unforeseen bug in the conversion to microseconds you'll need to run upgradesstables. Is this referring to: https://issues.apache.org/jira/browse/CASSANDRA-4432 Can anyone know the impact of not running upgrade sstables? Or possible not running it for several days? Thanks, -Mike On 2/10/2013 3:27 PM, aaron morton wrote: I would do #1. You can play with nodetool setcompactionthroughput to speed things up, but beware nothing comes for free. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 6:40 AM, Mike mthero...@yahoo.com wrote: Thank you, Another question on this topic. Upgrading from 1.1.2-1.1.9 requires running upgradesstables, which will take many hours on our dataset (about 12). For this upgrade, is it recommended that I: 1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and run a staggered upgrade of the sstables over a number of days. 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2-1.1.9 configuration for a number of days. I would prefer #1, as with #2, streaming will not work until all the nodes are upgraded. I appreciate your thoughts, -Mike On 1/16/2013 11:08 AM, Jason Wee wrote: always check NEWS.txt for instance for cassandra 1.1.3 you need to run nodetool upgradesstables if your cf has counter. On Wed, Jan 16, 2013 at 11:58 PM, Mike mthero...@yahoo.com wrote: Hello, We are looking to upgrade our Cassandra cluster from 1.1.2 - 1.1.8 (or possibly 1.1.9 depending on timing). It is my understanding that rolling upgrades of Cassandra is supported, so as we upgrade our cluster, we can do so one node at a time without experiencing downtime. Has anyone had any gotchas recently that I should be aware of before performing this upgrade? In order to upgrade, is the only thing that needs to change are the JAR files? Can everything remain as-is? Thanks, -Mike
Re: Upgrade to Cassandra 1.2
Were you upgrading to 1.2 AND running the shuffle or just upgrading to 1.2? If you have not run shuffle I would suggest reverting the changes to num_tokens and inital_token. This is a guess because num_tokens is only used at bootstrap. Just get upgraded to 1.2 first, then do the shuffle when things are stable. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 2:55 PM, Daning Wang dan...@netseer.com wrote: Thanks Aaron. I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed. - I followed http://www.datastax.com/docs/1.2/install/upgrading, have merged cassandra.yaml, with follow parameter num_tokens: 256 #initial_token: 0 the initial_token is commented out, current token should be obtained from system schema - I did rolling upgrade, during the upgrade, I got Borken Pipe error from the nodes with old version, is that normal? - After I upgraded 3 nodes(still have 5 to go), I found it is total wrong, the first node upgraded owns 99.2 of ring [cassy@d5:/usr/local/cassy conf]$ ~/bin/nodetool -h localhost status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack DN 10.210.101.11745.01 GB 254 99.2% f4b6afe3-7e2e-4c61-96e8-12a529a31373 rack1 UN 10.210.101.12045.43 GB 256 0.4% 0fd912fb-3187-462b-8c8a-7d223751b649 rack1 UN 10.210.101.11127.08 GB 256 0.4% bd4c37bc-07dd-488b-bfab-e74e32c26f6e rack1 What was wrong? please help. I could provide more information if you need. Thanks, Daning On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.com wrote: There is a command line utility in 1.2 to shuffle the tokens… http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes $ ./cassandra-shuffle --help Missing sub-command argument. Usage: shuffle [options] sub-command Sub-commands: create Initialize a new shuffle operation ls List pending relocations clearClear pending relocations en[able] Enable shuffling dis[able]Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enableImmediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote: I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is that it can have multiple tokens in one node. but there is only one token in 1.1.6. how can I upgrade to 1.2.1 then breaking the token to take advantage of this feature? I went through this doc but it does not say how to change the num_token http://www.datastax.com/docs/1.2/install/upgrading Is there other doc about this upgrade path? Thanks, Daning I think for each node you need to change the num_token option in conf/cassandra.yaml (this only split the current range into num_token parts) and run the bin/cassandra-shuffle command (this spread it all over the ring).
Re: Cassandra 1.2.1 key cache error
This looks like a bug in 1.2 beta https://issues.apache.org/jira/browse/CASSANDRA-4553 Can you confirm you are running 1.2.1 and if you can re-create this with a clean install please create a ticket on https://issues.apache.org/jira/browse/CASSANDRA Thanks - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 1:22 AM, Ahmed Guecioueur ahme...@gmail.com wrote: Hi I am currently evaluating Cassandra on a single node. Running the node seems fine, it responds to Thrift (via Hector) and CQL3 requests to create delete keyspaces. I have not yet tested any data operations. However, I get the following each time the node is started. This is using the latest production jars (v 1.2.1) downloaded from the Apache website: INFO [main] 2013-02-07 19:48:55,610 AutoSavingCache.java (line 139) reading saved cache C:\Cassandra\saved_caches\system-local-KeyCache-b.db WARN [main] 2013-02-07 19:48:55,614 AutoSavingCache.java (line 160) error reading saved cache C:\Cassandra\saved_caches\system-local-KeyCache-b.db java.io.EOFException at java.io.DataInputStream.readInt(Unknown Source) at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:349) at org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:378) at org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:144) at org.apache.cassandra.db.ColumnFamilyStore.init(ColumnFamilyStore.java:277) at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:392) at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:364) at org.apache.cassandra.db.Table.initCf(Table.java:337) at org.apache.cassandra.db.Table.init(Table.java:280) at org.apache.cassandra.db.Table.open(Table.java:110) at org.apache.cassandra.db.Table.open(Table.java:88) at org.apache.cassandra.db.SystemTable.checkHealth(SystemTable.java:421) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:177) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:370) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:413) INFO [SSTableBatchOpen:1] 2013-02-07 19:48:56,212 SSTableReader.java (line 164) Opening C:\Cassandra\data\system_auth\users\system_auth-users-ib-1 (72 bytes) INFO [main] 2013-02-07 19:48:56,242 CassandraDaemon.java (line 224) completed pre-loading (3 keys) key cache. That binary file exists, though ofc the content is unreadable. Deleting the file and letting it be recreated doesn't help either. Can anyone suggest any other solutions? Cheers Ahmed
Re: Upgrade to Cassandra 1.2
Restore the settings for num_tokens and intial_token to what they were before you upgraded. They should not be changed just because you are upgrading to 1.2, they are used to enable virtual nodes. Which are not necessary to run 1.2. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 8:02 AM, Daning Wang dan...@netseer.com wrote: No, I did not run shuffle since the upgrade was not successful. what do you mean reverting the changes to num_tokens and inital_token? set num_tokens=1? initial_token should be ignored since it is not bootstrap. right? Thanks, Daning On Tue, Feb 12, 2013 at 10:52 AM, aaron morton aa...@thelastpickle.com wrote: Were you upgrading to 1.2 AND running the shuffle or just upgrading to 1.2? If you have not run shuffle I would suggest reverting the changes to num_tokens and inital_token. This is a guess because num_tokens is only used at bootstrap. Just get upgraded to 1.2 first, then do the shuffle when things are stable. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 2:55 PM, Daning Wang dan...@netseer.com wrote: Thanks Aaron. I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed. - I followed http://www.datastax.com/docs/1.2/install/upgrading, have merged cassandra.yaml, with follow parameter num_tokens: 256 #initial_token: 0 the initial_token is commented out, current token should be obtained from system schema - I did rolling upgrade, during the upgrade, I got Borken Pipe error from the nodes with old version, is that normal? - After I upgraded 3 nodes(still have 5 to go), I found it is total wrong, the first node upgraded owns 99.2 of ring [cassy@d5:/usr/local/cassy conf]$ ~/bin/nodetool -h localhost status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack DN 10.210.101.11745.01 GB 254 99.2% f4b6afe3-7e2e-4c61-96e8-12a529a31373 rack1 UN 10.210.101.12045.43 GB 256 0.4% 0fd912fb-3187-462b-8c8a-7d223751b649 rack1 UN 10.210.101.11127.08 GB 256 0.4% bd4c37bc-07dd-488b-bfab-e74e32c26f6e rack1 What was wrong? please help. I could provide more information if you need. Thanks, Daning On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.com wrote: There is a command line utility in 1.2 to shuffle the tokens… http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes $ ./cassandra-shuffle --help Missing sub-command argument. Usage: shuffle [options] sub-command Sub-commands: create Initialize a new shuffle operation ls List pending relocations clearClear pending relocations en[able] Enable shuffling dis[able]Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enableImmediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote: I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is that it can have multiple tokens in one node. but there is only one token in 1.1.6. how can I upgrade to 1.2.1 then breaking the token to take advantage of this feature? I went through this doc but it does not say how to change the num_token http://www.datastax.com/docs/1.2/install/upgrading Is there other doc about this upgrade path? Thanks, Daning I think for each node you need to change the num_token option in conf/cassandra.yaml (this only split the current range into num_token parts) and run the bin/cassandra-shuffle command (this spread it all over the ring).
Re: RuntimeException during leveled compaction
That sounds like something wrong with the way the rows are merged during compaction then. Can you run the compaction with DEBUG logging and raise a ticket? You may want to do this with the node not in the ring. Five minutes after it starts it will run pending compactions, so if there if compactions are not running they should start again. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 8:11 PM, Andre Sprenger andre.spren...@getanet.de wrote: Aaron, thanks for your help. I ran 'nodetool scrub' and it finished after a couple of hours. But there are no infos about out of order rows in the logs and the compaction on the column family still raises the same exception. With the row key I could identify some of the errant SSTables and removed them during a node restart. On some nodes compaction is working for the moment but there are likely more corrupt datafiles and than I would be in the same situation as before. So I still need some help to resolve this issue! Cheers Andre 2013/2/12 aaron morton aa...@thelastpickle.com snapshot all nodes so you have a backup: nodetool snapshot -t corrupt run nodetool scrub on the errant CF. Look for messages such as: Out of order row detected… %d out of order rows found while scrubbing %s; Those have been written (in order) to a new sstable (%s) In the logs. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 6:13 AM, Andre Sprenger andre.spren...@getanet.de wrote: Hi, I'm running a 6 node Cassandra 1.1.5 cluster on EC2. We have switched to leveled compaction a couple of weeks ago, this has been successful. Some days ago 3 of the nodes start to log the following exception during compaction of a particular column family: ERROR [CompactionExecutor:726] 2013-02-11 13:02:26,582 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[CompactionExecutor:726,1,main] java.lang.RuntimeException: Last written key DecoratedKey(84590743047470232854915142878708713938, 3133353533383530323237303130313030303232313537303030303132393832) = current key DecoratedKey(28357704665244162161305918843747894551, 31333430313336313830333831303130313030303230313632303030303036363338) writing into /var/cassandra/data/AdServer/EventHistory/Adserver-EventHistory-tmp-he-68638-Data.db at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Compaction does not happen any more for the column family and read performance gets worse because of the growing number of data files accessed during reads. Looks like one or more of the data files are corrupt and have keys that are stored out of order. Any help to resolve this situation would be greatly appreciated. Thanks Andre
Re: Deleting old items
Is that a feature that could possibly be developed one day ? No. Timestamps are essentially internal implementation used to resolve different values for the same column. With min_compaction_level_threshold did you mean min_compaction_threshold ? If so, why should I do that, what are the advantage/inconvenient of reducing this value ? Yes, min_compaction_threshold, my bad. If you have a wide row and delete a lot of values you will end up with a lot of tombstones. These may dramatically reduce the read performance until they are purged. Reducing the compaction threshold makes compaction happen more frequently. Looking at the doc I saw that: max_compaction_threshold: Ignored in Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount of SSTables then ? AFAIK it's not. There may be some confusion about the location of the settings in CLI vs CQL. Can you point to the docs. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 10:14 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Aaron, once again thanks for this answer. So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. Why is there no way of deleting or getting data using the internal timestamp stored alongside of any inserted column (as described here: http://www.datastax.com/docs/1.1/ddl/column_family#standard-columns) ? Is that a feature that could possibly be developed one day ? It could be useful to perform delete of old data or to bring to a dev cluster just the last week of data for example. With min_compaction_level_threshold did you mean min_compaction_threshold ? If so, why should I do that, what are the advantage/inconvenient of reducing this value ? Looking at the doc I saw that: max_compaction_threshold: Ignored in Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount of SSTables then ? Why is this deprecated ? Alain 2013/2/12 aaron morton aa...@thelastpickle.com So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. You need to issue row level deletes. If you don't know the row key you'll need to do range scans to locate them. If you are deleting parts of wide rows consider reducing the min_compaction_level_threshold on the CF to 2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I would like to know if there is a way to delete old/unused data easily ? I know about TTL but there are 2 limitations of TTL: - AFAIK, there is no TTL on counter columns - TTL need to be defined at write time, so it's too late for data already inserted. I also could use a standard delete but it seems inappropriate for such a massive. In some cases, I don't know the row key and would like to delete all the rows starting by, let's say, 1050#... Even better, I understood that columns are always inserted in C* with (name, value, timestamp). So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? Alain
Re: Deleting old items during compaction (WAS: Deleting old items)
That's what the TTL does. Manually delete all the older data now, then start using TTL. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 11:08 PM, Ilya Grebnov i...@metricshub.com wrote: Hi, We looking for solution for same problem. We have a wide column family with counters and we want to delete old data like 1 months old. One of potential ideas was to implement hook in compaction code and drop column which we don’t need. Is this a viable option? Thanks, Ilya From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Tuesday, February 12, 2013 9:01 AM To: user@cassandra.apache.org Subject: Re: Deleting old items So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. You need to issue row level deletes. If you don't know the row key you'll need to do range scans to locate them. If you are deleting parts of wide rows consider reducing the min_compaction_level_threshold on the CF to 2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I would like to know if there is a way to delete old/unused data easily ? I know about TTL but there are 2 limitations of TTL: - AFAIK, there is no TTL on counter columns - TTL need to be defined at write time, so it's too late for data already inserted. I also could use a standard delete but it seems inappropriate for such a massive. In some cases, I don't know the row key and would like to delete all the rows starting by, let's say, 1050#... Even better, I understood that columns are always inserted in C* with (name, value, timestamp). So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? Alain
Re: Mutation dropped
You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some timeout associated with replicated data persistence ? Thanks, Kanwar From: Kanwar Sangha [mailto:kan...@mavenir.com] Sent: 14 February 2013 09:08 To: user@cassandra.apache.org Subject: Mutation dropped Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing a lot of mutation dropped messages. I understand that this is due to the replica not being written to the other node ? RF = 2, CL =1. From the wiki - For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair Thanks, Kanwar
Re: [nodetool] repair with vNodes
I'm a bit late, but for reference. Repair runs in two stages, first differences are detected. You an monitor the validation compaction with nodetool compactionstats. Then the differences are streamed between the nodes, you can monitor that with nodetool netstats. Nodetool repair command has been running for almost 24hours and I can’t see any activity from the logs or JMX. Grep for session completed Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 11:38 PM, Haithem Jarraya haithem.jarr...@struq.com wrote: Hi, I am new to Cassandra and I would like to hear your thoughts on this. We are running our tests with Cassandra 1.2.1, in relatively small dataset ~60GB. Nodetool repair command has been running for almost 24hours and I can’t see any activity from the logs or JMX. What am I missing? Or there is a problem with node tool repair? What other commands that I can run to do a sanity check on the cluster? Can I run nodetool repair on different node in the same time? Here is the current test deployment of Cassandra $ nodetool status Datacenter: ams01 (Replication Factor 2) = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.70.48.23 38.38 GB 256 19.0% 7c5fdfad-63c6-4f37-bb9f-a66271aa3423 RAC1 UN 10.70.6.7858.13 GB 256 18.3% 94e7f48f-d902-4d4a-9b87-81ccd6aa9e65 RAC1 UN 10.70.47.126 53.89 GB 256 19.4% f36f1f8c-1956-4850-8040-b58273277d83 RAC1 Datacenter: wdc01 (Replication Factor 1) = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.24.116.66 65.81 GB 256 22.1% f9dba004-8c3d-4670-94a0-d301a9b775a8 RAC1 Datacenter: sjc01 (Replication Factor 1) = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.55.104.90 63.31 GB 256 21.2% 4746f1bd-85e1-4071-ae5e-9c5baac79469 RAC1 Many Thanks, Haithem
Re: Question on Cassandra Snapshot
With incremental_backup turned OFF in Cassandra.yaml - Are all SSTables are under /data/TestKeySpace/ColumnFamily at all times? No. They are deleted when they are compacted and no internal operations are referencing them. With incremental_backup turned ON in cassandra.yaml - Are current SSTables under /data/TestKeySpace/ColumnFamily/ with a hardlink to /data/TestKeySpace/ColumnFamily/backups? Yes, sort of. *All* SSTables ever created are in the backups directory. Not just the ones currently live. Lets say I have taken snapshot and moved the /data/TestKeySpace/ColumnFamily/snapshots/snapshot-name/*.db to tape, at what point should I be backing up *.db files from /data/TestKeySpace/ColumnFamily/backups directory. Also, should I be deleting the *.db files whose inode matches with the files in the snapshot? Is that a correct approach? Backup all files in the snapshots. There may be non .db extensions files if you use levelled compactions When you are finished with the snapshot delete it. If the inode is not longer referenced from the live data dir it will be deleted. I noticed /data/TestKeySpace/ColumnFamily/snapshots/timestamp-ColumnFamily/ what are these timestamp directories? Probably automatic snapshot from dropping KS or CF's Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 4:41 AM, S C as...@outlook.com wrote: I appreciate any advise or pointers on this. Thanks in advance. From: as...@outlook.com To: user@cassandra.apache.org Subject: Question on Cassandra Snapshot Date: Thu, 14 Feb 2013 20:47:14 -0600 I have been looking at incremental backups and snapshots. I have done some experimentation but could not come to a conclusion. Can somebody please help me understanding it right? /data is my data partition With incremental_backup turned OFF in Cassandra.yaml - Are all SSTables are under /data/TestKeySpace/ColumnFamily at all times? With incremental_backup turned ON in cassandra.yaml - Are current SSTables under /data/TestKeySpace/ColumnFamily/ with a hardlink to /data/TestKeySpace/ColumnFamily/backups? Lets say I have taken snapshot and moved the /data/TestKeySpace/ColumnFamily/snapshots/snapshot-name/*.db to tape, at what point should I be backing up *.db files from /data/TestKeySpace/ColumnFamily/backups directory. Also, should I be deleting the *.db files whose inode matches with the files in the snapshot? Is that a correct approach? I noticed /data/TestKeySpace/ColumnFamily/snapshots/timestamp-ColumnFamily/ what are these timestamp directories? Thanks in advance. SC
Re: odd production issue today 1.1.4
There is always this old chestnut http://wiki.apache.org/cassandra/FAQ#ubuntu_hangs A - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 8:22 AM, Edward Capriolo edlinuxg...@gmail.com wrote: With hyper threading a core can show up as two or maybe even four physical system processors, this is something the kernel does. On Fri, Feb 15, 2013 at 11:41 AM, Hiller, Dean dean.hil...@nrel.gov wrote: We ran into an issue today where website became around 10 times slower. We found out node 5 out of our 6 nodes was hitting 2100% cpu (cat /proc/cpuinfo reveals a 16 processor machine). I am really not sure how to hit 2100% unless we had 21 processors. It bounces between 300% and 2100% so I tried to a do a thread dump and had to use –F which then hotspot hit a nullpointer :(. I copied off all my logs after restarting(should have done it before restarting it). Any ideas what I could even look for as to what went wrong with this node? Also, we know our astyanax for some reason is not setup properly yet so we probably would not have seen an issue had we had all nodes in the seed list(which we changed today) as astyanax is supposed to be measuring time per request and changing which nodes it hits but we know it only hits nodes in our seedlist right now as we have not fixed that yet. Our astyanax was hitting 3,4,5,6 and did not have 1 and 2 in the seed list (we rollout a new version next wed. with the new seedlist including the last two delaying the dynamic discovery config we need to look at). Thanks, Dean Commands I ran with jstack that didn't work out too well…. [cassandra@a5 ~]$ jstack -l 20907 threads.txt 20907: Unable to open socket file: target process not responding or HotSpot VM not loaded The -F option can be used when the target process is not responding [cassandra@a5 ~]$ jstack -l -F 20907 threads.txt Attaching to process ID 20907, please wait... Debugger attached successfully. Server compiler detected. JVM version is 20.7-b02 java.lang.NullPointerException at sun.jvm.hotspot.oops.InstanceKlass.computeSubtypeOf(InstanceKlass.java:426) at sun.jvm.hotspot.oops.Klass.isSubtypeOf(Klass.java:137) at sun.jvm.hotspot.oops.Oop.isA(Oop.java:100) at sun.jvm.hotspot.runtime.DeadlockDetector.print(DeadlockDetector.java:93) at sun.jvm.hotspot.runtime.DeadlockDetector.print(DeadlockDetector.java:39) at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:52) at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:45) at sun.jvm.hotspot.tools.JStack.run(JStack.java:60) at sun.jvm.hotspot.tools.Tool.start(Tool.java:221) at sun.jvm.hotspot.tools.JStack.main(JStack.java:86) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.tools.jstack.JStack.runJStackTool(JStack.java:118) at sun.tools.jstack.JStack.main(JStack.java:84) [cassandra@a5 ~]$ java -version java version 1.6.0_32
Re: cassandra vs. mongodb quick question
If you have spinning disk and 1G networking and no virtual nodes, I would still say 300G to 500G is a soft limit. If you are using virtual nodes, SSD, JBOD disk configuration or faster networking you may go higher. The limiting factors are the time it take to repair, the time it takes to replace a node, the memory considerations for 100's of millions of rows. If you the performance of those operations is acceptable to you, then go crazy. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 9:05 AM, Hiller, Dean dean.hil...@nrel.gov wrote: So I found out mongodb varies their node size from 1T to 42T per node depending on the profile. So if I was going to be writing a lot but rarely changing rows, could I also use cassandra with a per node size of +20T or is that not advisable? Thanks, Dean
Re: can we pull rows out compressed from cassandra(lots of rows)?
No. The rows are uncompressed deep down in the IO stack. There is compression in the binary protocol http://www.datastax.com/dev/blog/binary-protocol https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=doc/native_protocol.spec;hb=refs/heads/cassandra-1.2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 9:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Thanks, Dean
Re: Deleting old items
I'll email the docs people. I believe they are saying use compaction throttling rather than this not this does nothing Although I used this in the last month on a machine with very little ram to limit compaction memory use. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 17/02/2013, at 7:05 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Can you point to the docs. http://www.datastax.com/docs/1.1/configuration/storage_configuration#max-compaction-threshold And thanks about the rest of your answers, once again ;-). Alain 2013/2/16 aaron morton aa...@thelastpickle.com Is that a feature that could possibly be developed one day ? No. Timestamps are essentially internal implementation used to resolve different values for the same column. With min_compaction_level_threshold did you mean min_compaction_threshold ? If so, why should I do that, what are the advantage/inconvenient of reducing this value ? Yes, min_compaction_threshold, my bad. If you have a wide row and delete a lot of values you will end up with a lot of tombstones. These may dramatically reduce the read performance until they are purged. Reducing the compaction threshold makes compaction happen more frequently. Looking at the doc I saw that: max_compaction_threshold: Ignored in Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount of SSTables then ? AFAIK it's not. There may be some confusion about the location of the settings in CLI vs CQL. Can you point to the docs. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 10:14 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Aaron, once again thanks for this answer. So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. Why is there no way of deleting or getting data using the internal timestamp stored alongside of any inserted column (as described here: http://www.datastax.com/docs/1.1/ddl/column_family#standard-columns) ? Is that a feature that could possibly be developed one day ? It could be useful to perform delete of old data or to bring to a dev cluster just the last week of data for example. With min_compaction_level_threshold did you mean min_compaction_threshold ? If so, why should I do that, what are the advantage/inconvenient of reducing this value ? Looking at the doc I saw that: max_compaction_threshold: Ignored in Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount of SSTables then ? Why is this deprecated ? Alain 2013/2/12 aaron morton aa...@thelastpickle.com So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. You need to issue row level deletes. If you don't know the row key you'll need to do range scans to locate them. If you are deleting parts of wide rows consider reducing the min_compaction_level_threshold on the CF to 2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I would like to know if there is a way to delete old/unused data easily ? I know about TTL but there are 2 limitations of TTL: - AFAIK, there is no TTL on counter columns - TTL need to be defined at write time, so it's too late for data already inserted. I also could use a standard delete but it seems inappropriate for such a massive. In some cases, I don't know the row key and would like to delete all the rows starting by, let's say, 1050#... Even better, I understood that columns are always inserted in C* with (name, value, timestamp). So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? Alain
Re: Is there any consolidated literature about Read/Write and Data Consistency in Cassandra ?
If you want the underlying ideas try the Dynamo paper, the Big Table paper and the original Cassandra paper from facebook. Start here http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 17/02/2013, at 7:40 AM, mateus mat...@tripleoxygen.net wrote: Like articles with tests and conclusions about it, and such, and not like the documentation in DataStax, or the Cassandra Books. Thank you.
Re: nodetool repair with vnodes
…so it seems to me that it is running on all vnodes ranges. Yes. Also, whatever the node which I launch the command on is, only one node log is moving and is always the same node. Not sure what you mean here. So, to me, it's like the nodetool repair command is running always on the same single node and repairing everything. If you use nodetool repair without the -pr flag in your setup (3 nodes and I assume RF 3) it will repair all token ranges in the cluster. Is there anything I'm missing ? Look for messages with session completed in the log from the AntiEntropyService Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 12:51 AM, Marco Matarazzo marco.matara...@hexkeep.com wrote: Greetings. I'm trying to run nodetool repair on a Cassandra 1.2.1 cluster of 3 nodes with 256 vnodes each. On a pre-1.2 cluster I used to launch a nodetool repair on every node every 24hrs. Now I'm getting a differenf behavior, and I'm sure I'm missing something. What I see on the command line is: [2013-02-17 10:20:15,186] Starting repair command #1, repairing 768 ranges for keyspace goh_master [2013-02-17 10:48:13,401] Repair session 3d140e10-78e3-11e2-af53-d344dbdd69f5 for range (6556914650761469337,6580337080281832001] finished (…repeat the last line 767 times) …so it seems to me that it is running on all vnodes ranges. Also, whatever the node which I launch the command on is, only one node log is moving and is always the same node. So, to me, it's like the nodetool repair command is running always on the same single node and repairing everything. I'm sure I'm making some mistakes, and I just can't find any clue of what's wrong with my nodetool usage on the documentation (if anything is wrong, btw). Is there anything I'm missing ? -- Marco Matarazzo
Re: Deleting old items during compaction (WAS: Deleting old items)
Sorry, missed the Counters part. You are probably interested in this one https://issues.apache.org/jira/browse/CASSANDRA-5228 Add your need to ticket to help it along. IMHO if you have write once, read many time series data the SSTables are effectively doing horizontal partitioning for you. So been able to drop a partition would make life easier. If you can delete entire row then the deletes have less impact than per column. However the old rows will not be purged from disk unless all fragments of the row are involved in a compaction process. So it may take some time to purge from disk, depending on the workload. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 10:43 AM, Ilya Grebnov i...@metricshub.com wrote: According to https://issues.apache.org/jira/browse/CASSANDRA-2103 There is no support for time to live (TTL) on counter columns. Did I miss something? Thanks, Ilya From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Sunday, February 17, 2013 9:16 AM To: user@cassandra.apache.org Subject: Re: Deleting old items during compaction (WAS: Deleting old items) That's what the TTL does. Manually delete all the older data now, then start using TTL. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 11:08 PM, Ilya Grebnov i...@metricshub.com wrote: Hi, We looking for solution for same problem. We have a wide column family with counters and we want to delete old data like 1 months old. One of potential ideas was to implement hook in compaction code and drop column which we don’t need. Is this a viable option? Thanks, Ilya From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Tuesday, February 12, 2013 9:01 AM To: user@cassandra.apache.org Subject: Re: Deleting old items So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. You need to issue row level deletes. If you don't know the row key you'll need to do range scans to locate them. If you are deleting parts of wide rows consider reducing the min_compaction_level_threshold on the CF to 2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I would like to know if there is a way to delete old/unused data easily ? I know about TTL but there are 2 limitations of TTL: - AFAIK, there is no TTL on counter columns - TTL need to be defined at write time, so it's too late for data already inserted. I also could use a standard delete but it seems inappropriate for such a massive. In some cases, I don't know the row key and would like to delete all the rows starting by, let's say, 1050#... Even better, I understood that columns are always inserted in C* with (name, value, timestamp). So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? Alain
Re: nodetool repair with vnodes
So, running it periodically on just one node is enough for cluster maintenance ? In the special case where you have RF == Number of nodes. The recommended approach is to use -pr and run it on each node periodically. Also: running it with -pr does output: That does not look right. There should be messages about requesting and receiving merkle tree's from other nodes, and that certain CF's are in sync. These are all logged from the AntiEntropyService. Is there a way to run it only for all vnodes on a single physical node ? it should be doing that. Look for messages like this in the log logger.info(String.format([repair #%s] new session: will sync %s on range %s for %s.%s, getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames))); They say how much is going to be synced, and with what. Try running repair with -pr on one of nodes not already repaired. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 11:12 AM, Marco Matarazzo marco.matara...@hexkeep.com wrote: So, to me, it's like the nodetool repair command is running always on the same single node and repairing everything. If you use nodetool repair without the -pr flag in your setup (3 nodes and I assume RF 3) it will repair all token ranges in the cluster. That's correct, 3 nodes and RF 3. Sorry for not specifying it in the beginning. So, running it periodically on just one node is enough for cluster maintenance ? Does this depends on the fact that every vnode data is related with the previous and next vnode, and this particular setup makes this enough as it cover every physical node? Also: running it with -pr does output: [2013-02-17 12:29:25,293] Nothing to repair for keyspace 'system' [2013-02-17 12:29:25,301] Starting repair command #2, repairing 1 ranges for keyspace keyspace_test [2013-02-17 12:29:28,028] Repair session 487d0650-78f5-11e2-a73a-2f5b109ee83c for range (-9177680845984855691,-9171525326632276709] finished [2013-02-17 12:29:28,028] Repair command #2 finished … that, as far as I can understand, works on the first vnode on the specified node, or so it seems from the output range. Am I right? Is there a way to run it only for all vnodes on a single physical node ? Thank you! -- Marco Matarazzo
Re: Cassandra on Red Hat 6.3
Nothing jumps out. Check /var/log/cassandra/output.log , that's where stdout and std err are directed. Check file permissions. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 9:08 PM, amulya rattan talk2amu...@gmail.com wrote: I followed step-by-step instructions for installing Cassandra on Red Hat Linux Server 6.3 from the datastax site, without much success. Apparently it installs fine but starting cassandra service does nothing(no ports are bound so opscenter/cli doesnt work). When I check service's status, it shows Cassandra dead but pid file exists. When I try launching Cassandra from /usr/sbin, it throws Error opening zip file or JAR manifest missing : /lib/jamm-0.2.5.jar and stop, so clearly that's why service isn't running. While I investigate it further, I thought it'd be worthwhile to put this on the list and see if anybody else saw similar issue. I must point out that this is fresh machine with fresh Cassandra installation so no conflicts with any previous installations are possible. So anybody else came across something similar? ~Amulya
Re: NPE in running ClientOnlyExample
An you can never go wrong relying on the documentation for the python pycassa library, it has some handy tutorials for getting started. http://pycassa.github.com/pycassa/ cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 9:51 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote: I hope you have already gone through this link https://github.com/zznate/hector-examples. If not will suggest you to go through, and you can also refer http://hector-client.github.com/hector/build/html/documentation.html. Best Regards, On Mon, Feb 18, 2013 at 12:15 AM, Jain Rahul ja...@ivycomptech.com wrote: Thanks Edward, My Bad. I was confused as It does seems to create keyspace also, As I understand (although i'm not sure) ListCfDef cfDefList = new ArrayListCfDef(); CfDef columnFamily = new CfDef(KEYSPACE, COLUMN_FAMILY); cfDefList.add(columnFamily); try { client.system_add_keyspace(new KsDef(KEYSPACE, org.apache.cassandra.locator.SimpleStrategy, 1, cfDefList)); int magnitude = client.describe_ring(KEYSPACE).size(); Can I request you to please point me to some examples with I can start. I try to see some example from hector but it does seems to be in-line with Cassandra's 1.1 version. Regards, Rahul -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: 17 February 2013 21:49 To: user@cassandra.apache.org Subject: Re: NPE in running ClientOnlyExample This is a bad example to follow. This is the internal client the Cassandra nodes use to talk to each other (fat client) usually you do not use this unless you want to write some embedded code on the Cassandra server. Typically clients use thrift/native transport. But you are likely getting the error you are seeing because the keyspace or column family is not created yet. On Sat, Feb 16, 2013 at 11:41 PM, Jain Rahul ja...@ivycomptech.com wrote: Hi All, I am newbie to Cassandra and trying to run an example program ClientOnlyExample taken from https://raw.github.com/apache/cassandra/cassandra-1.2/examples/client_only/src/ClientOnlyExample.java. But while executing the program it gives me a null pointer exception. Can you guys please help me out what I am missing. I am using Cassandra 1.2.1 version. I have pasted the logs at http://pastebin.com/pmADWCYe Exception in thread main java.lang.NullPointerException at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:71) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:66) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:61) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:56) at org.apache.cassandra.db.RowMutation.add(RowMutation.java:183) at org.apache.cassandra.db.RowMutation.add(RowMutation.java:204) at ClientOnlyExample.testWriting(ClientOnlyExample.java:78) at ClientOnlyExample.main(ClientOnlyExample.java:135) Regards, Rahul This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk. This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk. -- Abhijit Chanda +91-974395
Re: cassandra vs. mongodb quick question
My experience is repair of 300GB compressed data takes longer than 300GB of uncompressed, but I cannot point to an exact number. Calculating the differences is mostly CPU bound and works on the non compressed data. Streaming uses compression (after uncompressing the on disk data). So if you have 300GB of compressed data, take a look at how long repair takes and see if you are comfortable with that. You may also want to test replacing a node so you can get the procedure documented and understand how long it takes. The idea of the soft 300GB to 500GB limit cam about because of a number of cases where people had 1 TB on a single node and they were surprised it took days to repair or replace. If you know how long things may take, and that fits in your operations then go with it. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 10:08 PM, Vegard Berget p...@fantasista.no wrote: Just out of curiosity : When using compression, does this affect this one way or another? Is 300G (compressed) SSTable size, or total size of data? .vegard, - Original Message - From: user@cassandra.apache.org To: user@cassandra.apache.org Cc: Sent: Mon, 18 Feb 2013 08:41:25 +1300 Subject: Re: cassandra vs. mongodb quick question If you have spinning disk and 1G networking and no virtual nodes, I would still say 300G to 500G is a soft limit. If you are using virtual nodes, SSD, JBOD disk configuration or faster networking you may go higher. The limiting factors are the time it take to repair, the time it takes to replace a node, the memory considerations for 100's of millions of rows. If you the performance of those operations is acceptable to you, then go crazy. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 9:05 AM, Hiller, Dean dean.hil...@nrel.gov wrote: So I found out mongodb varies their node size from 1T to 42T per node depending on the profile. So if I was going to be writing a lot but rarely changing rows, could I also use cassandra with a per node size of +20T or is that not advisable? Thanks, Dean
Re: Mutation dropped
Does the rpc_timeout not control the client timeout ? No it is how long a node will wait for a response from other nodes before raising a TimedOutException if less than CL nodes have responded. Set the client side socket timeout using your preferred client. Is there any param which is configurable to control the replication timeout between nodes ? There is no such thing. rpc_timeout is roughly like that, but it's not right to think about it that way. i.e. if a message to a replica times out and CL nodes have already responded then we are happy to call the request complete. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron. Does the rpc_timeout not control the client timeout ? Is there any param which is configurable to control the replication timeout between nodes ? Or the same param is used to control that since the other node is also like a client ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 17 February 2013 11:26 To: user@cassandra.apache.org Subject: Re: Mutation dropped You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some timeout associated with replicated data persistence ? Thanks, Kanwar From: Kanwar Sangha [mailto:kan...@mavenir.com] Sent: 14 February 2013 09:08 To: user@cassandra.apache.org Subject: Mutation dropped Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing a lot of mutation dropped messages. I understand that this is due to the replica not being written to the other node ? RF = 2, CL =1. From the wiki - For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair Thanks, Kanwar
Re: Testing compaction strategies on a single production server?
I *think* it will work. The steps in the blog post to change the compaction strategy before RING_DELAY expires is to ensure no sstables are created before the strategy is changed. But I think you will be venturing into unchartered territory where their might be dragons. And not the fun Disney kind. While it may be more work I personally would use one node in write survey to test LCS Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 6:28 AM, Henrik Schröder skro...@gmail.com wrote: Well, that answer didn't really help. I know how to make a survey node, and I know how to simulate reads to it, it's just that that's a lot of work, and I wouldn't be sure that the simulated load is the same as the production load. We gather a lot of metrics from our production servers, so we know exactly how they perform over long periods of time. Changing a single server to run a different compaction strategy would allow us to know in detail how a different strategy would impact the cluster. So, is it possible to modify org.apache.cassandra.db.[keyspace].[column family].CompactionStrategyClass through jmx on a production server without any ill effects? Or is this only possible to do on a survey node while it is in a specific state? /Henrik On Tue, Feb 19, 2013 at 3:09 PM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: Just turn off dynamic snitch on survey node and make read requests from it directly with CL.ONE, watch histograms, compare. Regarding switching compaction strategy there’re a lot of info already. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider Take a ride with Adform's Rich Media Suite signature-logo18be.png signature-best-employer-logo6784.png Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: Henrik Schröder [mailto:skro...@gmail.com] Sent: Tuesday, February 19, 2013 15:57 To: user Subject: Testing compaction strategies on a single production server? Hey, Version 1.1 of Cassandra introduced live traffic sampling, which allows you to measure the performance of a node without it really joining the cluster: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-live-traffic-sampling That page mentions that you can change the compaction strategy through jmx if you want to test out a different strategy on your survey node. That's great, but it doesn't give you a complete view of how your performance would change, since you're not doing reads from the survey node. But what would happen if you used jmx to change the compaction strategy of a column family on a single *production* node? Would that be a safe way to test it out or are there side-effects of doing that live? And if you do that, would running a major compaction transform the entire column family to the new format? Finally, if the test was a success, how do you proceed from there? Just change the schema? /Henrik
Re: Cassandra network latency tuning
I would like to understand how we can capture network latencies between a 1GbE and 10GbE for ex. Cassandra reports two latencies. The CF latencies reported by nodetool cfstats, nodetool cfhistograms and the CF MBeans cover the local time it takes to read or write the data. This does not include any local wait times, network latency or coordinator overhead. The Storage Proxy latency from nodetool proxyhistograms and the StorageProxy MBean is the total latency for a request on a coordinator. Under load, with a consistent workload, the CF latency should not vary too much. While the request latency can increase as wait time becomes more of an factor. Additionally streaming is throttled which you may want to increase, see the the yaml file. We will soon be adding SSD's and was wondering how Cassandra can utilize the 10GbE and the SSD's and if there are specific tuning that is required. You may want to increase both the concurrent_writes and reads in the yaml file to take advantage of the extra IO. Same for the compaction settings, comments in the yaml file will help. With SSD and 10GbE you can easily hold more data on each node. Typically we advise 300GB to 500GB per node with HDD and 1GbE, because of the time repair and node replacement takes. With SSD and 10GbE it will take less, and even less if you are using SSD. If you feel like being thorough add repair and node replacement (all under load) to your test lineup. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 1:44 PM, Brandon Walsh brandon_9021...@yahoo.com wrote: I have a 5 node cluster and currently running ver 1.2. Prior to full scale deployment, I'm running some benchmarks using YCSB. From a hadoop cluster deployment we saw an excellent improvement using higher speed networks. However Cassandra does not include network latencies and I would like to understand how we can capture network latencies between a 1GbE and 10GbE for ex. As of now all the graphs look the same. We will soon be adding SSD's and was wondering how Cassandra can utilize the 10GbE and the SSD's and if there are specific tuning that is required.
Re: How to limit query results like from row 50 to 100
CQL does not support offset but does have limit. See http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT#specifying-rows-returned-using-limit Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 1:47 PM, Mateus Ferreira e Freitas mateus.ffrei...@hotmail.com wrote: With CQL or an API.
Re: Heap is N.N full. Immediately on startup
My first guess would be the bloom filter and index sampling from lots-o-rows Check the row count in cfstats Check the bloom filter size in cfstats. Background on memory requirements http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 11:27 PM, Andras Szerdahelyi andras.szerdahe...@ignitionone.com wrote: Hey list, Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap at startup in Cassandra 1.1.6 besides row cache : not persisted and is at 0 keys when this warning is produced Memtables : no write traffic at startup, my app's column families are durable_writes:false Pending tasks : no pending tasks, except for 928 compactions ( not sure where those are coming from ) I drew these conclusions from the StatusLogger output below: INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 8375238656 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) ReadStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) MutationStage 0-1 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) commitlog_archiver0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) CompactionManager 0 928 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) MessagingServicen/a 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) KeyCache 25 25 all INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) RowCache 00 all org.apache.cassandra.cache.SerializingCacheProvider INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) ColumnFamilyMemtable ops,data INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) MYAPP_1.CF0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) MYAPP_2.CF 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) HiveMetaStore.MetaStore 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) system.NodeIdInfo 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) system.IndexInfo 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) system.LocationInfo 0,0 INFO
Re: SSTable Num
Hi – I have around 6TB of data on 1 node Unless you have SSD and 10GbE you probably have too much data on there. Remember you need to run repair and that can take a long time with a lot of data. Also you may need to replace a node one day and moving 6TB will take a while. Or will the sstable compaction continue and eventually we will have 1 file ? No. The default size tiered strategy compacts files what are roughly the same size, and only when there are more than 4 (default) of them. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 3:47 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – I have around 6TB of data on 1 node and the cfstats show 32 sstables. There is no compaction job running in the background. Is there a limit on the size per sstable ? Or will the sstable compaction continue and eventually we will have 1 file ? Thanks, Kanwar
Re: how to debug slowdowns from these log snippets-more info 2
Some things to consider: Check for contention around the switch lock. This can happen if you get a lot of tables flushing at the same time, or if you have a lot of secondary indexes. It shows up as a pattern in the logs. As soon a the writer starts flushing a memtable another will be queued. Probably not happening here but can be a pain when a lot of memtables are flushed. I would turn on GC logging in cassandra-env.sh and watch that. After a full CMS flush how full / empty is the tenured heap ? If it is still got a lot in it then you are running with too much cache / bloom filter / index sampling. You can also experiment with the Max Tenuring Threshold, try turning it up to 4 to start with. The GC logs will show you how much data is at each tenuring level. You can then see how much data is being tenuring, and if premature tenuring was an issue. I've seen premature tenuring cause issues with wide rows / long reads. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 4:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Oh, and my startup command that cassandra logged was a2.bigde.nrel.gov: xss = -ea -javaagent:/opt/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8021M -Xmx8021M -Xmn1600M -XX:+HeapDumpOnOutOfMemoryError -Xss128k And I remember from docs you don't want to go above 8G or java GC doesn't work out so well. I am not sure why this is not working out though. Dean On 2/20/13 7:16 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Here is the printout before that log which is probably important as wellŠ.. INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 3618 ms for 2 collections, 7038159096 used; max is 8243904512 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line 72) ReadStage11 264 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) MutationStage1288 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) GossipStage 1 7 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) commitlog_archiver0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 72) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 72) HintedHandoff 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 77) CompactionManager 4 5 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 89) MessagingServicen/a10,127 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 99) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 100) KeyCache1310719 1310719 all INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 106) RowCache 00 all org.apache.cassandra.cache.SerializingCacheProvider INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 113) ColumnFamilyMemtable ops,data INFO [ScheduledTasks:1] 2013-02-20 07:14:00,379 StatusLogger.java
Re: Mutation dropped
What does rpc_timeout control? Only the reads/writes? Yes. like data stream, streaming_socket_timeout_in_ms in the yaml merkle tree request? Either no time out or a number of days, cannot remember which right now. What is the side effect if it's set to a really small number, say 20ms? You will probably get a lot more requests that fail with a TimedOutException. rpc_timeout needs to be longer than the time it takes a node to process the message, and the time it takes the coordinator to do it's thing. You can look at cfhistograms and proxyhistograms to get a better idea of how long a request takes in your system. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote: What does rpc_timeout control? Only the reads/writes? How about other inter-node communication, like data stream, merkle tree request? What is the reasonable value for roc_timeout? The default value of 10 seconds are way too long. What is the side effect if it's set to a really small number, say 20ms? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 7:32 PM Subject: Re: Mutation dropped Does the rpc_timeout not control the client timeout ? No it is how long a node will wait for a response from other nodes before raising a TimedOutException if less than CL nodes have responded. Set the client side socket timeout using your preferred client. Is there any param which is configurable to control the replication timeout between nodes ? There is no such thing. rpc_timeout is roughly like that, but it's not right to think about it that way. i.e. if a message to a replica times out and CL nodes have already responded then we are happy to call the request complete. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron. Does the rpc_timeout not control the client timeout ? Is there any param which is configurable to control the replication timeout between nodes ? Or the same param is used to control that since the other node is also like a client ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 17 February 2013 11:26 To: user@cassandra.apache.org Subject: Re: Mutation dropped You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some timeout associated with replicated data persistence ? Thanks, Kanwar From: Kanwar Sangha [mailto:kan...@mavenir.com] Sent: 14 February 2013 09:08 To: user@cassandra.apache.org Subject: Mutation dropped Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing a lot of mutation dropped messages. I understand that this is due to the replica not being written to the other node ? RF = 2, CL =1. From the wiki - For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair Thanks, Kanwar
Re: very confused by jmap dump of cassandra
Cannot comment too much on the jmap but I can add my general compaction is hurting strategy. Try any or all of the following to get to a stable setup, then increase until things go bang. Set concurrent compactors to 2. Reduce compaction throughput by half. Reduce in_memory_compaction_limit. If you see compactions using a lot of sstables in the logs, reduce max_compaction_threshold. I can easily go higher than 8G on these systems as I have 32gig each node, but there was docs that said 8G is better for GC. More JVM memory is not the answer. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 7:49 AM, Hiller, Dean dean.hil...@nrel.gov wrote: I took this jmap dump of cassandra(in production). Before I restarted the whole production cluster, I had some nodes running compaction and it looked like all memory had been consumed(kind of like cassandra is not clearing out the caches or memtables fast enough). I am trying to still debug compaction causes slowness on the cluster since all cassandra.yaml files are pretty much the defaults with size tiered compaction. The weird thing is I dump and get a 5.4G heap.bin file and load that into ecipse who tells me total is 142.8MB….what So low, top was showing 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in use?) Tasks: 398 total, 1 running, 397 sleeping, 0 stopped, 0 zombie Cpu(s): 2.8%us, 0.5%sy, 0.0%ni, 96.5%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 32854680k total, 31910708k used, 943972k free,89776k buffers Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 20909 cassandr 20 0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java 22455 cassandr 20 0 15288 1340 824 R 3.9 0.0 0:00.02 top It almost seems like cassandra is not being good about memory management here as we slowly get into a situation where compaction is run which takes out our memory(configured for 8G). I can easily go higher than 8G on these systems as I have 32gig each node, but there was docs that said 8G is better for GC. Has anyone else taken a jmap dump of cassandra? Thanks, Dean
Re: cassandra vs. mongodb quick question(good additional info)
If you are lazy like me wolfram alpha can help http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes-- 10 hours 15 minutes 43.59 seconds Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 11:31 AM, Wojciech Meler wojciech.me...@gmail.com wrote: you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link 19 lut 2013 02:01, Hiller, Dean dean.hil...@nrel.gov napisał(a): I thought about this more, and even with a 10Gbit network, it would take 40 days to bring up a replacement node if mongodb did truly have a 42T / node like I had heard. I wrote the below email to the person I heard this from going back to basics which really puts some perspective on it….(and a lot of people don't even have a 10Gbit network like we do) Nodes are hooked up by a 10G network at most right now where that is 10gigabit. We are talking about 10Terabytes on disk per node recently. Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second (yes I could have divided by 8 in my head but eh…course when I saw the number, I went duh) So trying to transfer 10 Terabytes or 10,000 Gigabytes to a node that we are bringing online to replace a dead node would take approximately 5 days??? This means no one else is using the bandwidth too ;). 10,000Gigabytes * 1 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days. This is more likely 11 days if we only use 50% of the network. So bringing a new node up to speed is more like 11 days once it is crashed. I think this is the main reason the 1Terabyte exists to begin with, right? From an ops perspective, this could sound like a nightmare scenario of waiting 10 days…..maybe it is livable though. Either way, I thought it would be good to share the numbers. ALSO, that is assuming the bus with it's 10 disk can keep up with 10G Can it? What is the limit of throughput on a bus / second on the computers we have as on wikipedia there is a huge variance? What is the rate of the disks too (multiplied by 10 of course)? Will they keep up with a 10G rate for bringing a new node online? This all comes into play even more so when you want to double the size of your cluster of course as all nodes have to transfer half of what they have to all the new nodes that come online(cassandra actually has a very data center/rack aware topology to transfer data correctly to not use up all bandwidth unecessarily…I am not sure mongodb has that). Anyways, just food for thought. From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 18, 2013 1:39 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no Subject: Re: cassandra vs. mongodb quick question My experience is repair of 300GB compressed data takes longer than 300GB of uncompressed, but I cannot point to an exact number. Calculating the differences is mostly CPU bound and works on the non compressed data. Streaming uses compression (after uncompressing the on disk data). So if you have 300GB of compressed data, take a look at how long repair takes and see if you are comfortable with that. You may also want to test replacing a node so you can get the procedure documented and understand how long it takes. The idea of the soft 300GB to 500GB limit cam about because of a number of cases where people had 1 TB on a single node and they were surprised it took days to repair or replace. If you know how long things may take, and that fits in your operations then go with it. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 10:08 PM, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no wrote: Just out of curiosity : When using compression, does this affect this one way or another? Is 300G (compressed) SSTable size, or total size of data? .vegard, - Original Message - From: user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: Sent: Mon, 18 Feb 2013 08:41:25 +1300 Subject: Re: cassandra vs. mongodb quick question If you have spinning disk and 1G networking and no virtual nodes, I would still say 300G to 500G is a soft limit. If you are using virtual nodes, SSD, JBOD disk configuration or faster networking you may go higher. The limiting factors are the time it take to repair, the time it takes to replace a node, the memory considerations for 100's of millions of rows. If you
Re: key cache size
This is the key cache entry https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cache/KeyCacheKey.java Note that the Descriptor is re-used. If you want to see key cache metrics, including bytes used, use nodetool info. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 3:45 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – What is the approximate overhead of the key cache ? Say each key is 50 bytes. What would be the overhead for this key in the key cache ? Thanks, Kanwar
Re: Read IO
AFAIk this is still roughly correct http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ It includes information on the page size read from disk. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 5:45 AM, Jouni Hartikainen jouni.hartikai...@reaktor.fi wrote: Hi, On Feb 21, 2013, at 7:52 , Kanwar Sangha kan...@mavenir.com wrote: Hi – Can someone explain the worst case IOPS for a read ? No key cache, No row cache, sampling rate say 512. 1) Bloom filter will be checked to see existence of key (In RAM) 2) Index filer sample (IN RAM) will be checked to find approx. location in index file on disk 3) 1 IOPS to read the actual index file on disk (DISK) 4) 1 IOPS to get the data from the location in the sstable (DISK) Is this correct ? As you were asking for the worst case, I would still add one step that would be a seek inside an SSTable from the row start to the queried columns using column index. However, this applies only if you are querying a subset of columns in the row (not all) and the total row size exceeds column_index_size_in_kb (defaults to 64kB). So, as far as I have understood, the worst case steps (without any caches) are: 1. Check the SSTable bloom filters (in memory) 2. Use index samples to find approx. correct place in the key index file (in memory) 3. Read the key index file until correct key is found (1st disk seek read) 5. Seek to the start of the row in SSTable file and read row headers (possibly including column index) (2nd seek read) 6. Using column index seek to the correct place inside the SSTable file to actually read the columns (3rd seek read) If the row is very wide and you are asking for a random bunch of columns from here and there, the step 6 might even be needed multiple times. Also, if your row has spread over many SSTables, each of them needs to be accessed (at least steps 1. - 5.) to get the complete results for the query. All this in mind, if your node has any reasonable amount of reads, I'd say that in practice key index files will be page cached by the OS very quickly and thus normal read would end up being either one seek (for small rows without the column index) or two (for wider rows). Of course, as Peter already pointed out, the more columns you ask for, the more disk needs to read. For a continuous set of columns the read should be linear, however. -Jouni
Re: SSTable Num
Ok. So for 10 TB, I could have at least 4 SStables files each of 2.5 TB ? You will have many sstables, in your case 32. Each bucket of files (files that are within 50% of the average size of files in a bucket) will contain 3 or less files. This article provides com back ground, but it's working correctly as you have described it http://www.datastax.com/dev/blog/when-to-use-leveled-compaction Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 6:39 AM, Kanwar Sangha kan...@mavenir.com wrote: No. The default size tiered strategy compacts files what are roughly the same size, and only when there are more than 4 (default) of them. Ok. So for 10 TB, I could have at least 4 SStables files each of 2.5 TB ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 21 February 2013 11:01 To: user@cassandra.apache.org Subject: Re: SSTable Num Hi – I have around 6TB of data on 1 node Unless you have SSD and 10GbE you probably have too much data on there. Remember you need to run repair and that can take a long time with a lot of data. Also you may need to replace a node one day and moving 6TB will take a while. Or will the sstable compaction continue and eventually we will have 1 file ? No. The default size tiered strategy compacts files what are roughly the same size, and only when there are more than 4 (default) of them. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 3:47 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – I have around 6TB of data on 1 node and the cfstats show 32 sstables. There is no compaction job running in the background. Is there a limit on the size per sstable ? Or will the sstable compaction continue and eventually we will have 1 file ? Thanks, Kanwar
Re: Heap is N.N full. Immediately on startup
To get a good idea of how GC is performing turn on the GC logging in cassandra-env.sh. After a full cms GC event, see how big the tenured heap is. If it's not reducing enough then GC will never get far enough ahead. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 8:37 AM, Andras Szerdahelyi andras.szerdahe...@ignitionone.com wrote: Thank you- indeed my index interval is 64 with a CF of 300M rows + bloom filter false positive chance was default. Raising the index interval to 512 didn't fix this alone, so I guess I'll have to set the bloom filter to some reasonable value and scrub. From: aaron morton aa...@thelastpickle.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday 21 February 2013 17:58 To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Heap is N.N full. Immediately on startup My first guess would be the bloom filter and index sampling from lots-o-rows Check the row count in cfstats Check the bloom filter size in cfstats. Background on memory requirements http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 11:27 PM, Andras Szerdahelyi andras.szerdahe...@ignitionone.com wrote: Hey list, Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap at startup in Cassandra 1.1.6 besides row cache : not persisted and is at 0 keys when this warning is produced Memtables : no write traffic at startup, my app's column families are durable_writes:false Pending tasks : no pending tasks, except for 928 compactions ( not sure where those are coming from ) I drew these conclusions from the StatusLogger output below: INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 8375238656 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) ReadStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) MutationStage 0-1 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) commitlog_archiver0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) CompactionManager 0 928 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) MessagingServicen/a 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) KeyCache 25 25 all INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) RowCache 00
Re: Mutation dropped
If you are running repair, using QUORUM, and there are not dropped writes you should not be getting DigestMismatch during reads. If everything else looks good, but the request latency is higher than the CF latency I would check that client load is evenly distributed. Then start looking to see if the request throughput is at it's maximum for the cluster. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 8:15 PM, Wei Zhu wz1...@yahoo.com wrote: Thanks Aaron for the great information as always. I just checked cfhistograms and only a handful of read latency are bigger than 100ms, but for proxyhistograms there are 10 times more are greater than 100ms. We are using QUORUM for reading with RF=3, and I understand coordinator needs to get the digest from other nodes and read repair on the miss match etc. But is it normal to see the latency from proxyhistograms to go beyond 100ms? Is there anyway to improve that? We are tracking the metrics from Client side and we see the 95th percentile response time averages at 40ms which is a bit high. Our 50th percentile was great under 3ms. Any suggestion is very much appreciated. Thanks. -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: Cassandra User user@cassandra.apache.org Sent: Thursday, February 21, 2013 9:20:49 AM Subject: Re: Mutation dropped What does rpc_timeout control? Only the reads/writes? Yes. like data stream, streaming_socket_timeout_in_ms in the yaml merkle tree request? Either no time out or a number of days, cannot remember which right now. What is the side effect if it's set to a really small number, say 20ms? You will probably get a lot more requests that fail with a TimedOutException. rpc_timeout needs to be longer than the time it takes a node to process the message, and the time it takes the coordinator to do it's thing. You can look at cfhistograms and proxyhistograms to get a better idea of how long a request takes in your system. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote: What does rpc_timeout control? Only the reads/writes? How about other inter-node communication, like data stream, merkle tree request? What is the reasonable value for roc_timeout? The default value of 10 seconds are way too long. What is the side effect if it's set to a really small number, say 20ms? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 7:32 PM Subject: Re: Mutation dropped Does the rpc_timeout not control the client timeout ? No it is how long a node will wait for a response from other nodes before raising a TimedOutException if less than CL nodes have responded. Set the client side socket timeout using your preferred client. Is there any param which is configurable to control the replication timeout between nodes ? There is no such thing. rpc_timeout is roughly like that, but it's not right to think about it that way. i.e. if a message to a replica times out and CL nodes have already responded then we are happy to call the request complete. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron. Does the rpc_timeout not control the client timeout ? Is there any param which is configurable to control the replication timeout between nodes ? Or the same param is used to control that since the other node is also like a client ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 17 February 2013 11:26 To: user@cassandra.apache.org Subject: Re: Mutation dropped You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two
Re: Adding new nodes in a cluster with virtual nodes
So, it looks that the repair is required if we want to add new nodes in our platform, but I don't understand why. Bootstrapping should take care of it. But new seed nodes do not bootstrap. Check the logs on the nodes you added to see what messages have bootstrap in them. Anytime you are worried about things like this throw in a nodetool repair. If you are using QUOURM for read and writes you will still be getting consistent data, so long as you have only added one node. Or one node every RF'th nodes. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 9:55 PM, Jean-Armel Luce jaluc...@gmail.com wrote: Hi Aaron, Thanks for your answer. I apologize, I did a mistake in my 1st mail. The cluster was only 12 nodes instead of 16 (it is a test cluster). There are 2 datacenters b1 and s1. Here is the result of nodetool status after adding a new node in the 1st datacenter (dc s1): root@node007:~# nodetool status Datacenter: b1 == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.234.72.135 10.71 GB 256 44.6% 2fc583b2-822f-4347-9fab-5e9d10d548c9 c01 UN 10.234.72.134 16.74 GB 256 63.7% f209a8c5-7e1b-45b5-aa80-ed679bbbdbd1 e01 UN 10.234.72.139 17.09 GB 256 62.0% 95661392-ccd8-4592-a76f-1c99f7cdf23a e07 UN 10.234.72.138 10.96 GB 256 42.9% 0d6725f0-1357-423d-85c1-153fb94257d5 e03 UN 10.234.72.137 11.09 GB 256 45.7% 492190d7-3055-4167-8699-9c6560e28164 e03 UN 10.234.72.136 11.91 GB 256 41.1% 3872f26c-5f2d-4fb3-9f5c-08b4c7762466 c01 Datacenter: s1 == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.98.255.139 16.94 GB 256 43.8% 3523e80c-8468-4502-b334-79eabc3357f0 g10 UN 10.98.255.138 12.62 GB 256 42.4% a2bcddf1-393e-453b-9d4f-9f7111c01d7f i02 UN 10.98.255.137 10.59 GB 256 38.4% f851b6ee-f1e4-431b-8beb-e7b173a77342 i02 UN 10.98.255.136 11.89 GB 256 42.9% 36fe902f-3fb1-4b6d-9e2c-71e601fa0f2e a09 UN 10.98.255.135 10.29 GB 256 40.4% e2d020a5-97a9-48d4-870c-d10b59858763 a09 UN 10.98.255.134 16.19 GB 256 52.3% 73e3376a-5a9f-4b8a-a119-c87ae1fafdcb h06 UN 10.98.255.140 127.84 KB 256 39.9% 3d5c33e6-35d0-40a0-b60d-2696fd5cbf72 g10 We can see that the new node (10.98.255.140) contains only 127,84KB. We saw also that there was no network traffic between the nodes. Then we added a new node in the 2nd datacenter (dc b1) root@node007:~# nodetool status Datacenter: b1 == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.234.72.135 12.95 GB 256 42.0% 2fc583b2-822f-4347-9fab-5e9d10d548c9 c01 UN 10.234.72.134 20.11 GB 256 53.1% f209a8c5-7e1b-45b5-aa80-ed679bbbdbd1 e01 UN 10.234.72.140 122.25 KB 256 41.9% 501ea498-8fed-4cc8-a23a-c99492bc4f26 e07 UN 10.234.72.139 20.46 GB 256 40.2% 95661392-ccd8-4592-a76f-1c99f7cdf23a e07 UN 10.234.72.138 13.21 GB 256 40.9% 0d6725f0-1357-423d-85c1-153fb94257d5 e03 UN 10.234.72.137 13.34 GB 256 42.9% 492190d7-3055-4167-8699-9c6560e28164 e03 UN 10.234.72.136 14.16 GB 256 39.0% 3872f26c-5f2d-4fb3-9f5c-08b4c7762466 c01 Datacenter: s1 == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.98.255.139 19.19 GB 256 43.8% 3523e80c-8468-4502-b334-79eabc3357f0 g10 UN 10.98.255.138 14.9 GB256 42.4% a2bcddf1-393e-453b-9d4f-9f7111c01d7f i02 UN 10.98.255.137 12.49 GB 256 38.4% f851b6ee-f1e4-431b-8beb-e7b173a77342 i02 UN 10.98.255.136 14.13 GB 256 42.9% 36fe902f-3fb1-4b6d-9e2c-71e601fa0f2e a09 UN 10.98.255.135 12.16 GB 256 40.4% e2d020a5-97a9-48d4-870c-d10b59858763 a09 UN 10.98.255.134 18.85 GB 256 52.3% 73e3376a-5a9f-4b8a-a119-c87ae1fafdcb h06 UN 10.98.255.140 2.24 GB256 39.9% 3d5c33e6-35d0-40a0-b60d-2696fd5cbf72 g10 We can see that the 2nd new node (10.234.72.140) contains only 122,25KB. The new node in the 1st datacenter contains now 2,24 GB because we
Re: operations progress on DBA operations?
nodetool compactionstats Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/02/2013, at 3:44 AM, Hiller, Dean dean.hil...@nrel.gov wrote: I am used to systems running a first phase calculating how much files it will need to go through and then logging out the percent done or X files out of total files done. I ran this command and it is logging nothing nodetool upgradesstables databus5 nreldata; I have 130Gigs of data on my node and not all of it in that one column family above. How can I tell how far it is in it's process? It has been running for about 10 minutes already. I don't see anything in the log files either. Thanks, Dean
Re: ReverseIndexExample
We are trying to answer client library specific questions on the client-dev list, see the link at the bottom here http://cassandra.apache.org/ If you can ask a more specific question I'll answer it there. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/02/2013, at 3:44 AM, Everton Lima peitin.inu...@gmail.com wrote: Hello, Anyone have already used ReverseIndexQuery from Astyanay. I was tring to understand it, but I execute the example of Astyanax Site and can not understood. Ssomeone can help me please? Thanks; -- Everton Lima Aleixo Mestrando em Ciência da Computação pela UFG Programador no LUPA
Re: disabling bloomfilter not working? or did I do this wrong?
Bloom Filter Space Used: 2318392048 Just to be sane do a quick check of the -Filter.db files on disk for this CF. If they are very small try a restart on the node. Number of Keys (estimate): 1249133696 Hey a billion rows on a node, what an age we live in :) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/02/2013, at 4:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote: So in the cli, I ran update column family nreldata with bloom_filter_fp_chance=1.0; Then I ran nodetool upgradesstables databus5 nreldata; But my bloom filter size is still around 2gig(and I want to free up this heap) According to nodetool cfstats command… Column Family: nreldata SSTable count: 10 Space used (live): 96841497731 Space used (total): 96841497731 Number of Keys (estimate): 1249133696 Memtable Columns Count: 7066 Memtable Data Size: 4286174 Memtable Switch Count: 924 Read Count: 19087150 Read Latency: 0.595 ms. Write Count: 21281994 Write Latency: 0.013 ms. Pending Tasks: 0 Bloom Filter False Postives: 974393 Bloom Filter False Ratio: 0.8 Bloom Filter Space Used: 2318392048 Compacted row minimum size: 73 Compacted row maximum size: 446 Compacted row mean size: 143
Re: How wide rows are structured in CQL3
Does this effectively create the same storage structure? Yes. SELECT Value FROM X WHERE RowKey = 'RowKey1' AND TimeStamp BETWEEN 100 AND 1000; select value from X where RoWKey = 'foo' and timestamp = 100 and timestamp = 1000; I also don't understand some of the things like WITH COMPACT STORAGE and CLUSTERING. Some info here, does not cover compact storage http://thelastpickle.com/2013/01/11/primary-keys-in-cql/ Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/02/2013, at 4:36 AM, Boris Solovyov boris.solov...@gmail.com wrote: Hi, My impression from reading docs is that in old versions of Cassandra, you could create very wide rows, say with timestamps as column names for time series data, and read an ordered slice of the row. So, RowKeyColumns === == RowKey1 1:val1 2:val2 3:val3 N:valN With this data I think you could say get RowKey1, cols 100 to 1000 and get a slice of values. (I have no experience with this, just from reading about it.) In CQL3 it looks like this is kind of normalized so I would have CREATE TABLE X ( RowKey text, TimeStamp int, Value text, PRIMARY KEY(RowKey, TimeStamp) ); Does this effectively create the same storage structure? Now, in CQL3, it looks like I should access it like this, SELECT Value FROM X WHERE RowKey = 'RowKey1' AND TimeStamp BETWEEN 100 AND 1000; Does this do the same thing? I also don't understand some of the things like WITH COMPACT STORAGE and CLUSTERING. I'm having a hard time figuring out how this maps to the underlying storage. It is a little more abstract. I feel like the new CQL stuff isn't really explained clearly to me -- is it just a query language that accesses the same underlying structures, or is Cassandra's storage and access model fundamentally different now?
Re: Q on schema migratins
dropped this secondary index after while. I assume you use UPDATE COLUMN FAMILY in the CLI. How can I avoid this secondary index building on node join? Check the schema using show schema in the cli. Check that all nodes in the cluster have the same schema, using describe cluster in the cli. If they are in disagreement see this http://wiki.apache.org/cassandra/FAQ#schema_disagreement Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/02/2013, at 5:17 AM, Igor i...@4friends.od.ua wrote: Hello Cassandra 1.0.7 Some time ago we used secondary index on one of CF. Due to performance reasons we dropped this secondary index after while. But now, each time I add and bootstrap new node I see how cassandra again build this secondary index on this node (which takes huge time), and when index is built it is not used anymore, so I can safely delete files from disk. How can I avoid this secondary index building on node join? Thanks for your answers!
Re: is there a way to drain node(and prevent reads) and upgrade sstables offline?
To stop all writes and reads disable thrift and gossip via nodetool. This will not stop any in progress repair sessions nor disconnect fat clients if you have them. There are also cmd line args cassandra.start_rpc and cassandra.join_ring whihc do the same thing. You can also change the compaction throughput using nodetool multithreaded_compaction = true temporarily Unless you have SSD leave this guy alone. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/02/2013, at 6:04 AM, Michael Kjellman mkjell...@barracuda.com wrote: Couldn't you just disable thrift and leave gossip active? On 2/22/13 9:01 AM, Hiller, Dean dean.hil...@nrel.gov wrote: We would like to take a node out of the ring and upgradesstables while it is not doing any writes nor reads with the ring. Is this possible? I am thinking from the documentation 1. nodetool drain 2. ANYTHING to stop reads here 3. Modify cassandra.yaml with compaction_throughput_mb_per_sec = 0 and multithreaded_compaction = true temporarily 4. Restart cassandra and run nodetool upgradesstables keyspace CF 5. Modify cassandra.yaml to revert changes 6. Restart cassandra to join the cluster again. Is this how it should be done? Thanks, Dean Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: Size Tiered - Leveled Compaction
If you did not use LCS until after the upgrade to 1.1.9 I think you are ok. If in doubt the steps here look like they helped https://issues.apache.org/jira/browse/CASSANDRA-4644?focusedCommentId=13456137page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13456137 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/02/2013, at 6:56 AM, Mike mthero...@yahoo.com wrote: Hello, Still doing research before we potentially move one of our column families from Size Tiered-Leveled compaction this weekend. I was doing some research around some of the bugs that were filed against leveled compaction in Cassandra and I found this: https://issues.apache.org/jira/browse/CASSANDRA-4644 The bug mentions: You need to run the offline scrub (bin/sstablescrub) to fix the sstable overlapping problem from early 1.1 releases. (Running with -m to just check for overlaps between sstables should be fine, since you already scrubbed online which will catch out-of-order within an sstable.) We recently upgraded from 1.1.2 to 1.1.9. Does anyone know if an offline scrub is recommended to be performed when switching from STCS-LCS after upgrading from 1.1.2? Any insight would be appreciated, Thanks, -Mike On 2/17/2013 8:57 PM, Wei Zhu wrote: We doubled the SStable size to 10M. It still generates a lot of SSTable and we don't see much difference of the read latency. We are able to finish the compactions after repair within serveral hours. We will increase the SSTable size again if we feel the number of SSTable hurts the performance. - Original Message - From: Mike mthero...@yahoo.com To: user@cassandra.apache.org Sent: Sunday, February 17, 2013 4:50:40 AM Subject: Re: Size Tiered - Leveled Compaction Hello Wei, First thanks for this response. Out of curiosity, what SSTable size did you choose for your usecase, and what made you decide on that number? Thanks, -Mike On 2/14/2013 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei From: Charles Brophy cbro...@zulily.com To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:29 AM Subject: Re: Size Tiered - Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered - Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running nodetool sstableupgrade on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple
Re: Bulk Loading-Unable to select from CQL3 tables with NO COMPACT STORAGE option after Bulk Loading - Cassandra version 1.2.1
CQL 3 tables that do not use compact storage store use Composite Types , which other code may not be expecting. Take a look at the CQL 3 table definitions through cassandra-cli and you may see the changes you need to make when creating the SSTables. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 26/02/2013, at 3:44 AM, praveen.akun...@wipro.com wrote: Hi All, I am using the bulk loader program provided in Datastax website. http://www.datastax.com/dev/blog/bulk-loading I am able to load data into tables created with COMPACT STORAGE option and also into tables created with out this option. However, I am unable to read data from the table created without COMPACT STORAGE option. I created 2 tables as below: CREATE TABLE TABLE1( field1 text PRIMARY KEY, field2 text, field3 text, field4 text ) WITH COMPACT STORAGE; CREATE TABLE TABLE2( field1 text PRIMARY KEY, field2 text, field3 text, field4 text ); Now, I loaded these 2 tables using the Java bulk loader program(Create SSTables and load them using SSTableloader utility). I can read the data from TABLE1, but, when I try to read data from TABLE2, I am getting timeout from both cqlsh cli. Screen Shot 2013-02-25 at 8.10.58 PM.png Screen Shot 2013-02-25 at 8.10.38 PM.png Is this expected behavior, or am I doing something wrong? Can anyone please help. Thanks Best Regards, Praveen Wipro Limited (Company Regn No in UK - FC 019088) Address: Level 2, West wing, 3 Sheldon Square, London W2 6PS, United Kingdom. Tel +44 20 7432 8500 Fax: +44 20 7286 5703 VAT Number: 563 1964 27 (Branch of Wipro Limited (Incorporated in India at Bangalore with limited liability vide Reg no L9KA1945PLC02800 with Registrar of Companies at Bangalore, India. Authorized share capital: Rs 5550 mn)) Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: disabling bloomfilter not working? memory numbers don't add up?
1. Can I stop the node, delete the *Filter.db files and restart the node(is this safe)??? No. 2. Why do I have 5 gig being eaten up by cassandra? nodetool info memory 5.2Gig, key cache:11 meg and row cache 0 bytes. All bloomfilters are also small 1meg. If this is the Heap memory reported by the JVM that all you can say is since the server was started it has allocated at least 5.2 GB of memory it's not there is 5.2GB of live memory in use Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 25/02/2013, at 9:32 AM, Hiller, Dean dean.hil...@nrel.gov wrote: H, my upgrade completed and then I added node back in and ran my repair. What is weird is that my nreldata column family still shows 156Meg of memory still(down from 2 gig though!!) in use and a false positive ratio of .99576 when I have the filter completely disabled(ie. Set to 1.0). I see the *Filter.db files on disk(and size approximately matches the in-memory size). I tried restarting the node as well. 1. Can I stop the node, delete the *Filter.db files and restart the node(is this safe)??? 2. Why do I have 5 gig being eaten up by cassandra? nodetool info memory 5.2Gig, key cache:11 meg and row cache 0 bytes. All bloomfilters are also small 1meg. Exception to #2 is I have nreldata still using 156MB for some reason but still no where close to 5.2 gig that nodetool shows in use. Thanks, Dean Bloom Filter Space Used: 2318392048tel:2318392048 Just to be sane do a quick check of the -Filter.db files on disk for this CF. If they are very small try a restart on the node. Number of Keys (estimate): 1249133696 Hey a billion rows on a node, what an age we live in :) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 23/02/2013, at 4:35 AM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: So in the cli, I ran update column family nreldata with bloom_filter_fp_chance=1.0; Then I ran nodetool upgradesstables databus5 nreldata; But my bloom filter size is still around 2gig(and I want to free up this heap) According to nodetool cfstats command… Column Family: nreldata SSTable count: 10 Space used (live): 96841497731 Space used (total): 96841497731 Number of Keys (estimate): 1249133696 Memtable Columns Count: 7066 Memtable Data Size: 4286174 Memtable Switch Count: 924 Read Count: 19087150 Read Latency: 0.595 ms. Write Count: 21281994 Write Latency: 0.013 ms. Pending Tasks: 0 Bloom Filter False Postives: 974393 Bloom Filter False Ratio: 0.8 Bloom Filter Space Used: 2318392048 Compacted row minimum size: 73 Compacted row maximum size: 446 Compacted row mean size: 143
Re: Retrieving local data
Take a look at the token function with the select statement http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 25/02/2013, at 10:06 AM, Everton Lima peitin.inu...@gmail.com wrote: Hi people, I was needing to retrieve some data in a local machine that was running cassandra. I start the Cassandra's daemon with my java process, so now I need to execute a CQL, but just in data that was storaged in that machine, is it possible? How? Thanks -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA
Re: cluster with cross data center and local
I assume my only options are to create another cluster or to create another keyspace using LocalStrategy strategy? You do need another key space, but you can still use the NetworkTopologyStrategy. Just set the strategy options to be dc1: 2 and dc2: 0. (check the docs for CLI and CQL for exact strategy options). What's the difference between LocalStrategy and SimpleStrategy? LocalStrategy is used by System keyspaces and secondary indexes to store data on a local node only. You do not want that. IMHO better to use NetworkTopologyStrategy as above than simple. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 25/02/2013, at 10:41 AM, Keith Wright kwri...@nanigans.com wrote: Hi all, I have a cluster with 2 data centers with an RF 2 keyspace using network topology on 1.1.10. I would like to configure it such that some of the data is not cross data center replicated but is replicated between the nodes of the local data center. I assume my only options are to create another cluster or to create another keyspace using LocalStrategy strategy? What's the difference between LocalStrategy and SimpleStrategy? Thanks!
Re: please explain read path when key not in database
This is my understanding from using cassandra for probably around 2 years Sounds about right. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 26/02/2013, at 7:43 AM, Hiller, Dean dean.hil...@nrel.gov wrote: This is my understanding from using cassandra for probably around 2 years….(though I still make mistakes sometimes)…. For CL.ONE read Depending on the client, the client may go through one of it's known nodes(co-ordinating node) which goes to real node(clients like astyanax/hector read in the ring information and usually go direct so for CL_ONE, no co-ordination really needed). The node it finally gets to may not have the data yet and will return no row while the other 2 node might have data. For CL.QUOROM read and RF=3 Client goes to the node with data(again depending on client) and that node sends off a request to one of the other 2. Let's say A does not have row yet, but B has row, comparison results and latest wins and a repair for that row is kicked off to get all nodes in sync of that row. If local node responsible for key replied that it have no data for this key - will coordinator send digest commands? IT looks like CL_ONE does trigger a read repair according to this doc (found googling CL_ONE read repair cassandra) http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CL-ONE-reads-RR-badness-threshold-interaction-td6247418.html http://wiki.apache.org/cassandra/ReadRepair Later, Dean Explain please, how this work when I request for key which is not in database * The closest node (as determined by proximity sorting as described above) will be sent a command to perform an actual data read (i.e., return data to the co-ordinating node). * As required by consistency level, additional nodes may be sent digest commands, asking them to perform the read locally but send back the digest only. * For example, at replication factor 3 a read at consistency level QUORUM would require one digest read in additional to the data read sent to the closest node. (See ReadCallbackhttp://wiki.apache.org/cassandra/ReadCallback, instantiated by StorageProxyhttp://wiki.apache.org/cassandra/StorageProxy) I have multi-DC with NetworkTopologyStrategy and RF:1 per datacenter, and reads are at consitency level ONE. If local node responsible for key replied that it have no data for this key - will coordinator send digest commands? Thanks!
Re: no backwards compatibility for thrift in 1.2.2? (we get utter failure)
Dean, Is this an issue with tables created using CQL 3 ? OR… An issue with tables created in 1.1.4 using the CLI not been readable after an in place upgrade to 1.2.2 ? I did a quick test and it worked. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 3/03/2013, at 8:18 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Your other option is to create tables 'WITH COMPACT STORAGE'. Basically if you use COMPACT STORAGE and create tables as you did before. https://issues.apache.org/jira/browse/CASSANDRA-2995 From an application standpoint, if you can't do sparse, wide rows, you break compatibility with 90% of Cassandra applications. So that rules out almost everything; if you can't provide the same data model, you're creating fragmentation, not pluggability. I now call Cassandra compact storage 'c*' storage, and I call CQL3 storage 'c*++' storage. See debates on c vs C++ to understand why :). On Sun, Mar 3, 2013 at 9:39 PM, Michael Kjellman mkjell...@barracuda.com wrote: Dean, I think if you look back through previous mailing list items you'll find answers to this already but to summarize: Tables created prior to 1.2 will continue to work after upgrade. New tables created are not exposed by the Thrift API. It is up to client developers to upgrade the client to pull the required metadata for serialization and deserialization of the data from the System column family instead. I don't know Netflix's time table for an update to Astyanax but I'm sure they are working on it. Alternatively, you can also use the Datastax java driver in your QA environment for now. If you only need to access existing column families this shouldn't be an issue On 3/3/13 6:31 PM, Hiller, Dean dean.hil...@nrel.gov wrote: I remember huge discussions on backwards compatibility and we have a ton of code using thrift(as do many people out there). We happen to have a startup bean for development that populates data in cassandra for us. We cleared out our QA completely(no data) and ran thisŠ.it turns out there seems to be no backwards compatibility as it utterly fails. From astyanax point of view, we simply get this (when going back to 1.1.4, everything works fine. I can go down the path of finding out where backwards compatibility breaks but does this mean essentially everyone has to rewrite their applications? OR is there a list of breaking changes that we can't do anymore? Has anyone tried the latest astyanax client with 1.2.2 version? An unexpected error occured caused by exception RuntimeException: com.netflix.astyanax.connectionpool.exceptions.NoAvailableHostsException: NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0), attempts=0]No hosts to borrow from Thanks, Dean Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: Select X amount of column families in a super column family in Cassandra using PHP?
You'll probably have better luck asking the author directly. Check the tutorial http://cassandra-php-client-library.com/tutorial/fetching-data and tell them what you have tried. For future reference we are trying to direct client specific queries to the client-dev list. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 2/03/2013, at 2:10 PM, Crocker Jordan jcrocker.115...@students.smu.ac.uk wrote: I'm using Kallaspriit's Cassandra/PHP library ( https://github.com/kallaspriit/Cassandra-PHP-Client-Library). I'm trying to select the first x amount of column families within the super column family, however, I'm having absolutely no luck, and google searches don't seem to bring up much. I'm using Random Partitioning, and don't particularly wish to change to OPP as I have read there is a lot more work involved. Any help would be much appreciated.
Re: Column Slice Query performance after deletions
I need something to keep the deleted columns away from my query fetch. Not only the tombstones. It looks like the min compaction might help on this. But I'm not sure yet on what would be a reasonable value for its threeshold. Your tombstones will not be purged in a compaction until after gc_grace and only if all fragments of the row are in the compaction. You right that you would probably want to run repair during the day if you are going to dramatically reduce gc_grace to avoid deleted data coming back to life. If you are using a single cassandra row as a queue, you are going to have trouble. Levelled compaction may help a little. If you are reading the most recent entries in the row, assuming the columns are sorted by some time stamp. Use the Reverse Comparator and issue slice commands to get the first X cols. That will remove tombstones from the problem. (Am guessing this is not something you do, just mentioning it). You next option is to change the data model so you don't use the same row all day. After that, consider a message queue. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 2/03/2013, at 12:03 PM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com wrote: Tombstones stay around until gc grace so you could lower that to see of that fixes the performance issues. If the tombstones get collected,the column will live again, causing data inconsistency since I cant run a repair during the regular operations. Not sure if I got your thoughts on this. Size tiered or leveled comparison? I'm actuallly running on Size Tiered Compaction, but I've been looking into changing it for Leveled. It seems to be the case. Although even if I achieve some performance, I would still have the same problem with the deleted columns. I need something to keep the deleted columns away from my query fetch. Not only the tombstones. It looks like the min compaction might help on this. But I'm not sure yet on what would be a reasonable value for its threeshold. On Sat, Mar 2, 2013 at 4:22 PM, Michael Kjellman mkjell...@barracuda.com wrote: Tombstones stay around until gc grace so you could lower that to see of that fixes the performance issues. Size tiered or leveled comparison? On Mar 2, 2013, at 11:15 AM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com wrote: What is your gc_grace set to? Sounds like as the number of tombstones records increase your performance decreases. (Which I would expect) gr_grace is default. Casandra's data files are write once. Deletes are another write. Until compaction they all live on disk.Making really big rows has these problem. Oh, so it looks like I should lower the min_compaction_threshold for this column family. Right? What does realy mean this threeshold value? Guys, thanks for the help so far. On Sat, Mar 2, 2013 at 3:42 PM, Michael Kjellman mkjell...@barracuda.com wrote: What is your gc_grace set to? Sounds like as the number of tombstones records increase your performance decreases. (Which I would expect) On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com wrote: I have a daily maintenance of my cluster where I truncate this column family. Because its data doesnt need to be kept more than a day. Since all the regular operations on it finishes around 4 hours before finishing the day. I regurlarly run a truncate on it followed by a repair at the end of the day. And every day, when the operations are started(when are only few deleted columns), the performance looks pretty well. Unfortunately it is degraded along the day. On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman mkjell...@barracuda.com wrote: When is the last time you did a cleanup on the cf? On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com wrote: Hello guys. I'm investigating the reasons of performance degradation for my case scenario which follows: - I do have a column family which is filled of thousands of columns inside a unique row(varies between 10k ~ 200k). And I do have also thousands of rows, not much more than 15k. - This rows are constantly updated. But the write-load is not that intensive. I estimate it as 100w/sec in the column family. - Each column represents a message which is read and processed by another process. After reading it, the column is marked for deletion in order to keep it out from the next query on this row. Ok, so, I've been figured out that after many insertions plus deletion updates, my queries( column slice query ) are taking more time to be performed. Even if there are only few columns, lower than 100. So it looks like that the longer is the number of columns being deleted, the longer is the time spent for a query. - Internally at C*, does column slice query ranges among deleted
Re: reading the updated values
my question is how do i get the updated data in cassandra for last 1 hour or so to be indexed in elasticsearch. You cannot. The best approach is to update elastic search at the same time you update cassandra. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 1/03/2013, at 11:57 PM, subhankar biswas neo20iit...@gmail.com wrote: hi, i m trying to use cassandra as main data-store and elasticsearch for realtime quries. my question is how do i get the updated data in cassandra for last 1 hour or so to be indexed in elasticsearch. once i get the updated data from cassandra i can index that to ES. is there any specific data model i have to follow to get the recent updates of any CF. thanks subhankar
Re: no backwards compatibility for thrift in 1.2.2? (we get utter failure)
ok, we are talking about all thrift / cli / hector / no CQL tables not been read after an upgrade. If you can get some repo steps that would be handy. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/03/2013, at 5:01 AM, Hiller, Dean dean.hil...@nrel.gov wrote: For us, this was an issue creating tables in 1.1.4 using thrift, then upgrading to 1.2.2. We did not use cli to create anything. I will try the complete test again today and hopefully get more detail(I didn't know I could not run the same thrift code in 1.2.2 for keyspace creation/table creation) Thanks, Dean From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, March 3, 2013 11:09 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: no backwards compatibility for thrift in 1.2.2? (we get utter failure) Dean, Is this an issue with tables created using CQL 3 ? OR… An issue with tables created in 1.1.4 using the CLI not been readable after an in place upgrade to 1.2.2 ? I did a quick test and it worked. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 3/03/2013, at 8:18 PM, Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com wrote: Your other option is to create tables 'WITH COMPACT STORAGE'. Basically if you use COMPACT STORAGE and create tables as you did before. https://issues.apache.org/jira/browse/CASSANDRA-2995 From an application standpoint, if you can't do sparse, wide rows, you break compatibility with 90% of Cassandra applications. So that rules out almost everything; if you can't provide the same data model, you're creating fragmentation, not pluggability. I now call Cassandra compact storage 'c*' storage, and I call CQL3 storage 'c*++' storage. See debates on c vs C++ to understand why :). On Sun, Mar 3, 2013 at 9:39 PM, Michael Kjellman mkjell...@barracuda.commailto:mkjell...@barracuda.com wrote: Dean, I think if you look back through previous mailing list items you'll find answers to this already but to summarize: Tables created prior to 1.2 will continue to work after upgrade. New tables created are not exposed by the Thrift API. It is up to client developers to upgrade the client to pull the required metadata for serialization and deserialization of the data from the System column family instead. I don't know Netflix's time table for an update to Astyanax but I'm sure they are working on it. Alternatively, you can also use the Datastax java driver in your QA environment for now. If you only need to access existing column families this shouldn't be an issue On 3/3/13 6:31 PM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: I remember huge discussions on backwards compatibility and we have a ton of code using thrift(as do many people out there). We happen to have a startup bean for development that populates data in cassandra for us. We cleared out our QA completely(no data) and ran thisŠ.ithttp://thisŠ.it turns out there seems to be no backwards compatibility as it utterly fails. From astyanax point of view, we simply get this (when going back to 1.1.4, everything works fine. I can go down the path of finding out where backwards compatibility breaks but does this mean essentially everyone has to rewrite their applications? OR is there a list of breaking changes that we can't do anymore? Has anyone tried the latest astyanax client with 1.2.2 version? An unexpected error occured caused by exception RuntimeException: com.netflix.astyanax.connectionpool.exceptions.NoAvailableHostsException: NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0), attempts=0]No hosts to borrow from Thanks, Dean Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.comhttp://www.copy.com/.
Re: Unable to instantiate cache provider org.apache.cassandra.cache.SerializingCacheProvider
What version are you using ? As of 1.1 off heap caches no longer require JNA https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L327 Also the row and key caches are now set globally not per CF https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L324 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 1/03/2013, at 1:33 AM, Jason Wee peich...@gmail.com wrote: This happened sometime ago, but for the sake of helping others if they encounter, each column family has a row cache provider, you can read into the schema, for example : ... and row_cache_provider = 'SerializingCacheProvider' ... it cannot start the cache provider for a reason and as a result, default to the ConcurrentLinkedHashCacheProvider. the serializing cache provider require jna lib, and if you place the library into cassandra lib directory, then this warning should not happen again.
Re: backing up and restoring from only 1 replica?
That would be OK only if you never had node go down (e.g. a restart) or drop messages. It's not something I would consider trying. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/02/2013, at 3:21 PM, Mike Koh defmike...@gmail.com wrote: It has been suggested to me that we could save a fair amount of time and money by taking a snapshot of only 1 replica (so every third node for most column families). Assuming that we are okay with not having the absolute latest data, does this have any possibility of working? I feel like it shouldn't but don't really know the argument for why it wouldn't.
Re: Retrieving local data
Yes. You can get the token ranges via astynax and only ask for rows that are within the token ranges. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/02/2013, at 2:25 PM, Everton Lima peitin.inu...@gmail.com wrote: Ok aaron. But the problem is that I am running Cassandra 1.1.8. I am using it for the compatibility with Astyanax 1.56. So, it is possible in Cassandra 1.1.8, too? 2013/2/28 aaron morton aa...@thelastpickle.com Take a look at the token function with the select statement http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 25/02/2013, at 10:06 AM, Everton Lima peitin.inu...@gmail.com wrote: Hi people, I was needing to retrieve some data in a local machine that was running cassandra. I start the Cassandra's daemon with my java process, so now I need to execute a CQL, but just in data that was storaged in that machine, is it possible? How? Thanks -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA
Re: Unable to instantiate cache provider org.apache.cassandra.cache.SerializingCacheProvider
Details are here https://issues.apache.org/jira/browse/CASSANDRA-3271 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/03/2013, at 8:04 AM, Jason Wee peich...@gmail.com wrote: version 1.0.8 Just curious, what is the mechanism for off heap in 1.1? Thank you. /Jason On Mon, Mar 4, 2013 at 11:49 PM, aaron morton aa...@thelastpickle.com wrote: What version are you using ? As of 1.1 off heap caches no longer require JNA https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L327 Also the row and key caches are now set globally not per CF https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L324 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 1/03/2013, at 1:33 AM, Jason Wee peich...@gmail.com wrote: This happened sometime ago, but for the sake of helping others if they encounter, each column family has a row cache provider, you can read into the schema, for example : ... and row_cache_provider = 'SerializingCacheProvider' ... it cannot start the cache provider for a reason and as a result, default to the ConcurrentLinkedHashCacheProvider. the serializing cache provider require jna lib, and if you place the library into cassandra lib directory, then this warning should not happen again.
Re: backing up and restoring from only 1 replica?
Hinted Handoff works well. But it's an optimisation that has certain safety valves, configuration and throttling that means it is still not considered the way to ensure on disk consistency. In general, if a node restarts or drops mutations HH should get the message there eventually. In specific cases it may not. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/03/2013, at 10:40 AM, Mike Koh defmike...@gmail.com wrote: Thanks for the response. Could you elaborate more on the bad things that happen during a restart or message drops that would cause a 1 replica restore to fail? I'm completely on board with not using a restore process that nobody else uses, but I need to convince somebody else who thinks that it will work that it is not a good idea. On 3/4/2013 7:54 AM, aaron morton wrote: That would be OK only if you never had node go down (e.g. a restart) or drop messages. It's not something I would consider trying. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/02/2013, at 3:21 PM, Mike Koh defmike...@gmail.com wrote: It has been suggested to me that we could save a fair amount of time and money by taking a snapshot of only 1 replica (so every third node for most column families). Assuming that we are okay with not having the absolute latest data, does this have any possibility of working? I feel like it shouldn't but don't really know the argument for why it wouldn't.
Re: anyone see this user-cassandra thread get answered...
Was probably this https://issues.apache.org/jira/browse/CASSANDRA-4597 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/03/2013, at 2:05 PM, Hiller, Dean dean.hil...@nrel.gov wrote: I was reading http://mail-archives.apache.org/mod_mbox/cassandra-user/201208.mbox/%3CCAGZm5drRh3VXNpHefR9UjH8H=dhad2y18s0xmam5cs4yfl5...@mail.gmail.com%3E As we are having the same issue in 1.2.2. We modify to LCS and cassandra-cli shows us at LCS on any node we run cassandra cli on, but then looking at cqlsh, it is showing us at SizeTieredCompactionStrategy :(. Thanks, Dean
Re: Consistent problem when solve Digest mismatch
Otherwise, it means the version conflict solving strong depends on global sequence id (timestamp) which need provide by client ? Yes. If you have an area of your data model that has a high degree of concurrency C* may not be the right match. In 1.1 we have atomic updates so clients see either the entire write or none of it. And sometimes you can design a data model that does mutate shared values, but writes ledger entries instead. See Matt Denis talk here http://www.datastax.com/events/cassandrasummit2012/presentations or this post http://thelastpickle.com/2012/08/18/Sorting-Lists-For-Humans/ Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/03/2013, at 4:30 PM, Jason Tang ares.t...@gmail.com wrote: Hi The timestamp provided by my client is unix timestamp (with ntp), and as I said, due to the ntp drift, the local unix timestamp is not accurately synchronized (compare to my case). So for short, client can not provide global sequence number to indicate the event order. But I wonder, I configured Cassandra consistency level as write QUORUM. So for one record, I suppose Cassandra has the ability to decide the final update results. Otherwise, it means the version conflict solving strong depends on global sequence id (timestamp) which need provide by client ? //Tang 2013/3/4 Sylvain Lebresne sylv...@datastax.com The problem is, what is the sequence number you are talking about is exactly? Or let me put it another way: if you do have a sequence number that provides a total ordering of your operation, then that is exactly what you should use as your timestamp. What Cassandra calls the timestamp, is exactly what you call seqID, it's the number Cassandra uses to decide the order of operation. Except that in real life, provided you have more than one client talking to Cassandra, then providing a total ordering of operation is hard, and in fact not doable efficiently. So in practice, people use unix timestamp (with ntp) which provide a very good while cheap approximation of the real life order of operations. But again, if you do know how to assign a more precise timestamp, Cassandra let you use that: you can provid your own timestamp (using unix timestamp is just the default). The point being, unix timestamp is the better approximation we have in practice. -- Sylvain On Mon, Mar 4, 2013 at 9:26 AM, Jason Tang ares.t...@gmail.com wrote: Hi Previous I met a consistency problem, you can refer the link below for the whole story. http://mail-archives.apache.org/mod_mbox/cassandra-user/201206.mbox/%3CCAFb+LUxna0jiY0V=AvXKzUdxSjApYm4zWk=ka9ljm-txc04...@mail.gmail.com%3E And after check the code, seems I found some clue of the problem. Maybe some one can check this. For short, I have Cassandra cluster (1.0.3), The consistency level is read/write quorum, replication_factor is 3. Here is event sequence: seqID NodeA NodeB NodeC 1. New New New 2. Update Update Update 3. Delete Delete When try to read from NodeB and NodeC, Digest mismatch exception triggered, so Cassandra try to resolve this version conflict. But the result is value Update. Here is the suspect root cause, the version conflict resolved based on time stamp. Node C local time is a bit earlier then node A. Update requests sent from node C with time stamp 00:00:00.050, Delete sent from node A with time stamp 00:00:00.020, which is not same as the event sequence. So the version conflict resolved incorrectly. It is true? If Yes, then it means, consistency level can secure the conflict been found, but to solve it correctly, dependence one time synchronization's accuracy, e.g. NTP ?
Re: hinted handoff disabling trade-offs
The advantage of HH is that it reduces the probability of a DigestMismatch when using a CL ONE. A DigestMismatch means the read has to run a second time before returning to the client. - No risk of hinted-handoffs building up - No risk of hinted-handoffs flooding a node that just came up See the yaml config settings for the max hint window and the throttling. Can anyone suggest any other factors that I'm missing here. Specifically reasons not to do this. If you are doing this for performance first make sure your data model is efficient, that you are doing the most efficient reads (see my presentation here http://www.datastax.com/events/cassandrasummit2012/presentations), and your caching is bang on. Then consider if you can tune the CL, and if your client is token aware so it directs traffic to a node that has it. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/03/2013, at 9:19 PM, Michael Kjellman mkjell...@barracuda.com wrote: Also, if you have enough hints being created that its significantly impacting your heap I have a feeling things are going to get out of sync very quickly. On Mar 4, 2013, at 9:17 PM, Wz1975 wz1...@yahoo.com wrote: Why do you think disabling hinted handoff will improve memory usage? Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: hinted handoff disabling trade-offs From: Michael Kjellman mkjell...@barracuda.com To: user@cassandra.apache.org user@cassandra.apache.org CC: Repair is slow. On Mar 4, 2013, at 8:07 PM, Matt Kap matvey1...@gmail.com wrote: I am looking to get a second opinion about disabling hinted-handoffs. I have an application that can tolerate a fair amount of inconsistency (advertising domain), and so I'm weighting the pros and cons of hinted handoffs. I'm running Cassandra 1.0, looking to upgrade to 1.1 soon. Pros of disabling hinted handoffs: - Reduces heap - Improves GC performance - No risk of hinted-handoffs building up - No risk of hinted-handoffs flooding a node that just came up Cons - Some writes can be lost, at least until repair runs Can anyone suggest any other factors that I'm missing here. Specifically reasons not to do this. Cheers! -Matt Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com. -- Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: Replacing dead node when num_tokens is used
AFAIK you just fire up the new one and let nature take it's course :) http://www.datastax.com/docs/1.2/operations/add_replace_nodes#replace-node i.e. you do not need to use -Dcassandra.replace_token. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/03/2013, at 1:06 AM, Jan Kesten j.kes...@enercast.de wrote: Hello, while trying out cassandra I read about the steps necessary to replace a dead node. In my test cluster I used a setup using num_tokens instead of initial_tokens. How do I replace a dead node in this scenario? Thanks, Jan
Re: old data / tombstones are not deleted after ttl
If you have a data model with long lived and frequently updated rows, you can get around the all fragments problem by running a user defined compaction. Look for the CompactionManagerMbean on the JMX API https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManagerMBean.java#L67 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/03/2013, at 1:52 AM, Michal Michalski mich...@opera.com wrote: I have read in the documentation, that after a major compaction, minor compactions are no longer automatically trigger. Does this mean, that I have to do the nodetool compact regulary? Or is there a way to get back to the automatically minor compactions? I think it's one of the most confusing parts of C* docs. There's nothing like a switch for minor compactions that gets magically turned off when you trigger major compaction. Minor compactions won't get trigerred automatically for _some_ time, because you'll only have one gargantuan SSTable and unless you get enough new (smaller) SSTables to get them compacted together (4 by default), no compactions will kick in. Of course you'll still have one huge SSTable and it will take a lot of time to get another 3 of similar size to get them compacted. I think that it will be a problem for your TTL-based data model, as you'll have tons of Tombstones in the newer/smaller SSTables that you won't be able to compact together with the huge SSTable containing data. BTW: As far as I remember, there was an external tool (I don't remember the name) allowing to split SSTables - I didn't use it, so I can't suggest you using it, but you may want to give it a try. M. W dniu 05.03.2013 09:46, Matthias Zeilinger pisze: Short question afterwards: I have read in the documentation, that after a major compaction, minor compactions are no longer automatically trigger. Does this mean, that I have to do the nodetool compact regulary? Or is there a way to get back to the automatically minor compactions? Thx, Br, Matthias Zeilinger Production Operation – Shared Services P: +43 (0) 50 858-31185 M: +43 (0) 664 85-34459 E: matthias.zeilin...@bwinparty.com bwin.party services (Austria) GmbH Marxergasse 1B A-1030 Vienna www.bwinparty.com -Original Message- From: Matthias Zeilinger [mailto:matthias.zeilin...@bwinparty.com] Sent: Dienstag, 05. März 2013 08:03 To: user@cassandra.apache.org Subject: RE: old data / tombstones are not deleted after ttl Yes it was a major compaction. I know it´s not a great solution, but I needed something to get rid of the old data, because I went out of diskspace. Br, Matthias Zeilinger Production Operation – Shared Services P: +43 (0) 50 858-31185 M: +43 (0) 664 85-34459 E: matthias.zeilin...@bwinparty.com bwin.party services (Austria) GmbH Marxergasse 1B A-1030 Vienna www.bwinparty.com -Original Message- From: Michal Michalski [mailto:mich...@opera.com] Sent: Dienstag, 05. März 2013 07:47 To: user@cassandra.apache.org Subject: Re: old data / tombstones are not deleted after ttl Was it a major compaction? I ask because it's definitely a solution that had to work, but it's also a solution that - in general - probably no-one here would suggest you to use. M. W dniu 05.03.2013 07:08, Matthias Zeilinger pisze: Hi, I have done a manually compaction over the nodetool and this worked. But thx for the explanation, why it wasn´t compacted Br, Matthias Zeilinger Production Operation – Shared Services P: +43 (0) 50 858-31185 M: +43 (0) 664 85-34459 E: matthias.zeilin...@bwinparty.com bwin.party services (Austria) GmbH Marxergasse 1B A-1030 Vienna www.bwinparty.com From: Bryan Talbot [mailto:btal...@aeriagames.com] Sent: Montag, 04. März 2013 23:36 To: user@cassandra.apache.org Subject: Re: old data / tombstones are not deleted after ttl Those older files won't be included in a compaction until there are min_compaction_threshold (4) files of that size. When you get another SS table -Data.db file that is about 12-18GB then you'll have 4 and they will be compacted together into one new file. At that time, if there are any rows with only tombstones that are all older than gc_grace the row will be removed (assuming the row exists exclusively in the 4 input SS tables). Columns with data that is more than TTL seconds old will be written with a tombstone. If the row does have column values in SS tables that are not being compacted, the row will not be removed. -Bryan On Sun, Mar 3, 2013 at 11:07 PM, Matthias Zeilinger matthias.zeilin...@bwinparty.commailto:matthias.zeilin...@bwinparty.com wrote: Hi, I´m running Cassandra 1.1.5 and have following issue. I´m using a 10 days TTL on my CF. I can see a lot of tombstones in there, but they aren´t deleted
Re: what size file for LCS is best for 300-500G per node?
Don't forget you can test things http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-live-traffic-sampling Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/03/2013, at 7:37 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Thanks! Dean On 3/4/13 7:12 PM, Wei Zhu wz1...@yahoo.com wrote: We have 200G and ended going with 10M. The compaction after repair takes a day to finish. Try to run a repair and see how it goes. -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Monday, March 4, 2013 10:52:27 AM Subject: what size file for LCS is best for 300-500G per node? Should we really be going with 5MB when it compresses to 3MB? That seems to be on the small side, right? We have ulimit cranked up so many files shouldn't be an issue but maybe we should go to 10MB or 100MB or something in between? Does anyone have any experience with changing the LCS sizes? I do read somewhere startup times of opening 100,000 files could be slow? Which implies a larger size so less files might be better? Thanks, Dean