Secondary index query + 2 Datacenters + Row Cache + Restart = 0 rows

2013-02-01 Thread Alexei Bakanov

I've found a combination that doesn't work:
A column family that have a secondary index and caching='ALL' with
data in two datacenters and I do a restart of the nodes, then my
secondary index queries start returning 0 rows.
It happens when amount of data goes over a certain threshold, so I
suspect that compactions are involved in this as well.
Taking out one of the ingredients fixes the problem and my queries
return rows from secondary index.
I suspect that this guy is struggling with the same thing

Here is a sequence of actions that reproduces it with help of CCM:

$ ccm create --cassandra-version 1.2.1 --nodes 2 -p RandomPartitioner
$ ccm updateconf 'endpoint_snitch: PropertyFileSnitch'
$ ccm updateconf 'row_cache_size_in_mb: 200'
$ cp ~/Downloads/
~/.ccm/testRowCacheDC/node1/conf/  (please find .properties file
$ cp ~/Downloads/ ~/.ccm/testRowCacheDC/node2/conf/
$ ccm start
$ ccm cli
 -create keyspace and column family(please find schema below)
$ python
$ ccm stop  (I tried flush first, doesn't help)
$ ccm start
$ ccm cli
Connected to: testRowCacheDC on
Welcome to Cassandra CLI version 1.2.1-SNAPSHOT

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use testks;
Authenticated to keyspace: testks
[default@testks] get cf1 where 'indexedColumn'='userId_75';

0 Row Returned.
Elapsed time: 68 msec(s).

My cassandra instances run with -Xms1927M -Xmx1927M -Xmn400M
Thanks for help.

Best regards,

-- START --
-- FINISH --

-- START cassandra-cli schema ---
create keyspace testks
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {DC2 : 1, DC1 : 1}
  and durable_writes = true;

use testks;

create column family cf1
  with column_type = 'Standard'
  and comparator = 'org.apache.cassandra.db.marshal.AsciiType'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UTF8Type'
  and read_repair_chance = 1.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
  and caching = 'ALL'
  and column_metadata = [
{column_name : 'indexedColumn',
validation_class : UTF8Type,
index_name : 'INDEX1',
index_type : 0}]
  and compression_options = {'sstable_compression' :
---FINISH cassandra-cli schema ---

-- START ---
from pycassa.batch import Mutator

import pycassa

pool = pycassa.ConnectionPool('testks', timeout=5)
cf = pycassa.ColumnFamily(pool, 'cf1')

for userId in xrange(0, 1000):
print userId
b = Mutator(pool, queue_size=200)
for itemId in xrange(20):
rowKey = 'userId_%s:itemId_%s'%(userId, itemId)
for message_number in xrange(10):
b.insert(cf, rowKey, {'indexedColumn': 'userId_%s'%userId,
str(message_number): str(message_number)})

-- FINISH ---

Re: Start token sorts after end token

2013-02-01 Thread Jeremy Hanna
See - should be fixed in 
1.1.10 and 1.2.2.

On Jan 30, 2013, at 9:18 AM, Tejas Patil wrote:

 While reading data from Cassandra in map-reduce, I am getting 
 InvalidRequestException(why:Start token sorts after end token)
 Below is the code snippet that I used and the entire stack trace.
 (I am using Cassandra 1.2.0 and hadoop 0.20.2)
 Can you point out the issue here ?
 Code snippet:
SlicePredicate predicate = new SlicePredicate();
 SliceRange sliceRange = new SliceRange();
 sliceRange.start = ByteBuffer.wrap((1.getBytes()));
 sliceRange.finish = ByteBuffer.wrap((100.getBytes()));
 sliceRange.reversed = false;
 //predicate.slice_range = sliceRange;
 ListByteBuffer colNames = new ArrayListByteBuffer();
 predicate.column_names = colNames;
 ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
 Full stack trace:
 java.lang.RuntimeException: InvalidRequestException(why:Start token sorts 
 after end token)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(

Re: too many warnings of Heap is full

2013-02-01 Thread Guillermo Barbero
 What is the cardinality like on these indexes? Can you provide the
 schema creation for these two column families?

This is the schema of the CFs:

create column family CF_users
with comparator = UTF8Type
and column_metadata =
{column_name: userSBCode, validation_class:
UTF8Type, index_type: KEYS},
{column_name: userEmail, validation_class:
UTF8Type, index_type: KEYS},
{column_name: userName, validation_class: UTF8Type},
{column_name: userLastName, validation_class: UTF8Type},
{column_name: userOwnPhoneKey,
validation_class: UTF8Type, index_type: KEYS},
{column_name: userOwnPhone,
validation_class: UTF8Type, index_type: KEYS},
{column_name: userPasswordMD5,
validation_class: UTF8Type},
{column_name: userDOB, validation_class: UTF8Type},
{column_name: userGender, validation_class: UTF8Type},
{column_name: userProfilePicMD5,
validation_class: UTF8Type},
{column_name: userAbout, validation_class: UTF8Type},
{column_name: userLastSession,
validation_class: UTF8Type}
{column_name: userMasterKey, validation_class: UTF8Type}

create column family CF_SBMessages
with comparator = UTF8Type
and column_metadata =

{column_name: SBMessageId, validation_class:
UTF8Type, index_type: KEYS},
{column_name: fromSBCode, validation_class:
UTF8Type, index_type: KEYS},
{column_name: SBMessageDate, validation_class:
UTF8Type, index_type: KEYS},
{column_name: SBMessageType, validation_class:
{column_name: SBMessageText, validation_class:
{column_name: SBMessageAttachments,
validation_class: UTF8Type},

I've read about the importance of keeping the cardinality of the
secondary indexes low (great article at,
and I'm afraid that we did completely the opposite (we did consider
the secondary indexes as alternate indexes).
I guess here is some work to do to create other CFs to store these
secondary indexes.

Anyway, I still don't understand why did appear these peaks (by the
way, last night there wasn't any)

neither 'nodetool repair' nor 'hinted hanoff/read repair' work for secondary indexes

2013-02-01 Thread Alexei Bakanov
Hi again,

Once started playing with CCM it's hard to stop, such a great tool.
My issue with secondary indexes is following: neither explicit
'nodetool repair' nor implicit 'hinted handoffs/read repairs' resolve
inconsistencies in data I get from secondary indexes.
I observe this for both one- and 2-datacenter deployments, independent
of caching settings. Rebuilding/droping and creating index or
restarting nodes doesn't help.

In the following scenario I start up 2 nodes and insert some rows with
CL.ONE. During this process I deliberately stop and start the nodes in
order to trigger inconsistencies.
I then query all data by its index with read CL.ONE and stop if I see
that data is missing. I see that none of C* repair mechanisms work for
secondary indexes.

$ ccm create --cassandra-version 1.2.1 --nodes 2 -p RandomPartitioner
$ ccm start
$ ccm node1 cli
- create keyspace and column family  (please find schemas attached)
$ python (in first terminal)
$ ccm node1 stop; sleep 10; ccm node1 start   (in second terminal,
while runs)
$ ccm node2 stop; sleep 10; ccm node2 start   (in second terminal,
while runs. Hinted Handoffs do the work but
unfortunately not on Secondary Indexes)

$ python

Traceback (most recent call last):
  File, line 19, in module
raise Exception('missing rows for userId %s, data length is
%d'%(userId, len(data)))
Exception: missing rows for userId 256, data length is 0

$ ccm cli
[default@unknown] use testks;
Authenticated to keyspace: testks
[default@testks] get cf1 where 'indexedColumn'='userId_256';

0 Row Returned.
Elapsed time: 47 msec(s).

$ python  (running one more time in hope that 'read
repair' kicked in after the last query, but unfortunately no)

Traceback (most recent call last):
  File, line 19, in module
raise Exception('missing rows for userId %s, data length is
%d'%(userId, len(data)))
Exception: missing rows for userId 256, data length is 0

$ ccm node1 repair
$ ccm node2 repair
$ ccm cli

[default@unknown] use testks;
Authenticated to keyspace: testks
[default@testks] get cf1 where 'indexedColumn'='userId_256';

0 Row Returned.

Both cassandra instances run with -Xms1927M -Xmx1927M -Xmn400M

Thanks for help.

Best regards,

--START cassandra-cli schemas 
create keyspace testks
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {datacenter1 : 2}
  and durable_writes = true;

use testks;

create column family cf1
  with column_type = 'Standard'
  and comparator = 'AsciiType'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UTF8Type'
  and read_repair_chance = 1.0
  and dclocal_read_repair_chance = 1.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
  and caching = 'KEYS_ONLY'
  and column_metadata = [
{column_name : 'indexedColumn',
validation_class : UTF8Type,
index_name : 'INDEX1',
index_type : 0}]
  and compression_options = {'sstable_compression' :
--FINISH cassandra-cli schemas 

--START --
import datetime
from pycassa.batch import Mutator

import pycassa

pool = pycassa.ConnectionPool('testks', timeout=5,
server_list=['', ''])
cf = pycassa.ColumnFamily(pool, 'cf1')

for userId in xrange(0, 2000):
print userId
b = Mutator(pool, queue_size=200)
for itemId in xrange(20):
rowKey = 'userId_%s:itemId_%s'%(userId, itemId)
for message_number in xrange(10):
b.insert(cf, rowKey, {'indexedColumn': 'userId_%s'%userId,
str(message_number): str(message_number)})


--START --
import pycassa
from pycassa.columnfamily import ColumnFamily
from pycassa.pool import ConnectionPool
from pycassa.index import *

pool = pycassa.ConnectionPool('testks', server_list=['',
cf = pycassa.ColumnFamily(pool, 'cf1')

for userId in xrange(2000):
print userId
index_expr = create_index_expression('indexedColumn', 'userId_%s'%userId)
index_clause = create_index_clause([index_expr], count=1000)
data = list(cf.get_indexed_slices(index_clause=index_clause))
if len(data) != 20:
raise Exception('missing rows for userId %s, data length is
%d'%(userId, len(data)))


Re: Inserting via thrift interface to column family created with Compound Key via cql3

2013-02-01 Thread aaron morton
Whats the full error stack on the client ? 

Are you using a pre-build thrift client or you own ? If the later try using a 
pre built client first, like Hector or pycassa. If it works there look into how 
that code works and go from there. 


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 31/01/2013, at 5:24 AM, Oleksandr Petrov wrote:

 BTW, thanks for chiming in!
 No-no, I'm using Thrift client, not inserting via cql.
 I'm serializing via CompositeType, actually. 
 CompositeType.getInstance(UTF8Type, UTF8Type).decompose([firstkeypart, 
 Hm... From what you say I understand that it's technically possible :/ 
 So I must be wrong somewhere,

Re: cluster issues

2013-02-01 Thread aaron morton
For Data Stax Enterprise specific questions try the support forums


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 31/01/2013, at 8:27 AM, S C wrote:

 I am using DseDelegateSnitch
 Subject: Re: cluster issues
 Date: Tue, 29 Jan 2013 20:15:45 +1300
   • We can always be proactive in keeping the time sync. But, Is there 
 any way to recover from a time drift (in a reactive manner)? Since it was a 
 lab environment, I dropped the KS (deleted data directory)
 There is a way to remove future dated columns, but it not for the faint 
 1) Drop the gc_grace_seconds to 0
 2) Delete the column with a timestamp way in the future, so it is guaranteed 
 to be higher than the value you want to delete. 
 3) Flush the CF
 4) Compact all the SSTables that contain the row. The easiest way to do that 
 is a major compaction, but we normally advise not to do that because it 
 creates one big file. You can also do a user defined compaction. 
   • Are there any other scenarios that would lead a cluster look like 
 below? Note:Actual topology of the cluster - ONE Cassandra node and TWO 
 Analytic nodes.
 What snitch are you using?
 If you have the property file snitch do all nodes have the same configuration 
 There is a lot of sickness there. If possible I would scrub and start again. 
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 On 29/01/2013, at 6:29 AM, S C wrote:
 One of our node in a 3 node cluster drifted by ~ 20-25 seconds. While I 
 figured this pretty quickly, I had few questions that am looking for some 
   • We can always be proactive in keeping the time sync. But, Is there 
 any way to recover from a time drift (in a reactive manner)? Since it was a 
 lab environment, I dropped the KS (deleted data directory).
   • Are there any other scenarios that would lead a cluster look like 
 below?Note:Actual topology of the cluster - ONE Cassandra node and TWO 
 Analytic nodes.
 Address DC  RackStatus State   LoadOwns   
  113427455640312821154458202477256070485  Cassandra   rack1   Up Normal  601.34 MB   33.33%  
 0  Analytics   rack1   Down   Normal  149.75 MB   33.33%  
 56713727820156410577229101238628035242  Analytics   rack1   Down   Normal  ?   33.33%  
 Address DC  RackStatus State   LoadOwns   
  113427455640312821154458202477256070485  Analytics   rack1   Down   Normal  ?   33.33%  
 0  Analytics   rack1   Up Normal  158.59 MB   33.33%  
 56713727820156410577229101238628035242  Analytics   rack1   Down   Normal  ?   33.33%  
 Address DC  RackStatus State   LoadOwns   
  113427455640312821154458202477256070485  Analytics   rack1   Down   Normal  ?   33.33%  
 0  Analytics   rack1   Down   Normal  ?   33.33%  
 56713727820156410577229101238628035242  Analytics   rack1   Up Normal  117.02 MB   33.33%  
 Appreciate your valuable inputs.


2013-02-01 Thread aaron morton
Can you update the ticket with your experiences ?


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 31/01/2013, at 11:13 AM, wrote:

 I had the same problem with 1.2.0.  The problem went away after readline was 
 Yen-Fen Hsu

Re: why set replica placement strategy at keyspace level ?

2013-02-01 Thread aaron morton
Many of my mental models bother people :)

This particular one came from my understanding of Big Table and the code. 

For me this works, I think of (internal) rows as roughly containing the CF's. 

In the CQL world it works for me as well, the partition key (first part of the 
primary key) is important and identifies the storage container that has the 

Your milage may vary
Aaron Morton
Freelance Cassandra Developer
New Zealand


On 31/01/2013, at 4:43 PM, Edward Capriolo wrote:

 That should not bother you.
 For example, if your doing an hbase scan that crosses two column families,
 that count end up being two (disk) seeks.
 Having an API that hides the seeks from you does not give you better
 performance, it only helps you when your debating with people that do not
 understand the fundamentals.

Re: Cassandra pending compaction tasks keeps increasing

2013-02-01 Thread aaron morton
 Will that cause  the symptom of no data streamed from other nodes? Other 
 nodes still think the node had all the data?
AFAIk they will not make assumptions like that. 

  Can I just change it in yaml and restart C* and it will correct itself?
It's a schema config change, check the help for the CLI or the CQL docs. 

 Any side effect? Since we are using SSD, a bit bigger SSD won't slow down the 
 read too much, I suppose that is the main concern for bigger size of SSTable?
Do some experiments to see how it works, and let others know :) 


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 31/01/2013, at 5:30 PM, Wei Zhu wrote:

 Some updates:
 Since we still have not fully turned on the system. We did something crazy 
 today. We tried to treat the node as dead one. (My boss wants us to practice 
 replacing a dead node before going to full production) and boot strap it. 
 Here is what we did:
   • drain the node
   • check nodetool on other nodes, and this node is marked down (the 
 token for this node is 100)
   • clear the data, commit log, saved cache
   • change initial_token from 100 to 99 in the yaml file
   • start the node
   • check nodetool, the down node of 100 disappeared by itself (!!) and 
 new node with token 99 showed up
   • checked log, see the message saying bootstrap completed. But only a 
 couple of MB streamed. 
   • nodetool movetoken 98
   • nodetool, see the node with token 98 comes up. 
   • check log, see the message saying bootstrap completed. But still only 
 a couple of MB streamed.
 The only reason I can think of is that the new node has the same IP as the 
 dead node we tried to replace? Will that cause  the symptom of no data 
 streamed from other nodes? Other nodes still think the node had all the data?
 We had to do nodetool repair -pr to bring in the data. After 3 hours, 150G  
 transferred. And no surprise, pending compaction tasks are now at 30K. There 
 are about 30K SStable transferred and I guess all of them needs to be 
 compacted since we use LCS.
 My concern is that if we did nothing wrong, replacing a dead node will cause 
 such a hugh back log of pending compaction. It might take a week to clear 
 that off. And we have RF = 3, we still need to bring in the data for the 
 other two replicates since we use pr for nodetool repair. It will take 
 about 3 weeks to fully replace a 200G node using LCS? We tried everything we 
 can to speed up the compaction and no luck. The only thing I can think of is 
 to increase the default size of SSTable, so less number of compaction will be 
 needed. Can I just change it in yaml and restart C* and it will correct 
 itself? Any side effect? Since we are using SSD, a bit bigger SSD won't slow 
 down the read too much, I suppose that is the main concern for bigger size of 
 I think 1.2 comes with parallel LC which should help the situation. But we 
 are not going to upgrade for a little while.
 Did I miss anything? It might not be practical to use LCS for 200G node? But 
 if  we use Sized compaction, we need to have at least 400G for the 
 HD...Although SSD is cheap now, still hard to convince the management. three 
 replicates + double the Disk for compaction? that is 6 times of the real data 
 Sorry for the long email. Any suggestion or advice?
 From: aaron morton
 To: Cassandra User
 Sent: Tuesday, January 29, 2013 12:59:42 PM
 Subject: Re: Cassandra pending compaction tasks keeps increasing
 * Will try it tomorrow. Do I need to restart server to change the log level?
 You can set it via JMX, and supposedly log4j is configured to watch the 
 config file. 
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 On 29/01/2013, at 9:36 PM, Wei Zhu wrote:
 Thanks for the reply. Here is some information:
 Do you have wide rows ? Are you seeing logging about Compacting wide rows ? 
 * I don't see any log about wide rows
 Are you seeing GC activity logged or seeing CPU steal on a VM ? 
 * There is some GC, but CPU general is under 20%. We have heap size of 8G, 
 RAM is at 72G.
 Have you tried disabling multithreaded_compaction ? 
 * By default, it's disabled. We enabled it, but doesn't see much difference. 
 Even a little slower with it's enabled. Is it bad to enable it? We have SSD, 
 according to comment in yaml, it should help while using SSD.
 Are you using Key Caches ? Have you tried disabling 
 * We have fairly big Key caches, we set as 10% of Heap which is 800M. Yes, 
 compaction_preheat_key_cache is disabled. 
 Can you enabled DEBUG level logging and make them available ? 
 * Will try it tomorrow. Do I need to restart server to change the log 

Re: CPU hotspot at BloomFilterSerializer#deserialize

2013-02-01 Thread aaron morton
 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G)
So very large rows ? 
What does nodetool cfstats or cfhistograms say about the row sizes ? 

 1. what is happening?

I think this is partially large rows and partially the query pattern, this is 
only by roughly correct and my talk here

 3. any more info required to proceed?

Do some tests with different query techniques…

Get a single named column. 
Get the first 10 columns using the natural column order.
Get the last 10 columns using the reversed order. 

Hope that helps. 

Aaron Morton
Freelance Cassandra Developer
New Zealand


On 31/01/2013, at 7:20 PM, Takenori Sato wrote:

 Hi all,
 We have a situation that CPU loads on some of our nodes in a cluster has 
 spiked occasionally since the last November, which is triggered by requests 
 for rows that reside on two specific sstables.
 We confirmed the followings(when spiked):
 version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8
 jdk: Oracle 1.6.0
 1. a profiling showed that BloomFilterSerializer#deserialize was the 
 hotspot(70% of the total load by running threads)
 * the stack trace looked like this(simplified)
 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData
 89.5% -
 79.9% -
 68.9% -
 66.7% -
 2. Usually, 1 should be so fast that a profiling by sampling can not detect
 3. no pressure on Cassandra's VM heap nor on machine in overal
 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat 1 
 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G)
 6. the problematic Filter file size is only 256B(could be normal)
 So now, I am trying to read the Filter file in the same way 
 BloomFilterSerializer#deserialize does as possible as I can, in order to see 
 if the file is something wrong.
 Could you give me some advise on:
 1. what is happening?
 2. the best way to simulate the BloomFilterSerializer#deserialize
 3. any more info required to proceed?

Re: initial_token

2013-02-01 Thread Víctor Hugo Oliveira Molinar
Do not set initial_token when using murmur3partitioner.
instead, set num_tokens.

For example, u have 3 hosts with the same hardware setup, then, for each
one set the same num_tokens.
But now consider adding another better host, this time i'd suggest you to
set previous num_tokens * 2.

num_tokens: 128 (worse machines)
num_tokens: 256(twice better machine)

This is the setup of virtual nodes.
Check current datastax docs for it.

On Thu, Jan 31, 2013 at 8:43 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 This is the bad side of changing default. There are going to be a few
 groups unfortunates.

 The first group, who only can not setup their cluster, and eventually
 figure out their tokens. (this thread)
 The second group, who assume their tokens were correct and run around
 with an unbalanced cluster thinking the performance sucks. (the
 threads for the next few months)
 The third group, who will google how to balance my ring and find a
 page with random partitioner instructions. (the occasional thread for
 the next N years)
 The fourth group, because as of now map reduce is highly confused by this.

 On Thu, Jan 31, 2013 at 4:52 PM, Rob Coli wrote:
  On Thu, Jan 31, 2013 at 12:17 PM, Edward Capriolo
  Now by default a new partitioner is chosen Murmer3.
  Now = as of 1.2, to be unambiguous.
  =Robert Coli
  YAHOO - rcoli.palominob
  SKYPE - rcoli_palominodb

Re: Understanding Virtual Nodes on Cassandra 1.2

2013-02-01 Thread aaron morton
 Are there tickets/documents explain how data be replicated on Virtual Nodes?
Check the changes.txt file, they link to tickets. 

not many people use BOP so you may be exploring new'ish territory. Try asking 
someone on the IRC channel. 

Aaron Morton
Freelance Cassandra Developer
New Zealand


On 31/01/2013, at 11:47 PM, Manu Zhang wrote:

 On Thu 31 Jan 2013 03:43:32 AM CST, Zhong Li wrote:
 Are there tickets/documents explain how data be replicated on Virtual
 Nodes? If there are multiple tokens on one physical host, may a chance
 two or more tokens chosen by replication strategy located on same
 host? If move/remove/add a token
 manually, does Cassandra Engine validate the case?
 On Jan 30, 2013, at 12:46 PM, Zhong Li wrote:
 You add a physical node and that in turn adds num_token tokens to
 the ring.
 No, I am talking about Virtual Nodes with order preserving
 partitioner. For an existing host with multiple tokens setting list
 on cassandra.inital_token. After initial bootstrapping, the host will
 not aware changes of cassandra.inital_token. If I want add a new
 token( virtual node), I have to rebuild the host with new token list.
 My question is if there is way to add a virtual nodes without rebuild it?
 On Jan 30, 2013, at 10:21 AM, Manu Zhang wrote:
 On Wed 30 Jan 2013 02:29:27 AM CST, Zhong Li wrote:
 One more question, can I add a virtual node manually without reboot
 and rebuild a host data?
 I checked nodetool command, there is no option to add a node.
 On Jan 29, 2013, at 11:09 AM, Zhong Li wrote:
 I was misunderstood this ,
 If you want to get started with vnodes on a fresh cluster, however,
 that is fairly straightforward. Just don’t set the
 |initial_token| parameter in your|conf/cassandra.yaml| and instead
 enable the |num_tokens| parameter. A good default value for this
 is 256
 Also I couldn't find document about set multiple tokens
 for cassandra.inital_token
 Anyway, I just tested, it does work to set  comma separated list of
 On Jan 29, 2013, at 3:06 AM, aaron morton wrote:
 After I searched some document on Datastax website and some old
 ticket, seems that it works for random partitioner only, and leaves
 order preserved partitioner out of the luck.
 Links ?
 or allow add Virtual Nodes manually?
 If not looked into it but there is a cassandra.inital_token startup
 param that takes a comma separated list of tokens for the node.
 There also appears to be support for the ordered partitions to
 generate random tokens.
 But you would still have the problem of having to balance your row
 keys around the token space.
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 On 29/01/2013, at 10:31 AM, Zhong Li wrote:
 Hi All,
 Virtual Nodes is great feature. After I searched some document on
 Datastax website and some old ticket, seems that it works for
 random partitioner only, and leaves order preserved partitioner out
 of the luck. I may misunderstand, please correct me. if it doesn't
 love order preserved partitioner, would be possible to add support
 multiple initial_token(s) for  order preserved partitioner  or
 allow add Virtual Nodes manually?
 You add a physical node and that in turn adds num_token tokens to
 the ring.
 no, those tokens will be skipped

Re: JDBC : CreateresultSet fails with null column in CqlResultSet

2013-02-01 Thread aaron morton
I think 
is the place to raise the issue. 

Can you update the mail thread with the ticket as well?


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 1/02/2013, at 3:25 AM, Andy Cobley wrote:

 As you may be aware I've been trying to track down a problem using JDBC 1.1.2 
 with Cassandra 1.2.0  I was getting a null pointer exception in the result 
 set.  I've done some digging into the JDBC driver  and found the following.
 In the new result set is Instantiated in 
 CassandraResultSet(Statement statement, CqlResult resultSet, String keyspace)
 I decided to trace the result set with the following code:
 rowsIterator = resultSet.getRowsIterator();
   CqlRow row =;
   curRowKey = row.getKey();
   System.out.println(Row Key +curRowKey);
   ListColumn cols = row.getColumns();
   IteratorColumn iterator;
   iterator = cols.iterator(); 
  while (iterator.hasNext()){
  Column col=(Column);
  String Name= new String(col.getName());
   String Value = new String(col.getValue());
  System.out.println(Col +Name+  : +Value);
 This produced the following output:
 Row Key [B@617e53c9
 Col key : jsmith
 Col : 
 Col password : ch@ngem3a
 Row Key [B@2caee320
 Col key : jbrown
 Col : 
 Col gender : male
 As you can see there is a black column at position 2 in each of the rows.  As 
 this resultset has come from the Cassandra thrift client ( I believe) the 
 problem amy lay there.  There is no blank column defined by my SQL create 
 statements I believe. 
 If I'm correct here, should I raise a ticket with JDBC or Cassandra ? (for 
 now I've patched my local JDBC driver so it doesn't create a TypedColumn if 
 the result set produces a null column)
 The University of Dundee is a Scottish Registered Charity, No. SC015096.

Re: JDBC : CreateresultSet fails with null column in CqlResultSet

2013-02-01 Thread Andy Cobley

Ticket is at


On 1 Feb 2013, at 18:01, aaron morton wrote:

 I think is 
 the place to raise the issue. 
 Can you update the mail thread with the ticket as well?
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

The University of Dundee is a Scottish Registered Charity, No. SC015096.

conditional update or insert

2013-02-01 Thread Jay Svc
Hi All,

On each row I have a column which maintains the timestamp like
lastUpdated etc.

While inserting such row I want to make sure that the row should be only
updated if the lastUpdated is older than the new one I am inserting.

One way to do this is -

Read the record first check the timestamp if newer is latest then update.

Since I have higher volume of read and writes load. This additional read
will add to it.

Any alternative to achieve this?


Re: initial_token

2013-02-01 Thread Edward Capriolo
You do not just want to vnodes without being sure. Some queries are
not optimized for vnodes and issue 128 slices to solve some

On Fri, Feb 1, 2013 at 12:55 PM, Víctor Hugo Oliveira Molinar wrote:
 Do not set initial_token when using murmur3partitioner.
 instead, set num_tokens.

 For example, u have 3 hosts with the same hardware setup, then, for each one
 set the same num_tokens.
 But now consider adding another better host, this time i'd suggest you to
 set previous num_tokens * 2.

 num_tokens: 128 (worse machines)
 num_tokens: 256(twice better machine)

 This is the setup of virtual nodes.
 Check current datastax docs for it.

 On Thu, Jan 31, 2013 at 8:43 PM, Edward Capriolo

 This is the bad side of changing default. There are going to be a few
 groups unfortunates.

 The first group, who only can not setup their cluster, and eventually
 figure out their tokens. (this thread)
 The second group, who assume their tokens were correct and run around
 with an unbalanced cluster thinking the performance sucks. (the
 threads for the next few months)
 The third group, who will google how to balance my ring and find a
 page with random partitioner instructions. (the occasional thread for
 the next N years)
 The fourth group, because as of now map reduce is highly confused by this.

 On Thu, Jan 31, 2013 at 4:52 PM, Rob Coli wrote:
  On Thu, Jan 31, 2013 at 12:17 PM, Edward Capriolo wrote:
  Now by default a new partitioner is chosen Murmer3.
  Now = as of 1.2, to be unambiguous.
  =Robert Coli
  YAHOO - rcoli.palominob
  SKYPE - rcoli_palominodb

Re: Cassandra pending compaction tasks keeps increasing

2013-02-01 Thread Derek Williams
Did the node list itself as a seed node in cassandra.yaml? Unless something
has changed, a node that considers itself a seed will not auto bootstrap.
Although I haven't tried it, I think running 'nodetool rebuild' will cause
it to stream in the data it needs without doing a repair.

On Wed, Jan 30, 2013 at 9:30 PM, Wei Zhu wrote:

 Some updates:
 Since we still have not fully turned on the system. We did something crazy
 today. We tried to treat the node as dead one. (My boss wants us to
 practice replacing a dead node before going to full production) and boot
 strap it. Here is what we did:

- drain the node
- check nodetool on other nodes, and this node is marked down (the
token for this node is 100)
- clear the data, commit log, saved cache
- change initial_token from 100 to 99 in the yaml file
- start the node
- check nodetool, the down node of 100 disappeared by itself (!!) and
new node with token 99 showed up
- checked log, see the message saying bootstrap completed. But only a
couple of MB streamed.
- nodetool movetoken 98
- nodetool, see the node with token 98 comes up.
- check log, see the message saying bootstrap completed. But still
only a couple of MB streamed.

 The only reason I can think of is that the new node has the same IP as the
 dead node we tried to replace? Will that cause  the symptom of no data
 streamed from other nodes? Other nodes still think the node had all the

 We had to do nodetool repair -pr to bring in the data. After 3 hours,
 150G  transferred. And no surprise, pending compaction tasks are now at
 30K. There are about 30K SStable transferred and I guess all of them needs
 to be compacted since we use LCS.

 My concern is that if we did nothing wrong, replacing a dead node will
 cause such a hugh back log of pending compaction. It might take a week to
 clear that off. And we have RF = 3, we still need to bring in the data for
 the other two replicates since we use pr for nodetool repair. It will
 take about 3 weeks to fully replace a 200G node using LCS? We tried
 everything we can to speed up the compaction and no luck. The only thing I
 can think of is to increase the default size of SSTable, so less number of
 compaction will be needed. Can I just change it in yaml and restart C* and
 it will correct itself? Any side effect? Since we are using SSD, a bit
 bigger SSD won't slow down the read too much, I suppose that is the main
 concern for bigger size of SSTable?

 I think 1.2 comes with parallel LC which should help the situation. But we
 are not going to upgrade for a little while.

 Did I miss anything? It might not be practical to use LCS for 200G node?
 But if  we use Sized compaction, we need to have at least 400G for the
 HD...Although SSD is cheap now, still hard to convince the management.
 three replicates + double the Disk for compaction? that is 6 times of the
 real data size!

 Sorry for the long email. Any suggestion or advice?


 *From: *aaron morton
 *To: *Cassandra User
 *Sent: *Tuesday, January 29, 2013 12:59:42 PM

 *Subject: *Re: Cassandra pending compaction tasks keeps increasing

 * Will try it tomorrow. Do I need to restart server to change the log

 You can set it via JMX, and supposedly log4j is configured to watch the
 config file.


 Aaron Morton
 Freelance Cassandra Developer
 New Zealand


 On 29/01/2013, at 9:36 PM, Wei Zhu wrote:

 Thanks for the reply. Here is some information:

 Do you have wide rows ? Are you seeing logging about Compacting wide
 rows ?

 * I don't see any log about wide rows

 Are you seeing GC activity logged or seeing CPU steal on a VM ?

 * There is some GC, but CPU general is under 20%. We have heap size of 8G,
 RAM is at 72G.

 Have you tried disabling multithreaded_compaction ?

 * By default, it's disabled. We enabled it, but doesn't see much
 difference. Even a little slower with it's enabled. Is it bad to enable it?
 We have SSD, according to comment in yaml, it should help while using SSD.

 Are you using Key Caches ? Have you tried disabling

 * We have fairly big Key caches, we set as 10% of Heap which is 800M. Yes,
 compaction_preheat_key_cache is disabled.

 Can you enabled DEBUG level logging and make them available ?

 * Will try it tomorrow. Do I need to restart server to change the log



 From: aaron morton
 Sent: Monday, January 28, 2013 11:31:42 PM
 Subject: Re: Cassandra pending compaction tasks keeps increasing

 * Why nodetool repair increases the data size that much? It's not likely
 that much data needs to be repaired. Will that happen for all the
 subsequent repair?
 Repair only 

Not enough replicas???

2013-02-01 Thread Stephen.M.Thompson
I need to offer my profound thanks to this community which has been so helpful 
in trying to figure this system out.

I've setup a simple ring with two nodes and I'm trying to insert data to them.  
I get failures 100% with this error:

me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough 
replicas present to handle consistency level.

I'm not doing anything fancy - this is just from setting up the cluster 
following the basic instructions from datastax for a simple one data center 
cluster.  My config is basically the default except for the changes they 
discuss (except that I have configured for my IP addresses... my two boxes are 
.126 and .127)

cluster_name: 'MyDemoCluster'
num_tokens: 256
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
 - seeds:
endpoint_snitch: RackInferringSnitch

Nodetool shows both nodes active in the ring, status = up, state = normal.

For the CF:

   ColumnFamily: SystemEvent
 Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
 Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
 Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
 GC grace seconds: 864000
 Compaction min/max thresholds: 4/32
 Read repair chance: 0.1
 DC Local Read repair chance: 0.0
 Replicate on write: true
 Caching: KEYS_ONLY
 Bloom Filter FP chance: default
 Built indexes: [SystemEvent.IdxName]
 Column Metadata:
   Column Name: eventTimeStamp
 Validation Class: org.apache.cassandra.db.marshal.DateType
   Column Name: name
 Validation Class: org.apache.cassandra.db.marshal.UTF8Type
 Index Name: IdxName
 Index Type: KEYS
 Compaction Strategy: 
 Compression Options:

Any ideas?

Re: Not enough replicas???

2013-02-01 Thread Edward Capriolo
Please include the information on how your keyspace was created. This
may indicate you set the replication factor to 3, when you only have 1
node, or some similar condition.

On Fri, Feb 1, 2013 at 4:57 PM, wrote:
 I need to offer my profound thanks to this community which has been so
 helpful in trying to figure this system out.

 I’ve setup a simple ring with two nodes and I’m trying to insert data to
 them.  I get failures 100% with this error:

 me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be
 enough replicas present to handle consistency level.

 I’m not doing anything fancy – this is just from setting up the cluster
 following the basic instructions from datastax for a simple one data center
 cluster.  My config is basically the default except for the changes they
 discuss (except that I have configured for my IP addresses… my two boxes are
 .126 and .127)

 cluster_name: 'MyDemoCluster'

 num_tokens: 256


   - class_name: org.apache.cassandra.locator.SimpleSeedProvider


  - seeds:



 endpoint_snitch: RackInferringSnitch

 Nodetool shows both nodes active in the ring, status = up, state = normal.

 For the CF:

ColumnFamily: SystemEvent

  Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type

  Default column value validator:

  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type

  GC grace seconds: 864000

  Compaction min/max thresholds: 4/32

  Read repair chance: 0.1

  DC Local Read repair chance: 0.0

  Replicate on write: true

  Caching: KEYS_ONLY

  Bloom Filter FP chance: default

  Built indexes: [SystemEvent.IdxName]

  Column Metadata:

Column Name: eventTimeStamp

  Validation Class: org.apache.cassandra.db.marshal.DateType

Column Name: name

  Validation Class: org.apache.cassandra.db.marshal.UTF8Type

  Index Name: IdxName

  Index Type: KEYS

  Compaction Strategy:

  Compression Options:


 Any ideas?

Re: Cassandra pending compaction tasks keeps increasing

2013-02-01 Thread Wei Zhu
That is must be it.
Yes. it happens to be the seed. I should have tried rebuild. Instead I did 
repair and now I am sitting here waiting for the compaction to finish...


 From: Derek Williams
To:; Wei Zhu 
Sent: Friday, February 1, 2013 1:56 PM
Subject: Re: Cassandra pending compaction tasks keeps increasing

Did the node list itself as a seed node in cassandra.yaml? Unless something has 
changed, a node that considers itself a seed will not auto bootstrap. Although 
I haven't tried it, I think running 'nodetool rebuild' will cause it to stream 
in the data it needs without doing a repair.

On Wed, Jan 30, 2013 at 9:30 PM, Wei Zhu wrote:

Some updates:
Since we still have not fully turned on the system. We did something crazy 
today. We tried to treat the node as dead one. (My boss wants us to practice 
replacing a dead node before going to full production) and boot strap it. Here 
is what we did:

   * drain the node
   * check nodetool on other nodes, and this node is marked down (the 
 token for this node is 100)

   * clear the data, commit log, saved cache
   * change initial_token from 100 to 99 in the yaml file
   * start the node
   * check nodetool, the down node of 100 disappeared by itself (!!) and 
 new node with token 99 showed up
   * checked log, see the message saying bootstrap completed. But only a 
 couple of MB streamed. 

   * nodetool movetoken 98
   * nodetool, see the node with token 98 comes up. 

   * check log, see the message saying bootstrap completed. But still only 
 a couple of MB streamed. The only reason I can think of is that the new node 
 has the same IP as the dead node we tried to replace? Will that cause  the 
 symptom of no data streamed from other nodes? Other nodes still think the 
 node had all the data?

We had to do nodetool repair -pr to bring in the data. After 3 hours, 150G  
transferred. And no surprise, pending compaction tasks are now at 30K. There 
are about 30K SStable transferred and I guess all of them needs to be 
compacted since we use LCS.

My concern is that if we did nothing wrong, replacing a dead node will cause 
such a hugh back log of pending compaction. It might take a week to clear that 
off. And we have RF = 3, we still need to bring in the data for the other two 
replicates since we use pr for nodetool repair. It will take about 3 weeks 
to fully replace a
 200G node using LCS? We tried everything we can to speed up the compaction and 
no luck. The only thing I can think of is to increase the default size of 
SSTable, so less number of compaction will be needed. Can I just change it in 
yaml and restart C* and it will correct itself? Any side effect? Since we are 
using SSD, a bit bigger SSD won't slow down the read too much, I suppose that 
is the main concern for bigger size of SSTable?
I think 1.2 comes with parallel LC which should help the situation. But we are 
not going to upgrade for a little while.

Did I miss anything? It might not be practical to use LCS for 200G node? But 
if  we use Sized compaction, we need to have at least 400G for the 
HD...Although SSD is cheap now, still hard to convince the management. three 
replicates + double the Disk for compaction? that is 6 times of the real data 

Sorry for the long email. Any suggestion or advice?


From: aaron morton
To: Cassandra User
Sent: Tuesday, January 29, 2013 12:59:42 PM

Subject: Re: Cassandra pending compaction tasks keeps increasing

* Will try it tomorrow. Do I need to restart server to change the log level?
You can set it via JMX, and supposedly log4j is configured to watch the 
config file. 


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 29/01/2013, at 9:36 PM, Wei Zhu wrote:

Thanks for the reply. Here is some information:

Do you have wide rows ? Are you seeing logging about Compacting wide rows ? 

* I don't see any log about wide rows

Are you seeing GC activity logged or seeing CPU steal on a VM ? 

* There is some GC, but CPU general is under 20%. We have heap size of 8G, 
RAM is at 72G.

Have you tried disabling multithreaded_compaction ? 

* By default, it's disabled. We enabled it, but doesn't see much difference. 
Even a little slower with it's enabled. Is it bad to enable it? We have SSD, 
according to comment in yaml, it should help while using SSD.

Are you using Key Caches ? Have you tried disabling 

* We have fairly big Key caches, we set as 10%
 of Heap which is 800M. Yes, compaction_preheat_key_cache is disabled. 

Can you enabled DEBUG level logging and make them available ? 

* Will try it tomorrow. Do I need to restart server to change the 

Re: Cassandra behavior on single node

2013-02-01 Thread Edward Capriolo
You are likely hitting the point where compaction is running all the time
and consuming all the weak cloud io. Ebs is not suggested for performance
you should use the ephermal drives.

On Friday, February 1, 2013, Marcelo Elias Del Valle wrote:


  I am trying to figure out why the following behavior happened. Any
 help would be highly appreciated.
  This graph shows the server resources allocation of my single
 cassandra machine (running at Amazon EC2):
  I ran a hadoop process that reads a CSV file and writtes data to
 Cassandra. For about 1 h, the process ran fine, but taking about 100% of
 CPU. After 1 h, my hadoop process started to have its connection attempts
 refused by cassandra, as shown bellow.
  Since them, it has been taking 100% of the machine IO. It has been 2
 h already since the IO is 100% on the machine running Cassandra.
  I am running Cassandra under Amazon EBS, which is slow, but I didn't
 think it would be that slow. Just wondering, is it normal for Cassandra to
 use a high amount of CPU? I am guessing all the writes were going to the
 memtables and when it was time to flush the server went down.
  Makes sense? I am still learning Cassandra as it's the first time I
 use it in production, so I am not sure if I am missing something really
 basic here.

 2013-02-01 16:44:43,741 ERROR com.s1mbi0se.dmp.input.service.InputService 
 (Thread-18): EXCEPTION:PoolTimeoutException: [host=(, 
 latency=5005(5005), attempts=1] Timed out waiting for connection 
 PoolTimeoutException: [, 
 latency=5005(5005), attempts=1] Timed out waiting for connection

 2013-02-01 16:44:43,743 ERROR com.s1mbi0se.dmp.input.service.InputService 
 (Thread-15): EXCEPTION:PoolTimeoutException:

 Best regards,

 Marcelo Elias Del Valle - @mvallebr

Re: CQL binary protocol

2013-02-01 Thread aaron morton
The spec for the protocol is here;a=blob_plain;f=doc/native_protocol.spec;hb=refs/heads/cassandra-1.2


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 1/02/2013, at 6:42 AM, Gabriel Ciuloaica wrote:

 You may take a look to java-driver project. It has an implementation for 
 connection pool.
 On 1/31/13 6:48 PM, Vivek Mishra wrote:
 Any connection pool API available for cassandra transport 

Re: rangeQuery to traverse keys backward?

2013-02-01 Thread aaron morton
There is no facility to do a get_range in reverse. 

Rows are ordered by their token, and using the Random or Murmur3 partitioner 
this means they are randomly ordered. So there is not much need to go 
backwards, or get 10 rows from either side of a particular row. 

Can you change your data model to not require precise range scans ? 


Aaron Morton
Freelance Cassandra Developer
New Zealand


On 1/02/2013, at 1:36 PM, Yuhan Zhang wrote:

 Hi all,
 I'm tryinng to use get_range to traverse the rows by page by providing a 
 :start_key and an :finish_key.
 This works fine when I traverse forward with :start_key=last_key, 
 However, when I tried to traversed backward with :start_key=, 
 :finish_key=first_key, this always gave me the first few rows in the column 
 (my goal is to get  the rows adjacent to my first_key)
 looks like it always takes priority of :start_key over the :finish_key.
 as for column range,  there is an option to reverse the order. but there is 
 an option for  traversing rows.
 so I'm wondering whether cassandra is capable of doing this task with the 
 current api
 I tried both twitter cassandra client and hector client, but couldn't find a 
 way to perform it.
 have someone been able to do this?
 Thank you
 The information contained in this e-mail is for the exclusive use of the 
 intended recipient(s) and may be confidential, proprietary, and/or legally 
 privileged. Inadvertent disclosure of this message does not constitute a 
 waiver of any privilege.  If you receive this message in error, please do not 
 directly or indirectly print, copy, retransmit, disseminate, or otherwise use 
 the information. In addition, please delete this e-mail and all copies and 
 notify the sender.