CQL Clarification

2013-04-28 Thread Michael Theroux
Hello,

Just wondering if I can get a quick clarification on some simple CQL.  We 
utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
previous question I had, when using CQL and Thrift, timestamps on the cassandra 
column data is assigned by the server, not the client, unless AND TIMESTAMP 
is utilized in the query, for example:

http://www.datastax.com/docs/1.0/references/cql/UPDATE

According to the Datastax documentation, this timestamp should be:

Values serialized with the timestamp type are encoded as 64-bit signed 
integers representing a number of milliseconds since the standard base time 
known as the epoch: January 1 1970 at 00:00:00 GMT.

However, my testing showed that updates didn't work when I used a timestamp of 
this format.  Looking at the Cassandra code, it appears that cassandra will 
assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not 
specified, which would be the number of nanoseconds since the stand base time.  
In my test environment, setting the timestamp to be the current time * 1000 
seems to work.  It seems that if you have an older installation without 
TIMESTAMP being specified in the CQL,   or a mixed environment, the timestamp 
should be * 1000.

Just making sure I'm reading everything properly... improperly setting the 
timestamp could cause us some serious damage.

Thanks,
-Mike




Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
Hello,

We've done some additional monitoring, and I think we have more information.  
We've been collecting vmstat information every minute, attempting to catch  a 
node with issues,.

So, it appears, that the cassandra node runs fine.  Then suddenly, without any 
correlation to any event that I can identify, the I/O wait time goes way up, 
and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
snapshots and backups) start causing large I/O Wait times when they typically 
would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
with very few blocked processes on I/O.  Once this issue manifests itself, i/O 
wait times for the same activities jump to 30-40% with many blocked processes.  
The I/O wait times do go back down when there is literally no activity.   

-  Updating the node to the latest Amazon Linux patches and rebooting the 
instance doesn't correct the issue.
-  Backing up the node, and replacing the instance does correct the issue.  I/O 
wait times return to normal.

One relatively recent change we've made is we upgraded to m1.xlarge instances 
which has 4 ephemeral drives available.  We create a logical volume from the 4 
drives with the idea that we should be able to get increased I/O throughput.  
When we ran m1.large instances, we had the same setup, although it was only 
using 2 ephemeral drives.  We chose to use LVM, vs. madm because we were having 
issues having madm create the raid volume reliably on restart (and research 
showed that this was a common problem).  LVM just worked (and had worked for 
months before this upgrade)..

For reference, this is the script we used to create the logical volume:

vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
blockdev --setra 65536 /dev/mnt_vg/mnt_lv
sleep 2
mkfs.xfs /dev/mnt_vg/mnt_lv
sleep 3
mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
sleep 3

Another tidbit... thus far (and this maybe only a coincidence), we've only had 
to replace DB nodes within a single availability zone within us-east.  Other 
availability zones, in the same region, have yet to show an issue.

It looks like I'm going to need to replace a third DB node today.  Any advice 
would be appreciated.

Thanks,
-Mike


On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

 Thanks.
 
 We weren't monitoring this value when the issue occurred, and this particular 
 issue has not appeared for a couple of days (knock on wood).  Will keep an 
 eye out though,
 
 -Mike
 
 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
 
 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at about 
 the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one 
 node, our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would 
 see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 
 



Re: Secondary Index on table with a lot of data crashes Cassandra

2013-04-28 Thread aaron morton
 What are we doing wrong? Can it be that Cassandra is actually trying to read 
 all the CF data rather than just the keys! (actually, it doesn't need to go 
 to the users CF at all - all the data it needs is in the index CF)
  
Data is not stored as a BTree, that's the RDBMS approach. We hit the in memory 
bloom filter, then perhaps the -index.db and finally the -data.db. While in 
this edge case it may be possible to serve your query just from the -index.db 
there is no optimisation in place for that. 

  
 Select user_name from users where status = 2; 
  
 Always crashes.
  
What is the error ? 

 2. understand if there is something in this use case which indicates that we 
 are not using Cassandra the way it is meant. 
Just like a RDBMS data base, this are fastest when you use the primary key, a 
bit slower when you use a non primary index, and slowest still when you do not 
use an index. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/04/2013, at 8:32 PM, moshe.kr...@barclays.com wrote:

 IMHO: user_name is not a column, it is the row key. Therefore, according 
 tohttp://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ , the row does 
 not contain a relevant column index, which causes the iterator to read each 
 column (including value) of each row.
  
 I believe that instead of referring to user_name as if it were a column, you 
 need to refer to it via the reserved word “KEY”, e.g.:
  
 Select KEY from users where status = 2; 
  
 Always glad to share a theory with a friend….
  
  
 From: Tamar Rosen [mailto:ta...@correlor.com] 
 Sent: Thursday, April 25, 2013 11:04 AM
 To: user@cassandra.apache.org
 Subject: Secondary Index on table with a lot of data crashes Cassandra
  
 Hi,
  
 We have a case of a reproducible crash, probably due to out of memory, but I 
 don't understand why. 
  
 The installation is currently single node. 
  
 We have a column family with approx 5 rows. 
  
 In cql, the CF definition is:
  
  
 CREATE TABLE users (
   user_name text PRIMARY KEY,
   big_json text,
   status int
 );
  
 Each big_json can have 500K or more of data.
  
 There is also a secondary index on the status column. 
 Status can have various values, over 90% of all rows have status = 2. 
  
  
 Calling:
  
 Select user_name from users limit 8;
  
 Is pretty fast
  
  
  
 Calling:
  
 Select user_name from users where status = 1; 
 is slower, even though much less data is returned.
  
 Calling:
  
 Select user_name from users where status = 2; 
  
 Always crashes.
  
  
 What are we doing wrong? Can it be that Cassandra is actually trying to read 
 all the CF data rather than just the keys! (actually, it doesn't need to go 
 to the users CF at all - all the data it needs is in the index CF)
  
  
 Also, in the code I am doing the same using Astyanax index query with 
 pagination, and the behavior is the same. 
 
 
 Please help me:
  
 1. solve the immediate issue
  
 2. understand if there is something in this use case which indicates that we 
 are not using Cassandra the way it is meant. 
  
 
 
 Thanks,
  
 
 
 Tamar Rosen
  
 Correlor.com
  
 
 
  
 ___
 
 This message may contain information that is confidential or privileged. If 
 you are not an intended recipient of this message, please delete it and any 
 attachments, and notify the sender that you have received it in error. Unless 
 specifically stated in the message or otherwise indicated, you may not 
 duplicate, redistribute or forward this message or any portion thereof, 
 including any attachments, by any means to any other person, including any 
 retail investor or customer. This message is not a recommendation, advice, 
 offer or solicitation, to buy/sell any product or service, and is not an 
 official confirmation of any transaction. Any opinions presented are solely 
 those of the author and do not necessarily represent those of Barclays. This 
 message is subject to terms available at: www.barclays.com/emaildisclaimer 
 and, if received from Barclays' Sales or Trading desk, the terms available 
 at: www.barclays.com/salesandtradingdisclaimer/. By messaging with Barclays 
 you consent to the foregoing. Barclays Bank PLC is a company registered in 
 England (number 1026167) with its registered office at 1 Churchill Place, 
 London, E14 5HP. This email may relate to or be sent from other members of 
 the Barclays group.
 
 ___
 



Re: 1.2.3 and 1.2.4 memory usage growth on idle cluster

2013-04-28 Thread aaron morton
 INFO 11:10:56,273 GC for ParNew: 1039 ms for 1 collections, 6631277912 used; 
 max is 10630070272
It depends on the settings. It looks like you are using non default JVM 
settings. 

It'd recommend restoring the default JVM settings as a start. 

CHeers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/04/2013, at 9:30 PM, Igor i...@4friends.od.ua wrote:

 Hello
 
 Does  anybody seen memory problems on idle cluster?
 I have 8-node ring with cassandra 1.2.3 which never been used and stay idle 
 for several weeks. Yesterday when I decided to upgrade it to 1.2.4 I found 
 lot of messages like
 
 INFO 11:10:56,273 GC for ParNew: 1039 ms for 1 collections, 6631277912 used; 
 max is 10630070272
 INFO 11:10:56,273 Pool NameActive   Pending Blocked
 INFO 11:10:56,275 ReadStage 0 0 0
 INFO 11:10:56,276 RequestResponseStage  0 0 0
 INFO 11:10:56,276 ReadRepairStage   0 0 0
 INFO 11:10:56,277 MutationStage 0 0 0
 INFO 11:10:56,277 ReplicateOnWriteStage 0 0 0
 INFO 11:10:56,278 GossipStage   0 0 0
 INFO 11:10:56,278 AntiEntropyStage  0 0 0
 INFO 11:10:56,278 MigrationStage0 0 0
 INFO 11:10:56,279 MemtablePostFlusher   0 0 0
 INFO 11:10:56,279 FlushWriter   0 0 0
 INFO 11:10:56,280 MiscStage 0 0 0
 INFO 11:10:56,280 commitlog_archiver0 0 0
 INFO 11:10:56,280 InternalResponseStage 0 0 0
 INFO 11:10:56,281 HintedHandoff 0 0 0
 INFO 11:10:56,281 CompactionManager 0 0
 INFO 11:10:56,281 MessagingServicen/a   0,0
 INFO 11:10:56,281 Cache Type Size Capacity   
 KeysToSave
Provider
 INFO 11:10:56,281 KeyCache 7368104857600  
 all
 
 INFO 11:10:56,281 RowCache 00  all
 org.apache.cassandra.cache.SerializingCacheProvider
 INFO 11:10:56,281 ColumnFamilyMemtable ops,data
 INFO 11:10:56,281 system.local 4,52
 INFO 11:10:56,281 system.peers  30,6093
 INFO 11:10:56,282 system.batchlog   0,0
 INFO 11:10:56,282 system.NodeIdInfo 0,0
 INFO 11:10:56,282 system.LocationInfo   0,0
 INFO 11:10:56,282 system.Schema 0,0
 INFO 11:10:56,282 system.Migrations 0,0
 INFO 11:10:56,282 system.schema_keyspaces   0,0
 INFO 11:10:56,282 system.schema_columns 0,0
 INFO 11:10:56,282 system.schema_columnfamilies 0,0
 INFO 11:10:56,282 system.IndexInfo  0,0
 INFO 11:10:56,282 system.range_xfers0,0
 INFO 11:10:56,282 system.peer_events0,0
 INFO 11:10:56,283 system.hints  0,0
 INFO 11:10:56,283 system.HintsColumnFamily  0,0
 INFO 11:10:56,283 system_auth.users 0,0
 INFO 11:10:56,283 system_traces.sessions0,0
 INFO 11:10:56,283 system_traces.events  0,0
 INFO 11:11:21,205 GC for ParNew: 1035 ms for 1 collections, 6633037168 used; 
 max is 10630070272
 
 So you can see - there is no any activity. And what I can see from the java 
 heap graph - it constantly grow. I plan to use this ring in prod, but this 
 strange behaviour confuses me.
 



Re: CQL indexing

2013-04-28 Thread aaron morton
This discussion belongs on the user list, also please only email one list at a 
time. 

The article discusses improvements in secondary indexes in 1.2 
http://www.datastax.com/dev/blog/improving-secondary-index-write-performance-in-1-2

If you have some more specific questions let us know. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/04/2013, at 7:01 PM, Sri Ramya ramya.1...@gmail.com wrote:

 HI
 
 In cql to perform a query based on columns you have to create a index on
 that column. What exactly happening when we create a index on a column.
 What the index column family might contain.



Re: Many creation/inserts in parallel

2013-04-28 Thread aaron morton
 At first many CF are being created in parallel (about 1000 CF).
 
 
Can you explain this in a bit more detail ? By in parallel do you mean multiple 
threads creating CF's at the same time ?

I would also recommend taking a second look at your data model, you probably do 
not want to create so many CF's. 

  During tests we're receiving some exceptions from driver, e.g.:
 
 

The CF you are trying to read / write from does not exist. Check if the table 
exists using cqlsh / cassandra-cli. 

Check your code to make sure it was created. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/04/2013, at 10:49 PM, Sasha Yanushkevich yanus...@gmail.com wrote:

 Hi All
 
 We are testing Cassandra 1.2.3 (3 nodes with RF:2) with FluentCassandra 
 driver. At first many CF are being created in parallel (about 1000 CF). After 
 creation is done follows many insertions of little amount of data into the 
 DB. During tests we're receiving some exceptions from driver, e.g.:
 
 FluentCassandra.Operations.CassandraOperationException: unconfigured 
 columnfamily table_78_9
 and
 FluentCassandra.Operations.CassandraOperationException: Connection to 
 Cassandra has timed out
 
 Though in Cassandra's logs there are no exceptions.
 
 What should we do to handle these exceptions?
 
 -- 
 Best regards,
 Alexander



Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
I forgot to mention,

When things go really bad, I'm seeing I/O waits in the 80-95% range.  I 
restarted cassandra once when a node is in this situation, and it took 45 
minutes to start (primarily reading SSTables).  Typically, a node would start 
in about 5 minutes.

Thanks,
-Mike
 
On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

 Hello,
 
 We've done some additional monitoring, and I think we have more information.  
 We've been collecting vmstat information every minute, attempting to catch  a 
 node with issues,.
 
 So, it appears, that the cassandra node runs fine.  Then suddenly, without 
 any correlation to any event that I can identify, the I/O wait time goes way 
 up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
 snapshots and backups) start causing large I/O Wait times when they typically 
 would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
 with very few blocked processes on I/O.  Once this issue manifests itself, 
 i/O wait times for the same activities jump to 30-40% with many blocked 
 processes.  The I/O wait times do go back down when there is literally no 
 activity.   
 
 -  Updating the node to the latest Amazon Linux patches and rebooting the 
 instance doesn't correct the issue.
 -  Backing up the node, and replacing the instance does correct the issue.  
 I/O wait times return to normal.
 
 One relatively recent change we've made is we upgraded to m1.xlarge instances 
 which has 4 ephemeral drives available.  We create a logical volume from the 
 4 drives with the idea that we should be able to get increased I/O 
 throughput.  When we ran m1.large instances, we had the same setup, although 
 it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because 
 we were having issues having madm create the raid volume reliably on restart 
 (and research showed that this was a common problem).  LVM just worked (and 
 had worked for months before this upgrade)..
 
 For reference, this is the script we used to create the logical volume:
 
 vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
 lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
 blockdev --setra 65536 /dev/mnt_vg/mnt_lv
 sleep 2
 mkfs.xfs /dev/mnt_vg/mnt_lv
 sleep 3
 mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
 sleep 3
 
 Another tidbit... thus far (and this maybe only a coincidence), we've only 
 had to replace DB nodes within a single availability zone within us-east.  
 Other availability zones, in the same region, have yet to show an issue.
 
 It looks like I'm going to need to replace a third DB node today.  Any advice 
 would be appreciated.
 
 Thanks,
 -Mike
 
 
 On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
 
 Thanks.
 
 We weren't monitoring this value when the issue occurred, and this 
 particular issue has not appeared for a couple of days (knock on wood).  
 Will keep an eye out though,
 
 -Mike
 
 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
 
 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at 
 about the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one 
 node, our cassandra client appears to see this, reporting errors.  We 
 use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients 
 would see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 
 
 



Re: Deletes, null values

2013-04-28 Thread aaron morton
What's your table definition ? 

 select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553'
 from myCF where key = 'all';

The output looks correct to me. CQL table return values, including null, for 
all of the selected columns.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 12:48 AM, Sorin Manolache sor...@gmail.com wrote:

 On 2013-04-26 11:55, Alain RODRIGUEZ wrote:
 Of course:
 
 From CQL 2 (cqlsh -2):
 
 delete '183#16684','183#16714','183#16717' from myCF where key = 'all';
 
 And selecting this data as follow gives me the result above:
 
 select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553'
 from myCF where key = 'all';
 
 From thrift (phpCassa client):
 
 $pool = new
 ConnectionPool('myKeyspace',array('192.168.100.201'),6,0,3,3);
 $my_cf= new ColumnFamily($pool, 'myCF', true, true,
 ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM);
 $my_cf-remove('all', array('1228#16857','1228#16866','1228#16875'));
 
 
 I see. I'm sorry, I know nothing about phpCassa. I use batch_mutation with 
 deletions and it works. But I guess phpCassa must use the same thrift 
 primitives.
 
 Sorin
 
 
 
 
 2013/4/25 Sorin Manolache sor...@gmail.com mailto:sor...@gmail.com
 
On 2013-04-25 11:48, Alain RODRIGUEZ wrote:
 
Hi, I tried to delete some columns using cql2 as well as thrift on
C*1.2.2 and instead of being unreachable, deleted columns have a
null value.
 
I am using no value in this CF, the only information I use is the
existence of the column. So when I select all the column for a
given key
I have the following returned:
 
   1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553

 ---+--__+--+--__-+__--
   null |  null | null |
  |
 
 
This is quite annoying since my app thinks that I have 5 columns
there
when I should have 2 only.
 
I first thought that this was a visible marker of tombstones but
they
didn't vanish after a major compaction.
 
How can I get rid of these null/ghost columns and why does it
happen ?
 
 
I do something similar but I don't see null values. Could you please
post the code where you delete the columns?
 
Sorin
 
 
 



Re: Is Cassandra oversized for this kind of use case?

2013-04-28 Thread aaron morton
Sounds like something C* would be good at. 

I would do some searching on Time Series data in cassandra, such as 
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra And 
definitely consider storing data at the smallest level on granularity. 

On the analytics side there is good news and no so good news. First the good 
news is reads do not block writes as in a traditional RDBMS (without MVCC) 
running with Transaction Isolation of Repeatable Read or higher. 

The not the so good news it's not as easy to support the wide range of 
analytical queries that you are used to with SQL using the standard Thrift/CQL 
API. If you need very flexible analysis I recommend looking into Hive / Pig 
with Hadoop, DataStax Enterprise is a commercial product but free for 
development and a great way to learn without having to worry about the setup 
http://www.datastax.com/

You may also be interested in http://www.pentaho.com/ or 
http://www.karmasphere.com/

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 5:26 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 I would at least start with 3 cheap nodes with RF=3 and start with CL=TWO on 
 writes and reads most likely getting your feet wet.  Don't buy very expensive 
 computers like a lot do getting into the game for the first time…Every time I 
 walk into a new gig, they seem to think they need to spend 6/10k per node.  I 
 think this kind of scenario sounds find to use cassandra.  When you say 
 virtualize, I believe you mean use Vms…..many use Amazon Vms and there is 
 stuff to configure if you are on amazon specifically for this.
 
 If you are on your own VM's, you do need to worry about if two nodes end up 
 on the same hardware stealing resources from each other or if hardware fails 
 as well.  Ie. The idea in noSQL is you typically have 3 copies of all data so 
 if one node goes down, you are still live with CL=TWO.
 
 Also, plan on doing ~300GB per node typically depending on how it works out 
 in testing.
 
 Later,
 Dean
 
 From: Marc Teufel 
 teufel.m...@googlemail.commailto:teufel.m...@googlemail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Friday, April 26, 2013 10:59 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Is Cassandra oversized for this kind of use case?
 
 Okay one billion rows of data is a lot, compared to that i am far far away - 
 means i can stay with Oracle? Maybe.
 But you're right when you say its not only about big data but also about your 
 need.
 
 So storing the data is one part, doing analytical analysis is the second. I 
 do a lot of calculations and queries to generate management criteria about 
 how the production is going on actually, how the production went the last 
 week, month, years and so on. Saving in a 5 minute rhythm is only a 
 compromise to reduce the amount of data - maybe in the future the usecase 
 will change an is about to store status of each machine as soon as it 
 changes. This will of course increase the amount of data and the complexity 
 of my queries again. And sure I show Live Data today... 5 Minute old Live 
 Data... but if i tell the CEO that i am also able to work with real live 
 data, i am sure this is what he wants to get  ;-)
 
 Can you recommend me to use Cassandra for this kind of scenario or is this 
 oversized ?
 
 Does it makes sense to start with 2 Nodes ?
 
 Can i virtualize these two Nodes ?
 
 
 Thx a lot for your assistance.
 
 Marc
 
 
 
 
 2013/4/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
 Well, it depends more on what you will do with the data.  I know I was on a 
 sybase(RDBMS) with 1 billion rows but it was getting close to not being able 
 to handle more (constraints had to be turned off and all sorts of 
 optimizations done and expert consultants brought in and everything).
 
 BUT there are other use cases where noSQL is great for (ie. It is not just 
 great for big data type systems).  It is great for really high write 
 throughput as you can add more nodes and handle more writes/second than an 
 RDBMS very easily yet you may be doing so many deletes that the system 
 constantly stays at a small data set.
 
 You may want to analyze the data constantly or near real time involving huge 
 amounts of reads / second in which case noSQL can be better as well.
 
 Ie. Nosql is not just for big data.  I know with PlayOrm for cassandra, we 
 have handled many different use cases out there.
 
 Later,
 Dean
 
 From: Marc Teufel 
 teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.com
 Reply-To: 
 

question about internode_compression

2013-04-28 Thread John Sanda
When internode_compression is enabled, will the compression algorithm used
be the same as whatever I am using for sstable_compression?


- John


Re: cost estimate about some Cassandra patchs

2013-04-28 Thread aaron morton
 Does anyone know enough of the inner working of Cassandra to tell me how much 
 work is needed to patch Cassandra to enable such communication 
 vectorization/batch ?
  
Assuming you mean have the coordinator send multiple row read/write requests 
in a single message to replicas

Pretty sure this has been raised as a ticket before but I cannot find one now. 

It would be a significant change and I'm not sure how big the benefit is. To 
send the messages the coordinator places them in a queue, there is little delay 
sending. Then it waits on them async. So there may be some saving on networking 
but from the coordinators point of view I think the impact is minimal. 

What is your use case?

Cheers


-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 4:04 AM, DE VITO Dominique dominique.dev...@thalesgroup.com 
wrote:

 Hi,
  
 We are created a new partitioner that groups some rows with **different** row 
 keys on the same replicas.
  
 But neither the batch_mutate, or the multiget_slice are able to take 
 opportunity of this partitioner-defined placement to vectorize/batch 
 communications between the coordinator and the replicas.
  
 Does anyone know enough of the inner working of Cassandra to tell me how much 
 work is needed to patch Cassandra to enable such communication 
 vectorization/batch ?
  
 Thanks.
  
 Regards,
 Dominique
  
  



Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-28 Thread aaron morton
 We're going to try running a shuffle before adding a new node again... maybe 
 that will help
I don't think  hurt but I doubt it will help. 


 It seems when new nodes join, they are streamed *all* sstables in the 
 cluster.

 

How many nodes did you join, what was the num_tokens ? 
Did you notice streaming from all nodes (in the logs) or are you saying this in 
response to the cluster load increasing ? 

 The purple line machine, I just stopped the joining process because the main 
 cluster was dropping mutation messages at this point on a few nodes (and it 
 still had dozens of sstables to stream.)
Which were the new nodes ?
Can you show the output from nodetool status?


Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 9:35 AM, Bryan Talbot btal...@aeriagames.com wrote:

 I believe that nodetool rebuild is used to add a new datacenter, not just a 
 new host to an existing cluster.  Is that what you ran to add the node?
 
 -Bryan
 
 
 
 On Fri, Apr 26, 2013 at 1:27 PM, John Watson j...@disqus.com wrote:
 Small relief we're not the only ones that had this issue.
 
 We're going to try running a shuffle before adding a new node again... maybe 
 that will help
 
 - John
 
 
 On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral 
 fsob...@igcorp.com.br wrote:
 I am using the same version and observed something similar.
 
 I've added a new node, but the instructions from Datastax did not work for 
 me. Then I ran nodetool rebuild on the new node. After finished this 
 command, it contained two times the load of the other nodes. Even when I ran 
 nodetool cleanup on the older nodes, the situation was the same.
 
 The problem only seemed to disappear when nodetool repair was applied to 
 all nodes.
 
 Regards,
 Francisco Sobral.
 
 
 
 
 On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote:
 
 After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running 
 upgradesstables, I figured it would be safe to start adding nodes to the 
 cluster. Guess not?
 
 It seems when new nodes join, they are streamed *all* sstables in the 
 cluster.
 
 https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png
 
 The gray the line machine ran out disk space and for some reason cascaded 
 into errors in the cluster about 'no host id' when trying to store hints for 
 it (even though it hadn't joined yet).
 The purple line machine, I just stopped the joining process because the main 
 cluster was dropping mutation messages at this point on a few nodes (and it 
 still had dozens of sstables to stream.)
 
 I followed this: 
 http://www.datastax.com/docs/1.2/operations/add_replace_nodes
 
 Is there something missing in that documentation?
 
 Thanks,
 
 John
 
 
 



cassandra-shuffle time to completion and required disk space

2013-04-28 Thread John Watson
The amount of time/space cassandra-shuffle requires when upgrading to using
vnodes should really be apparent in documentation (when some is made).

Only semi-noticeable remark about the exorbitant amount of time is a bullet
point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance

Shuffling will entail moving a lot of data around the cluster and so has
the potential to consume a lot of disk and network I/O, and to take a
considerable amount of time. For this to be an online operation, the
shuffle will need to operate on a lower priority basis to other streaming
operations, and should be expected to take days or weeks to complete.

We tried running shuffle on a QA version of our cluster and 2 things were
brought to light:
 - Even with no reads/writes it was going to take 20 days
 - Each machine needed enough free diskspace to potentially hold the entire
cluster's sstables on disk

Regards,

John


Re: Really odd issue (AWS related?)

2013-04-28 Thread Alex Major
Hi Mike,

We had issues with the ephemeral drives when we first got started, although
we never got to the bottom of it so I can't help much with troubleshooting
unfortunately. Contrary to a lot of the comments on the mailing list we've
actually had a lot more success with EBS drives (PIOPs!). I'd definitely
suggest try striping 4 EBS drives (Raid 0) and using PIOPs.

You could be having a noisy neighbour problem, I don't believe that
m1.large or m1.xlarge instances get all of the actual hardware,
virtualisation on EC2 still sucks in isolating resources.

We've also had more success with Ubuntu on EC2, not so much with our
Cassandra nodes but some of our other services didn't run as well on Amazon
Linux AMIs.

Alex



On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux mthero...@yahoo.comwrote:

 I forgot to mention,

 When things go really bad, I'm seeing I/O waits in the 80-95% range.  I
 restarted cassandra once when a node is in this situation, and it took 45
 minutes to start (primarily reading SSTables).  Typically, a node would
 start in about 5 minutes.

 Thanks,
 -Mike

 On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

 Hello,

 We've done some additional monitoring, and I think we have more
 information.  We've been collecting vmstat information every minute,
 attempting to catch  a node with issues,.

 So, it appears, that the cassandra node runs fine.  Then suddenly, without
 any correlation to any event that I can identify, the I/O wait time goes
 way up, and stays up indefinitely.  Even non-cassandra  I/O activities
 (such as snapshots and backups) start causing large I/O Wait times when
 they typically would not.  Previous to an issue, we would typically see I/O
 wait times 3-4% with very few blocked processes on I/O.  Once this issue
 manifests itself, i/O wait times for the same activities jump to 30-40%
 with many blocked processes.  The I/O wait times do go back down when there
 is literally no activity.

 -  Updating the node to the latest Amazon Linux patches and rebooting the
 instance doesn't correct the issue.
 -  Backing up the node, and replacing the instance does correct the issue.
  I/O wait times return to normal.

 One relatively recent change we've made is we upgraded to m1.xlarge
 instances which has 4 ephemeral drives available.  We create a logical
 volume from the 4 drives with the idea that we should be able to get
 increased I/O throughput.  When we ran m1.large instances, we had the same
 setup, although it was only using 2 ephemeral drives.  We chose to use LVM,
 vs. madm because we were having issues having madm create the raid volume
 reliably on restart (and research showed that this was a common problem).
  LVM just worked (and had worked for months before this upgrade)..

 For reference, this is the script we used to create the logical volume:

 vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
 lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
 blockdev --setra 65536 /dev/mnt_vg/mnt_lv
 sleep 2
 mkfs.xfs /dev/mnt_vg/mnt_lv
 sleep 3
 mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
 sleep 3

 Another tidbit... thus far (and this maybe only a coincidence), we've only
 had to replace DB nodes within a single availability zone within us-east.
  Other availability zones, in the same region, have yet to show an issue.

 It looks like I'm going to need to replace a third DB node today.  Any
 advice would be appreciated.

 Thanks,
 -Mike


 On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

 Thanks.

 We weren't monitoring this value when the issue occurred, and this
 particular issue has not appeared for a couple of days (knock on wood).
  Will keep an eye out though,

 -Mike

 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:

 top command? st : time stolen from this vm by the hypervisor

 jason


 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.comwrote:

 Sorry, Not sure what CPU steal is :)

 I have AWS console with detailed monitoring enabled... things seem to
 track close to the minute, so I can see the CPU load go to 0... then jump
 at about the minute Cassandra reports the dropped messages,

 -Mike

 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:

 The messages appear right after the node wakes up.

 Are you tracking CPU steal ?

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com
 wrote:

 Another related question.  Once we see messages being dropped on one
 node, our cassandra client appears to see this, reporting errors.  We use
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see
 an error?  If only one node reports an error, shouldn't the consistency
 level prevent the client from seeing an issue?


 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to 

setcompactionthroughput and setstreamthroughput have no effect

2013-04-28 Thread John Watson
Running these 2 commands are noop IO wise:
  nodetool setcompactionthroughput 0
  nodetool setstreamtrhoughput 0

If trying to recover or rebuild nodes, it would be super helpful to get
more than ~120mbit/s of streaming throughput (per session or ~500mbit
total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf).

Even enabling multithreaded_compaction gives marginal improvements (1
additional thread doesn't help all that much and was only measurable in CPU
usage).

I understand that these processes should take lower priority to servicing
reads and writes. However, in emergencies it would be a nice feature to
have a switch to recover a cluster ASAP.

Thanks,

John


Re: CQL Clarification

2013-04-28 Thread aaron morton
I think this is some confusion about the two different usages of timestamp. 

The timestamp stored with the column value (not a column of timestamp type) is 
stored using microsecond scale, it's just a 64 bit int we do not use it as a 
time value. Each mutation in a single request will have a different timestamp 
as per 
https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/service/QueryState.java#L48
 

A column of type timestamp is internally stored as a DateTime type which is 
milliseconds past the epoch 
https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/db/marshal/DateType.java

Does that help ? 

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 29/04/2013, at 3:42 AM, Michael Theroux mthero...@yahoo.com wrote:

 Hello,
 
 Just wondering if I can get a quick clarification on some simple CQL.  We 
 utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
 previous question I had, when using CQL and Thrift, timestamps on the 
 cassandra column data is assigned by the server, not the client, unless AND 
 TIMESTAMP is utilized in the query, for example:
 
 http://www.datastax.com/docs/1.0/references/cql/UPDATE
 
 According to the Datastax documentation, this timestamp should be:
 
 Values serialized with the timestamp type are encoded as 64-bit signed 
 integers representing a number of milliseconds since the standard base time 
 known as the epoch: January 1 1970 at 00:00:00 GMT.
 
 However, my testing showed that updates didn't work when I used a timestamp 
 of this format.  Looking at the Cassandra code, it appears that cassandra 
 will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp 
 is not specified, which would be the number of nanoseconds since the stand 
 base time.  In my test environment, setting the timestamp to be the current 
 time * 1000 seems to work.  It seems that if you have an older installation 
 without TIMESTAMP being specified in the CQL,   or a mixed environment, the 
 timestamp should be * 1000.
 
 Just making sure I'm reading everything properly... improperly setting the 
 timestamp could cause us some serious damage.
 
 Thanks,
 -Mike
 
 



Re: setcompactionthroughput and setstreamthroughput have no effect

2013-04-28 Thread Edward Capriolo
Out of curiosity. Why did you decide to set it to 0 rather then 9. Does
any documentation anywhere say that setting to 0 disables the feature? I
have set streamthroughput higher and seen node join improvements. The
features do work however they are probably not your limiting factor.
Remember for stream you are setting Mega Bytes per second but network cards
are measured in Mega Bits per second.


On Sun, Apr 28, 2013 at 5:28 PM, John Watson j...@disqus.com wrote:

 Running these 2 commands are noop IO wise:
   nodetool setcompactionthroughput 0
   nodetool setstreamtrhoughput 0

 If trying to recover or rebuild nodes, it would be super helpful to get
 more than ~120mbit/s of streaming throughput (per session or ~500mbit
 total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf).

 Even enabling multithreaded_compaction gives marginal improvements (1
 additional thread doesn't help all that much and was only measurable in CPU
 usage).

 I understand that these processes should take lower priority to servicing
 reads and writes. However, in emergencies it would be a nice feature to
 have a switch to recover a cluster ASAP.

 Thanks,

 John



Re: question about internode_compression

2013-04-28 Thread aaron morton
It uses Snappy Compression with the default block size. 

There may be a case for allowing configuration, for example so the 
LZ4Compressor can be used. Feel free to raise a ticket at 
https://issues.apache.org/jira/browse/CASSANDRA

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 29/04/2013, at 8:39 AM, John Sanda john.sa...@gmail.com wrote:

 When internode_compression is enabled, will the compression algorithm used be 
 the same as whatever I am using for sstable_compression?
 
 
 - John



Re: setcompactionthroughput and setstreamthroughput have no effect

2013-04-28 Thread John Watson
The help command says 0 to disable:
  setcompactionthroughput value_in_mb - Set the MB/s throughput cap for
compaction in the system, or 0 to disable throttling.
  setstreamthroughput  value_in_mb - Set the MB/s throughput cap for
streaming in the system, or 0 to disable throttling.

I also set both to 1000 and it also had no effect (just in case the
documentation was incorrect.)



On Sun, Apr 28, 2013 at 2:43 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Out of curiosity. Why did you decide to set it to 0 rather then 9.
 Does any documentation anywhere say that setting to 0 disables the feature?
 I have set streamthroughput higher and seen node join improvements. The
 features do work however they are probably not your limiting factor.
 Remember for stream you are setting Mega Bytes per second but network cards
 are measured in Mega Bits per second.


 On Sun, Apr 28, 2013 at 5:28 PM, John Watson j...@disqus.com wrote:

 Running these 2 commands are noop IO wise:
   nodetool setcompactionthroughput 0
   nodetool setstreamtrhoughput 0

 If trying to recover or rebuild nodes, it would be super helpful to get
 more than ~120mbit/s of streaming throughput (per session or ~500mbit
 total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf).

 Even enabling multithreaded_compaction gives marginal improvements (1
 additional thread doesn't help all that much and was only measurable in CPU
 usage).

 I understand that these processes should take lower priority to servicing
 reads and writes. However, in emergencies it would be a nice feature to
 have a switch to recover a cluster ASAP.

 Thanks,

 John





Re: cassandra-shuffle time to completion and required disk space

2013-04-28 Thread aaron morton
Can you provide some info on the number of nodes, node load, cluster load etc ?

AFAIK shuffle was not an easy thing to test and does not get much real world 
use as only some people will run it and they (normally) use it once.

Any info you can provide may help improve the process. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 29/04/2013, at 9:21 AM, John Watson j...@disqus.com wrote:

 The amount of time/space cassandra-shuffle requires when upgrading to using 
 vnodes should really be apparent in documentation (when some is made).
 
 Only semi-noticeable remark about the exorbitant amount of time is a bullet 
 point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance
 
 Shuffling will entail moving a lot of data around the cluster and so has the 
 potential to consume a lot of disk and network I/O, and to take a 
 considerable amount of time. For this to be an online operation, the shuffle 
 will need to operate on a lower priority basis to other streaming operations, 
 and should be expected to take days or weeks to complete.
 
 We tried running shuffle on a QA version of our cluster and 2 things were 
 brought to light:
  - Even with no reads/writes it was going to take 20 days
  - Each machine needed enough free diskspace to potentially hold the entire 
 cluster's sstables on disk
 
 Regards,
 
 John



Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-28 Thread John Watson
On Sun, Apr 28, 2013 at 2:19 PM, aaron morton aa...@thelastpickle.comwrote:

  We're going to try running a shuffle before adding a new node again...
 maybe that will help

 I don't think  hurt but I doubt it will help.


We had to bail on shuffle since we need to add capacity ASAP and not in 20
days.



It seems when new nodes join, they are streamed *all* sstables in the
 cluster.



 How many nodes did you join, what was the num_tokens ?
 Did you notice streaming from all nodes (in the logs) or are you saying
 this in response to the cluster load increasing ?


Was only adding 2 nodes at the time (planning to add a total of 12.)
Starting with a cluster of 12, but now 11 since 1 node entered some weird
state when one of the new nodes ran out disk space.
num_tokens is set to 256 on all nodes.
Yes, nearly all current nodes were streaming to the new ones (which was
great until disk space was an issue.)

 The purple line machine, I just stopped the joining process because
 the main cluster was dropping mutation messages at this point on a few
 nodes (and it still had dozens of sstables to stream.)

 Which were the new nodes ?
 Can you show the output from nodetool status?


The new nodes are the purple and gray lines above all the others.

nodetool status doesn't show joining nodes. I think I saw a bug already
filed for this but I can't seem to find it.



 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 27/04/2013, at 9:35 AM, Bryan Talbot btal...@aeriagames.com wrote:

 I believe that nodetool rebuild is used to add a new datacenter, not
 just a new host to an existing cluster.  Is that what you ran to add the
 node?

 -Bryan



 On Fri, Apr 26, 2013 at 1:27 PM, John Watson j...@disqus.com wrote:

 Small relief we're not the only ones that had this issue.

 We're going to try running a shuffle before adding a new node again...
 maybe that will help

 - John


 On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral 
 fsob...@igcorp.com.br wrote:

 I am using the same version and observed something similar.

 I've added a new node, but the instructions from Datastax did not work
 for me. Then I ran nodetool rebuild on the new node. After finished this
 command, it contained two times the load of the other nodes. Even when I
 ran nodetool cleanup on the older nodes, the situation was the same.

 The problem only seemed to disappear when nodetool repair was applied
 to all nodes.

 Regards,
 Francisco Sobral.




 On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote:

 After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and
 running upgradesstables, I figured it would be safe to start adding nodes
 to the cluster. Guess not?

 It seems when new nodes join, they are streamed *all* sstables in the
 cluster.


 https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png

 The gray the line machine ran out disk space and for some reason
 cascaded into errors in the cluster about 'no host id' when trying to store
 hints for it (even though it hadn't joined yet).
 The purple line machine, I just stopped the joining process because the
 main cluster was dropping mutation messages at this point on a few nodes
 (and it still had dozens of sstables to stream.)

 I followed this:
 http://www.datastax.com/docs/1.2/operations/add_replace_nodes

 Is there something missing in that documentation?

 Thanks,

 John








Re: CQL Clarification

2013-04-28 Thread Michael Theroux
Yes, that does help,

So, in the link I provided:

http://www.datastax.com/docs/1.0/references/cql/UPDATE

It states:

You can specify these options:

Consistency level
Time-to-live (TTL)
Timestamp for the written columns.

Where timestamp is a link to Working with dates and times and mentions the 
64bit millisecond value.  Is that incorrect?

-Mike

On Apr 28, 2013, at 11:42 AM, Michael Theroux wrote:

 Hello,
 
 Just wondering if I can get a quick clarification on some simple CQL.  We 
 utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
 previous question I had, when using CQL and Thrift, timestamps on the 
 cassandra column data is assigned by the server, not the client, unless AND 
 TIMESTAMP is utilized in the query, for example:
 
 http://www.datastax.com/docs/1.0/references/cql/UPDATE
 
 According to the Datastax documentation, this timestamp should be:
 
 Values serialized with the timestamp type are encoded as 64-bit signed 
 integers representing a number of milliseconds since the standard base time 
 known as the epoch: January 1 1970 at 00:00:00 GMT.
 
 However, my testing showed that updates didn't work when I used a timestamp 
 of this format.  Looking at the Cassandra code, it appears that cassandra 
 will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp 
 is not specified, which would be the number of nanoseconds since the stand 
 base time.  In my test environment, setting the timestamp to be the current 
 time * 1000 seems to work.  It seems that if you have an older installation 
 without TIMESTAMP being specified in the CQL,   or a mixed environment, the 
 timestamp should be * 1000.
 
 Just making sure I'm reading everything properly... improperly setting the 
 timestamp could cause us some serious damage.
 
 Thanks,
 -Mike
 
 



Re: cassandra-shuffle time to completion and required disk space

2013-04-28 Thread John Watson
11 nodes
1 keyspace
256 vnodes per node
upgraded 1.1.9 to 1.2.3 a week ago

These are taken just before starting shuffle (ran repair/cleanup the day
before).
During shuffle disabled all reads/writes to the cluster.

nodetool status keyspace:

Load   Tokens  Owns (effective)  Host ID
80.95 GB   256 16.7% 754f9f4c-4ba7-4495-97e7-1f5b6755cb27
87.15 GB   256 16.7% 93f4400a-09d9-4ca0-b6a6-9bcca2427450
98.16 GB   256 16.7% ff821e8e-b2ca-48a9-ac3f-8234b16329ce
142.6 GB   253 100.0%339c474f-cf19-4ada-9a47-8b10912d5eb3
77.64 GB   256 16.7% e59a02b3-8b91-4abd-990e-b3cb2a494950
194.31 GB  256 25.0% 6d726cbf-147d-426e-a735-e14928c95e45
221.94 GB  256 33.3% 83ca527c-60c5-4ea0-89a8-de53b92b99c8
87.61 GB   256 16.7% c3ea4026-551b-4a14-a346-480e8c1fe283
101.02 GB  256 16.7% df7ba879-74ad-400b-b371-91b45dcbed37
172.44 GB  256 25.0% 78192d73-be0b-4d49-a129-9bec0770efed
108.5 GB   256 16.7% 9889280a-1433-439e-bb84-6b7e7f44d761

nodetool status:

Load   Tokens  Owns   Host ID
142.6 GB   253 97.5%  339c474f-cf19-4ada-9a47-8b10912d5eb3
172.44 GB  256 0.1%   78192d73-be0b-4d49-a129-9bec0770efed
221.94 GB  256 0.4%   83ca527c-60c5-4ea0-89a8-de53b92b99c8
194.31 GB  256 0.1%   6d726cbf-147d-426e-a735-e14928c95e45
77.64 GB   256 0.3%   e59a02b3-8b91-4abd-990e-b3cb2a494950
87.15 GB   256 0.4%   93f4400a-09d9-4ca0-b6a6-9bcca2427450
98.16 GB   256 0.1%   ff821e8e-b2ca-48a9-ac3f-8234b16329ce
87.61 GB   256 0.3%   c3ea4026-551b-4a14-a346-480e8c1fe283
80.95 GB   256 0.4%   754f9f4c-4ba7-4495-97e7-1f5b6755cb27
108.5 GB   256 0.1%   9889280a-1433-439e-bb84-6b7e7f44d761
101.02 GB  256 0.3%   df7ba879-74ad-400b-b371-91b45dcbed37

Here's image of the actual disk usage during shuffle:

https://dl.dropbox.com/s/bx57j1z5c2spqo0/shuffle%20disk%20space.png

Little after 00:00 I disabled/cleared the xfers and restarted the cluster
(those drops around 00:15 are the restarts) before starting running
cleanup. The disks are only 540G and whenever cassandra runs out of disk
space, bad things seem to happen. Was just barely able to run cleanup
without running out space after the failed shuffle.

After the restart:

Load   Tokens  Owns (effective)  Host ID
131.73 GB  256 16.7% 754f9f4c-4ba7-4495-97e7-1f5b6755cb27
418.88 GB  255 16.7% 93f4400a-09d9-4ca0-b6a6-9bcca2427450
171.19 GB  255 8.5%  ff821e8e-b2ca-48a9-ac3f-8234b16329ce
142.61 GB  253 100.0%339c474f-cf19-4ada-9a47-8b10912d5eb3
178.83 GB  257 24.9% e59a02b3-8b91-4abd-990e-b3cb2a494950
442.32 GB  257 25.0% 6d726cbf-147d-426e-a735-e14928c95e45
185.28 GB  257 16.7% c3ea4026-551b-4a14-a346-480e8c1fe283
274.47 GB  255 33.3% 83ca527c-60c5-4ea0-89a8-de53b92b99c8
210.73 GB  256 16.7% df7ba879-74ad-400b-b371-91b45dcbed37
274.49 GB  256 25.0% 78192d73-be0b-4d49-a129-9bec0770efed
106.47 GB  256 16.7% 9889280a-1433-439e-bb84-6b7e7f44d761

It's currently still running cleanup, so taking the output from status will
be a little inaccurate.

I have everything instrumented by Metrics being pushed into Graphite. So if
there's graphs/data that may help from there please let me know.

Thanks,

John


On Sun, Apr 28, 2013 at 2:52 PM, aaron morton aa...@thelastpickle.comwrote:

 Can you provide some info on the number of nodes, node load, cluster load
 etc ?

 AFAIK shuffle was not an easy thing to test and does not get much real
 world use as only some people will run it and they (normally) use it once.

 Any info you can provide may help improve the process.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 29/04/2013, at 9:21 AM, John Watson j...@disqus.com wrote:

 The amount of time/space cassandra-shuffle requires when upgrading to
 using vnodes should really be apparent in documentation (when some is made).

 Only semi-noticeable remark about the exorbitant amount of time is a
 bullet point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance

 Shuffling will entail moving a lot of data around the cluster and so has
 the potential to consume a lot of disk and network I/O, and to take a
 considerable amount of time. For this to be an online operation, the
 shuffle will need to operate on a lower priority basis to other streaming
 operations, and should be expected to take days or weeks to complete.

 We tried running shuffle on a QA version of our cluster and 2 things were
 brought to light:
  - Even with no reads/writes it was going to take 20 days
  - Each machine needed enough free diskspace to potentially hold the
 entire cluster's sstables on disk

 Regards,

 John