Re: Getting NullPointerException while executing query

2013-04-11 Thread Kuldeep Mishra
I am using cassandra 1.2.0,


Thanks
Kuldeep


On Wed, Apr 10, 2013 at 10:40 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On which version of Cassandra are you? I can't reproduce the
 NullPointerException on Cassandra 1.2.3.

 That being said, that query is not valid, so you will get an error
 message. There is 2 reasons why it's not valid:
   1) in token(deep), deep is not a valid term. So you should have
 something like: token('deep').
   2) the name column is not the partition key so the token method cannot
 be applied to it.

 A valid query with that schema would be for instance:
   select * from CQLUSER where token(id)  token(4)
 though I don't know if that help in any way for what you aimed to do.

 --
 Sylvain


 On Wed, Apr 10, 2013 at 9:42 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.comwrote:

 Hi ,
  TABLE -
 CREATE TABLE CQLUSER (
   id int PRIMARY KEY,
   age int,
   name text
 )
 Query -
   select * from CQLUSER where token(name)  token(deep);

 ERROR -
 Bad Request: Failed parsing statement: [select * from CQLUSER where
 token(name)  token(deep);] reason: NullPointerException null
 text could not be lexed at line 1, char 15

 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





-- 
Thanks and Regards
Kuldeep Kumar Mishra
+919540965199


Re: Added column does not sort as the last column at

2013-04-11 Thread aaron morton
To reduce possibilities, have you changed a super CF to a standard CF recently 
? 

Can you isolate this to specific CF ? 

Have you changed the comparators / schema recently ? 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 1:26 AM, Sam Hodgson hodgson_...@hotmail.com wrote:

 Hi All,
 
 
 
 I just upgraded from Cassandra 1.1.7 to 1.2.3 and im now seeing a lot of the 
 following error in my output.log, cant find much on the web about it:
 
 
 
 ERROR 11:56:01,317 Exception in thread Thread[ReadStage:7236,5,main] 
 java.lang.AssertionError: Added column does not sort as the last column at 
 org.apache.cassandra.db.ArrayBackedSortedColumns.addColumn(ArrayBackedSortedColumns.java:131)
  at 
 org.apache.cassandra.db.AbstractColumnContainer.addColumn(AbstractColumnContainer.java:109)
  at 
 org.apache.cassandra.db.AbstractColumnContainer.addColumn(AbstractColumnContainer.java:104)
  at 
 org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:171)
  at 
 org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
  at 
 org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
  at 
 org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
  at 
 org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
  at 
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
  at 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
  at 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
  at org.apache.cassandra.db.Table.getRow(Table.java:348) at 
 org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
  at 
 org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
  at 
 org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
 
 
 
 I dont know Java and im struggling to identify exactly which query is causing 
 this, is there a way to trace this back to a specific query?
 
 
 
 Any help appreciated.
 
 
 
 Sam
 



Re: Column index vs Row index vs Denormalizing

2013-04-11 Thread aaron morton
 Retrieving the latest 1000 tweets (of a given day) is trivial by requesting 
 the streamTweets columnFamily. 
If you normally want to get the most recent items use a reverse comparator on 
the column name 
see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

 Getting the latest tweets for a given hashtag would mean you have to get the 
 TimeUUIDs from the streamHashTagTweets first, and then do a second get call 
 on the streamTweets with the former TimeUUIDs as the list of columns we like 
 to retrieve (column index).
You choices here depend on what sort of queries are the most frequent and how 
much disk space you have. 

You current model makes sense if the stream by day is the most frequent query, 
and you want to conserve disk space. If disk space is not an issue you can 
denormalise further and store the tweet JSON. 

If you have potentially many streamHashTagTweets rows where a single tweet is 
replicated it may make sense to stick with the current design to reduce disk 
use. 

 (we want to get up to 1000 tweets). 
If you want to get 1000 anything from cassandra please break the multiget up 
into multiple calls. Each row request becomes a task in the thread pools on RF 
nodes. If you have a small ish cluster one client asking for 1000 rows will 
temporarily block other clients and hurt request throughput. 

  Referencing key values requires another columnFamily for tweets (key: 
 tweetId, columns: 1 column with data).
This will be a more efficient (aka faster) read than reading from the a wide 
row. 

 Next to that we will request tweets by these secondary indexes quite 
 infrequently, while the tweets by timestamp will be requested heavily.
If the hot path is the streamTweets calls demoralise into that, and normalise 
the tweet storage into it's own CF and reference them from the 
streamHashTagTweets. Have a canonical store of the events / tweets / entities  
addressable by their business key can give you more flexibility. 

 Given we are estimating to store many TBs of tweets, we would prefer setting 
 up machines with spinning disks (2TB per node) to save costs.
If you have spinning disks and 1G networking the rule of thumb is 300GB to 
500GB per node. See previous discussions about size per node.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 2:00 AM, Coen Stevens beatle...@gmail.com wrote:

 Hi all, 
 
 We are working on a data model for storing tweets for multiple streams (where 
 a stream is defined by a number of keyword filters on the full twitter 
 firehose), and retrieving the tweets by timestamp and hashtag. My question is 
 whether the following data model would a good way for doing that, where I'm 
 creating a column name index for the hashtags.
 
 ColumnFamily: streamTweets
  key: streamID + dayTimestamp (creating daily buckets for each stream)
  columns = name: TimeUUID, value: tweet json (storing all the tweets for 
 this stream in a wide row with a TimeUUID)
 
 ColumnFamily: streamHashTagTweets
  key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
  columns = name: TimeUUID (referencing the TimeUUID value in the 
 streamTweets ColumnFamily), value: tweetID
 
 Retrieving the latest 1000 tweets (of a given day) is trivial by requesting 
 the streamTweets columnFamily. Getting the latest tweets for a given hashtag 
 would mean you have to get the TimeUUIDs from the streamHashTagTweets first, 
 and then do a second get call on the streamTweets with the former TimeUUIDs 
 as the list of columns we like to retrieve (column index).
 
 Is referencing column names (TimeUUIDs) a smart thing to do when we have wide 
 rows spanning millions of columns? It seems easier (one reference call) to do 
 this, then it is to reference key values and running a multi-get to get all 
 the rows (we want to get up to 1000 tweets). Referencing key values requires 
 another columnFamily for tweets (key: tweetId, columns: 1 column with data).
 
 Of course we could instead denormalize the data and store the tweet also in 
 the streamHashTagTweet columns, but we want to do the same thing for other 
 indexes as well (topics, twitter usernames, links, etc), so it quickly adds 
 up in required storage space. Next to that we will request tweets by these 
 secondary indexes quite infrequently, while the tweets by timestamp will be 
 requested heavily.
 
 Given we are estimating to store many TBs of tweets, we would prefer setting 
 up machines with spinning disks (2TB per node) to save costs.
 
 We would love to hear your feedback.
 
 Cheers,
 Coen



Re: data modeling from batch_mutate point of view

2013-04-11 Thread aaron morton
 b) the batch_mutate advantages are better, for the communication 
 client=coordinator node __and__ for the communications coordinator 
 node=replicas.
Yes. A single row mutation can write to many CFs. 

 Is there any experience out there about such data modeling (option_a vs 
 option_b) from the batch_mutate perspective ?
 Thanks.
I would not worry about the internal network lag as much as creating hot rows 
in the model. Sometimes it makes sense for an entity to map to rows in several 
CF's that use the same key, e.g. user info or a blog post. However it is 
normally bad when many entities require storing data on the same row, e.g. all 
blog posts have to update one row. 

From my understanding of what you are doing I would look to spread out the 
index entries to use different row keys. If the indexes are small you may get 
away with using the same key, but I would start with spreading it out. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 2:27 AM, DE VITO Dominique dominique.dev...@thalesgroup.com 
wrote:

 Thanks Aaron.
  
 It helped.
  
 Let's me rephrase a little bit my questions. It's about data modeling impact 
 on batch_mutate advantages.
  
 I have one CF for storing data, and ~10 (all different) CF used for indexing 
 that data.
  
 when adding a piece of data, I need to add indexes too, and then, I need to 
 add columns to one row for each of the 10 indexing CF = 2 main designs are 
 possible for adding these new indexes.
  
 a) all the updated 10 rows of indexing CF have different rowkeys
 b) all the updated 10 rows of indexing CF have all the same rowkey
  
 AFAIK, this has effect on batch_mutate:
  
 a) the batch_mutate advantages stop at the coordinator node. The advantage 
 appears for the communication client=coordinator node
 b) the batch_mutate advantages are better, for the communication 
 client=coordinator node __and__ for the communications coordinator 
 node=replicas.
  
 So, for resuming:
  
 a) CF with few data repeats (good) but the coordinator node needs to 
 communicate to different replicas according to different rowkeys
 b) CF with more denormalization, repeating some data, again and again over 
 composite columns,  but batch_mutate performs better (good) up to replicas, 
 and not only up to coordinator node.
  
 Each option has one pro and one con.
  
 Is there any experience out there about such data modeling (option_a vs 
 option_b) from the batch_mutate perspective ?
 Thanks.
  
 Dominique
  
  
  
 De : aaron morton [mailto:aa...@thelastpickle.com] 
 Envoyé : mardi 9 avril 2013 10:12
 À : user@cassandra.apache.org
 Objet : Re: data modeling from batch_mutate point of view
  
 So, one alternative design for indexing CF could be:
 rowkey = folder_id
 colname = (indexed value, timestamp, file_id)
 colvalue = 
  
 If you always search in a folder what about 
 rowkey = folder_id : property_name : property_value
 colname = file_id
  
 (That's closer to secondary indexes in cassandra with the addition of the 
 folder_id)
  
 According to pro vs con, is the alternative design more or less interesting ?
 IMHO it's normally better to spread the rows and consider how they grow over 
 time. 
 You can send updates for multiple rows in the same batch mutation. 
  
 Hope that helps. 
  
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 9/04/2013, at 3:57 AM, DE VITO Dominique 
 dominique.dev...@thalesgroup.com wrote:
 
 
 Hi,
  
 I have a use case that sounds like storing data associated with files. So, I 
 store them with the CF:
 rowkey = (folder_id, file_id)
 colname = property name (about the file corresponding to file_id)
 colvalue = property value
  
 And I have CF for manual indexing:
 rowkey = (folder_id, indexed value)
 colname = (timestamp, file_id)
 colvalue = 
  
 like
 rowkey = (folder_id, note_of_5) or (folder_id, some_status)
 colname = (some_date, some_filename)
 colvalue = 
  
 I have many CF for indexing, as I index according to different (file) 
 properties.
  
 So, one alternative design for indexing CF could be:
 rowkey = folder_id
 colname = (indexed value, timestamp, file_id)
 colvalue = 
  
 Alternative design :
 * pro: same rowkey for all indexing CF = **all** indexing CF could be 
 updated through one batch_mutate
 * con: repeating indexed value (1er colname part) again ang again (= a 
 string up to 20c)
  
 According to pro vs con, is the alternative design more or less interesting ?
  
 Thanks.
  
 Dominique
  
  



Re: other questions about // RE: batch_mutate

2013-04-11 Thread aaron morton
 Is it true the coordinator node treats them as __independent__ 
 communications/requests to replicas (even if in that case, the replicas are 
 the same for every request) ?
A row mutation is a request to store columns in one or more CF's using one row 
key. It is treated as indivisible by the commit log and a single message is 
sent to each replica that contains all the CF updates. When the mutation is 
applied it is applied to each CF in turn, and the row level isolation does not 
apply across CF's. 

I assume your them is the many CF's the row mutation updates, if so the 
answer is no. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 3:26 AM, DE VITO Dominique dominique.dev...@thalesgroup.com 
wrote:

 When the coordinator node receives a batch_mutate for __one__ row key 
 associated with different mutations for different CF :
  
 Is it true the coordinator node treats them as __independent__ 
 communications/requests to replicas (even if in that case, the replicas are 
 the same for every request) ?
  
 Thanks,
 Dominique
  
  
 De : aaron morton [mailto:aa...@thelastpickle.com] 
 Envoyé : vendredi 6 juillet 2012 01:21
 À : user@cassandra.apache.org
 Objet : Re: batch_mutate
  
 Does it mean that the popular use case is when we need to update multiple 
 column families using the same key?
 Yes. 
  
 Shouldn’t we design our space in such a way that those columns live in the 
 same column family?
 Design a model where the data for common queries is stored in one row+cf. You 
 can also take into consideration the workload. e.g. things are are updated 
 frequently often live together, things that are updated infrequently often 
 live together.
  
 cheers
  
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
  
 On 6/07/2012, at 3:16 AM, Leonid Ilyevsky wrote:
  
 
 I actually found an answer to my first question at 
 http://wiki.apache.org/cassandra/API. So I got it wrong: actually the outer 
 key is the key in the table, and the inner key is the table name (this was 
 somewhat counter-intuitive). Does it mean that the popular use case is when 
 we need to update multiple column families using the same key? Shouldn’t we 
 design our space in such a way that those columns live in the same column 
 family?
  
 From: Leonid Ilyevsky [mailto:lilyev...@mooncapital.com] 
 Sent: Thursday, July 05, 2012 10:39 AM
 To: 'user@cassandra.apache.org'
 Subject: batch_mutate
  
 My current way of inserting rows one by one is too slow (I use cql3 prepared 
 statements) , so I want to try batch_mutate.
  
 Could anybody give me more details about the interface? In the javadoc it 
 says:
  
 public 
 voidbatch_mutate(java.util.Mapjava.nio.ByteBuffer,java.util.Mapjava.lang.String,java.util.ListMutation
  mutation_map,
  ConsistencyLevel consistency_level)
   throws InvalidRequestException,
  UnavailableException,
  TimedOutException,
  org.apache.thrift.TException
 Description copied from interface: Cassandra.Iface
 Mutate many columns or super columns for many row keys. See also: Mutation. 
 mutation_map maps key to column family to a list of Mutation objects to take 
 place at that scope. *
  
  
 I need to understand the meaning of the elements of mutation_map parameter.
 My guess is, the key in the outer map is columnfamily name, is this correct?
 The key in the inner map is, probably, a key to the columnfamily (it is 
 somewhat confusing that it is String while the outer key is ByteBuffer, I 
 wonder what is the rational). If this is correct, how should I do it if my 
 key is a composite one. Does anybody have an example?
  
 Thanks,
  
 Leonid
  
 This email, along with any attachments, is confidential and may be legally 
 privileged or otherwise protected from disclosure. Any unauthorized 
 dissemination, copying or use of the contents of this email is strictly 
 prohibited and may be in violation of law. If you are not the intended 
 recipient, any disclosure, copying, forwarding or distribution of this email 
 is strictly prohibited and this email and any attachments should be deleted 
 immediately. This email and any attachments do not constitute an offer to 
 sell or a solicitation of an offer to purchase any interest in any investment 
 vehicle sponsored by Moon Capital Management LP (“Moon Capital”). Moon 
 Capital does not provide legal, accounting or tax advice. Any statement 
 regarding legal, accounting or tax matters was not intended or written to be 
 relied upon by any person as advice. Moon Capital does not waive 
 confidentiality or privilege as a result of this email.
  
 This email, along with any attachments, is confidential and may be legally 
 privileged or otherwise protected from disclosure. Any unauthorized 
 dissemination, copying or use of 

Re: Cassandra 1.2.2 cluster + raspberry

2013-04-11 Thread aaron morton
 I've already tried to set internode_compression: none in my yaml files. 

What version are you on?

If you've set internode_compression to none and restarted? Can you double check.
The code stack shows cassandra deciding that the connection should be 
compressed. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 12:54 PM, murat migdisoglu murat.migdiso...@gmail.com wrote:

 Hi, 
 
 I'm trying to set up a cassandra cluster for some experiments on my raspberry 
 pies but I'm still having trouble to join my nodes to the cluster.
 
 I started with two nodes (192.168.2.3 and 192.168.2.7) and when I start the 
 cassandra, I see the following exception on the node 192.168.2.7
 ERROR [WRITE-/192.168.2.3] 2013-04-10 02:10:24,524 CassandraDaemon.java (line 
 132) Exception in thread Thread[WRITE-/192.168.2.3,5,main]
 java.lang.NoClassDefFoundError: Could not initialize class 
 org.xerial.snappy.Snappy
 at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
 at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:66)
 at 
 org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:322)
 at 
 org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:143)
 
 I suspect that the lack of native snappy libraries are causing this exception 
 furing the internode communication. 
 I did not try to compile the native Snappy for ARM yet but I wonder if it is 
 not possible to use cassandra without snappy. 
 
 I've already tried to set internode_compression: none in my yaml files. 
 
 nodetool outputs:
 
 nodetool -h pi1 ring
 
 Datacenter: dc1
 ==
 Replicas: 1
 
 Address RackStatus State   LoadOwns   
  Token   
 
 192.168.2.7 RAC1Up Normal  92.35 KB100.00%
  0  
   
 nodetool -h pi2 ring
 
 Datacenter: dc1
 ==
 Replicas: 1
 
 Address RackStatus State   LoadOwns   
  Token   
 
 192.168.2.3 RAC1Up Normal  92.42 KB100.00%
  85070591730234615865843651857942052864  
 
 
 
 Kind Regards
 
 
 



Re: CDH4 + Cassandra 1.2 Integration Issue

2013-04-11 Thread aaron morton
cqlsh in cassandra 1.2 defaults to cql 3. 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 6:55 PM, Gurminder Gill gurminde...@hotmail.com wrote:

 Ah ha. So, the client defaults to CQL 2. Anyway of changing that? I tired 
 libthrift 0.9 as well but it doesn't work.
 
 Thanks.
 
 
 On Tue, Apr 9, 2013 at 11:29 PM, Shamim sre...@yandex.ru wrote:
 Hello,
   if you created your table user with cql then you have to add COMPACT 
 STORAGE as follows:
 CREATE TABLE user (
   id int PRIMARY KEY,
   age int,
   fname text,
   lname text
 ) WITH COMPACT STORAGE
 
 
 
 
 --
 Best regards
   Shamim A.
 
 
 
 10.04.2013, 08:22, Gurminder Gill gurminde...@hotmail.com:
  I was able to start a MR job after patching Cassandra.Hadoop as per 
  CASSANDRA-5201.
 
  But then, ColumnFamilyRecordReader pukes within the MapTask. It is unable 
  to read CF definition in the sample keyspace. The CF user does exist.
  How can cf_defs below be possibly empty? Any pointers?
 
  KsDef.toString() during Read Operation from within the MapTask :-
 
  KsDef(name:wordcount, 
  strategy_class:org.apache.cassandra.locator.SimpleStrategy, 
  strategy_options:{replication_factor=1}, cf_defs:[], durable_writes:true)
 
  Output from cqlsh :-
 
  cqlsh describe keyspace wordcount;
 
  CREATE KEYSPACE wordcount WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
  };
 
  USE wordcount;
 
  CREATE TABLE user (
id int PRIMARY KEY,
age int,
fname text,
lname text
  ) WITH
bloom_filter_fp_chance=0.01 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.00 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.10 AND
replicate_on_write='true' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
 
 
 



Re: Blobs in CQL?

2013-04-11 Thread Gabriel Ciuloaica

Hi Brian,

I'm using the blobs to store images in cassandra(1.2.3) using the 
java-driver version 1.0.0-beta1.

There is no need to convert a byte array into hex.

Br,
Gabi

On 4/11/13 3:21 PM, Brian O'Neill wrote:


I started playing around with the CQL driver.
Has anyone used blobs with it yet?

Are you forced to convert a byte[] to hex?
(e.g. I have a photo that I want to store in C* using the java-driver API)

-brian

--
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42




Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Great!

Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
I couldn't find the part of  the API that allowed you to pass in the byte
array.

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

Hi Brian,

I'm using the blobs to store images in cassandra(1.2.3) using the
java-driver version 1.0.0-beta1.
There is no need to convert a byte array into hex.

Br,
Gabi

On 4/11/13 3:21 PM, Brian O'Neill wrote:

 I started playing around with the CQL driver.
 Has anyone used blobs with it yet?

 Are you forced to convert a byte[] to hex?
 (e.g. I have a photo that I want to store in C* using the java-driver
API)

 -brian

 -- 
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42





Re: Blobs in CQL?

2013-04-11 Thread Gabriel Ciuloaica

I'm not using the query builder but the PreparedStatement.

Here is the sample code: https://gist.github.com/devsprint/5363023

Gabi
On 4/11/13 3:27 PM, Brian O'Neill wrote:

Great!

Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
I couldn't find the part of  the API that allowed you to pass in the byte
array.

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
  







On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:


Hi Brian,

I'm using the blobs to store images in cassandra(1.2.3) using the
java-driver version 1.0.0-beta1.
There is no need to convert a byte array into hex.

Br,
Gabi

On 4/11/13 3:21 PM, Brian O'Neill wrote:

I started playing around with the CQL driver.
Has anyone used blobs with it yet?

Are you forced to convert a byte[] to hex?
(e.g. I have a photo that I want to store in C* using the java-driver
API)

-brian

--
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42






Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Cool.  That might be it.  I'll take a look at PreparedStatement.

For query building, I took a look under the covers, and even when I was
passing in a ByteBuffer, it runs through the following code in the
java-driver:

Utils.java:
   if (value instanceof ByteBuffer) {
  sb.append(0x);
  sb.append(ByteBufferUtil.bytesToHex((ByteBuffer)value));
   }

Hopefully, the prepared statement doesn't do the conversion.
(I'm not sure if it is a limitation of the CQL protocol itself)

thanks again,
-brian



---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

I'm not using the query builder but the PreparedStatement.

Here is the sample code: https://gist.github.com/devsprint/5363023

Gabi
On 4/11/13 3:27 PM, Brian O'Neill wrote:
 Great!

 Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
 I couldn't find the part of  the API that allowed you to pass in the
byte
 array.

 -brian

 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material.
If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.
   






 On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

 Hi Brian,

 I'm using the blobs to store images in cassandra(1.2.3) using the
 java-driver version 1.0.0-beta1.
 There is no need to convert a byte array into hex.

 Br,
 Gabi

 On 4/11/13 3:21 PM, Brian O'Neill wrote:
 I started playing around with the CQL driver.
 Has anyone used blobs with it yet?

 Are you forced to convert a byte[] to hex?
 (e.g. I have a photo that I want to store in C* using the java-driver
 API)

 -brian

 -- 
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42






Re: Blobs in CQL?

2013-04-11 Thread Sylvain Lebresne
 Hopefully, the prepared statement doesn't do the conversion.


It does not.


 (I'm not sure if it is a limitation of the CQL protocol itself)

 thanks again,
 -brian



 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

 I'm not using the query builder but the PreparedStatement.
 
 Here is the sample code: https://gist.github.com/devsprint/5363023
 
 Gabi
 On 4/11/13 3:27 PM, Brian O'Neill wrote:
  Great!
 
  Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
  I couldn't find the part of  the API that allowed you to pass in the
 byte
  array.
 
  -brian
 
  ---
  Brian O'Neill
  Lead Architect, Software Development
  Health Market Science
  The Science of Better Results
  2700 Horizon Drive € King of Prussia, PA € 19406
  M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
  healthmarketscience.com
 
  This information transmitted in this email message is for the intended
  recipient only and may contain confidential and/or privileged material.
 If
  you received this email in error and are not the intended recipient, or
  the person responsible to deliver it to the intended recipient, please
  contact the sender at the email above and delete this email and any
  attachments and destroy any copies thereof. Any review, retransmission,
  dissemination, copying or other use of, or taking any action in reliance
  upon, this information by persons or entities other than the intended
  recipient is strictly prohibited.
 
 
 
 
 
 
 
  On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:
 
  Hi Brian,
 
  I'm using the blobs to store images in cassandra(1.2.3) using the
  java-driver version 1.0.0-beta1.
  There is no need to convert a byte array into hex.
 
  Br,
  Gabi
 
  On 4/11/13 3:21 PM, Brian O'Neill wrote:
  I started playing around with the CQL driver.
  Has anyone used blobs with it yet?
 
  Are you forced to convert a byte[] to hex?
  (e.g. I have a photo that I want to store in C* using the java-driver
  API)
 
  -brian
 
  --
  Brian ONeill
  Lead Architect, Health Market Science (http://healthmarketscience.com
 )
  mobile:215.588.6024
  blog: http://brianoneill.blogspot.com/
  twitter: @boneill42
 
 





multiple Datacenter values in PropertyFileSnitch

2013-04-11 Thread Matthias Zeilinger
Hi,

I would like to create big cluster for many applications.
Within this cluster I would like to separate the data for each application, 
which can be easily done via different virtual datacenters and the correct 
replication strategy.
What I would like to know, if I can specify for 1 node multiple values in the 
PropertyFileSnitch configuration, so that I can use 1 node for more 
applications?
For example:
6 nodes:
3 for App A
3 for App B
4 for App C

I want to have such a configuration:
Node 1 - DC-A DC-C
Node 2 - DC-B  DC-C
Node 3 - DC-A  DC-C
Node 4 - DC-B  DC-C
Node 5 - DC-A
Node 6 - DC-B

Is this possible or does anyone have another solution for this?


Thx  br matthias


Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Yep, it worked like a charm.  (PreparedStatement avoided the hex conversion)

But now, I'm seeing a few extra bytes come back in the select….
(I'll keep digging, but maybe you have some insight?)

I see this:
ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao:
repository.add() byte.length()=[259804]

ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao:
repository.get() [foo.jpeg] byte.length()=[259861]


(Notice the length's don't match up)

Using this code:
public void addContent(String key, byte[] data)

throws NoHostAvailableException {

LOG.error(repository.add() byte.length()=[ + data.length + ]);

String statement = INSERT INTO  + KEYSPACE + . + TABLE + (key,
data) VALUES (?, ?);

PreparedStatement ps = session.prepare(statement);

BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

session.execute(bs);

}



public byte[] getContent(String key) throws NoHostAvailableException {

Query select = select(data).from(KEYSPACE, TABLE).where(eq(key,
key));

ResultSet resultSet = session.execute(select);

byte[] data = resultSet.one().getBytes(data).array();

LOG.error(repository.get() [ + key + ] byte.length()=[ +
data.length + ]);

return data;

}


---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42   •
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Thursday, April 11, 2013 8:48 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Cc:  Gabriel Ciuloaica gciuloa...@gmail.com
Subject:  Re: Blobs in CQL?


 Hopefully, the prepared statement doesn't do the conversion.

It does not.
 
 (I'm not sure if it is a limitation of the CQL protocol itself)
 
 thanks again,
 -brian
 
 
 
 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 tel:215.588.6024  • @boneill42
 http://www.twitter.com/boneill42  •
 healthmarketscience.com http://healthmarketscience.com
 
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.
 
 
 
 
 
 
 
 On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:
 
 I'm not using the query builder but the PreparedStatement.
 
 Here is the sample code: https://gist.github.com/devsprint/5363023
 
 Gabi
 On 4/11/13 3:27 PM, Brian O'Neill wrote:
  Great!
 
  Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
  I couldn't find the part of  the API that allowed you to pass in the
 byte
  array.
 
  -brian
 
  ---
  Brian O'Neill
  Lead Architect, Software Development
  Health Market Science
  The Science of Better Results
  2700 Horizon Drive € King of Prussia, PA € 19406
  M: 215.588.6024 tel:215.588.6024  € @boneill42
 http://www.twitter.com/boneill42  €
  healthmarketscience.com http://healthmarketscience.com
 
  This information transmitted in this email message is for the intended
  recipient only and may contain confidential and/or privileged material.
 If
  you received this email in error and are not the intended recipient, or
  the person responsible to deliver it to the intended recipient, please
  contact the sender at the email above and delete this email and any
  attachments and destroy any copies thereof. Any review, retransmission,
  dissemination, copying or other use of, or taking any action in reliance
  upon, this information by persons or entities other than the intended
  recipient is strictly prohibited.
 
 
 
 
 
 
 
  On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:
 
  Hi Brian,
 
  I'm using 

Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Sylvain,

Interesting, when I look at the actual bytes returned, I see the byte array
is prefixed with the keyspace and table name.

I assume I'm doing something wrong in the select.  Am I incorrectly using
the ResultSet?

-brian

On Thu, Apr 11, 2013 at 9:09 AM, Brian O'Neill b...@alumni.brown.eduwrote:

 Yep, it worked like a charm.  (PreparedStatement avoided the hex
 conversion)

 But now, I'm seeing a few extra bytes come back in the select….
 (I'll keep digging, but maybe you have some insight?)

 I see this:

 ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao:
 repository.add() byte.length()=[259804]

 ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao:
 repository.get() [foo.jpeg] byte.length()=[259861]

 (Notice the length's don't match up)

 Using this code:

 public void addContent(String key, byte[] data)

 throws NoHostAvailableException {

 LOG.error(repository.add() byte.length()=[ + data.length + ]);

 String statement = INSERT INTO  + KEYSPACE + . + TABLE + (key,
 data) VALUES (?, ?);

 PreparedStatement ps = session.prepare(statement);

 BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

 session.execute(bs);

 }


 public byte[] getContent(String key) throws NoHostAvailableException {

 Query select = select(data).from(KEYSPACE, TABLE).where(eq(key,
 key));

 ResultSet resultSet = session.execute(select);

 byte[] data = resultSet.one().getBytes(data).array();

 LOG.error(repository.get() [ + key + ] byte.length()=[ + data.
 length + ]);

 return data;

 }

 ---

 Brian O'Neill

 Lead Architect, Software Development

 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive • King of Prussia, PA • 19406

 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.

 ** **


 From: Sylvain Lebresne sylv...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, April 11, 2013 8:48 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Cc: Gabriel Ciuloaica gciuloa...@gmail.com
 Subject: Re: Blobs in CQL?


 Hopefully, the prepared statement doesn't do the conversion.


 It does not.


 (I'm not sure if it is a limitation of the CQL protocol itself)

 thanks again,
 -brian



 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

 I'm not using the query builder but the PreparedStatement.
 
 Here is the sample code: https://gist.github.com/devsprint/5363023
 
 Gabi
 On 4/11/13 3:27 PM, Brian O'Neill wrote:
  Great!
 
  Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
  I couldn't find the part of  the API that allowed you to pass in the
 byte
  array.
 
  -brian
 
  ---
  Brian O'Neill
  Lead Architect, Software Development
  Health Market Science
  The Science of Better Results
  2700 Horizon Drive € King of Prussia, PA € 19406
  M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
  healthmarketscience.com
 
  This information transmitted in this email message is for the intended
  recipient only and may contain confidential and/or privileged material.
 If
  you received this email in error and are not the intended recipient, or
  the person responsible to deliver it to the intended recipient, please
  contact the sender at the email above and delete this email and any
  attachments and destroy any copies thereof. Any review, retransmission,
  dissemination, copying or 

Re: Blobs in CQL?

2013-04-11 Thread Gabriel Ciuloaica

That's right, there is some padding there...
So, instead of getting calling array(), you have to do something like:

byte[] data = resultSet.one().getBytes(data);
int length = data.remaining();
blobBytes = new byte[length];
data.get(blobBytes, 0, length);


Gabi


On 4/11/13 4:09 PM, Brian O'Neill wrote:
Yep, it worked like a charm.  (PreparedStatement avoided the hex 
conversion)


But now, I'm seeing a few extra bytes come back in the select….
(I'll keep digging, but maybe you have some insight?)

I see this:

ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao: 
repository.add() byte.length()=[259804]


ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao: 
repository.get() [foo.jpeg] byte.length()=[259861]



(Notice the length's don't match up)

Using this code:

public void addContent(String key, byte[] data)

throws NoHostAvailableException {

LOG.error(repository.add() byte.length()=[+ data.length+ ]);

  String statement = INSERT INTO + KEYSPACE+ .+ TABLE+ (key, 
data) VALUES (?, ?);


PreparedStatement ps = session.prepare(statement);

BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

session.execute(bs);

}


public byte[] getContent(String key) throws NoHostAvailableException {

Query select = select(data).from(KEYSPACE, 
TABLE).where(eq(key, key));


ResultSet resultSet = session.execute(select);

byte[] data = resultSet.one().getBytes(data).array();

LOG.error(repository.get() [+ key + ] byte.length()=[+ 
data.length+ ]);


return data;

}


---

Brian O'Neill

Lead Architect, Software Development

*Health Market Science*

/The Science of Better Results/

2700 Horizon Drive •King of Prussia, PA •19406

M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 •

healthmarketscience.com


This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged 
material. If you received this email in error and are not the intended 
recipient, or the person responsible to deliver it to the intended 
recipient, please contact the sender at the email above and delete 
this email and any attachments and destroy any copies thereof. Any 
review, retransmission, dissemination, copying or other use of, or 
taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.



From: Sylvain Lebresne sylv...@datastax.com 
mailto:sylv...@datastax.com

Reply-To: user@cassandra.apache.org mailto:user@cassandra.apache.org
Date: Thursday, April 11, 2013 8:48 AM
To: user@cassandra.apache.org mailto:user@cassandra.apache.org 
user@cassandra.apache.org mailto:user@cassandra.apache.org

Cc: Gabriel Ciuloaica gciuloa...@gmail.com mailto:gciuloa...@gmail.com
Subject: Re: Blobs in CQL?


Hopefully, the prepared statement doesn't do the conversion.


It does not.

(I'm not sure if it is a limitation of the CQL protocol itself)

thanks again,
-brian



---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 tel:215.588.6024 • @boneill42
http://www.twitter.com/boneill42  •
healthmarketscience.com http://healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged
material. If
you received this email in error and are not the intended
recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review,
retransmission,
dissemination, copying or other use of, or taking any action in
reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.







On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com
mailto:gciuloa...@gmail.com wrote:

I'm not using the query builder but the PreparedStatement.

Here is the sample code: https://gist.github.com/devsprint/5363023

Gabi
On 4/11/13 3:27 PM, Brian O'Neill wrote:
 Great!

 Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
 I couldn't find the part of  the API that allowed you to pass
in the
byte
 array.

 -brian

 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 tel:215.588.6024 € @boneill42
http://www.twitter.com/boneill42  €
 healthmarketscience.com http://healthmarketscience.com

 This information transmitted in this email message is for the
intended
 recipient only and may contain confidential and/or 

Re: Blobs in CQL?

2013-04-11 Thread Sylvain Lebresne
 I assume I'm doing something wrong in the select.  Am I incorrectly using
 the ResultSet?


You're incorrectly using the returned ByteBuffer. But you should not feel
bad, that API kinda
sucks.

The short version is that .array() returns the backing array of the
ByteBuffer. But there is no
guarantee that you'll have a one-to-one correspondence between the valid
content of the
ByteBuffer and the backing array, the backing array can be bigger in
particular (long story short,
this allows multiple ByteBuffer to share the same backing array, which can
avoid doing copies).

I also note that there is no guarantee that .array() will work unless
you've called .hasArray().

Anyway, what you could do is:
ByteBuffer bb = resultSet.one().getBytes(data);
byte[] data = new byte[bb.remaining()];
bb.get(data);

Alternatively, you can use the result of .array(), but you should only
consider the bb.remaining()
bytes starting at bb.arrayOffset() + bb.position() (where bb is the
returned ByteBuffer).

--
Sylvain




 -brian

 On Thu, Apr 11, 2013 at 9:09 AM, Brian O'Neill b...@alumni.brown.eduwrote:

 Yep, it worked like a charm.  (PreparedStatement avoided the hex
 conversion)

 But now, I'm seeing a few extra bytes come back in the select….
 (I'll keep digging, but maybe you have some insight?)

 I see this:

 ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao:
 repository.add() byte.length()=[259804]

 ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao:
 repository.get() [foo.jpeg] byte.length()=[259861]

 (Notice the length's don't match up)

 Using this code:

 public void addContent(String key, byte[] data)

 throws NoHostAvailableException {

 LOG.error(repository.add() byte.length()=[ + data.length + ]
 );

 String statement = INSERT INTO  + KEYSPACE + . + TABLE + (key,
 data) VALUES (?, ?);

 PreparedStatement ps = session.prepare(statement);

 BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

 session.execute(bs);

 }


 public byte[] getContent(String key) throws NoHostAvailableException
 {

 Query select = select(data).from(KEYSPACE, TABLE).where(eq(
 key, key));

 ResultSet resultSet = session.execute(select);

 byte[] data = resultSet.one().getBytes(data).array();

 LOG.error(repository.get() [ + key + ] byte.length()=[ +
 data.length + ]);

 return data;

 }

 ---

 Brian O'Neill

 Lead Architect, Software Development

 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive • King of Prussia, PA • 19406

 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.

 ** **


 From: Sylvain Lebresne sylv...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, April 11, 2013 8:48 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Cc: Gabriel Ciuloaica gciuloa...@gmail.com
 Subject: Re: Blobs in CQL?


 Hopefully, the prepared statement doesn't do the conversion.


 It does not.


 (I'm not sure if it is a limitation of the CQL protocol itself)

 thanks again,
 -brian



 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material.
 If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

 I'm not using the query builder but the PreparedStatement.
 
 Here is the sample code: https://gist.github.com/devsprint/5363023
 
 Gabi
 On 4/11/13 3:27 PM, Brian O'Neill wrote:
  Great!
 
  Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
  I couldn't find the part of  the 

Re: multiple Datacenter values in PropertyFileSnitch

2013-04-11 Thread Jabbar Azam
Hello,

I'm not an expert but I don't think you can do what you want. The way to
separate data for applications on the same cluster is to use different
tables for different applications or use multiple keyspaces, a keyspace per
application. The replication factor you specify for each keyspace specifies
how many copies of the data are stored in each datacenter.

You can't specify that data for a particular application is stored on a
specific node, unless that node is in its own cluster.

I think of a cassandra cluster as a shared resource where all the
applications have access to all the nodes in the cluster.


Thanks

Jabbar Azam


On 11 April 2013 14:13, Matthias Zeilinger matthias.zeilin...@bwinparty.com
 wrote:

  Hi,

 ** **

 I would like to create big cluster for many applications.

 Within this cluster I would like to separate the data for each
 application, which can be easily done via different virtual datacenters and
 the correct replication strategy.

 What I would like to know, if I can specify for 1 node multiple values in
 the PropertyFileSnitch configuration, so that I can use 1 node for more
 applications?

 For example:

 6 nodes:

 3 for App A

 3 for App B

 4 for App C

 ** **

 I want to have such a configuration:

 Node 1 – DC-A DC-C

 Node 2 – DC-B  DC-C

 Node 3 – DC-A  DC-C

 Node 4 – DC-B  DC-C

 Node 5 – DC-A

 Node 6 – DC-B

 ** **

 Is this possible or does anyone have another solution for this?

 ** **

 ** **

 Thx  br matthias



Compaction, truncate, cqlsh problems

2013-04-11 Thread Ondřej Černoš
Hi,

I use C* 1.2.3 and CQL3.

I integrated cassandra into our testing environment. In order to make the
tests repeatable I truncate all the tables that need to be empty before the
test run via ssh session to the host cassandra runs on and by running cqlsh
where I issue the truncate.

It works, only sometimes it silently fails (1 in 400 runs of the truncate,
actually).

At the same time the truncate fails I see system ks compaction.
Additionally, it seems there is quite a lot of these system ks compactions
(the numbers in the filenames go up pretty fast to thousands).

I googled truncate and found out there were some issues with race
conditions and with slowing down if truncate is used frequently (as is my
case, where truncate is run before each test in quite a big test suite).

Any hints?

Regards,
Ondřej Černoš


Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Bingo! Thanks to both of you.  (the C* community rocks)

A few hours worth of work, and I've got a working REST-based photo
repository backed by  C* using the CQL java driver. =)

rock on, thanks again,
-brian


On Thu, Apr 11, 2013 at 9:33 AM, Sylvain Lebresne sylv...@datastax.comwrote:


 I assume I'm doing something wrong in the select.  Am I incorrectly using
 the ResultSet?


 You're incorrectly using the returned ByteBuffer. But you should not feel
 bad, that API kinda
 sucks.

 The short version is that .array() returns the backing array of the
 ByteBuffer. But there is no
 guarantee that you'll have a one-to-one correspondence between the valid
 content of the
 ByteBuffer and the backing array, the backing array can be bigger in
 particular (long story short,
 this allows multiple ByteBuffer to share the same backing array, which can
 avoid doing copies).

 I also note that there is no guarantee that .array() will work unless
 you've called .hasArray().

 Anyway, what you could do is:
 ByteBuffer bb = resultSet.one().getBytes(data);
 byte[] data = new byte[bb.remaining()];
 bb.get(data);

 Alternatively, you can use the result of .array(), but you should only
 consider the bb.remaining()
 bytes starting at bb.arrayOffset() + bb.position() (where bb is the
 returned ByteBuffer).

 --
 Sylvain




 -brian

 On Thu, Apr 11, 2013 at 9:09 AM, Brian O'Neill b...@alumni.brown.eduwrote:

 Yep, it worked like a charm.  (PreparedStatement avoided the hex
 conversion)

 But now, I'm seeing a few extra bytes come back in the select….
 (I'll keep digging, but maybe you have some insight?)

 I see this:

 ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao:
 repository.add() byte.length()=[259804]

 ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao:
 repository.get() [foo.jpeg] byte.length()=[259861]

 (Notice the length's don't match up)

 Using this code:

 public void addContent(String key, byte[] data)

 throws NoHostAvailableException {

 LOG.error(repository.add() byte.length()=[ + data.length + ]
 );

 String statement = INSERT INTO  + KEYSPACE + . + TABLE + (key,
 data) VALUES (?, ?);

 PreparedStatement ps = session.prepare(statement);

 BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

 session.execute(bs);

 }


 public byte[] getContent(String key) throwsNoHostAvailableException {

 Query select = select(data).from(KEYSPACE, TABLE).where(eq(
 key, key));

 ResultSet resultSet = session.execute(select);

 byte[] data = resultSet.one().getBytes(data).array();

 LOG.error(repository.get() [ + key + ] byte.length()=[ +
 data.length + ]);

 return data;

 }

 ---

 Brian O'Neill

 Lead Architect, Software Development

 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive • King of Prussia, PA • 19406

 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.

 ** **


 From: Sylvain Lebresne sylv...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, April 11, 2013 8:48 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Cc: Gabriel Ciuloaica gciuloa...@gmail.com
 Subject: Re: Blobs in CQL?


 Hopefully, the prepared statement doesn't do the conversion.


 It does not.


 (I'm not sure if it is a limitation of the CQL protocol itself)

 thanks again,
 -brian



 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material.
 If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 4/11/13 8:34 AM, Gabriel Ciuloaica 

Re: CorruptedBlockException

2013-04-11 Thread Alexis Rodríguez
Aaron,

It seems that we are in the same situation as Nury, we are storing a lot of
files of ~5MB in a CF.

This happens in a test cluster, with one node using cassandra 1.1.5, we
have commitlog in a different partition than the data directory. Normally
our tests use nearly 13 GB in data, but when the exception on compaction
appears our disk space ramp up to:

# df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda1 440G  330G   89G  79% /
tmpfs 7.9G 0  7.9G   0% /lib/init/rw
udev  7.9G  160K  7.9G   1% /dev
tmpfs 7.9G 0  7.9G   0% /dev/shm
/dev/sdb1 459G  257G  179G  59% /cassandra

# cd /cassandra/data/Repository/

# ls Files/*tmp* | wc -l
1671

# du -ch Files | tail -1
257Gtotal

# du -ch Files/*tmp* | tail -1
34G total

We are using cassandra 1.1.5 with one node, our schema for that keyspace is:

[default@unknown] use Repository;
Authenticated to keyspace: Repository
[default@Repository] show schema;
create keyspace Repository
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {datacenter1 : 1}
  and durable_writes = true;

use Repository;

create column family Files
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and compaction_strategy_options = {'sstable_size_in_mb' : '120'}
  and compression_options = {'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};

In our logs:

ERROR [CompactionExecutor:1831] 2013-04-11 09:12:41,725
AbstractCassandraDaemon.java (line 135) Exception in thread
Thread[CompactionExecutor:1831,1,main]
java.io.IOError: org.apache.cassandra.io.compress.CorruptedBlockException:
(/cassandra/data/Repository/Files/Repository-Files-he-4533-Data.db):
corruption detected, chunk at 43325354 of length 65545.
at
org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116)
at
org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99)
at
org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176)
at
org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83)
at
org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68)
at
org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
at
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at
com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173)
at
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
at
org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)





On Thu, Jul 5, 2012 at 7:42 PM, aaron morton aa...@thelastpickle.comwrote:

  But I don't understand, how was all the available space taken away.
 Take a look on disk at /var/lib/cassandra/data/your_keyspace and
 /var/lib/cassandra/commitlog to see what is taking up a lot of space.

 Cassandra stores the column names as well as the values, so that can take
 up some space.

   it says that while compaction a CorruptedBlockException has occured.
 Are you able to reproduce this error ?

 Thanks


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 6/07/2012, at 12:04 AM, Nury Redjepow wrote:

  Hello to all,
 
   I have cassandra instance I'm trying to use to store millions of file
 with 

RE: CorruptedBlockException

2013-04-11 Thread moshe.kranc
I have formulated the following theory regarding C* 1.2.2 which may be 
relevant: Whenever there is a disk error during compaction of an SS table 
(e.g., bad block, out of disk space), that SStable's files stick around forever 
after, and do not subsequently get deleted by normal compaction (minor or 
major), long after all its records have been deleted. This causes disk usage to 
rise dramatically. The only way to make the SStable files disappear is to run 
nodetool cleanup (which takes hours to run).

Just a theory so far

From: Alexis Rodríguez [mailto:arodrig...@inconcertcc.com]
Sent: Thursday, April 11, 2013 5:31 PM
To: user@cassandra.apache.org
Subject: Re: CorruptedBlockException

Aaron,

It seems that we are in the same situation as Nury, we are storing a lot of 
files of ~5MB in a CF.

This happens in a test cluster, with one node using cassandra 1.1.5, we have 
commitlog in a different partition than the data directory. Normally our tests 
use nearly 13 GB in data, but when the exception on compaction appears our disk 
space ramp up to:

# df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda1 440G  330G   89G  79% /
tmpfs 7.9G 0  7.9G   0% /lib/init/rw
udev  7.9G  160K  7.9G   1% /dev
tmpfs 7.9G 0  7.9G   0% /dev/shm
/dev/sdb1 459G  257G  179G  59% /cassandra

# cd /cassandra/data/Repository/

# ls Files/*tmp* | wc -l
1671

# du -ch Files | tail -1
257Gtotal

# du -ch Files/*tmp* | tail -1
34G total

We are using cassandra 1.1.5 with one node, our schema for that keyspace is:

[default@unknown] use Repository;
Authenticated to keyspace: Repository
[default@Repository] show schema;
create keyspace Repository
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {datacenter1 : 1}
  and durable_writes = true;

use Repository;

create column family Files
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy = 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and compaction_strategy_options = {'sstable_size_in_mb' : '120'}
  and compression_options = {'sstable_compression' : 
'org.apache.cassandra.io.compress.SnappyCompressor'};

In our logs:

ERROR [CompactionExecutor:1831] 2013-04-11 09:12:41,725 
AbstractCassandraDaemon.java (line 135) Exception in thread 
Thread[CompactionExecutor:1831,1,main]
java.io.IOError: org.apache.cassandra.io.compress.CorruptedBlockException: 
(/cassandra/data/Repository/Files/Repository-Files-he-4533-Data.db): corruption 
detected, chunk at 43325354 of length 65545.
at 
org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116)
at 
org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99)
at 
org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176)
at 
org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83)
at 
org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68)
at 
org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
at 
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at 
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173)
at 
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
at 
org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)




On Thu, Jul 5, 

Re: Compaction, truncate, cqlsh problems

2013-04-11 Thread Edward Capriolo
If you do not have JNA truncate has to fork an 'ln -s'' command for the
snapshots. I think that makes it un-predicatable. Truncate has its own
timeout value now (separate from the other timeouts). If possible I think
it is better to make each test use it's own CF and avoid truncate entirely.


On Thu, Apr 11, 2013 at 9:48 AM, Ondřej Černoš cern...@gmail.com wrote:

 Hi,

 I use C* 1.2.3 and CQL3.

 I integrated cassandra into our testing environment. In order to make the
 tests repeatable I truncate all the tables that need to be empty before the
 test run via ssh session to the host cassandra runs on and by running cqlsh
 where I issue the truncate.

 It works, only sometimes it silently fails (1 in 400 runs of the truncate,
 actually).

 At the same time the truncate fails I see system ks compaction.
 Additionally, it seems there is quite a lot of these system ks compactions
 (the numbers in the filenames go up pretty fast to thousands).

 I googled truncate and found out there were some issues with race
 conditions and with slowing down if truncate is used frequently (as is my
 case, where truncate is run before each test in quite a big test suite).

 Any hints?

 Regards,
 Ondřej Černoš



Re: Column index vs Row index vs Denormalizing

2013-04-11 Thread Coen Stevens
Thanks for the feedback! We will be going forward by implementing and
deploying the proposed model, and test it out.

Cheers,
Coen


On Thu, Apr 11, 2013 at 12:21 PM, aaron morton aa...@thelastpickle.comwrote:

 Retrieving the latest 1000 tweets (of a given day) is trivial by
 requesting the streamTweets columnFamily.

 If you normally want to get the most recent items use a reverse comparator
 on the column name
 see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

 Getting the latest tweets for a given hashtag would mean you have to get
 the TimeUUIDs from the streamHashTagTweets first, and then do a second get
 call on the streamTweets with the former TimeUUIDs as the list of columns
 we like to retrieve (column index).

 You choices here depend on what sort of queries are the most frequent and
 how much disk space you have.

 You current model makes sense if the stream by day is the most frequent
 query, and you want to conserve disk space. If disk space is not an issue
 you can denormalise further and store the tweet JSON.

 If you have potentially many streamHashTagTweets rows where a single tweet
 is replicated it may make sense to stick with the current design to reduce
 disk use.

 (we want to get up to 1000 tweets).

 If you want to get 1000 anything from cassandra please break the multiget
 up into multiple calls. Each row request becomes a task in the thread pools
 on RF nodes. If you have a small ish cluster one client asking for 1000
 rows will temporarily block other clients and hurt request throughput.

  Referencing key values requires another columnFamily for tweets (key:
 tweetId, columns: 1 column with data).

 This will be a more efficient (aka faster) read than reading from the a
 wide row.

 Next to that we will request tweets by these secondary indexes quite
 infrequently, while the tweets by timestamp will be requested heavily.

 If the hot path is the streamTweets calls demoralise into that, and
 normalise the tweet storage into it's own CF and reference them from
 the streamHashTagTweets. Have a canonical store of the events / tweets /
 entities  addressable by their business key can give you more flexibility.

 Given we are estimating to store many TBs of tweets, we would prefer
 setting up machines with spinning disks (2TB per node) to save costs.

 If you have spinning disks and 1G networking the rule of thumb is 300GB to
 500GB per node. See previous discussions about size per node.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 10/04/2013, at 2:00 AM, Coen Stevens beatle...@gmail.com wrote:

 Hi all,

 We are working on a data model for storing tweets for multiple streams
 (where a stream is defined by a number of keyword filters on the full
 twitter firehose), and retrieving the tweets by timestamp and hashtag. My
 question is whether the following data model would a good way for doing
 that, where I'm creating a column name index for the hashtags.

 ColumnFamily: streamTweets
  key: streamID + dayTimestamp (creating daily buckets for each stream)
  columns = name: TimeUUID, value: tweet json (storing all the tweets
 for this stream in a wide row with a TimeUUID)

 ColumnFamily: streamHashTagTweets
  key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
  columns = name: TimeUUID (referencing the TimeUUID value in the
 streamTweets ColumnFamily), value: tweetID

 Retrieving the latest 1000 tweets (of a given day) is trivial by
 requesting the streamTweets columnFamily. Getting the latest tweets for a
 given hashtag would mean you have to get the TimeUUIDs from the
 streamHashTagTweets first, and then do a second get call on the
 streamTweets with the former TimeUUIDs as the list of columns we like to
 retrieve (column index).

 Is referencing column names (TimeUUIDs) a smart thing to do when we have
 wide rows spanning millions of columns? It seems easier (one reference
 call) to do this, then it is to reference key values and running a
 multi-get to get all the rows (we want to get up to 1000 tweets).
 Referencing key values requires another columnFamily for tweets (key:
 tweetId, columns: 1 column with data).

 Of course we could instead denormalize the data and store the tweet also
 in the streamHashTagTweet columns, but we want to do the same thing for
 other indexes as well (topics, twitter usernames, links, etc), so it
 quickly adds up in required storage space. Next to that we will request
 tweets by these secondary indexes quite infrequently, while the tweets by
 timestamp will be requested heavily.

 Given we are estimating to store many TBs of tweets, we would prefer
 setting up machines with spinning disks (2TB per node) to save costs.

 We would love to hear your feedback.

 Cheers,
 Coen





Re: Compaction, truncate, cqlsh problems

2013-04-11 Thread Ondřej Černoš
Hi,

I have JNA (cassandra only complains about obsolete version - Obsolete
version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later
- I have stock centos version 3.2.4).

Usage of separate CFs for each test run is difficult to set up.

Can you please elaborate on the specials of truncate?

regards,
Ondřej Černoš



On Thu, Apr 11, 2013 at 5:04 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 If you do not have JNA truncate has to fork an 'ln -s'' command for the
 snapshots. I think that makes it un-predicatable. Truncate has its own
 timeout value now (separate from the other timeouts). If possible I think
 it is better to make each test use it's own CF and avoid truncate entirely.


 On Thu, Apr 11, 2013 at 9:48 AM, Ondřej Černoš cern...@gmail.com wrote:

 Hi,

 I use C* 1.2.3 and CQL3.

 I integrated cassandra into our testing environment. In order to make the
 tests repeatable I truncate all the tables that need to be empty before the
 test run via ssh session to the host cassandra runs on and by running cqlsh
 where I issue the truncate.

 It works, only sometimes it silently fails (1 in 400 runs of the
 truncate, actually).

 At the same time the truncate fails I see system ks compaction.
 Additionally, it seems there is quite a lot of these system ks compactions
 (the numbers in the filenames go up pretty fast to thousands).

 I googled truncate and found out there were some issues with race
 conditions and with slowing down if truncate is used frequently (as is my
 case, where truncate is run before each test in quite a big test suite).

 Any hints?

 Regards,
 Ondřej Černoš





is the select result grouped by the value of the partition key?

2013-04-11 Thread Sorin Manolache

Hello,

Let us consider that we have a table t created as follows:

create table t(k1 vachar, k2 varchar, value varchar, primary key (k1, k2));

Its contents is

a m x
a n y
z 0 9
z 1 8

and I perform a

select * from p where k1 in ('a', 'z');

Is it guaranteed that the rows are grouped by the value of the partition 
key? That is, is it guaranteed that I'll get


a m x
a n y
z 0 9
z 1 8

or

a n y
a m x
z 1 8
z 0 9

or even

z 0 9
z 1 8
a n y
a m x

but NEVER

a m x
z 0 9
a n y
z 1 8


Thank you,
Sorin


[RELEASE] Apache Cassandra 1.2.4 released

2013-04-11 Thread Sylvain Lebresne
The Cassandra team is pleased to announce the release of Apache Cassandra
version 1.2.4.

Cassandra is a highly scalable second-generation distributed database,
bringing together Dynamo's fully distributed design and Bigtable's
ColumnFamily-based data model. You can read more here:

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a maintenance/bug fix release[1] on the 1.2 series. As
always,
please pay attention to the release notes[2] and Let us know[3] if you were
to
encounter any problem.

Enjoy!

[1]: http://goo.gl/t7x9f (CHANGES.txt)
[2]: http://goo.gl/6IEbR (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA


Re: Two Cluster each with 12 nodes- Cassandra database

2013-04-11 Thread Raihan Jamal
Folks, Any thoughts on this? I am still in the learning process. So any
guidance will be of great help.





*Raihan Jamal*


On Wed, Apr 10, 2013 at 10:39 PM, Raihan Jamal jamalrai...@gmail.comwrote:

 I have started working on a project in which I am using `Cassandra
 database`.

 Our production DBA's have setup `two cluster` and each cluster will have
 `12 nodes`.

 I will be using `Pelops client` to read the data from Cassandra database.
 Now I am thinking what's the best way to create `Cluster` using `Pelops
 client` like how many nodes I should add while creating cluster?

 My understanding was to create the cluster with all the `24 nodes` as I
 will be having two cluster each with 12 nodes? This is the right approach?


 *If not, then how we decide what nodes (from each cluster) I should add
 while creating the cluster using Pelops client?
 *

 String[] nodes = cfg.getStringArray(cassandra.servers);

 int port = cfg.getInt(cassandra.port);

 boolean dynamicND = true; // dynamic node discovery

 Config casconf = new Config(port, true, 0);

 Cluster cluster = new Cluster(nodes, casconf, dynamicND);

 Pelops.addPool(Const.CASSANDRA_POOL, cluster, Const.CASSANDRA_KS);


 Can anyone help me out with this?

 Any help will be appreciated.


 **



Re: multiple Datacenter values in PropertyFileSnitch

2013-04-11 Thread aaron morton
A node can only exist in one DC and one rack. 

Use different keyspaces as suggested. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 1:47 AM, Jabbar Azam aja...@gmail.com wrote:

 Hello,
 
 I'm not an expert but I don't think you can do what you want. The way to 
 separate data for applications on the same cluster is to use different tables 
 for different applications or use multiple keyspaces, a keyspace per 
 application. The replication factor you specify for each keyspace specifies 
 how many copies of the data are stored in each datacenter.
 
 You can't specify that data for a particular application is stored on a 
 specific node, unless that node is in its own cluster.
 
 I think of a cassandra cluster as a shared resource where all the 
 applications have access to all the nodes in the cluster.
 
 
 Thanks
 
 Jabbar Azam
 
 
 On 11 April 2013 14:13, Matthias Zeilinger matthias.zeilin...@bwinparty.com 
 wrote:
 Hi,
 
  
 
 I would like to create big cluster for many applications.
 
 Within this cluster I would like to separate the data for each application, 
 which can be easily done via different virtual datacenters and the correct 
 replication strategy.
 
 What I would like to know, if I can specify for 1 node multiple values in the 
 PropertyFileSnitch configuration, so that I can use 1 node for more 
 applications?
 
 For example:
 
 6 nodes:
 
 3 for App A
 
 3 for App B
 
 4 for App C
 
  
 
 I want to have such a configuration:
 
 Node 1 – DC-A DC-C
 
 Node 2 – DC-B  DC-C
 
 Node 3 – DC-A  DC-C
 
 Node 4 – DC-B  DC-C
 
 Node 5 – DC-A
 
 Node 6 – DC-B
 
  
 
 Is this possible or does anyone have another solution for this?
 
  
 
  
 
 Thx  br matthias
 
 



Re: Two Cluster each with 12 nodes- Cassandra database

2013-04-11 Thread Jabbar Azam
Hello,

I don't know what pelops is. I'm not sure why you want two clusters. I
would have two clusters if I want to have data stored on totally separate
servers for perhaps security reasons.

If you are going to have the servers in one location then you might as well
have one cluster. You'll have the maximum aggregate io of all the servers.

If you're thinking of doing analytics as well then you can create two
virtual datacentres.  One for realtime inserts and reads and the second for
analytics.  You could have have and 16 /8 server split.  Obviously you'll
have to work out what the optimum split is for your workload.

Not sure if I've answered your question...
On 11 Apr 2013 18:51, Raihan Jamal jamalrai...@gmail.com wrote:

 Folks, Any thoughts on this? I am still in the learning process. So any
 guidance will be of great help.





 *Raihan Jamal*


 On Wed, Apr 10, 2013 at 10:39 PM, Raihan Jamal jamalrai...@gmail.comwrote:

 I have started working on a project in which I am using `Cassandra
 database`.

 Our production DBA's have setup `two cluster` and each cluster will have
 `12 nodes`.

 I will be using `Pelops client` to read the data from Cassandra database.
 Now I am thinking what's the best way to create `Cluster` using `Pelops
 client` like how many nodes I should add while creating cluster?

 My understanding was to create the cluster with all the `24 nodes` as I
 will be having two cluster each with 12 nodes? This is the right approach?


 *If not, then how we decide what nodes (from each cluster) I should add
 while creating the cluster using Pelops client?
 *

 String[] nodes = cfg.getStringArray(cassandra.servers);

 int port = cfg.getInt(cassandra.port);

 boolean dynamicND = true; // dynamic node discovery

 Config casconf = new Config(port, true, 0);

 Cluster cluster = new Cluster(nodes, casconf, dynamicND);

 Pelops.addPool(Const.CASSANDRA_POOL, cluster, Const.CASSANDRA_KS);


 Can anyone help me out with this?

 Any help will be appreciated.


 **





Re: describe keyspace or column family query not working

2013-04-11 Thread aaron morton
tables created without COMPACT STORAGE are still visible in cassandra-cli.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/04/2013, at 5:40 AM, Tyler Hobbs ty...@datastax.com wrote:

 
 On Wed, Apr 10, 2013 at 11:09 AM, Vivek Mishra mishra.v...@gmail.com wrote:
 Ok. A column family and keyspace created via cqlsh using cql3 is visible via 
 cassandra-cli or thrift API?
 
 The column family will only be visible via cassandra-cli and the Thrift API 
 if it was created WITH COMPACT STORAGE: 
 http://www.datastax.com/docs/1.2/cql_cli/cql/CREATE_TABLE#using-compact-storage
 
 
 -- 
 Tyler Hobbs
 DataStax



Re: (info) Abort the seek op in SSTableIdentityIterator class.

2013-04-11 Thread aaron morton
When created by the SSTableScanner the dataStart passed in is the existing file 
position so it may not be necessary. But it may be sane to do it and the seek() 
call may not result in disk reads.

Cheers
  
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 1:48 AM, dong.yajun dongt...@gmail.com wrote:

 Hello, 
 
 I read the source code of SSTableIdentityIterator with v-1.0.9, and I thought 
 the following code is not necessary, did I miss anything? 
 
 RandomAccessReader file = (RandomAccessReader) input;
 file.seek(this.dataStart); 
 
 here, the value of dataStart is assigned in SSTableScanner: 
 
  long dataStart = file.getFilePointer();
 
 any suggestion abort this issue? thanks. 
 
 Best, 
 
 -- 
 Rick Dong
 



running cassandra on 8 GB servers

2013-04-11 Thread Nikolay Mihaylov
For one project I will need to run cassandra on following dedicated servers:

Single CPU XEON 4 cores no hyper-threading, 8 GB RAM, 12 TB locally
attached HDD's in some kind of RAID, visible as single HDD.

I can do cluster of 20-30 such servers, may be even more.

The data will be huge, I am estimating 4-6 TB per server. I know this is
best, but those are my resources.

Currently I am testing with one of such servers, except HDD is 300 GB.
Every 15-20 hours, I get out of heap memory, e.g. something like:

ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line 164)
Exception in thread Thread[Thrift:641,5,main]
...
 INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915
ThriftServer.java (line 116) Stop listening to thrift clients
 INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,943 Gossiper.java
(line 1077) Announcing shutdown
 INFO [StorageServiceShutdownHook] 2013-04-11 11:26:08,613
MessagingService.java (line 682) Waiting for messaging service to quiesce
 INFO [ACCEPT-/208.94.232.37] 2013-04-11 11:26:08,655 MessagingService.java
(line 888) MessagingService shutting down server thread.
ERROR [Thrift:721] 2013-04-11 11:26:37,709 CustomTThreadPoolServer.java
(line 217) Error occurred during processing of message.
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has
shut down

Anyone have some advices about better utilization of such servers?

Nick.


Re: Compaction, truncate, cqlsh problems

2013-04-11 Thread aaron morton
 Can you please elaborate on the specials of truncate?
I think ed was talking about this config setting in 1.2
https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L484


 It works, only sometimes it silently fails (1 in 400 runs of the truncate, 
 actually).

The data is left in place ? 
Are you making the calls very quickly? If you add a pause does it help ? 


 At the same time the truncate fails I see system ks compaction. Additionally, 
 it seems there is quite a lot of these system ks compactions (the numbers in 
 the filenames go up pretty fast to thousands).
In 1.2 truncate updates a system table, which always flush after updates. 

Cheers
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 3:12 AM, Ondřej Černoš cern...@gmail.com wrote:

 Hi,
 
 I have JNA (cassandra only complains about obsolete version - Obsolete 
 version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later - 
 I have stock centos version 3.2.4).
 
 Usage of separate CFs for each test run is difficult to set up.
 
 Can you please elaborate on the specials of truncate?
 
 regards,
 Ondřej Černoš
 
 
 
 On Thu, Apr 11, 2013 at 5:04 PM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
 If you do not have JNA truncate has to fork an 'ln -s'' command for the 
 snapshots. I think that makes it un-predicatable. Truncate has its own 
 timeout value now (separate from the other timeouts). If possible I think it 
 is better to make each test use it's own CF and avoid truncate entirely.
 
 
 On Thu, Apr 11, 2013 at 9:48 AM, Ondřej Černoš cern...@gmail.com wrote:
 Hi,
 
 I use C* 1.2.3 and CQL3.
 
 I integrated cassandra into our testing environment. In order to make the 
 tests repeatable I truncate all the tables that need to be empty before the 
 test run via ssh session to the host cassandra runs on and by running cqlsh 
 where I issue the truncate.
 
 It works, only sometimes it silently fails (1 in 400 runs of the truncate, 
 actually).
 
 At the same time the truncate fails I see system ks compaction. Additionally, 
 it seems there is quite a lot of these system ks compactions (the numbers in 
 the filenames go up pretty fast to thousands).
 
 I googled truncate and found out there were some issues with race conditions 
 and with slowing down if truncate is used frequently (as is my case, where 
 truncate is run before each test in quite a big test suite).
 
 Any hints?
 
 Regards,
 Ondřej Černoš
 
 



Re: is the select result grouped by the value of the partition key?

2013-04-11 Thread aaron morton
 Is it guaranteed that the rows are grouped by the value of the partition key? 
 That is, is it guaranteed that I'll get
Your primary key (k1, k2) is considered in type parts (partition_key , 
grouping_columns). In your case the primary_key is key and the grouping column 
in k2. Columns are ordered by the grouping columns, k2. 

See http://thelastpickle.com/2013/01/11/primary-keys-in-cql/

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 3:19 AM, Sorin Manolache sor...@gmail.com wrote:

 Hello,
 
 Let us consider that we have a table t created as follows:
 
 create table t(k1 vachar, k2 varchar, value varchar, primary key (k1, k2));
 
 Its contents is
 
 a m x
 a n y
 z 0 9
 z 1 8
 
 and I perform a
 
 select * from p where k1 in ('a', 'z');
 
 Is it guaranteed that the rows are grouped by the value of the partition key? 
 That is, is it guaranteed that I'll get
 
 a m x
 a n y
 z 0 9
 z 1 8
 
 or
 
 a n y
 a m x
 z 1 8
 z 0 9
 
 or even
 
 z 0 9
 z 1 8
 a n y
 a m x
 
 but NEVER
 
 a m x
 z 0 9
 a n y
 z 1 8
 
 
 Thank you,
 Sorin



Re: CorruptedBlockException

2013-04-11 Thread aaron morton
 Whenever there is a disk error during compaction of an SS table (e.g., bad 
 block, out of disk space), that SStable’s files stick around forever after
 
Fixed in 1.1.1 https://issues.apache.org/jira/browse/CASSANDRA-2261

 We are using 1.1.5, besides that I have tried to run cleanup, with no success 
 :( 
That is not what cleanup is used for. It is used to delete data that a node is 
no longer responsible for. 

If you have a log of -tmp- files created through a problem like this a restart 
will delete them. 

If you have an error reading data of disk use nodetool scrub. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 3:35 AM, Alexis Rodríguez arodrig...@inconcertcc.com wrote:

 Moshe, 
 
 We are using 1.1.5, besides that I have tried to run cleanup, with no success 
 :( 
 
 
 # nodetool -p 8080 cleanup
 Error occured during cleanup
 java.util.concurrent.ExecutionException: java.io.IOError: 
 org.apache.cassandra.io.compress.CorruptedBlockException: (/cass
 andra/data/Repository/Files/Repository-Files-he-4533-Data.db): corruption 
 detected, chunk at 43325354 of length 65545.
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
 at java.util.concurrent.FutureTask.get(FutureTask.java:111)
 at 
 org.apache.cassandra.db.compaction.CompactionManager.performAllSSTableOperation(CompactionManager.java:216)
 at 
 org.apache.cassandra.db.compaction.CompactionManager.performCleanup(CompactionManager.java:252)
 at 
 org.apache.cassandra.db.ColumnFamilyStore.forceCleanup(ColumnFamilyStore.java:970)
 at 
 org.apache.cassandra.service.StorageService.forceTableCleanup(StorageService.java:1772)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at 
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:111)
 at 
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:45)
 at 
 com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:226)
 at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
 at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:251)
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:857)
 at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:795)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1450)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:90)
 at 
 javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1285)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1383)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:807)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 
 
 
 On Thu, Apr 11, 2013 at 11:45 AM, moshe.kr...@barclays.com wrote:
 I have formulated the following theory regarding C* 1.2.2 which may be 
 relevant: Whenever there is a disk error during compaction of an SS table 
 (e.g., bad block, out of disk space), that SStable’s files stick around 
 forever after, and do not subsequently get deleted by normal compaction 
 (minor or major), long after all its records have been deleted. This causes 
 disk usage to rise dramatically. The only way to make the SStable files 
 disappear is to run “nodetool cleanup” (which takes hours to run).
 
  
 
 Just a theory so far….
 
  
 
 From: Alexis Rodríguez [mailto:arodrig...@inconcertcc.com] 
 Sent: Thursday, April 11, 2013 5:31 PM
 To: user@cassandra.apache.org
 Subject: Re: CorruptedBlockException
 
  
 
 Aaron,
 
  
 
 It seems that we are in the same situation as Nury, we are storing a lot of 
 files of ~5MB in a CF.
 
  
 
 This happens in a test cluster, with one node using cassandra 1.1.5, we have 
 commitlog in a different partition than the data directory. Normally our 
 tests use nearly 13 GB in data, but when the exception on compaction appears 
 our disk space ramp up to:
 
  
 
 # df -h
 
 FilesystemSize  Used Avail Use% Mounted on
 
 /dev/sda1 440G  330G   89G  79% /
 
 tmpfs 7.9G 0  7.9G   0% /lib/init/rw
 
 udev  7.9G  160K  7.9G   1% /dev
 
 tmpfs 7.9G 0  7.9G   0% /dev/shm
 
 /dev/sdb1 459G  257G  179G  59% /cassandra
 
  
 
 # cd 

Re: Two Cluster each with 12 nodes- Cassandra database

2013-04-11 Thread aaron morton
 I will be using `Pelops client`
If you are starting out using Java I *strongly* suggest using this client 
https://github.com/Netflix/astyanax/ see the documentation here 
https://github.com/Netflix/astyanax/wiki


 My understanding was to create the cluster with all the `24 nodes` as I will 
 be having two cluster each with 12 nodes? This is the right approach?
As discussed, you can have two separate clusters but you probably want to have 
one cluster with two data centres, each with 12 nodes. See 
http://www.datastax.com/docs/1.2/initialize/cluster_init

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 6:45 AM, Jabbar Azam aja...@gmail.com wrote:

 Hello,
 
 I don't know what pelops is. I'm not sure why you want two clusters. I would 
 have two clusters if I want to have data stored on totally separate servers 
 for perhaps security reasons.
 
 If you are going to have the servers in one location then you might as well 
 have one cluster. You'll have the maximum aggregate io of all the servers.
 
 If you're thinking of doing analytics as well then you can create two virtual 
 datacentres.  One for realtime inserts and reads and the second for 
 analytics.  You could have have and 16 /8 server split.  Obviously you'll 
 have to work out what the optimum split is for your workload.
 
 Not sure if I've answered your question...
 
 On 11 Apr 2013 18:51, Raihan Jamal jamalrai...@gmail.com wrote:
 Folks, Any thoughts on this? I am still in the learning process. So any 
 guidance will be of great help.
 
 
 
 
 
 Raihan Jamal
 
 
 On Wed, Apr 10, 2013 at 10:39 PM, Raihan Jamal jamalrai...@gmail.com wrote:
 I have started working on a project in which I am using `Cassandra database`. 
 
 Our production DBA's have setup `two cluster` and each cluster will have `12 
 nodes`.
 
 I will be using `Pelops client` to read the data from Cassandra database. Now 
 I am thinking what's the best way to create `Cluster` using `Pelops client` 
 like how many nodes I should add while creating cluster?
 
 My understanding was to create the cluster with all the `24 nodes` as I will 
 be having two cluster each with 12 nodes? This is the right approach?
 
 
 If not, then how we decide what nodes (from each cluster) I should add while 
 creating the cluster using Pelops client?
 
 
 String[] nodes = cfg.getStringArray(cassandra.servers); 
 
 int port = cfg.getInt(cassandra.port); 
 
 boolean dynamicND = true; // dynamic node discovery 
 
 Config casconf = new Config(port, true, 0); 
 
 Cluster cluster = new Cluster(nodes, casconf, dynamicND); 
 
 Pelops.addPool(Const.CASSANDRA_POOL, cluster, Const.CASSANDRA_KS);
 
 
 Can anyone help me out with this? 
 
 Any help will be appreciated.
 
 
 



Re: running cassandra on 8 GB servers

2013-04-11 Thread aaron morton
 The data will be huge, I am estimating 4-6 TB per server. I know this is 
 best, but those are my resources.
You will have a very unhappy time. 

The general rule of thumb / guideline for a HDD based system with 1G networking 
is 300GB to 500Gb per node. See previous discussions on this topic for reasons. 
 
 ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line 164) 
 Exception in thread Thread[Thrift:641,5,main]
 ...
  INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915 ThriftServer.java 
 (line 116) Stop listening to thrift clients
What was the error ?

What version are you using?
If you have changed any defaults for memory in cassandra-env.sh or 
cassandra.yaml revert them. Generally C* will do the right thing and not OOM, 
unless you are trying to store a lot of data on a node that does not have 
enough memory. See this thread for background 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 7:35 AM, Nikolay Mihaylov n...@nmmm.nu wrote:

 For one project I will need to run cassandra on following dedicated servers:
 
 Single CPU XEON 4 cores no hyper-threading, 8 GB RAM, 12 TB locally attached 
 HDD's in some kind of RAID, visible as single HDD.
 
 I can do cluster of 20-30 such servers, may be even more.
 
 The data will be huge, I am estimating 4-6 TB per server. I know this is 
 best, but those are my resources.
 
 Currently I am testing with one of such servers, except HDD is 300 GB. Every 
 15-20 hours, I get out of heap memory, e.g. something like:
 
 ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line 164) 
 Exception in thread Thread[Thrift:641,5,main]
 ...
  INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915 ThriftServer.java 
 (line 116) Stop listening to thrift clients
  INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,943 Gossiper.java 
 (line 1077) Announcing shutdown
  INFO [StorageServiceShutdownHook] 2013-04-11 11:26:08,613 
 MessagingService.java (line 682) Waiting for messaging service to quiesce
  INFO [ACCEPT-/208.94.232.37] 2013-04-11 11:26:08,655 MessagingService.java 
 (line 888) MessagingService shutting down server thread.
 ERROR [Thrift:721] 2013-04-11 11:26:37,709 CustomTThreadPoolServer.java (line 
 217) Error occurred during processing of message.
 java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
 down
 
 Anyone have some advices about better utilization of such servers?
 
 Nick.



Re: running cassandra on 8 GB servers

2013-04-11 Thread Edward Capriolo
With that much data per node you have to raise the IndexInterval and adjust
the bloom filter settings. Although the bloom filters are off heap now
having that much data can but a strain on physical memory.


On Thu, Apr 11, 2013 at 4:26 PM, aaron morton aa...@thelastpickle.comwrote:

  The data will be huge, I am estimating 4-6 TB per server. I know this is
 best, but those are my resources.
 You will have a very unhappy time.

 The general rule of thumb / guideline for a HDD based system with 1G
 networking is 300GB to 500Gb per node. See previous discussions on this
 topic for reasons.

  ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line
 164) Exception in thread Thread[Thrift:641,5,main]
  ...
   INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915
 ThriftServer.java (line 116) Stop listening to thrift clients
 What was the error ?

 What version are you using?
 If you have changed any defaults for memory in cassandra-env.sh or
 cassandra.yaml revert them. Generally C* will do the right thing and not
 OOM, unless you are trying to store a lot of data on a node that does not
 have enough memory. See this thread for background
 http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 12/04/2013, at 7:35 AM, Nikolay Mihaylov n...@nmmm.nu wrote:

  For one project I will need to run cassandra on following dedicated
 servers:
 
  Single CPU XEON 4 cores no hyper-threading, 8 GB RAM, 12 TB locally
 attached HDD's in some kind of RAID, visible as single HDD.
 
  I can do cluster of 20-30 such servers, may be even more.
 
  The data will be huge, I am estimating 4-6 TB per server. I know this is
 best, but those are my resources.
 
  Currently I am testing with one of such servers, except HDD is 300 GB.
 Every 15-20 hours, I get out of heap memory, e.g. something like:
 
  ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line
 164) Exception in thread Thread[Thrift:641,5,main]
  ...
   INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915
 ThriftServer.java (line 116) Stop listening to thrift clients
   INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,943 Gossiper.java
 (line 1077) Announcing shutdown
   INFO [StorageServiceShutdownHook] 2013-04-11 11:26:08,613
 MessagingService.java (line 682) Waiting for messaging service to quiesce
   INFO [ACCEPT-/208.94.232.37] 2013-04-11 11:26:08,655
 MessagingService.java (line 888) MessagingService shutting down server
 thread.
  ERROR [Thrift:721] 2013-04-11 11:26:37,709 CustomTThreadPoolServer.java
 (line 217) Error occurred during processing of message.
  java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has
 shut down
 
  Anyone have some advices about better utilization of such servers?
 
  Nick.




Re: running cassandra on 8 GB servers

2013-04-11 Thread Nikolay Mihaylov
I am using 1.2.3, used default heap - 2 GB without JNA installed,
then modified heap to 4 GB / 400 MB young generation. + JNA installed.
bloom filter on the CF's is lowered (more false positives, less disk space).

 WARN [ScheduledTasks:1] 2013-04-11 11:09:41,899 GCInspector.java (line
142) Heap is 0.9885574036095974 full.  You may need to reduce memtable
and/or cache sizes.  Cassandra will now flush up to the two largest
memtables to free up memory.  Adjust flush_largest_memtables_at threshold
in cassandra.yaml if you don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-04-11 11:09:41,906 StorageService.java (line
3541) Flushing CFS(Keyspace='CRAWLER', ColumnFamily='counters') to relieve
memory pressure
 INFO [ScheduledTasks:1] 2013-04-11 11:09:41,949 ColumnFamilyStore.java
(line 637) Enqueuing flush of Memtable-counters@862481781(711504/6211531
serialized/live bytes, 11810 ops)
ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line 164)
Exception in thread Thread[Thrift:641,5,main]
java.lang.OutOfMemoryError: *Java heap space*



On Thu, Apr 11, 2013 at 11:26 PM, aaron morton aa...@thelastpickle.comwrote:

  The data will be huge, I am estimating 4-6 TB per server. I know this is
 best, but those are my resources.
 You will have a very unhappy time.

 The general rule of thumb / guideline for a HDD based system with 1G
 networking is 300GB to 500Gb per node. See previous discussions on this
 topic for reasons.

  ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line
 164) Exception in thread Thread[Thrift:641,5,main]
  ...
   INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915
 ThriftServer.java (line 116) Stop listening to thrift clients
 What was the error ?

 What version are you using?
 If you have changed any defaults for memory in cassandra-env.sh or
 cassandra.yaml revert them. Generally C* will do the right thing and not
 OOM, unless you are trying to store a lot of data on a node that does not
 have enough memory. See this thread for background
 http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 12/04/2013, at 7:35 AM, Nikolay Mihaylov n...@nmmm.nu wrote:

  For one project I will need to run cassandra on following dedicated
 servers:
 
  Single CPU XEON 4 cores no hyper-threading, 8 GB RAM, 12 TB locally
 attached HDD's in some kind of RAID, visible as single HDD.
 
  I can do cluster of 20-30 such servers, may be even more.
 
  The data will be huge, I am estimating 4-6 TB per server. I know this is
 best, but those are my resources.
 
  Currently I am testing with one of such servers, except HDD is 300 GB.
 Every 15-20 hours, I get out of heap memory, e.g. something like:
 
  ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line
 164) Exception in thread Thread[Thrift:641,5,main]
  ...
   INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915
 ThriftServer.java (line 116) Stop listening to thrift clients
   INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,943 Gossiper.java
 (line 1077) Announcing shutdown
   INFO [StorageServiceShutdownHook] 2013-04-11 11:26:08,613
 MessagingService.java (line 682) Waiting for messaging service to quiesce
   INFO [ACCEPT-/208.94.232.37] 2013-04-11 11:26:08,655
 MessagingService.java (line 888) MessagingService shutting down server
 thread.
  ERROR [Thrift:721] 2013-04-11 11:26:37,709 CustomTThreadPoolServer.java
 (line 217) Error occurred during processing of message.
  java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has
 shut down
 
  Anyone have some advices about better utilization of such servers?
 
  Nick.




Does Memtable resides in Heap?

2013-04-11 Thread Jay Svc
Hi Team,

I have got this 8GB of RAM out of that 4GB allocated to Java Heap. My
question is the size of Memtable does it contribute to heap size? or they
are part of off-heap?

Does bigger Memtable would have impact on GC and overall memory management?

I am using DSE 3.0 / Cassandra 1.1.9.

Thanks,
Jay


Broken pipe when variating a lot number of connections

2013-04-11 Thread Rodrigo Felix
Hi,

I've been changing a benchmarking tool (YCSB) to vary the number of
clients throughout a workload execution and, for some reason, I believe
Cassandra is facing some problems to handle the variation (both up and
down) on the number of connections. Each client has a connection and
clients are added or removed during execution.
For some reason, after some (variable) time, I get the following
exception, using the original Cassandra client implemented by YCSB, that
works for a constant number of clients.
Is there any reason or known issue to explain why cassandra does not
handle properly the number of connection and gives broken pipe? I'm
supposing this can be a Cassandra problem, but feel free to let me know if
you think I'm wrong.
   Thanks in advance.

   *Cassandra version:* 1.1.5
   *Some property values:*

   - rpc_keepalive: true
   - rpc_min_threads: 16
   - rpc_max_threads: 2048

*Exception:*

*Changing clients from 192 to 240*
*org.apache.thrift.transport.TTransportException: java.net.SocketException:
Broken pipe*
*org.apache.thrift.transport.TTransportException: java.net.SocketException:
Broken pipe*
* at
org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
*
* at
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:158)
*
* at
org.apache.cassandra.thrift.Cassandra$Client.send_set_keyspace(Cassandra.java:436)
*
* at
org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:425)
*
* at com.yahoo.ycsb.db.CassandraClient10.scan(CassandraClient10.java:314)*
* at com.yahoo.ycsb.DBWrapper.scan(DBWrapper.java:106)*
* at
com.yahoo.ycsb.workloads.CoreWorkload.doTransactionScan(CoreWorkload.java:530)
*
* at
com.yahoo.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:431)*
* at com.yahoo.ycsb.ClientThread.run(ClientThread.java:105)*
Att.

*Rodrigo Felix de Almeida*
LSBD - Universidade Federal do Ceará
Project Manager
MBA, CSM, CSPO, SCJP


Re: CorruptedBlockException

2013-04-11 Thread Lanny Ripple
Saw this in earlier versions. Our workaround was disable; drain; snap; 
shutdown; delete; link from snap; restart;

  -ljr

On Apr 11, 2013, at 9:45, moshe.kr...@barclays.com wrote:

 I have formulated the following theory regarding C* 1.2.2 which may be 
 relevant: Whenever there is a disk error during compaction of an SS table 
 (e.g., bad block, out of disk space), that SStable’s files stick around 
 forever after, and do not subsequently get deleted by normal compaction 
 (minor or major), long after all its records have been deleted. This causes 
 disk usage to rise dramatically. The only way to make the SStable files 
 disappear is to run “nodetool cleanup” (which takes hours to run).
  
 Just a theory so far….
  
 From: Alexis Rodríguez [mailto:arodrig...@inconcertcc.com] 
 Sent: Thursday, April 11, 2013 5:31 PM
 To: user@cassandra.apache.org
 Subject: Re: CorruptedBlockException
  
 Aaron,
  
 It seems that we are in the same situation as Nury, we are storing a lot of 
 files of ~5MB in a CF.
  
 This happens in a test cluster, with one node using cassandra 1.1.5, we have 
 commitlog in a different partition than the data directory. Normally our 
 tests use nearly 13 GB in data, but when the exception on compaction appears 
 our disk space ramp up to:
  
 # df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/sda1 440G  330G   89G  79% /
 tmpfs 7.9G 0  7.9G   0% /lib/init/rw
 udev  7.9G  160K  7.9G   1% /dev
 tmpfs 7.9G 0  7.9G   0% /dev/shm
 /dev/sdb1 459G  257G  179G  59% /cassandra
  
 # cd /cassandra/data/Repository/
  
 # ls Files/*tmp* | wc -l
 1671
  
 # du -ch Files | tail -1
 257Gtotal
  
 # du -ch Files/*tmp* | tail -1
 34G total
  
 We are using cassandra 1.1.5 with one node, our schema for that keyspace is:
  
 [default@unknown] use Repository;
 Authenticated to keyspace: Repository
 [default@Repository] show schema;
 create keyspace Repository
   with placement_strategy = 'NetworkTopologyStrategy'
   and strategy_options = {datacenter1 : 1}
   and durable_writes = true;
  
 use Repository;
  
 create column family Files
   with column_type = 'Standard'
   and comparator = 'UTF8Type'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'BytesType'
   and read_repair_chance = 0.1
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 864000
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = true
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
   and caching = 'KEYS_ONLY'
   and compaction_strategy_options = {'sstable_size_in_mb' : '120'}
   and compression_options = {'sstable_compression' : 
 'org.apache.cassandra.io.compress.SnappyCompressor'};
  
 In our logs:
  
 ERROR [CompactionExecutor:1831] 2013-04-11 09:12:41,725 
 AbstractCassandraDaemon.java (line 135) Exception in thread 
 Thread[CompactionExecutor:1831,1,main]
 java.io.IOError: org.apache.cassandra.io.compress.CorruptedBlockException: 
 (/cassandra/data/Repository/Files/Repository-Files-he-4533-Data.db): 
 corruption detected, chunk at 43325354 of length 65545.
 at 
 org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116)
 at 
 org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:99)
 at 
 org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176)
 at 
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83)
 at 
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68)
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 at 
 com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173)
 at 
 org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)

RE: Does Memtable resides in Heap?

2013-04-11 Thread Viktor Jevdokimov
Memtables resides in heap, write rate impacts GC, more writes - more frequent 
and longer ParNew GC pauses.


From: Jay Svc [mailto:jaytechg...@gmail.com]
Sent: Friday, April 12, 2013 01:03
To: user@cassandra.apache.org
Subject: Does Memtable resides in Heap?

Hi Team,

I have got this 8GB of RAM out of that 4GB allocated to Java Heap. My question 
is the size of Memtable does it contribute to heap size? or they are part of 
off-heap?

Does bigger Memtable would have impact on GC and overall memory management?

I am using DSE 3.0 / Cassandra 1.1.9.

Thanks,
Jay

Best regards / Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-01112 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.


RE: multiple Datacenter values in PropertyFileSnitch

2013-04-11 Thread Matthias Zeilinger
I´m using for each application it´s own keyspace.
What I want is to split up for different load patterns.
So that 2 apps with same and very high load pattern are not clashing.

For other load patterns I want to use another splitting.

Is there any best practice or should I scale out, so that the complete load can 
be distributed to on all nodes?

Br,
Matthias Zeilinger
Production Operation - Shared Services

P: +43 (0) 50 858-31185
M: +43 (0) 664 85-34459
E: matthias.zeilin...@bwinparty.com

bwin.party services (Austria) GmbH
Marxergasse 1B
A-1030 Vienna

www.bwinparty.com

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Donnerstag, 11. April 2013 20:48
To: user@cassandra.apache.org
Subject: Re: multiple Datacenter values in PropertyFileSnitch

A node can only exist in one DC and one rack.

Use different keyspaces as suggested.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/04/2013, at 1:47 AM, Jabbar Azam 
aja...@gmail.commailto:aja...@gmail.com wrote:


Hello,

I'm not an expert but I don't think you can do what you want. The way to 
separate data for applications on the same cluster is to use different tables 
for different applications or use multiple keyspaces, a keyspace per 
application. The replication factor you specify for each keyspace specifies how 
many copies of the data are stored in each datacenter.
You can't specify that data for a particular application is stored on a 
specific node, unless that node is in its own cluster.
I think of a cassandra cluster as a shared resource where all the applications 
have access to all the nodes in the cluster.


Thanks

Jabbar Azam

On 11 April 2013 14:13, Matthias Zeilinger 
matthias.zeilin...@bwinparty.commailto:matthias.zeilin...@bwinparty.com 
wrote:
Hi,

I would like to create big cluster for many applications.
Within this cluster I would like to separate the data for each application, 
which can be easily done via different virtual datacenters and the correct 
replication strategy.
What I would like to know, if I can specify for 1 node multiple values in the 
PropertyFileSnitch configuration, so that I can use 1 node for more 
applications?
For example:
6 nodes:
3 for App A
3 for App B
4 for App C

I want to have such a configuration:
Node 1 - DC-A DC-C
Node 2 - DC-B  DC-C
Node 3 - DC-A  DC-C
Node 4 - DC-B  DC-C
Node 5 - DC-A
Node 6 - DC-B

Is this possible or does anyone have another solution for this?


Thx  br matthias