Re: About Composite range queries
Thx for the answer 1 more thing, a Composite key is not hashed only once I guess? It's hashed the number of part the composite have? So this means there are twice or 3 or ... as many keys as for normal column keys, is it true? Le 31 mai 2012 02:59, aaron morton aa...@thelastpickle.com a écrit : Composite Columns compare each part in turn, so the values are ordered as you've shown them. However the rows are not ordered according to key value. They are ordered using the random token generated by the partitioner see http://wiki.apache.org/cassandra/FAQ#range_rp What is the real advantage compared to super column families? They are faster. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/05/2012, at 10:08 PM, Cyril Auburtin wrote: How is it done in Cassandra to be able to range query on a composite key? key1 = (A:A:C), (A:B:C), (A:C:C), (A:D:C), (B,A,C) like get_range (key1, start_column=(A,), end_column=(A, C)); will return [ (A:B:C), (A:C:C) ] (in pycassa) I mean does the composite implementation add much overhead to make it work? Does it need to add other Column families, to be able to range query between composites simple keys (first, second and third part of the composite)? What is the real advantage compared to super column families? key1 = A: (A,C), (B,C), (C,C), (D,C) , B: (A,C) thx
cassandra-hadoop mapper
Hi, I'm working on some use cases to understand how cassandra-hadoop integration works. I have a very basic scenario: I have a column family that keeps the session id and some bson data that contains the username in two separate columns. I want to go through all rows and dump the row to a file when the username is matching to a certain criteria. And I don't need any Reducer or Combiner for now. After I've written the following very simple hadoop job, I see from the logs that my mapper function is called per each row. Is that normal? If that is the case, doing such a search operation in a big dataset would take hours if not days...Besides that, I see many small output files being created on HDFS. I guess i need a better understanding on how splitting the job into tasks works exactly.. @Override public void map(ByteBuffer key, SortedMapByteBuffer, IColumn columns, Context context) throws IOException, InterruptedException { String rowkey = ByteBufferUtil.string(key); String ip = context.getConfiguration(). get(IP); IColumn column = columns.get(sourceColumn); if (column == null) return; ByteBuffer byteBuffer = column.value(); ByteBuffer bb2 = byteBuffer.duplicate(); DataConvertor convertor= fromBson(byteBuffer, DataConvertor.class); String username= convertor.getUsername(); BytesWritable value = new BytesWritable(); if (username != null username.equals(cip)) { byte[] arr = convertToByteArray(bb2); value.set(new BytesWritable(arr)); Text tkey = new Text(rowkey); context.write( tkey, value); } else { log.info(ip not match [ + ip + ]); } } Thanks in advance Kind Regards -- Find a job you enjoy, and you'll never work a day in your life. Confucius
Re: cassandra-hadoop mapper
Hi, yes, the work can be split between different mappers, but each one will process one row at the time. In fact, the method public void map(ByteBuffer key, SortedMapByteBuffer, IColumn columns, Context context) processes 1 row, with the specified ByteBuffer key and the list of columns SortedMapByteBuffer, IColumn columns. That doesn't mean you will make millions of requests to Cassandra to retrieve one row at the time though. Requests are batched, and the parameter cassandra.range.batch.size determines The number of rows to request with each get range slices request (as per javadoc). Performance-wise, that shouldn't be a problem… the operation you are doing is very simple, and Cassandra will be fast to retrieve such a short rows. In any case, your business logic works well in parallel, so you can split the job between many concurrent mappers and distribute the work among them. -- Filippo On Thursday, 31 May 2012 at 09:59, murat migdisoglu wrote: Hi, I'm working on some use cases to understand how cassandra-hadoop integration works. I have a very basic scenario: I have a column family that keeps the session id and some bson data that contains the username in two separate columns. I want to go through all rows and dump the row to a file when the username is matching to a certain criteria. And I don't need any Reducer or Combiner for now. After I've written the following very simple hadoop job, I see from the logs that my mapper function is called per each row. Is that normal? If that is the case, doing such a search operation in a big dataset would take hours if not days...Besides that, I see many small output files being created on HDFS. I guess i need a better understanding on how splitting the job into tasks works exactly.. @Override public void map(ByteBuffer key, SortedMapByteBuffer, IColumn columns, Context context) throws IOException, InterruptedException { String rowkey = ByteBufferUtil.string(key); String ip = context.getConfiguration(). get(IP); IColumn column = columns.get(sourceColumn); if (column == null) return; ByteBuffer byteBuffer = column.value(); ByteBuffer bb2 = byteBuffer.duplicate(); DataConvertor convertor= fromBson(byteBuffer, DataConvertor.class); String username= convertor.getUsername(); BytesWritable value = new BytesWritable(); if (username != null username.equals(cip)) { byte[] arr = convertToByteArray(bb2); value.set(new BytesWritable(arr)); Text tkey = new Text(rowkey); context.write( tkey, value); } else { log.info (http://log.info)(ip not match [ + ip + ]); } } Thanks in advance Kind Regards -- Find a job you enjoy, and you'll never work a day in your life. Confucius
Re: Retrieving old data version for a given row
-Is there any other way to stract the contect of SSTable, writing a java program for example instead of using sstable2json? Look at the code in sstale2json and copy it :) -I tried to get tombstons using the thrift API, but seems to be not possible, is it right? When I try, the program throws an exception. No. Tombstones are not returned from API (See ColumnFamilyStore.getColumnFamily() ). You can see them if you use sstable2json. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/05/2012, at 9:53 PM, Felipe Schmidt wrote: I have further questions: -Is there any other way to stract the contect of SSTable, writing a java program for example instead of using sstable2json? -I tried to get tombstons using the thrift API, but seems to be not possible, is it right? When I try, the program throws an exception. thanks in advance Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) 2012/5/24 aaron morton aa...@thelastpickle.com: Ok... it's really strange to me that Cassandra doesn't support data versioning cause all of other key-value databases support it (at least those who I know). You can design it into your data model if you need it. I have one remaining question: -in the case that I have more than 1 SSTable in the disk for the same column but with different data versions, is it possible to make a query to get the old version instead of the newest one? No. There is only ever 1 value for a column. The older copies of the column in the SSTables are artefacts of immutable on disk structures. If you want to see what's inside an SSTable use bin/sstable2json Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 9:42 PM, Felipe Schmidt wrote: Ok... it's really strange to me that Cassandra doesn't support data versioning cause all of other key-value databases support it (at least those who I know). I have one remaining question: -in the case that I have more than 1 SSTable in the disk for the same column but with different data versions, is it possible to make a query to get the old version instead of the newest one? Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) 2012/5/16 Dave Brosius dbros...@mebigfatguy.com: You're in for a world of hurt going down that rabbit hole. If you truely want version data then you should think about changing your keying to perhaps be a composite key where key is of form NaturalKey/VersionId Or if you want the versioning at the column level, use composite columns with ColumnName/VersionId format On 05/16/2012 10:16 AM, Felipe Schmidt wrote: That was very helpfull, thank you very much! I still have some questions: -it is possible to make Cassandra keep old value data after flushing? The same question for the memTable, before flushing. Seems to me that when I update some tuple, the old data will be overwrited in memTable, even before flushing. -it is possible to scan values from the memtable, maybe using the so-called Thrift API? Using the client-api I can just see the newest data version, I can't see what's really happening with the memTable. I ask that cause what I'll try to do is a Change Data Capture to Cassandra and the answers will define what kind of aproaches I'm able to use. Thanks in advance. Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) 2012/5/14 aaron mortonaa...@thelastpickle.com: Cassandra does not provide access to multiple versions of the same column. It is essentially implementation detail. All mutations are written to the commit log in a binary format, see the o.a.c.db.RowMutation.getSerializedBuffer() (If you want to tail it for analysis you may want to change commitlog_sync in cassandra.yaml) Here is post about looking at multiple versions columns in an sstable http://thelastpickle.com/2011/05/15/Deletes-and-Tombstones/ Remember that not all versions of a column are written to disk (see http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/). Also compaction will compress multiple versions of the same column from multiple files into a single version in a single file . Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/05/2012, at 9:50 PM, Felipe Schmidt wrote: Yes, I need this information just for academic purposes. So, to read old data values, I tried to open the Commitlog using tail -f and also the log files viewer of Ubuntu, but I can not see many informations inside of the log! Is there any other way to open this log? I didn't find any Cassandra API for this purpose. Thanks averybody in advance. Regards, Felipe Mathias
Re: Renaming a keyspace in 1.1
Not directly. * stop the cluster * rename the /var/lib/cassandra/data/mykeyspace directory * start the cluster * create the keyspace with new name * drop the keyspace with the old name Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/05/2012, at 11:13 PM, Oleg Dulin wrote: Is it possible ? How ?
Re: tokens and RF for multiple phases of deployment
Could you provide some guide on how to assign the tokens in this growing deployment phases? background http://www.datastax.com/docs/1.0/install/cluster_init#calculating-tokens-for-a-multi-data-center-cluster Start with tokens for a 4 node cluster. Add the next 4 between between each of the ranges. Add 8 in the new DC to have the same tokens as the first DC +1 Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? No. It will fail if there are not enough nodes available in the first DC. We'd like to keep both write and read on the same cluster. Writes go to all replicas. Using EACH_QUORUM means the client in the first DC will be waiting for the quorum from the second DC to ack the write. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 3:20 AM, Chong Zhang wrote: Hi all, We are planning to deploy a small cluster with 4 nodes in one DC first, and will expend that to 8 nodes, then add another DC with 8 nodes for fail over (not active-active), so all the traffic will go to the 1st cluster, and switch to 2nd cluster if the whole 1st cluster is down or on maintenance. Could you provide some guide on how to assign the tokens in this growing deployment phases? I looked at some docs but not very clear on how to assign tokens on the fail-over case. Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? We'd like to keep both write and read on the same cluster. Thanks in advance, Chong
Re: commitlog_sync_batch_window_in_ms change in 0.7
Agree. Just happy to see people upgrade to something 1.X A - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 8:24 AM, Rob Coli wrote: On Tue, May 29, 2012 at 10:29 PM, Pierre Chalamet pie...@chalamet.net wrote: You'd better use version 1.0.9 (using this one in production) or 1.0.10. 1.1 is still a bit young to be ready for prod unfortunately. OP described himself as experimenting which I inferred to mean not-production. I agree with others, 1.0.x is what I'd currently recommend for production. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: java.net.SocketTimeoutException while Trying to Drop a Collection
There are two times of timeouts. The thrift TimedOutException occurs when the coordinator times out waiting for the CL level nodes to respond. The error is transmitted back to the client and raised. This is a client side socket timeout waiting for the coordinator to respond. See the CassandraHostConfigurator.setCassandraThriftSocketTimeout() setting. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 11:44 AM, Christof Bornhoevd wrote: Hello, We are using Cassandra 1.0.8 with Hector 1.0-5 on both Windows and Linux. In our development/test environment we always recreate the schema in Cassandra (first dropping all ColumnFamilies then recreating them) and then seeding the test data. We simply use cluster.dropColumnFamily(keyspace.getKeyspaceName(), collectionName); to drop ColumnFamilies. The client is using ThriftFramedTransport (configurator.setUseThriftFramedTransport(true);). Every so often we run into the following exception (with different ColumnFamilies): Caused by: me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:33) at me.prettyprint.cassandra.service.AbstractCluster$7.execute(AbstractCluster.java:279) at me.prettyprint.cassandra.service.AbstractCluster$7.execute(AbstractCluster.java:266) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258) at me.prettyprint.cassandra.service.AbstractCluster.dropColumnFamily(AbstractCluster.java:283) at me.prettyprint.cassandra.service.AbstractCluster.dropColumnFamily(AbstractCluster.java:261) at com.supervillains.plouton.cassandradatastore.CassandraDataStore.deleteCollection(CassandraDataStore.java:195) ... 57 more Is this problem related to https://issues.apache.org/jira/browse/CASSANDRA-3551 (which should have been fixed with Cassandra 1.0.6) or could there be anything we do wrong here? Thanks in advance for any kind help! Chris
Re: will compaction delete empty rows after all columns expired?
You can set the gc_grace_secs as a little value and force major compaction after the row is expired. After then please check whether the row still exists. There are some downsides to major compactions. (There have been some recent discussions). You can provoke (some) minor compactions by: * setting the min_compaction_threshold to 2 (not sure if nodetool in 0.7 supports this, you may need to make a schema change) * using nodetool flush If you have some larger sstables that do not get compacted try the userDefinedCompaction() method on the CompactionManager MBean via JMX (i may have gotten the names wrong there in 0.7). So if I understand... the empty row will only be removed after gc_grace if enough compactions have occurred so that all the column tombstones for the empty row are in a single SSTable file? We need to know that all the fragments of the row are contain in all of the sstables in the compaction task. They don't have to be in the same SSTable. You need tombstones to stop columns written previously from appearing in the results. If we purge the tombstone and a previous column value is in another sstable the delete will be undone. If you cannot compact the tombstones away let us know. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 2:16 PM, Zhu Han wrote: On Thu, May 31, 2012 at 9:31 AM, Curt Allred c...@mediosystems.com wrote: No, these were not wide rows. They are rows that formerly had one or 2 columns. The columns are deleted but the empty rows dont go away, even after gc_grace_secs. The empty row goes away only during a compaction after the gc_grace_secs. You can set the gc_grace_secs as a little value and force major compaction after the row is expired. After then please check whether the row still exists. So if I understand... the empty row will only be removed after gc_grace if enough compactions have occurred so that all the column tombstones for the empty row are in a single SSTable file? From: aaron morton [mailto:aa...@thelastpickle.com] Minor compaction will remove the tombstones if the row only exists in the sstable being compaction. Are these very wide rows that are constantly written to ? Cheers p.s. cassandra 1.0 really does rock.
Re: About Composite range queries
it is hashed once. To the partitioner it's just some bytes. Other parts of the code car about it's structure. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 7:00 PM, Cyril Auburtin wrote: Thx for the answer 1 more thing, a Composite key is not hashed only once I guess? It's hashed the number of part the composite have? So this means there are twice or 3 or ... as many keys as for normal column keys, is it true? Le 31 mai 2012 02:59, aaron morton aa...@thelastpickle.com a écrit : Composite Columns compare each part in turn, so the values are ordered as you've shown them. However the rows are not ordered according to key value. They are ordered using the random token generated by the partitioner see http://wiki.apache.org/cassandra/FAQ#range_rp What is the real advantage compared to super column families? They are faster. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/05/2012, at 10:08 PM, Cyril Auburtin wrote: How is it done in Cassandra to be able to range query on a composite key? key1 = (A:A:C), (A:B:C), (A:C:C), (A:D:C), (B,A,C) like get_range (key1, start_column=(A,), end_column=(A, C)); will return [ (A:B:C), (A:C:C) ] (in pycassa) I mean does the composite implementation add much overhead to make it work? Does it need to add other Column families, to be able to range query between composites simple keys (first, second and third part of the composite)? What is the real advantage compared to super column families? key1 = A: (A,C), (B,C), (C,C), (D,C) , B: (A,C) thx
How can we use composite indexes and secondary indexes together
We want to use cassandra to store complex data. But we can't figure out, how to organize indexes. Our table (column family) looks like this: Users = { RandomId int, Firstname varchar, Lastname varchar, Age int, Country int, ChildCount int } In our queries we have mandatory fields (Firstname,Lastname,Age) and extra search options (Country,ChildCount). How do we organize index to make this kind of queries fast? First I thought, it would be natural to make composite index on (Firstname,Lastname,Age) and add separate secondary index on remaining fields (Country and ChildCount). But I can't insert rows into table after creating secondary indexes. And also, I can't query the table. I'm using cassandra 1.1.0, and cqlsh with --cql3 option. Any other suggestions to solve our problem (complex queries with mandatory and additional options) are welcome.The main point is, how can we join data in cassandra. If I make few index column families, I need to intersect the values, to get rows that pass all search criteria??? Or should I use something based on Hadoop (Pig,Hive) to make such queries? Respectfully, Nury -- -- -- --
Re: About Composite range queries
but sorry, I dont undertand If you hash 4 composite keys, let's say ('A','B','C'), ('A','D','C'), ('A','E','X'), ('A','R','X'), you have only 4 hashes or you have more? If it's 4, how come you are able to range query for example between start_column=('A', 'D') and end_column=('A','E') and get this column ('A','D','C') the composites are like chapters between the whole keys set, there must be intermediate keys added? 2012/5/31 aaron morton aa...@thelastpickle.com it is hashed once. To the partitioner it's just some bytes. Other parts of the code car about it's structure. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 7:00 PM, Cyril Auburtin wrote: Thx for the answer 1 more thing, a Composite key is not hashed only once I guess? It's hashed the number of part the composite have? So this means there are twice or 3 or ... as many keys as for normal column keys, is it true? Le 31 mai 2012 02:59, aaron morton aa...@thelastpickle.com a écrit : Composite Columns compare each part in turn, so the values are ordered as you've shown them. However the rows are not ordered according to key value. They are ordered using the random token generated by the partitioner see http://wiki.apache.org/cassandra/FAQ#range_rp What is the real advantage compared to super column families? They are faster. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/05/2012, at 10:08 PM, Cyril Auburtin wrote: How is it done in Cassandra to be able to range query on a composite key? key1 = (A:A:C), (A:B:C), (A:C:C), (A:D:C), (B,A,C) like get_range (key1, start_column=(A,), end_column=(A, C)); will return [ (A:B:C), (A:C:C) ] (in pycassa) I mean does the composite implementation add much overhead to make it work? Does it need to add other Column families, to be able to range query between composites simple keys (first, second and third part of the composite)? What is the real advantage compared to super column families? key1 = A: (A,C), (B,C), (C,C), (D,C) , B: (A,C) thx
RE: nodetool move 0 gets stuck in moving state forever
Let me elaborate a bit. two node cluster node1 has token 0 node2 has token 85070591730234615865843651857942052864 node1 goes down perminently. do a nodetool move 0 on node2. monitor with ring... is in Moving state forever it seems. From: Poziombka, Wade L Sent: Tuesday, May 29, 2012 4:29 PM To: user@cassandra.apache.org Subject: nodetool move 0 gets stuck in moving state forever If the node with token 0 dies and we just want it gone from the cluster we would do a nodetool move 0. Then we monitor using nodetool ring it seems to be stuck on Moving forever. Any ideas?
Invalid Counter Shard errors?
Hi guys, We're running a three node cluster of cassandra 1.1 servers, originally 1.0.7 and immediately after the upgrade the error logs of all three servers began filling up with the following message: ERROR [ReplicateOnWriteStage:177] 2012-05-31 08:17:02,236 CounterContext.java (line 381) invalid counter shard detected; (3438afc0-7e71-11e1--da5a9d01e7f7, 3, 4) and (3438afc0-7e71-11e1--da5a9d01e7f7, 3, 7) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard ERROR [ValidationExecutor:20] 2012-05-31 08:17:01,570 CounterContext.java (line 381) invalid counter shard detected; (343cf580-7e71-11e1--ebc411012bff, 14, 27) and (343cf580-7e71-11e1--ebc411012bff, 14, 21) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard The counts change but the errors are constant. What is the best course of action? Google only turns up the source code for these errors. Thanks! Charles
Re: java.net.SocketTimeoutException while Trying to Drop a Collection
Thanks a lot Aaron for the very fast response! I have increased the CassandraThriftSocketTimeout from 5000 to 9000. Is this a reasonable setting? configurator.setCassandraThriftSocketTimeout(9000); Cheers, Christof 2012/5/31 aaron morton aa...@thelastpickle.com There are two times of timeouts. The thrift TimedOutException occurs when the coordinator times out waiting for the CL level nodes to respond. The error is transmitted back to the client and raised. This is a client side socket timeout waiting for the coordinator to respond. See the CassandraHostConfigurator.setCassandraThriftSocketTimeout() setting. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 11:44 AM, Christof Bornhoevd wrote: Hello, We are using Cassandra 1.0.8 with Hector 1.0-5 on both Windows and Linux. In our development/test environment we always recreate the schema in Cassandra (first dropping all ColumnFamilies then recreating them) and then seeding the test data. We simply use cluster.dropColumnFamily(keyspace.getKeyspaceName(), collectionName); to drop ColumnFamilies. The client is using ThriftFramedTransport (configurator.setUseThriftFramedTransport(*true*);). Every so often we run into the following exception (with different ColumnFa milies): Caused by: me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(Exce ptionsTranslatorImpl.java:33) at me.prettyprint.cassandra.service.AbstractCluster$7.execute(AbstractClust er.java:279) at me.prettyprint.cassandra.service.AbstractCluster$7.execute(AbstractClust er.java:266) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation .java:103) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailov er(HConnectionManager.java:258) at me.prettyprint.cassandra.service.AbstractCluster.dropColumnFamily(Abstra ctCluster.java:283) at me.prettyprint.cassandra.service.AbstractCluster.dropColumnFamily(Abstra ctCluster.java:261) at com.supervillains.plouton.cassandradatastore.CassandraDataStore.deleteCo llection(CassandraDataStore.java:195) ... 57 more Is this problem related to https://issues.apache.org/jira/browse/CASSANDRA- https://issues.apache.org/jira/browse/CASSANDRA-https://issues.apache.org/jira/browse/CASSANDRA-35513551 (which should have been fixed with Cassandra 1.0.6) or could there be anything we do wrong here? Thanks in advance for any kind help! Chris
Re: tokens and RF for multiple phases of deployment
Thanks Aaron. I might use LOCAL_QUORUM to avoid the waiting on the ack from DC2. Another question, after I setup a new node with token +1 in a new DC, and updated a CF with RF {DC1:2, DC2:1}. When i update a column on one node in DC1, it's also updated in the new node in DC2. But all the other rows are not in the new node. Do I need to copy the data files from a node in DC1 to the new node? The ring (2 in DC1, 1 in DC2) looks OK, but the load on the new node in DC2 is almost 0%. Address DC RackStatus State LoadOwns Token 85070591730234615865843651857942052864 10.10.10.1DC1 RAC1Up Normal 313.99 MB 50.00% 0 10.10.10.3DC2 RAC1Up Normal 7.07 MB 0.00% 1 10.10.10.2DC1 RAC1Up Normal 288.91 MB 50.00% 85070591730234615865843651857942052864 Thanks, Chong On Thu, May 31, 2012 at 5:48 AM, aaron morton aa...@thelastpickle.comwrote: Could you provide some guide on how to assign the tokens in this growing deployment phases? background http://www.datastax.com/docs/1.0/install/cluster_init#calculating-tokens-for-a-multi-data-center-cluster Start with tokens for a 4 node cluster. Add the next 4 between between each of the ranges. Add 8 in the new DC to have the same tokens as the first DC +1 Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? No. It will fail if there are not enough nodes available in the first DC. We'd like to keep both write and read on the same cluster. Writes go to all replicas. Using EACH_QUORUM means the client in the first DC will be waiting for the quorum from the second DC to ack the write. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 3:20 AM, Chong Zhang wrote: Hi all, We are planning to deploy a small cluster with 4 nodes in one DC first, and will expend that to 8 nodes, then add another DC with 8 nodes for fail over (not active-active), so all the traffic will go to the 1st cluster, and switch to 2nd cluster if the whole 1st cluster is down or on maintenance. Could you provide some guide on how to assign the tokens in this growing deployment phases? I looked at some docs but not very clear on how to assign tokens on the fail-over case. Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? We'd like to keep both write and read on the same cluster. Thanks in advance, Chong
newbie question :got error 'org.apache.thrift.transport.TTransportException'
Hi, I am new to Cassandra. I have started a Cassandra instance (Cassandra.bat), played with it for a while, created a keyspace Zodiac. When I kill Cassandra instance and restarted, the keyspace is gone but when I tried to recreate it, I got 'org.apache.thrift.transport.TTransportException' error. What have I done wrong here? Following are screen shots: C:\cassandra-1.1.0bin\cassandra-cli -host localhost -f C:\NoSqlProjects\dropZ.txt Starting Cassandra Client Connected to: ssc2Cluster on localhost/9160 Line 1 = Keyspace 'Zodiac' not found. C:\cassandra-1.1.0bin\cassandra-cli -host localhost -f C:\NoSqlProjects\usageDB.txt Starting Cassandra Client Connected to: ssc2Cluster on localhost/9160 Line 1 = org.apache.thrift.transport.TTransportException Following is part of server error message: INFO 11:09:56,761 Node localhost/127.0.0.1 state jump to normal INFO 11:09:56,761 Bootstrap/Replace/Move completed! Now serving reads. INFO 11:09:56,761 Will not load MX4J, mx4j-tools.jar is not in the classpath INFO 11:09:56,781 Binding thrift service to localhost/127.0.0.1:9160 INFO 11:09:56,781 Using TFastFramedTransport with a max frame size of 15728640 bytes. INFO 11:09:56,791 Using synchronous/threadpool thrift server on localhost/127.0.0.1 : 9160 INFO 11:09:56,791 Listening for thrift clients... INFO 11:20:06,044 Enqueuing flush of Memtable-schema_keyspaces@1062244145(184/230 serialized/live bytes, 4 ops) INFO 11:20:06,054 Writing Memtable-schema_keyspaces@1062244145(184/230 serialized/live bytes, 4 ops) INFO 11:20:06,074 Completed flushing c:\cassandra_data\data\system\schema_keyspaces\system-schema_keyspaces-hc-62-Data. b (240 bytes) RROR 11:20:06,134 Exception in thread Thread[MigrationStage:1,5,main] ava.lang.AssertionError at org.apache.cassandra.db.DefsTable.updateKeyspace(DefsTable.java:441) at org.apache.cassandra.db.DefsTable.mergeKeyspaces(DefsTable.java:339) at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:269) at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:214) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) RROR 11:20:06,134 Error occurred during processing of message. ava.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.AssertionError usageDB.txt: create keyspace Zodiac with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; use Zodiac; create column family ServiceUsage with comparator = UTF8Type and default_validation_class = UTF8Type and key_validation_class = LongType AND column_metadata = [ {column_name: 'TASK_ID', validation_class: IntegerType} {column_name: 'USAGE_COUNT', validation_class: IntegerType} {column_name: 'USAGE_TYPE', validation_class: UTF8Type} ]; From: Chong Zhang [mailto:chongz.zh...@gmail.com] Sent: Thursday, May 31, 2012 8:47 AM To: user@cassandra.apache.org Subject: Re: tokens and RF for multiple phases of deployment Thanks Aaron. I might use LOCAL_QUORUM to avoid the waiting on the ack from DC2. Another question, after I setup a new node with token +1 in a new DC, and updated a CF with RF {DC1:2, DC2:1}. When i update a column on one node in DC1, it's also updated in the new node in DC2. But all the other rows are not in the new node. Do I need to copy the data files from a node in DC1 to the new node? The ring (2 in DC1, 1 in DC2) looks OK, but the load on the new node in DC2 is almost 0%. Address DC RackStatus State LoadOwns Token 85070591730234615865843651857942052864 10.10.10.1DC1 RAC1Up Normal 313.99 MB 50.00% 0 10.10.10.3DC2 RAC1Up Normal 7.07 MB 0.00% 1 10.10.10.2DC1 RAC1Up Normal 288.91 MB 50.00% 85070591730234615865843651857942052864 Thanks, Chong On Thu, May 31, 2012 at 5:48 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: Could you provide some guide on how to assign the tokens in this growing deployment phases? background http://www.datastax.com/docs/1.0/install/cluster_init#calculating-tokens-for-a-multi-data-center-cluster Start with tokens for a 4 node cluster. Add the next 4 between between each of the ranges. Add 8 in the new DC to have the same tokens as the first DC +1 Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? No. It will fail if there are not enough nodes available in
Re: cassandra read latency help
Aaron, Thanks for your email. The test kinda resembles how the actual application will be. It is going to be a simple key-value store with 500 million keys per node. The traffic will be read heavy in steady state, and there will be some keys that will have a lot more traffic than others. The expected hot rows are estimated to be anywhere between 50 to 1 million keys. I have already populated this test system with 500 million keys, compacted it all to 1 file to check the size of the bloom filter and the index. This is how i am estimating my memory for 500 million keys. plz correct me if i am wrong or if i am missing any step. bloom filter: 1 gig index samples: Index file is 8.5 gig. I believe this index file is for all keys. Index interval is 128. Hence in RAM, this would be (8.5g / 128)*10 (factor for datastructure overhead) = 664 mb (lets say 1 gig) key cache size (3 million): 3 gigs memtable_total_space_mb : 2 gigs This totals 7 gig. my heap size is 8 gigs. Is there anything else that i am missing here? When i do top right now, it shows java as 96% memory, thats a concern because there is no write load. Should i be looking at any other number here? Off heap row cache: 500,000 - 750,000 ~ 3 and 5 gigs (avg row size = 250-500 bytes) My test system has 16 gigs RAM, production system will mostly have 32 gigs RAM and 12 spindles instead of 6 that i am testing with. I changed the underneath filesystem from xfs to ext2, and i am seeing better results, though not the best. The cfstats latency is down to 20 ms for 35 qps read load. row cache hit rate is 0.21, key cache = 0.75. Measuring from the client side, i am seeing roughly 10-15 ms per key, i would want even lesser though, any tips would greatly help. In production, i am hoping the row cache hit rate will be higher. The biggest thing that is affecting my system right now is the Invalid frame size of 0 error that cassandra server seems to be printing. Its causing read timeouts every minute or 2 minutes. I havent been able to figure out a way to fix this one. I see someone else also reported seeing this, but not sure where the problem is hector, cassandra or thrift. Thanks Gurpreet On Wed, May 30, 2012 at 4:38 PM, aaron morton aa...@thelastpickle.comwrote: 80 ms per request sounds high. I'm doing some guessing here, i am guessing memory usage is the problem.. * I assume you are not longer seeing excessive GC activity. * The key cache will not get used when you hit the row cache. I would disable the row cache if you have a random workload, which it looks like you do. * 500 million is a lot of keys to have on a single node. At the default index sample of every 128 keys it will have about 4 million samples, which is probably taking up a lot of memory. Is this testing a real world scenario or an abstract benchmark ? IMHO you will get more insight from testing something that resembles your application. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 8:48 PM, Gurpreet Singh wrote: Hi Aaron, Here is the latest on this.. i switched to a node with 6 disks and running some read tests, and i am seeing something weird. setup: 1 node, cassandra 1.0.9, 8 cpu, 16 gig RAM, 6 7200 rpm SATA data disks striped 512 kb, commitlog mirrored. 1 keyspace with just 1 column family random partitioner total number of keys: 500 million (the keys are just longs from 1 to 500 million) avg key size: 8 bytes bloom filter size: 1 gig total disk usage: 70 gigs compacted 1 sstable mean compacted row size: 149 bytes heap size: 8 gigs keycache size: 2 million (takes around 2 gigs in RAM) rowcache size: 1 million (off-heap) memtable_total_space_mb : 2 gigs test: Trying to do 5 reads per second. Each read is a multigetslice query for just 1 key, 2 columns. observations: row cache hit rate: 0.4 key cache hit rate: 0.0 (this will increase later on as system moves to steady state) cfstats - 80 ms iostat (every 5 seconds): r/s : 400 %util: 20% (all disks are at equal utilization) await: 65-70 ms (for each disk) svctm : 2.11 ms (for each disk) r-kB/s - 35000 why this is weird is because.. 5 reads per second is causing a latency of 80 ms per request (according to cfstats). isnt this too high? 35 MB/s is being read from the disk. That is again very weird. This number is way too high, avg row size is just 149 bytes. Even index reads should not cause this high data being read from the disk. what i understand is that each read request translates to 2 disk accesses (because there is only 1 sstable). 1 for the index, 1 for the data. At such a low reads/second, why is the latency so high? would appreciate help debugging this issue. Thanks Gurpreet On Tue, May 22, 2012 at 2:46 AM, aaron morton aa...@thelastpickle.comwrote: With heap size = 4 gigs I would check for GC activity in the logs and consider setting it to 8 given you
Re: cassandra read latency help
You may also consider disabling key/row cache at all. 1mm rows * 400 bytes = 400MB of data, can easily be in fs cache, and you will access your hot keys with thousands of qps without hitting disk at all. Enabling compression can make situation even better. On Thu, May 31, 2012 at 12:01 PM, Gurpreet Singh gurpreet.si...@gmail.comwrote: Aaron, Thanks for your email. The test kinda resembles how the actual application will be. It is going to be a simple key-value store with 500 million keys per node. The traffic will be read heavy in steady state, and there will be some keys that will have a lot more traffic than others. The expected hot rows are estimated to be anywhere between 50 to 1 million keys. I have already populated this test system with 500 million keys, compacted it all to 1 file to check the size of the bloom filter and the index. This is how i am estimating my memory for 500 million keys. plz correct me if i am wrong or if i am missing any step. bloom filter: 1 gig index samples: Index file is 8.5 gig. I believe this index file is for all keys. Index interval is 128. Hence in RAM, this would be (8.5g / 128)*10 (factor for datastructure overhead) = 664 mb (lets say 1 gig) key cache size (3 million): 3 gigs memtable_total_space_mb : 2 gigs This totals 7 gig. my heap size is 8 gigs. Is there anything else that i am missing here? When i do top right now, it shows java as 96% memory, thats a concern because there is no write load. Should i be looking at any other number here? Off heap row cache: 500,000 - 750,000 ~ 3 and 5 gigs (avg row size = 250-500 bytes) My test system has 16 gigs RAM, production system will mostly have 32 gigs RAM and 12 spindles instead of 6 that i am testing with. I changed the underneath filesystem from xfs to ext2, and i am seeing better results, though not the best. The cfstats latency is down to 20 ms for 35 qps read load. row cache hit rate is 0.21, key cache = 0.75. Measuring from the client side, i am seeing roughly 10-15 ms per key, i would want even lesser though, any tips would greatly help. In production, i am hoping the row cache hit rate will be higher. The biggest thing that is affecting my system right now is the Invalid frame size of 0 error that cassandra server seems to be printing. Its causing read timeouts every minute or 2 minutes. I havent been able to figure out a way to fix this one. I see someone else also reported seeing this, but not sure where the problem is hector, cassandra or thrift. Thanks Gurpreet On Wed, May 30, 2012 at 4:38 PM, aaron morton aa...@thelastpickle.comwrote: 80 ms per request sounds high. I'm doing some guessing here, i am guessing memory usage is the problem.. * I assume you are not longer seeing excessive GC activity. * The key cache will not get used when you hit the row cache. I would disable the row cache if you have a random workload, which it looks like you do. * 500 million is a lot of keys to have on a single node. At the default index sample of every 128 keys it will have about 4 million samples, which is probably taking up a lot of memory. Is this testing a real world scenario or an abstract benchmark ? IMHO you will get more insight from testing something that resembles your application. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 8:48 PM, Gurpreet Singh wrote: Hi Aaron, Here is the latest on this.. i switched to a node with 6 disks and running some read tests, and i am seeing something weird. setup: 1 node, cassandra 1.0.9, 8 cpu, 16 gig RAM, 6 7200 rpm SATA data disks striped 512 kb, commitlog mirrored. 1 keyspace with just 1 column family random partitioner total number of keys: 500 million (the keys are just longs from 1 to 500 million) avg key size: 8 bytes bloom filter size: 1 gig total disk usage: 70 gigs compacted 1 sstable mean compacted row size: 149 bytes heap size: 8 gigs keycache size: 2 million (takes around 2 gigs in RAM) rowcache size: 1 million (off-heap) memtable_total_space_mb : 2 gigs test: Trying to do 5 reads per second. Each read is a multigetslice query for just 1 key, 2 columns. observations: row cache hit rate: 0.4 key cache hit rate: 0.0 (this will increase later on as system moves to steady state) cfstats - 80 ms iostat (every 5 seconds): r/s : 400 %util: 20% (all disks are at equal utilization) await: 65-70 ms (for each disk) svctm : 2.11 ms (for each disk) r-kB/s - 35000 why this is weird is because.. 5 reads per second is causing a latency of 80 ms per request (according to cfstats). isnt this too high? 35 MB/s is being read from the disk. That is again very weird. This number is way too high, avg row size is just 149 bytes. Even index reads should not cause this high data being read from the disk. what i understand is that each read request translates to 2 disk accesses
Re: cassandra read latency help
But I think it's bad idea, since hot data will be evenly distributed between multiple sstables and filesystem pages. On Thu, May 31, 2012 at 1:08 PM, crypto five cryptof...@gmail.com wrote: You may also consider disabling key/row cache at all. 1mm rows * 400 bytes = 400MB of data, can easily be in fs cache, and you will access your hot keys with thousands of qps without hitting disk at all. Enabling compression can make situation even better. On Thu, May 31, 2012 at 12:01 PM, Gurpreet Singh gurpreet.si...@gmail.com wrote: Aaron, Thanks for your email. The test kinda resembles how the actual application will be. It is going to be a simple key-value store with 500 million keys per node. The traffic will be read heavy in steady state, and there will be some keys that will have a lot more traffic than others. The expected hot rows are estimated to be anywhere between 50 to 1 million keys. I have already populated this test system with 500 million keys, compacted it all to 1 file to check the size of the bloom filter and the index. This is how i am estimating my memory for 500 million keys. plz correct me if i am wrong or if i am missing any step. bloom filter: 1 gig index samples: Index file is 8.5 gig. I believe this index file is for all keys. Index interval is 128. Hence in RAM, this would be (8.5g / 128)*10 (factor for datastructure overhead) = 664 mb (lets say 1 gig) key cache size (3 million): 3 gigs memtable_total_space_mb : 2 gigs This totals 7 gig. my heap size is 8 gigs. Is there anything else that i am missing here? When i do top right now, it shows java as 96% memory, thats a concern because there is no write load. Should i be looking at any other number here? Off heap row cache: 500,000 - 750,000 ~ 3 and 5 gigs (avg row size = 250-500 bytes) My test system has 16 gigs RAM, production system will mostly have 32 gigs RAM and 12 spindles instead of 6 that i am testing with. I changed the underneath filesystem from xfs to ext2, and i am seeing better results, though not the best. The cfstats latency is down to 20 ms for 35 qps read load. row cache hit rate is 0.21, key cache = 0.75. Measuring from the client side, i am seeing roughly 10-15 ms per key, i would want even lesser though, any tips would greatly help. In production, i am hoping the row cache hit rate will be higher. The biggest thing that is affecting my system right now is the Invalid frame size of 0 error that cassandra server seems to be printing. Its causing read timeouts every minute or 2 minutes. I havent been able to figure out a way to fix this one. I see someone else also reported seeing this, but not sure where the problem is hector, cassandra or thrift. Thanks Gurpreet On Wed, May 30, 2012 at 4:38 PM, aaron morton aa...@thelastpickle.comwrote: 80 ms per request sounds high. I'm doing some guessing here, i am guessing memory usage is the problem.. * I assume you are not longer seeing excessive GC activity. * The key cache will not get used when you hit the row cache. I would disable the row cache if you have a random workload, which it looks like you do. * 500 million is a lot of keys to have on a single node. At the default index sample of every 128 keys it will have about 4 million samples, which is probably taking up a lot of memory. Is this testing a real world scenario or an abstract benchmark ? IMHO you will get more insight from testing something that resembles your application. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 8:48 PM, Gurpreet Singh wrote: Hi Aaron, Here is the latest on this.. i switched to a node with 6 disks and running some read tests, and i am seeing something weird. setup: 1 node, cassandra 1.0.9, 8 cpu, 16 gig RAM, 6 7200 rpm SATA data disks striped 512 kb, commitlog mirrored. 1 keyspace with just 1 column family random partitioner total number of keys: 500 million (the keys are just longs from 1 to 500 million) avg key size: 8 bytes bloom filter size: 1 gig total disk usage: 70 gigs compacted 1 sstable mean compacted row size: 149 bytes heap size: 8 gigs keycache size: 2 million (takes around 2 gigs in RAM) rowcache size: 1 million (off-heap) memtable_total_space_mb : 2 gigs test: Trying to do 5 reads per second. Each read is a multigetslice query for just 1 key, 2 columns. observations: row cache hit rate: 0.4 key cache hit rate: 0.0 (this will increase later on as system moves to steady state) cfstats - 80 ms iostat (every 5 seconds): r/s : 400 %util: 20% (all disks are at equal utilization) await: 65-70 ms (for each disk) svctm : 2.11 ms (for each disk) r-kB/s - 35000 why this is weird is because.. 5 reads per second is causing a latency of 80 ms per request (according to cfstats). isnt this too high? 35 MB/s is being read from the disk. That is again very weird. This number
RE: 1.1 not removing commit log files?
So this happened to me again, but it was only when the cluster had a node down for a while. Then the commit logs started piling up past the limit I set in the config file, and filled the drive. After the node recovered and hints had replayed the space was never reclaimed. A flush or drain did not reclaim the space either and delete any log files. Bryce Godfrey | Sr. Software Engineer | Azaleos Corporationhttp://www.azaleos.com/ From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] Sent: Tuesday, May 22, 2012 1:10 PM To: user@cassandra.apache.org Subject: RE: 1.1 not removing commit log files? The nodes appear to be holding steady at the 8G that I set it to in the config file now. I'll keep an eye on them. From: aaron morton [mailto:aa...@thelastpickle.com]mailto:[mailto:aa...@thelastpickle.com] Sent: Tuesday, May 22, 2012 4:08 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1.1 not removing commit log files? 4096 is also the internal hard coded default for commitlog_total_space_in_mb If you are seeing more that 4GB of commit log files let us know. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/05/2012, at 6:35 AM, Bryce Godfrey wrote: Thanks, I'll give it a try. -Original Message- From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]mailto:[mailto:arodr...@gmail.com] Sent: Monday, May 21, 2012 2:12 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1.1 not removing commit log files? commitlog_total_space_in_mb: 4096 By default this line is commented in 1.0.x if I remember well. I guess it is the same in 1.1. You really should remove this comment or your commit logs will entirely fill up your disk as it happened to me a while ago. Alain 2012/5/21 Pieter Callewaert pieter.callewa...@be-mobile.bemailto:pieter.callewa...@be-mobile.be: Hi, In 1.1 the commitlog files are pre-allocated with files of 128MB. (https://issues.apache.org/jira/browse/CASSANDRA-3411) This should however not exceed your commitlog size in Cassandra.yaml. commitlog_total_space_in_mb: 4096 Kind regards, Pieter Callewaert From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com]mailto:[mailto:bryce.godf...@azaleos.com] Sent: maandag 21 mei 2012 9:52 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: 1.1 not removing commit log files? The commit log drives on my nodes keep slowly filling up. I don't see any errors in my logs that are indicating any issues that I can map to this issue. Is this how 1.1 is supposed to work now? Previous versions seemed to keep this drive at a minimum as it flushed. /dev/mapper/mpathf 25G 21G 4.2G 83% /opt/cassandra/commitlog
Re: How can we use composite indexes and secondary indexes together
If you want to do arbitrary complex online / realtime queries look at Data Stax Enterprise, or https://github.com/tjake/Solandra or straight Solr. Alternatively denormalise the model to materialise the results when you insert so you query is a straight lookup. Or do some client side filtering / aggregation. If you want to do the queries offline, you can use Pig or Hive with Hadoop over Cassandra. The Apache Cassandra distro includes the pig support, hive is coming (i think) and there are Hadoop interfaces. You can also look at Data Stax Enterprise. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 11:07 PM, Nury Redjepow wrote: We want to use cassandra to store complex data. But we can't figure out, how to organize indexes. Our table (column family) looks like this: Users = { RandomId int, Firstname varchar, Lastname varchar, Age int, Country int, ChildCount int } In our queries we have mandatory fields (Firstname,Lastname,Age) and extra search options (Country,ChildCount). How do we organize index to make this kind of queries fast? First I thought, it would be natural to make composite index on (Firstname,Lastname,Age) and add separate secondary index on remaining fields (Country and ChildCount). But I can't insert rows into table after creating secondary indexes. And also, I can't query the table. I'm using cassandra 1.1.0, and cqlsh with --cql3 option. Any other suggestions to solve our problem (complex queries with mandatory and additional options) are welcome. The main point is, how can we join data in cassandra. If I make few index column families, I need to intersect the values, to get rows that pass all search criteria??? Or should I use something based on Hadoop (Pig,Hive) to make such queries? Respectfully, Nury
Re: Cassandra Data Archiving
I'm not sure on your needs, but the simplest thing to consider is snapshotting and copying off node. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 12:23 AM, Shubham Srivastava wrote: I need to archive my Cassandra data into another permanent storage . Two intent 1.To shed the unused data from the Live data. 2.To use the archived data for getting some analytics out or a potential source of DataWarehouse. Any recommendations for the same in terms of strategies or tools to use. Regards, Shubham Srivastava | Technical Lead - Technology Development +91 124 4910 548 | MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India image001.gifWhat's new? My Trip Rewards - An exclusive loyalty program for MakeMyTrip customers. image002.gif image003.gif Office Map image004.gif Facebook image005.gif Twitter
Re: About Composite range queries
If you hash 4 composite keys, let's say ('A','B','C'), ('A','D','C'), ('A','E','X'), ('A','R','X'), you have only 4 hashes or you have more? Four If it's 4, how come you are able to range query for example between start_column=('A', 'D') and end_column=('A','E') and get this column ('A','D','C') That's a slice query against columns, the column value is not hashed. The values of the column are sorted according to the comparator which can be different to the raw byte order. A range query is against rows. Rows keys are hashed (using the Random Partitioner) to create tokens, and are stored in token order. the composites are like chapters between the whole keys set, there must be intermediate keys added? Not sure what you mean. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 12:52 AM, Cyril Auburtin wrote: but sorry, I dont undertand If you hash 4 composite keys, let's say ('A','B','C'), ('A','D','C'), ('A','E','X'), ('A','R','X'), you have only 4 hashes or you have more? If it's 4, how come you are able to range query for example between start_column=('A', 'D') and end_column=('A','E') and get this column ('A','D','C') the composites are like chapters between the whole keys set, there must be intermediate keys added? 2012/5/31 aaron morton aa...@thelastpickle.com it is hashed once. To the partitioner it's just some bytes. Other parts of the code car about it's structure. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 7:00 PM, Cyril Auburtin wrote: Thx for the answer 1 more thing, a Composite key is not hashed only once I guess? It's hashed the number of part the composite have? So this means there are twice or 3 or ... as many keys as for normal column keys, is it true? Le 31 mai 2012 02:59, aaron morton aa...@thelastpickle.com a écrit : Composite Columns compare each part in turn, so the values are ordered as you've shown them. However the rows are not ordered according to key value. They are ordered using the random token generated by the partitioner see http://wiki.apache.org/cassandra/FAQ#range_rp What is the real advantage compared to super column families? They are faster. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/05/2012, at 10:08 PM, Cyril Auburtin wrote: How is it done in Cassandra to be able to range query on a composite key? key1 = (A:A:C), (A:B:C), (A:C:C), (A:D:C), (B,A,C) like get_range (key1, start_column=(A,), end_column=(A, C)); will return [ (A:B:C), (A:C:C) ] (in pycassa) I mean does the composite implementation add much overhead to make it work? Does it need to add other Column families, to be able to range query between composites simple keys (first, second and third part of the composite)? What is the real advantage compared to super column families? key1 = A: (A,C), (B,C), (C,C), (D,C) , B: (A,C) thx
Re: nodetool move 0 gets stuck in moving state forever
Look in the logs for errors or warnings. Also let us know what version you are using. Am guessing that node 2 still thought that node 1 was in the cluster when you did the move. Which should(?) have errored. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 1:50 AM, Poziombka, Wade L wrote: Let me elaborate a bit. two node cluster node1 has token 0 node2 has token 85070591730234615865843651857942052864 node1 goes down perminently. do a nodetool move 0 on node2. monitor with ring... is in Moving state forever it seems. From: Poziombka, Wade L Sent: Tuesday, May 29, 2012 4:29 PM To: user@cassandra.apache.org Subject: nodetool move 0 gets stuck in moving state forever If the node with token 0 dies and we just want it gone from the cluster we would do a nodetool move 0. Then we monitor using nodetool ring it seems to be stuck on Moving forever. Any ideas?
Re: Invalid Counter Shard errors?
I suggest creating a ticket on https://issues.apache.org/jira/browse/CASSANDRA with the details. If it is an immediate concern see if you can find someone in the #cassandra chat room http://cassandra.apache.org/ Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 3:20 AM, Charles Brophy wrote: Hi guys, We're running a three node cluster of cassandra 1.1 servers, originally 1.0.7 and immediately after the upgrade the error logs of all three servers began filling up with the following message: ERROR [ReplicateOnWriteStage:177] 2012-05-31 08:17:02,236 CounterContext.java (line 381) invalid counter shard detected; (3438afc0-7e71-11e1--da5a9d01e7f7, 3, 4) and (3438afc0-7e71-11e1--da5a9d01e7f7, 3, 7) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard ERROR [ValidationExecutor:20] 2012-05-31 08:17:01,570 CounterContext.java (line 381) invalid counter shard detected; (343cf580-7e71-11e1--ebc411012bff, 14, 27) and (343cf580-7e71-11e1--ebc411012bff, 14, 21) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard The counts change but the errors are constant. What is the best course of action? Google only turns up the source code for these errors. Thanks! Charles
Re: java.net.SocketTimeoutException while Trying to Drop a Collection
The default value for rpc_timeout is 1 - 10 seconds. You want the socket timeout to be higher than the rpc_timeout otherwise the client will give up before the server. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 3:26 AM, Christof Bornhoevd wrote: Thanks a lot Aaron for the very fast response! I have increased the CassandraThriftSocketTimeout from 5000 to 9000. Is this a reasonable setting? configurator.setCassandraThriftSocketTimeout(9000 ); Cheers, Christof 2012/5/31 aaron morton aa...@thelastpickle.com There are two times of timeouts. The thrift TimedOutException occurs when the coordinator times out waiting for the CL level nodes to respond. The error is transmitted back to the client and raised. This is a client side socket timeout waiting for the coordinator to respond. See the CassandraHostConfigurator.setCassandraThriftSocketTimeout() setting. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 11:44 AM, Christof Bornhoevd wrote: Hello, We are using Cassandra 1.0.8 with Hector 1.0-5 on both Windows and Linux. In our development/test environment we always recreate the schema in Cassandra (first dropping all ColumnFamilies then recreating them) and then seeding the test data. We simply use cluster.dropColumnFamily(keyspace.getKeyspaceName(), collectionName); to drop ColumnFamilies. The client is using ThriftFramedTransport (configurator.setUseThriftFramedTransport(true);). Every so often we run into the following exception (with different ColumnFamilies): Caused by: me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:33) at me.prettyprint.cassandra.service.AbstractCluster$7.execute(AbstractCluster.java:279) at me.prettyprint.cassandra.service.AbstractCluster$7.execute(AbstractCluster.java:266) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258) at me.prettyprint.cassandra.service.AbstractCluster.dropColumnFamily(AbstractCluster.java:283) at me.prettyprint.cassandra.service.AbstractCluster.dropColumnFamily(AbstractCluster.java:261) at com.supervillains.plouton.cassandradatastore.CassandraDataStore.deleteCollection(CassandraDataStore.java:195) ... 57 more Is this problem related to https://issues.apache.org/jira/browse/CASSANDRA-3551 (which should have been fixed with Cassandra 1.0.6) or could there be anything we do wrong here? Thanks in advance for any kind help! Chris
Re: tokens and RF for multiple phases of deployment
The ring (2 in DC1, 1 in DC2) looks OK, but the load on the new node in DC2 is almost 0%. yeah, thats the way it will look. But all the other rows are not in the new node. Do I need to copy the data files from a node in DC1 to the new node? How did you add the node ? (see http://www.datastax.com/docs/1.0/operations/cluster_management#adding-nodes-to-a-cluster) if in doubt run nodetool repair on the new node. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 3:46 AM, Chong Zhang wrote: Thanks Aaron. I might use LOCAL_QUORUM to avoid the waiting on the ack from DC2. Another question, after I setup a new node with token +1 in a new DC, and updated a CF with RF {DC1:2, DC2:1}. When i update a column on one node in DC1, it's also updated in the new node in DC2. But all the other rows are not in the new node. Do I need to copy the data files from a node in DC1 to the new node? The ring (2 in DC1, 1 in DC2) looks OK, but the load on the new node in DC2 is almost 0%. Address DC RackStatus State LoadOwns Token 85070591730234615865843651857942052864 10.10.10.1DC1 RAC1Up Normal 313.99 MB 50.00% 0 10.10.10.3DC2 RAC1Up Normal 7.07 MB 0.00% 1 10.10.10.2DC1 RAC1Up Normal 288.91 MB 50.00% 85070591730234615865843651857942052864 Thanks, Chong On Thu, May 31, 2012 at 5:48 AM, aaron morton aa...@thelastpickle.com wrote: Could you provide some guide on how to assign the tokens in this growing deployment phases? background http://www.datastax.com/docs/1.0/install/cluster_init#calculating-tokens-for-a-multi-data-center-cluster Start with tokens for a 4 node cluster. Add the next 4 between between each of the ranges. Add 8 in the new DC to have the same tokens as the first DC +1 Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? No. It will fail if there are not enough nodes available in the first DC. We'd like to keep both write and read on the same cluster. Writes go to all replicas. Using EACH_QUORUM means the client in the first DC will be waiting for the quorum from the second DC to ack the write. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/05/2012, at 3:20 AM, Chong Zhang wrote: Hi all, We are planning to deploy a small cluster with 4 nodes in one DC first, and will expend that to 8 nodes, then add another DC with 8 nodes for fail over (not active-active), so all the traffic will go to the 1st cluster, and switch to 2nd cluster if the whole 1st cluster is down or on maintenance. Could you provide some guide on how to assign the tokens in this growing deployment phases? I looked at some docs but not very clear on how to assign tokens on the fail-over case. Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? We'd like to keep both write and read on the same cluster. Thanks in advance, Chong
Re: newbie question :got error 'org.apache.thrift.transport.TTransportException'
Sounds like https://issues.apache.org/jira/browse/CASSANDRA-4219?attachmentOrder=desc Drop back to 1.0.10 and have a play. Good luck. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 6:38 AM, Chen, Simon wrote: Hi, I am new to Cassandra. I have started a Cassandra instance (Cassandra.bat), played with it for a while, created a keyspace Zodiac. When I kill Cassandra instance and restarted, the keyspace is gone but when I tried to recreate it, I got 'org.apache.thrift.transport.TTransportException’ error. What have I done wrong here? Following are screen shots: C:\cassandra-1.1.0bin\cassandra-cli -host localhost -f C:\NoSqlProjects\dropZ.txt Starting Cassandra Client Connected to: ssc2Cluster on localhost/9160 Line 1 = Keyspace 'Zodiac' not found. C:\cassandra-1.1.0bin\cassandra-cli -host localhost -f C:\NoSqlProjects\usageDB.txt Starting Cassandra Client Connected to: ssc2Cluster on localhost/9160 Line 1 = org.apache.thrift.transport.TTransportException Following is part of server error message: INFO 11:09:56,761 Node localhost/127.0.0.1 state jump to normal INFO 11:09:56,761 Bootstrap/Replace/Move completed! Now serving reads. INFO 11:09:56,761 Will not load MX4J, mx4j-tools.jar is not in the classpath INFO 11:09:56,781 Binding thrift service to localhost/127.0.0.1:9160 INFO 11:09:56,781 Using TFastFramedTransport with a max frame size of 15728640 bytes. INFO 11:09:56,791 Using synchronous/threadpool thrift server on localhost/127.0.0.1 : 9160 INFO 11:09:56,791 Listening for thrift clients... INFO 11:20:06,044 Enqueuing flush of Memtable-schema_keyspaces@1062244145(184/230 serialized/live bytes, 4 ops) INFO 11:20:06,054 Writing Memtable-schema_keyspaces@1062244145(184/230 serialized/live bytes, 4 ops) INFO 11:20:06,074 Completed flushing c:\cassandra_data\data\system\schema_keyspaces\system-schema_keyspaces-hc-62-Data. b (240 bytes) RROR 11:20:06,134 Exception in thread Thread[MigrationStage:1,5,main] ava.lang.AssertionError at org.apache.cassandra.db.DefsTable.updateKeyspace(DefsTable.java:441) at org.apache.cassandra.db.DefsTable.mergeKeyspaces(DefsTable.java:339) at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:269) at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:214) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) RROR 11:20:06,134 Error occurred during processing of message. ava.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.AssertionError usageDB.txt: create keyspace Zodiac with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; use Zodiac; create column family ServiceUsage with comparator = UTF8Type and default_validation_class = UTF8Type and key_validation_class = LongType AND column_metadata = [ {column_name: 'TASK_ID', validation_class: IntegerType} {column_name: 'USAGE_COUNT', validation_class: IntegerType} {column_name: 'USAGE_TYPE', validation_class: UTF8Type} ]; From: Chong Zhang [mailto:chongz.zh...@gmail.com] Sent: Thursday, May 31, 2012 8:47 AM To: user@cassandra.apache.org Subject: Re: tokens and RF for multiple phases of deployment Thanks Aaron. I might use LOCAL_QUORUM to avoid the waiting on the ack from DC2. Another question, after I setup a new node with token +1 in a new DC, and updated a CF with RF {DC1:2, DC2:1}. When i update a column on one node in DC1, it's also updated in the new node in DC2. But all the other rows are not in the new node. Do I need to copy the data files from a node in DC1 to the new node? The ring (2 in DC1, 1 in DC2) looks OK, but the load on the new node in DC2 is almost 0%. Address DC RackStatus State LoadOwns Token 85070591730234615865843651857942052864 10.10.10.1DC1 RAC1Up Normal 313.99 MB 50.00% 0 10.10.10.3DC2 RAC1Up Normal 7.07 MB 0.00% 1 10.10.10.2DC1 RAC1Up Normal 288.91 MB 50.00% 85070591730234615865843651857942052864 Thanks, Chong On Thu, May 31, 2012 at 5:48 AM, aaron morton aa...@thelastpickle.com wrote: Could you provide some guide on how to assign
Re: 1.1 not removing commit log files?
Could be this https://issues.apache.org/jira/browse/CASSANDRA-4201 But that talks about segments not being cleared at startup. Does not explain why they were allowed to get past the limit in the first place. Can you share some logs from the time the commit log got out of control ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 9:34 AM, Bryce Godfrey wrote: So this happened to me again, but it was only when the cluster had a node down for a while. Then the commit logs started piling up past the limit I set in the config file, and filled the drive. After the node recovered and hints had replayed the space was never reclaimed. A flush or drain did not reclaim the space either and delete any log files. Bryce Godfrey | Sr. Software Engineer | Azaleos Corporation From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] Sent: Tuesday, May 22, 2012 1:10 PM To: user@cassandra.apache.org Subject: RE: 1.1 not removing commit log files? The nodes appear to be holding steady at the 8G that I set it to in the config file now. I’ll keep an eye on them. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Tuesday, May 22, 2012 4:08 AM To: user@cassandra.apache.org Subject: Re: 1.1 not removing commit log files? 4096 is also the internal hard coded default for commitlog_total_space_in_mb If you are seeing more that 4GB of commit log files let us know. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/05/2012, at 6:35 AM, Bryce Godfrey wrote: Thanks, I'll give it a try. -Original Message- From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] Sent: Monday, May 21, 2012 2:12 AM To: user@cassandra.apache.org Subject: Re: 1.1 not removing commit log files? commitlog_total_space_in_mb: 4096 By default this line is commented in 1.0.x if I remember well. I guess it is the same in 1.1. You really should remove this comment or your commit logs will entirely fill up your disk as it happened to me a while ago. Alain 2012/5/21 Pieter Callewaert pieter.callewa...@be-mobile.be: Hi, In 1.1 the commitlog files are pre-allocated with files of 128MB. (https://issues.apache.org/jira/browse/CASSANDRA-3411) This should however not exceed your commitlog size in Cassandra.yaml. commitlog_total_space_in_mb: 4096 Kind regards, Pieter Callewaert From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] Sent: maandag 21 mei 2012 9:52 To: user@cassandra.apache.org Subject: 1.1 not removing commit log files? The commit log drives on my nodes keep slowly filling up. I don't see any errors in my logs that are indicating any issues that I can map to this issue. Is this how 1.1 is supposed to work now? Previous versions seemed to keep this drive at a minimum as it flushed. /dev/mapper/mpathf 25G 21G 4.2G 83% /opt/cassandra/commitlog
RE: Cassandra Data Archiving
Problem statement: We are keeping daily generated data(user generated content) in Cassandra, but our application is using only 15 days old data. So how can we archive data older than 15 days so that we can reduce load on Cassandra ring. Note : we can't apply TTL, as this data may be needed in future. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Friday, June 01, 2012 6:57 AM To: user@cassandra.apache.org Subject: Re: Cassandra Data Archiving I'm not sure on your needs, but the simplest thing to consider is snapshotting and copying off node. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/06/2012, at 12:23 AM, Shubham Srivastava wrote: I need to archive my Cassandra data into another permanent storage . Two intent 1.To shed the unused data from the Live data. 2.To use the archived data for getting some analytics out or a potential source of DataWarehouse. Any recommendations for the same in terms of strategies or tools to use. Regards, Shubham Srivastava | Technical Lead - Technology Development +91 124 4910 548 | MakeMyTrip.comhttp://MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India image001.gifWhat's new? My Trip Rewards - An exclusive loyalty program for MakeMyTrip customers.https://rewards.makemytrip.com/MTR image002.gifhttp://www.makemytrip.com/ image003.gifhttp://www.makemytrip.com/support/gurgaon-travel-agent-office.php Office Map image004.gifhttp://www.facebook.com/pages/MakeMyTrip-Deals/120740541030?ref=searchsid=10077980239.1422657277..1 Facebook image005.gifhttp://twitter.com/makemytripdeals Twitter
Re: Cassandra Data Archiving
On Fri, Jun 1, 2012 at 12:28 PM, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Problem statement: We are keeping daily generated data(user generated content) in Cassandra, but our application is using only 15 days old data. So how can we archive data older than 15 days so that we can reduce load on Cassandra ring. Can you put the new data to a different column family? ** ** Note : we can’t apply TTL, as this data may be needed in future. ** ** ** ** *From:* aaron morton [mailto:aa...@thelastpickle.com] *Sent:* Friday, June 01, 2012 6:57 AM *To:* user@cassandra.apache.org *Subject:* Re: Cassandra Data Archiving ** ** I'm not sure on your needs, but the simplest thing to consider is snapshotting and copying off node. ** ** Cheers ** ** - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com ** ** On 1/06/2012, at 12:23 AM, Shubham Srivastava wrote: I need to archive my Cassandra data into another permanent storage . Two intent 1.To shed the unused data from the Live data. 2.To use the archived data for getting some analytics out or a potential source of DataWarehouse. Any recommendations for the same in terms of strategies or tools to use.** ** Regards, *Shubham Srivastava* *|* Technical Lead - Technology Development +91 124 4910 548 | MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India image001.gif*What's new?* My Trip Rewards - An exclusive loyalty program for MakeMyTrip customers. https://rewards.makemytrip.com/MTR image002.gif http://www.makemytrip.com/ image003.gifhttp://www.makemytrip.com/support/gurgaon-travel-agent-office.php *Office Map* image004.gifhttp://www.facebook.com/pages/MakeMyTrip-Deals/120740541030?ref=searchsid=10077980239.1422657277..1 *Facebook* image005.gif http://twitter.com/makemytripdeals *Twitter* ** **
Re: Cassandra Data Archiving
I believe you are talking about HDD space, consumed by user generated data which is no longer required after 15 days or may required. First case to use TTL which you don't wan to use. 2nd as aaron pointed snapshotting data, but data still exist in cluster, only used for back up. I think of like using column family bucket, 15 day a bucket , 2 bucket a month. Creating new cf every 15th day with time-stamp marker trip_offer_cf_[ts -ts%(86400*15)], caching cf name in app for 15 days, after 15th day old cf bucket will be read only, no write goes into it, snapshotting that old_cf_bucket _data, and deleting that cf few days later, this will keep cf count fixed. current cf count=n, bucket cf count= b*n using separate cluster old data analytic. /Samal On Fri, Jun 1, 2012 at 9:58 AM, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Problem statement: We are keeping daily generated data(user generated content) in Cassandra, but our application is using only 15 days old data. So how can we archive data older than 15 days so that we can reduce load on Cassandra ring. ** ** Note : we can’t apply TTL, as this data may be needed in future. ** ** ** ** *From:* aaron morton [mailto:aa...@thelastpickle.com] *Sent:* Friday, June 01, 2012 6:57 AM *To:* user@cassandra.apache.org *Subject:* Re: Cassandra Data Archiving ** ** I'm not sure on your needs, but the simplest thing to consider is snapshotting and copying off node. ** ** Cheers ** ** - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com ** ** On 1/06/2012, at 12:23 AM, Shubham Srivastava wrote: I need to archive my Cassandra data into another permanent storage . Two intent 1.To shed the unused data from the Live data. 2.To use the archived data for getting some analytics out or a potential source of DataWarehouse. Any recommendations for the same in terms of strategies or tools to use.** ** Regards, *Shubham Srivastava* *|* Technical Lead - Technology Development +91 124 4910 548 | MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India image001.gif*What's new?* My Trip Rewards - An exclusive loyalty program for MakeMyTrip customers. https://rewards.makemytrip.com/MTR image002.gif http://www.makemytrip.com/ image003.gifhttp://www.makemytrip.com/support/gurgaon-travel-agent-office.php *Office Map* image004.gifhttp://www.facebook.com/pages/MakeMyTrip-Deals/120740541030?ref=searchsid=10077980239.1422657277..1 *Facebook* image005.gif http://twitter.com/makemytripdeals *Twitter* ** **