Re: Filter data on row key in Cassandra Hadoop's Random Partitioner
Thanks Hiller and Shamim. Let me share more details. I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like (timestamp/6)_otherid; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Filter-data-on-row-key-in-Cassandra-Hadoop-s-Random-Partitioner-tp7584212p7584263.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Does a scrub remove deleted/expired columns?
I'm using 1.0.12 and I find that large sstables tend to get compacted infrequently. I've got data that gets deleted or expired frequently. Is it possible to use scrub to accelerate the clean up of expired/deleted data? -- Mike Smith Director Development, MailChannels
Best Java Driver for Cassandra?
There seem to be a number of good options listed ... FireBrand and Hector seem to have the most attractive sites, but that doesn't necessarily mean anything. :) Can anybody make a case for one of the drivers over another, especially in terms of which ones seem to be most used in major implementations? Thanks Steve
Re: Why Secondary indexes is so slowly by my test?
Until the secondary indexes do not read before write is in a release and stabilized you should follow Ed ENuff s blog and do your indexing yourself with composites. On Thursday, December 13, 2012, aaron morton aa...@thelastpickle.com wrote: The IndexClause for the get_indexed_slices takes a start key. You can page the results from your secondary index query by making multiple calls with a sane count and including a start key. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/12/2012, at 6:34 PM, Chengying Fang cyf...@ngnsoft.com wrote: You are right, Dean. It's due to the heavy result returned by query, not index itself. According to my test, if the result rows less than 5000, it's very quick. But how to limit the result? It seems row limit is a good choice. But if do so, some rows I wanted maybe miss because the row order not fulfill query conditions. For example: CF User{I1,C1} with Index I1. Query conditions:I1=foo, order by C1. If I1=foo return 1 limit 100, I can't get the right result of C1. Also we can not always set row range fulfill the query conditions when doing query. Maybe I should redesign the CF model to fix it. -- Original -- From: Hiller, Deandean.hil...@nrel.gov; Date: Wed, Dec 12, 2012 10:51 PM To: user@cassandra.apache.orguser@cassandra.apache.org; Subject: Re: Why Secondary indexes is so slowly by my test? You could always try PlayOrm's query capability on top of cassandra ;)….it works for us. Dean From: Chengying Fang cyf...@ngnsoft.commailto:cyf...@ngnsoft.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, December 11, 2012 8:22 PM To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Why Secondary indexes is so slowly by my test? Thanks to Low. We use CompositeColumn to substitue it in single not-equality and definite equalitys query. And we will give up cassandra because of the weak query ability and unstability. Many times, we found our data in confusion without definite cause in our cluster. For example, only two rows in one CF, row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but some times, it becomes row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the wrong column value. -- Original -- From: Richard Lowr...@acunu.commailto:r...@acunu.com; Date: Tue, Dec 11, 2012 07:44 PM To: useruser@cassandra.apache.orgmailto:user@cassandra.apache.org; Subject: Re: Why Secondary indexes is so slowly by my test? Hi, Secondary index lookups are more complicated than normal queries so will be slower. Items have to first be queried in the index, then retrieved from their actual location. Also, inserting into indexed CFs will be slower (but will get substantially faster in 1.2 due
Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra
It should be good stuff. Brian eats this stuff for lunch. On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote: FWIW -- I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series: http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html I hope to make CQL part of the presentation and show how it integrates with the Java APIs. If you are interested, drop in. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Help on MMap of SSTables
This issue has to be looked from a micro and macro level. On the microlevel the best way is workload specific. On the macro level this mostly boils down to data and memory size. Companions are going to churn cache, this is unavoidable. Imho solid state makes the micro optimization meanless in the big picture. Not that we should not consider tweaking flags but just saying it is hard to believe anything like that is a game change. On Monday, December 10, 2012, Rob Coli rc...@palominodb.com wrote: On Thu, Dec 6, 2012 at 7:36 PM, aaron morton aa...@thelastpickle.com wrote: So for memory mapped files, compaction can do a madvise SEQUENTIAL instead of current DONTNEED flag after detecting appropriate OS versions. Will this help? AFAIK Compaction does use memory mapped file access. The history : https://issues.apache.org/jira/browse/CASSANDRA-1470 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Best Java Driver for Cassandra?
Well, we'll talk a bit about this in my webinar later today http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-cre dit.html I put together a quick decision matrix for all of the options based on production-readiness, potential and momentum. I think the slides will be made available afterwards. I also have a laundry list here: (written before I knew about Firebrand) http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 12/13/12 9:03 AM, stephen.m.thomp...@wellsfargo.com stephen.m.thomp...@wellsfargo.com wrote: There seem to be a number of good options listed ... FireBrand and Hector seem to have the most attractive sites, but that doesn't necessarily mean anything. :) Can anybody make a case for one of the drivers over another, especially in terms of which ones seem to be most used in major implementations? Thanks Steve
Re: Why Secondary indexes is so slowly by my test?
Hi Edward, can you share the link to this blog ? Alain 2012/12/13 Edward Capriolo edlinuxg...@gmail.com Ed ENuff s
Re: Why Secondary indexes is so slowly by my test?
Here is a good start. http://www.anuff.com/2011/02/indexing-in-cassandra.html On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi Edward, can you share the link to this blog ? Alain 2012/12/13 Edward Capriolo edlinuxg...@gmail.com Ed ENuff s
Re: Why Secondary indexes is so slowly by my test?
If anyone's interested in a little more background on the read-before-write fix that Ed mentioned, see: https://issues.apache.org/jira/browse/CASSANDRA-2897 On Thu, Dec 13, 2012 at 11:31 AM, Edward Capriolo edlinuxg...@gmail.comwrote: Here is a good start. http://www.anuff.com/2011/02/indexing-in-cassandra.html On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi Edward, can you share the link to this blog ? Alain 2012/12/13 Edward Capriolo edlinuxg...@gmail.com Ed ENuff s -- Tyler Hobbs DataStax http://datastax.com/
Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra
I tried to registered and got the following page and haven't received email yet. I registered 10 minutes ago. Thank you for registering to attend: Is My App a Good Fit for Apache Cassandra? Details about this webinar have also been sent to your email, including a link to the webinar's URL. Webinar Description: Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra as he examines the types of applications that are suited to be built on top of Cassandra. Eric will talk about the key considerations for designing and deploying your application on Apache Cassandra. How come it's saying Is My App a Good Fit for Apache Cassandra? which was the previous webniar. Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Thursday, December 13, 2012 7:23 AM Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra It should be good stuff. Brian eats this stuff for lunch. On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote: FWIW -- I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series: http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html I hope to make CQL part of the presentation and show how it integrates with the Java APIs. If you are interested, drop in. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra
Never mind, the email arrived after 15 minutes or so... From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Thursday, December 13, 2012 10:06 AM Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra I tried to registered and got the following page and haven't received email yet. I registered 10 minutes ago. Thank you for registering to attend: Is My App a Good Fit for Apache Cassandra? Details about this webinar have also been sent to your email, including a link to the webinar's URL. Webinar Description: Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra as he examines the types of applications that are suited to be built on top of Cassandra. Eric will talk about the key considerations for designing and deploying your application on Apache Cassandra. How come it's saying Is My App a Good Fit for Apache Cassandra? which was the previous webniar. Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Thursday, December 13, 2012 7:23 AM Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra It should be good stuff. Brian eats this stuff for lunch. On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote: FWIW -- I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series: http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html I hope to make CQL part of the presentation and show how it integrates with the Java APIs. If you are interested, drop in. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
State of Cassandra and Java 7
Hey Guys, With Java 6 begin EOL-ed soon (https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the status of Cassandra's Java 7 support? Anyone using it in production? Any outstanding *known* issues? -- Drew
Re: State of Cassandra and Java 7
Works just fine for us. On 12/13/12 11:43 AM, Drew Kutcharian d...@venarc.com wrote: Hey Guys, With Java 6 begin EOL-ed soon (https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the status of Cassandra's Java 7 support? Anyone using it in production? Any outstanding *known* issues? -- Drew Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f
BulkOutputFormat error - org.apache.thrift.transport.TTransportException
Hi I am a newbie to Cassandra. Was trying out a sample (word count) code on BulkOutputFormat and got stuck with an error. What I am trying to do is - migrate all Hive tables (from Hadoop cluster) to Cassandra column families. My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by pointing job config params 'fs.default.name' and 'mapred.job.tracker' appropriately. The output is pointed to my local Cassandra v1.1.7. Have set the following params for writing to Cassandra: conf.set(cassandra.output.keyspace, Customer); conf.set(cassandra.output.columnfamily, words); conf.set(cassandra.output.partitioner.class, org.apache.cassandra.dht.RandomPartitioner); conf.set(cassandra.output.thrift.port,9160);// default conf.set(cassandra.output.thrift.address, localhost); conf.set(mapreduce.output.bulkoutputformat.streamthrottlembits, 10); But, programs fails with the below error: 12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. Cassandra thrift address : localhost Cassandra thrift port : 9160 12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1 12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded 12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622 12/12/13 15:34:23 INFO mapred.JobClient: map 0% reduce 0% 12/12/13 15:34:28 INFO mapred.JobClient: map 100% reduce 0% 12/12/13 15:34:37 INFO mapred.JobClient: map 100% reduce 33% 12/12/13 15:34:39 INFO mapred.JobClient: Task Id : attempt_20121201_4622_r_00_0, Status : FAILED java.lang.RuntimeException: Could not retrieve endpoint ranges: at org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE Please help me out understand the problem. Regards Anand B The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
Re: Multiple Data Center shows very uneven load
There is a limit on the size of the commit log and on how long Hints are stored for. I'm not sure why your load was different, I think it was left of hints and commit log. But it's not always easy to diagnose thingsvia email. Hopefully nodetool drain or deleting the rest system and starting again will get you moving forwards again. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 14/12/2012, at 12:50 AM, Sergey Olefir solf.li...@gmail.com wrote: I'll try nodetool drain, thanks. But more generally -- are you basically saying that I should not worry about these things? Data will not keep accumulating indefinitely in production and it'll not affect performance negatively (despite vast differences in node load)? Best regards, Sergey aaron morton wrote try nodetool drain. It will flush everything to disk and the commit log will be truncated. HH can be ignored. If you really want them gone they can be purged using the JMX interface, or you can stop the node and delete the sstables. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/12/2012, at 10:35 AM, Sergey Olefir lt; solf.lists@ gt; wrote: Nick Bailey-2 wrote Dropping a keyspace causes a snapshot to be taken of the keyspace before it is removed from the schema. So it won't actually delete any data. You can manually delete the data from /var/lib/cassandra/ ks /lt;cf[s]gt;/snapshots Indeed, it looks like snapshot is on the file system. However it looks like it is not the only thing by a long shot, i.e.: cassa1-1:/var/log/cassandra# du -k /spool1/cassandra/data/1.1/ 375372 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots/1355222054452-marquisColumnFamily 375376 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots 375380 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily 375384 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace 4 /spool1/cassandra/data/1.1/system/Versions 52 /spool1/cassandra/data/1.1/system/schema_columns 4 /spool1/cassandra/data/1.1/system/Schema 28 /spool1/cassandra/data/1.1/system/NodeIdInfo 4 /spool1/cassandra/data/1.1/system/Migrations 28 /spool1/cassandra/data/1.1/system/schema_keyspaces 28 /spool1/cassandra/data/1.1/system/schema_columnfamilies 786348 /spool1/cassandra/data/1.1/system/HintsColumnFamily 52 /spool1/cassandra/data/1.1/system/LocationInfo 4 /spool1/cassandra/data/1.1/system/IndexInfo 786556 /spool1/cassandra/data/1.1/system 1161944 /spool1/cassandra/data/1.1/ And also 700+MB in the commitlog. Neither of which seemed to 'go away' on its own when idle or even after running nodetool repair/cleanup and even dropping keyspace. I suppose these hints and commitlog may be the reason behind huge difference in load on nodes -- but why does it happen and more importantly is it harmful? Will it keep accumulating? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584256.html Sent from the cassandra-user@.apache mailing list archive at Nabble.com. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584264.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Does a scrub remove deleted/expired columns?
Is it possible to use scrub to accelerate the clean up of expired/deleted data? No. Scrub, and upgradesstables, are used to re-write each file on disk. Scrub may remove some rows from a file because of corruption, however upgradesstables will not. If you have long lived rows and a mixed work load of writes and deletes there are a couple of options. You can try levelled compaction http://www.datastax.com/dev/blog/when-to-use-leveled-compaction You can tune the default sized tiered compaction by increasing the min_compaction_threshold. This will increase the number of files that must exist in each size tier before it will be compacted. As a result the speed at which rows move into the higher tiers will slow down. Note that having lots of files may have a negative impact on read performance. You can measure this my looking at the SSTables per read metric in the cfhistograms. Lastly you can run a user defined or major compaction. User defined compaction is available via JMX and allows you to compact any file you want. Manual / major compaction is available via node tool. We usually discourage it's use as it will create one big file that will not get compacted for a while. For background the tombstones / expired columns for a row are only purged from the database when all fragments of the row are in the files been compacted. So if you have an old row that is spread out over many files it may not get purged. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 14/12/2012, at 3:01 AM, Mike Smith m...@mailchannels.com wrote: I'm using 1.0.12 and I find that large sstables tend to get compacted infrequently. I've got data that gets deleted or expired frequently. Is it possible to use scrub to accelerate the clean up of expired/deleted data? -- Mike Smith Director Development, MailChannels
Re: BulkOutputFormat error - org.apache.thrift.transport.TTransportException
Looks like it cannot connect to the server conf.set(cassandra.output.thrift.address, localhost); Is this the same address as the rpc_address in the cassandra config ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 14/12/2012, at 9:57 AM, anand_balara...@homedepot.com wrote: Hi I am a newbie to Cassandra. Was trying out a sample (word count) code on BulkOutputFormat and got stuck with an error. What I am trying to do is – migrate all Hive tables (from Hadoop cluster) to Cassandra column families. My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by pointing job config params ‘fs.default.name’ and ‘mapred.job.tracker’ appropriately. The output is pointed to my local Cassandra v1.1.7. Have set the following params for writing to Cassandra: conf.set(cassandra.output.keyspace, Customer); conf.set(cassandra.output.columnfamily, words); conf.set(cassandra.output.partitioner.class, org.apache.cassandra.dht.RandomPartitioner); conf.set(cassandra.output.thrift.port,9160);// default conf.set(cassandra.output.thrift.address, localhost); conf.set(mapreduce.output.bulkoutputformat.streamthrottlembits, 10); But, programs fails with the below error: 12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. Cassandra thrift address : localhost Cassandra thrift port : 9160 12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1 12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded 12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622 12/12/13 15:34:23 INFO mapred.JobClient: map 0% reduce 0% 12/12/13 15:34:28 INFO mapred.JobClient: map 100% reduce 0% 12/12/13 15:34:37 INFO mapred.JobClient: map 100% reduce 33% 12/12/13 15:34:39 INFO mapred.JobClient: Task Id : attempt_20121201_4622_r_00_0, Status : FAILED java.lang.RuntimeException: Could not retrieve endpoint ranges: at org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE Please help me out understand the problem. Regards Anand B The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
Re: State of Cassandra and Java 7
On Thu, Dec 13, 2012 at 11:43 AM, Drew Kutcharian d...@venarc.com wrote: With Java 6 begin EOL-ed soon (https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the status of Cassandra's Java 7 support? Anyone using it in production? Any outstanding *known* issues? I'd love to see an official statement from the project, due to the sort of EOL issues you're referring to. Unfortunately previous requests on this list for such a statement have gone unanswered. The non-official response is that various people run in production with Java 7 and it seems to work. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Why Secondary indexes is so slowly by my test?
I do missed this important article about index, which discussing about the focus concerns. In fact, I have used Composite Column to resolve my problem. In some context, data model can resolves as 'alternate index', but it's complicated and can result new problems: data redundancy and maintenance. The most , what items I can query are decided at design time, that is, No Design No Query, even all data is there. Thanks to all. -- Original -- From: Edward Caprioloedlinuxg...@gmail.com; Date: Fri, Dec 14, 2012 01:31 AM To: useruser@cassandra.apache.org; Subject: Re: Why Secondary indexes is so slowly by my test? Here is a good start. http://www.anuff.com/2011/02/indexing-in-cassandra.html On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Edward, can you share the link to this blog ? Alain 2012/12/13 Edward Capriolo edlinuxg...@gmail.com Ed ENuff s
ETL Tools to transfer data from Cassandra into other relational databases
We will use Cassandra as logging storage in one of our web application. The application only insert rows into Cassandra but never update or delete any rows. The CF is expected to grow by about 0.5 million rows per day. We need to transfer the data in Cassandra to another relational database daily. Due to the large size of the CF, instead of truncating the relational table and reloading all rows into it each time, we plan to run a job to select the delta rows since the last run and insert them into the relational database. We know we can use Java, Pig or Hive to extract the delta rows to a flat file and load the data into the target relational table. We are particularly interested in a process that can extract delta rows without scanning the entire CF. Has anyone used any other ETL tools to do this kind of delta extraction from Cassandra? We appreciate any comments and experience. Thanks, Chin
Re: Does a scrub remove deleted/expired columns?
Thanks for the great explanation. I'd just like some clarification on the last point. Is it the case that if I constantly add new columns to a row, while periodically trimming the row by by deleting the oldest columns, the deleted columns won't get cleaned up until all fragments of the row exist in a single sstable and that sstable undergoes a compaction? If my understanding is correct, do you know if 1.2 will enable cleanup of columns in rows that have scattered fragments? Or, should I take a different approach? On Thu, Dec 13, 2012 at 5:52 PM, aaron morton aa...@thelastpickle.comwrote: Is it possible to use scrub to accelerate the clean up of expired/deleted data? No. Scrub, and upgradesstables, are used to re-write each file on disk. Scrub may remove some rows from a file because of corruption, however upgradesstables will not. If you have long lived rows and a mixed work load of writes and deletes there are a couple of options. You can try levelled compaction http://www.datastax.com/dev/blog/when-to-use-leveled-compaction You can tune the default sized tiered compaction by increasing the min_compaction_threshold. This will increase the number of files that must exist in each size tier before it will be compacted. As a result the speed at which rows move into the higher tiers will slow down. Note that having lots of files may have a negative impact on read performance. You can measure this my looking at the SSTables per read metric in the cfhistograms. Lastly you can run a user defined or major compaction. User defined compaction is available via JMX and allows you to compact any file you want. Manual / major compaction is available via node tool. We usually discourage it's use as it will create one big file that will not get compacted for a while. For background the tombstones / expired columns for a row are only purged from the database when all fragments of the row are in the files been compacted. So if you have an old row that is spread out over many files it may not get purged. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 14/12/2012, at 3:01 AM, Mike Smith m...@mailchannels.com wrote: I'm using 1.0.12 and I find that large sstables tend to get compacted infrequently. I've got data that gets deleted or expired frequently. Is it possible to use scrub to accelerate the clean up of expired/deleted data? -- Mike Smith Director Development, MailChannels -- Mike Smith Director Development, MailChannels
Re: ETL Tools to transfer data from Cassandra into other relational databases
Why would you use Cassandra for primary store of logging information? Have you considered Kafka ? You could , of course, then fan out the logs to both Cassandra (on a near real time basis ) and then on a daily basis (if you wish) extract the deltas from Kafka into a RDBMS; with no PIG/Hive etc. Regards Milind Regards Milind On Thu, Dec 13, 2012 at 7:19 PM, cko2...@gmail.com cko2...@gmail.comwrote: We will use Cassandra as logging storage in one of our web application. The application only insert rows into Cassandra but never update or delete any rows. The CF is expected to grow by about 0.5 million rows per day. We need to transfer the data in Cassandra to another relational database daily. Due to the large size of the CF, instead of truncating the relational table and reloading all rows into it each time, we plan to run a job to select the delta rows since the last run and insert them into the relational database. We know we can use Java, Pig or Hive to extract the delta rows to a flat file and load the data into the target relational table. We are particularly interested in a process that can extract delta rows without scanning the entire CF. Has anyone used any other ETL tools to do this kind of delta extraction from Cassandra? We appreciate any comments and experience. Thanks, Chin
RE: BulkOutputFormat error - org.apache.thrift.transport.TTransportException
Aaron Both the rpc_address in caasandra.yaml file and job configuration are same (localhost). I will try connecting to a different Cassandra cluster and test it again. -Original Message- From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Thursday, December 13, 2012 9:03 PM To: user@cassandra.apache.org Subject: Re: BulkOutputFormat error - org.apache.thrift.transport.TTransportException Looks like it cannot connect to the server conf.set(cassandra.output.thrift.address, localhost); Is this the same address as the rpc_address in the cassandra config ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 14/12/2012, at 9:57 AM, anand_balara...@homedepot.com wrote: Hi I am a newbie to Cassandra. Was trying out a sample (word count) code on BulkOutputFormat and got stuck with an error. What I am trying to do is - migrate all Hive tables (from Hadoop cluster) to Cassandra column families. My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by pointing job config params 'fs.default.name' and 'mapred.job.tracker' appropriately. The output is pointed to my local Cassandra v1.1.7. Have set the following params for writing to Cassandra: conf.set(cassandra.output.keyspace, Customer); conf.set(cassandra.output.columnfamily, words); conf.set(cassandra.output.partitioner.class, org.apache.cassandra.dht.RandomPartitioner); conf.set(cassandra.output.thrift.port,9160);// default conf.set(cassandra.output.thrift.address, localhost); conf.set(mapreduce.output.bulkoutputformat.streamthrottlembits, 10); But, programs fails with the below error: 12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. Cassandra thrift address : localhost Cassandra thrift port : 9160 12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1 12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded 12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622 12/12/13 15:34:23 INFO mapred.JobClient: map 0% reduce 0% 12/12/13 15:34:28 INFO mapred.JobClient: map 100% reduce 0% 12/12/13 15:34:37 INFO mapred.JobClient: map 100% reduce 33% 12/12/13 15:34:39 INFO mapred.JobClient: Task Id : attempt_20121201_4622_r_00_0, Status : FAILED java.lang.RuntimeException: Could not retrieve endpoint ranges: at org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE Please help me out understand the problem. Regards Anand B The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.