Re: Filter data on row key in Cassandra Hadoop's Random Partitioner

2012-12-13 Thread Ayush V.
Thanks Hiller and Shamim. 

Let me share more details. I want to use cassandra MR to calculate some
KPI's on the data which is stored in cassandra continuously. So here
fetching whole data from cassandra every time seems an overhead to me? 

The rowkey I'm using is like (timestamp/6)_otherid; this CF contains
reference of rowkeys of actual data stored in other CF. so to calculate KPI
I will work for a particular minute and fetch data from other CF, and
process it.



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Filter-data-on-row-key-in-Cassandra-Hadoop-s-Random-Partitioner-tp7584212p7584263.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Does a scrub remove deleted/expired columns?

2012-12-13 Thread Mike Smith
I'm using 1.0.12 and I find that large sstables tend to get compacted
infrequently. I've got data that gets deleted or expired frequently. Is it
possible to use scrub to accelerate the clean up of expired/deleted data?

-- 
Mike Smith
Director Development, MailChannels


Best Java Driver for Cassandra?

2012-12-13 Thread Stephen.M.Thompson
There seem to be a number of good options listed ... FireBrand and Hector seem 
to have the most attractive sites, but that doesn't necessarily mean anything.  
:)  Can anybody make a case for one of the drivers over another, especially in 
terms of which ones seem to be most used in major implementations?

Thanks
Steve


Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Edward Capriolo
Until the secondary indexes do not read before write is in a release and
stabilized you should follow Ed ENuff s blog and do your indexing yourself
with composites.

On Thursday, December 13, 2012, aaron morton aa...@thelastpickle.com
wrote:
 The IndexClause for the get_indexed_slices takes a start key. You can
page the results from your secondary index query by making multiple calls
with a sane count and including a start key.
 Cheers
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 @aaronmorton
 http://www.thelastpickle.com
 On 13/12/2012, at 6:34 PM, Chengying Fang cyf...@ngnsoft.com wrote:

 You are right, Dean. It's due to the heavy result returned by query, not
index itself. According to my test, if the result  rows less than 5000,
it's very quick. But how to limit the result? It seems row limit is a good
choice. But if do so, some rows I wanted  maybe miss because the row order
not fulfill query conditions.
 For example: CF User{I1,C1} with Index I1. Query conditions:I1=foo, order
by C1. If I1=foo return 1 limit 100, I can't get the right result of
C1. Also we can not always set row range fulfill the query conditions when
doing query. Maybe I should redesign the CF model to fix it.

 -- Original --
 From:  Hiller, Deandean.hil...@nrel.gov;
 Date:  Wed, Dec 12, 2012 10:51 PM
 To:  user@cassandra.apache.orguser@cassandra.apache.org;
 Subject:  Re: Why Secondary indexes is so slowly by my test?

 You could always try PlayOrm's query capability on top of cassandra
;)….it works for us.

 Dean

 From: Chengying Fang cyf...@ngnsoft.commailto:cyf...@ngnsoft.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, December 11, 2012 8:22 PM
 To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Why Secondary indexes is so slowly by my test?

 Thanks to Low. We use CompositeColumn to substitue it in single
not-equality and definite equalitys query. And we will give up cassandra
because of the weak query ability and unstability. Many times, we found our
data in confusion without definite  cause in our cluster. For example, only
two rows in one CF,
row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but some
times, it becomes
row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the
wrong column value.


 -- Original --
 From:  Richard Lowr...@acunu.commailto:r...@acunu.com;
 Date:  Tue, Dec 11, 2012 07:44 PM
 To:  useruser@cassandra.apache.orgmailto:user@cassandra.apache.org;
 Subject:  Re: Why Secondary indexes is so slowly by my test?

 Hi,

 Secondary index lookups are more complicated than normal queries so will
be slower. Items have to first be queried in the index, then retrieved from
their actual location. Also, inserting into indexed CFs will be slower (but
will get substantially faster in 1.2 due


Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Edward Capriolo
It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu
wrote:
 FWIW --
 I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:

http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

 I hope to make CQL part of the presentation and show how it integrates
 with the Java APIs.
 If you are interested, drop in.

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42



Re: Help on MMap of SSTables

2012-12-13 Thread Edward Capriolo
This issue has to be looked from a micro and macro level. On the microlevel
the best way is workload specific. On the macro level this mostly boils
down to data and memory size.

Companions are going to churn cache, this is unavoidable. Imho solid state
makes the micro optimization meanless in the big picture. Not that we
should not consider tweaking flags but just saying it is hard to believe
anything like that is a game change.

On Monday, December 10, 2012, Rob Coli rc...@palominodb.com wrote:
 On Thu, Dec 6, 2012 at 7:36 PM, aaron morton aa...@thelastpickle.com
wrote:
 So for memory mapped files, compaction can do a madvise SEQUENTIAL
instead
 of current DONTNEED flag after detecting appropriate OS versions. Will
this
 help?


 AFAIK Compaction does use memory mapped file access.

 The history :

 https://issues.apache.org/jira/browse/CASSANDRA-1470

 =Rob

 --
 =Robert Coli
 AIMGTALK - rc...@palominodb.com
 YAHOO - rcoli.palominob
 SKYPE - rcoli_palominodb



Re: Best Java Driver for Cassandra?

2012-12-13 Thread Brian O'Neill

Well, we'll talk a bit about this in my webinar later todayŠ
http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-cre
dit.html

I put together a quick decision matrix for all of the options based on
production-readiness, potential and momentum.  I think the slides will be
made available afterwards.

I also have a laundry list here: (written before I knew about Firebrand)
http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 12/13/12 9:03 AM, stephen.m.thomp...@wellsfargo.com
stephen.m.thomp...@wellsfargo.com wrote:

There seem to be a number of good options listed ... FireBrand and Hector
seem to have the most attractive sites, but that doesn't necessarily mean
anything.  :)  Can anybody make a case for one of the drivers over
another, especially in terms of which ones seem to be most used in major
implementations?

Thanks
Steve




Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Alain RODRIGUEZ
Hi Edward, can you share the link to this blog ?

Alain

2012/12/13 Edward Capriolo edlinuxg...@gmail.com

 Ed ENuff s


Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Edward Capriolo
Here is a good start.

http://www.anuff.com/2011/02/indexing-in-cassandra.html

On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi Edward, can you share the link to this blog ?

 Alain

 2012/12/13 Edward Capriolo edlinuxg...@gmail.com

 Ed ENuff s





Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Tyler Hobbs
If anyone's interested in a little more background on the read-before-write
fix that Ed mentioned, see:
https://issues.apache.org/jira/browse/CASSANDRA-2897


On Thu, Dec 13, 2012 at 11:31 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Here is a good start.

 http://www.anuff.com/2011/02/indexing-in-cassandra.html

 On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi Edward, can you share the link to this blog ?

 Alain

 2012/12/13 Edward Capriolo edlinuxg...@gmail.com

 Ed ENuff s






-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Wei Zhu
I tried to registered and got the following page and haven't received email 
yet. I registered 10 minutes ago.

Thank you for registering to attend:

Is My App a Good Fit for Apache Cassandra?

Details about this webinar have also been sent to your email, including a link 
to the webinar's URL.


Webinar Description:

Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra 
as he examines the types of applications that are suited to be built on 
top of Cassandra. Eric will talk about the key considerations for 
designing and deploying your application on Apache Cassandra. 

How come it's saying Is My App a Good Fit for Apache Cassandra? which was the 
previous webniar. 

Thanks.
-Wei



 From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Thursday, December 13, 2012 7:23 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra
 

It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote:
 FWIW --
 I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
 http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

 I hope to make CQL part of the presentation and show how it integrates
 with the Java APIs.
 If you are interested, drop in.

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42
 

Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Wei Zhu
Never mind, the email arrived after 15 minutes or so...



 From: Wei Zhu wz1...@yahoo.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Thursday, December 13, 2012 10:06 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra
 

I tried to registered and got the following page and haven't received email 
yet. I registered 10 minutes ago.

Thank you for registering to attend:

Is My App a Good Fit for Apache Cassandra?

Details about this webinar have also been sent to your email, including a link 
to the webinar's URL.


Webinar Description:

Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra 
as he examines the types of applications that are suited to be built on 
top of Cassandra. Eric will talk about the key considerations for 
designing and deploying your application on Apache Cassandra. 

How come it's saying Is My App a Good Fit for Apache Cassandra? which was the 
previous webniar. 

Thanks.
-Wei



 From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Thursday, December 13, 2012 7:23 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra
 

It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote:
 FWIW --
 I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
 http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

 I hope to make CQL part of the presentation and show how it integrates
 with the Java APIs.
 If you are interested, drop in.

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42
 

State of Cassandra and Java 7

2012-12-13 Thread Drew Kutcharian
Hey Guys,

With Java 6 begin EOL-ed soon 
(https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the 
status of Cassandra's Java 7 support? Anyone using it in production? Any 
outstanding *known* issues? 

-- Drew



Re: State of Cassandra and Java 7

2012-12-13 Thread Michael Kjellman
Works just fine for us.

On 12/13/12 11:43 AM, Drew Kutcharian d...@venarc.com wrote:

Hey Guys,

With Java 6 begin EOL-ed soon
(https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's
the status of Cassandra's Java 7 support? Anyone using it in production?
Any outstanding *known* issues?

-- Drew



Join Barracuda Networks in the fight against hunger.
To learn how you can help in your community, please visit: 
http://on.fb.me/UAdL4f





BulkOutputFormat error - org.apache.thrift.transport.TTransportException

2012-12-13 Thread ANAND_BALARAMAN
Hi

I am a newbie to Cassandra. Was trying out a sample (word count) code on 
BulkOutputFormat and got stuck with an error.

What I am trying to do is - migrate all Hive tables (from Hadoop cluster) to 
Cassandra column families.
My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by 
pointing job config params 'fs.default.name' and 'mapred.job.tracker' 
appropriately.
The output is pointed to my local Cassandra v1.1.7.
Have set the following params for writing to Cassandra:
conf.set(cassandra.output.keyspace, Customer);
   conf.set(cassandra.output.columnfamily, words);
   conf.set(cassandra.output.partitioner.class, 
org.apache.cassandra.dht.RandomPartitioner);
   conf.set(cassandra.output.thrift.port,9160);// default
   conf.set(cassandra.output.thrift.address, localhost);
   conf.set(mapreduce.output.bulkoutputformat.streamthrottlembits, 10);

But, programs fails with the below error:
12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration 
already set up for Hadoop, not re-installing.
Cassandra thrift address   :  localhost
Cassandra thrift port  :  9160
12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1
12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622
12/12/13 15:34:23 INFO mapred.JobClient:  map 0% reduce 0%
12/12/13 15:34:28 INFO mapred.JobClient:  map 100% reduce 0%
12/12/13 15:34:37 INFO mapred.JobClient:  map 100% reduce 33%
12/12/13 15:34:39 INFO mapred.JobClient: Task Id : 
attempt_20121201_4622_r_00_0, Status : FAILED
java.lang.RuntimeException: Could not retrieve endpoint ranges:
   at 
org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328)
   at 
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116)
   at 
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111)
   at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223)
   at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208)
   at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
   at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE

Please help me out understand the problem.

Regards
Anand B



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Multiple Data Center shows very uneven load

2012-12-13 Thread aaron morton
There is a limit on the size of the commit log and on how long Hints are stored 
for. 

I'm not sure why your load was different, I think it was left of hints and 
commit log. But it's not always easy to diagnose thingsvia email. 

Hopefully nodetool drain or deleting the rest system and starting again will 
get you moving forwards again.

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 12:50 AM, Sergey Olefir solf.li...@gmail.com wrote:

 I'll try nodetool drain, thanks.
 
 But more generally -- are you basically saying that I should not worry about
 these things? Data will not keep accumulating indefinitely in production and
 it'll not affect performance negatively (despite vast differences in node
 load)?
 
 Best regards,
 Sergey
 
 
 aaron morton wrote
 try nodetool drain. It will flush everything to disk and the commit log
 will be truncated.
 
 HH can be ignored. If you really want them gone they can be purged using
 the JMX interface, or you can stop the node and delete the sstables. 
 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 13/12/2012, at 10:35 AM, Sergey Olefir lt;
 
 solf.lists@
 
 gt; wrote:
 
 Nick Bailey-2 wrote
 Dropping a keyspace causes a snapshot to be taken of the keyspace before
 it
 is removed from the schema. So it won't actually delete any data. You
 can
 manually delete the data from /var/lib/cassandra/
 
 ks
 /lt;cf[s]gt;/snapshots
 
 Indeed, it looks like snapshot is on the file system. However it looks
 like
 it is not the only thing by a long shot, i.e.:
 cassa1-1:/var/log/cassandra# du -k /spool1/cassandra/data/1.1/
 375372 
 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots/1355222054452-marquisColumnFamily
 375376 
 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots
 375380 
 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily
 375384  /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace
 4   /spool1/cassandra/data/1.1/system/Versions
 52  /spool1/cassandra/data/1.1/system/schema_columns
 4   /spool1/cassandra/data/1.1/system/Schema
 28  /spool1/cassandra/data/1.1/system/NodeIdInfo
 4   /spool1/cassandra/data/1.1/system/Migrations
 28  /spool1/cassandra/data/1.1/system/schema_keyspaces
 28  /spool1/cassandra/data/1.1/system/schema_columnfamilies
 786348  /spool1/cassandra/data/1.1/system/HintsColumnFamily
 52  /spool1/cassandra/data/1.1/system/LocationInfo
 4   /spool1/cassandra/data/1.1/system/IndexInfo
 786556  /spool1/cassandra/data/1.1/system
 1161944 /spool1/cassandra/data/1.1/
 
 
 And also 700+MB in the commitlog. Neither of which seemed to 'go away' on
 its own when idle or even after running nodetool repair/cleanup and even
 dropping keyspace.
 
 I suppose these hints and commitlog may be the reason behind huge
 difference
 in load on nodes -- but why does it happen and more importantly is it
 harmful? Will it keep accumulating?
 
 
 
 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584256.html
 Sent from the 
 
 cassandra-user@.apache
 
 mailing list archive at Nabble.com.
 
 
 
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584264.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Does a scrub remove deleted/expired columns?

2012-12-13 Thread aaron morton
  Is it possible to use scrub to accelerate the clean up of expired/deleted 
 data?
No.
Scrub, and upgradesstables, are used to re-write each file on disk. Scrub may 
remove some rows from a file because of corruption, however upgradesstables 
will not. 

If you have long lived rows and a mixed work load of writes and deletes there 
are a couple of options. 

You can try levelled compaction 
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

You can tune the default sized tiered compaction by increasing the 
min_compaction_threshold. This will increase the number of files that must 
exist in each size tier before it will be compacted. As a result the speed at 
which rows move into the higher tiers will slow down. 

Note that having lots of files may have a negative impact on read performance. 
You can measure this my looking at the SSTables per read metric in the 
cfhistograms. 

Lastly you can run a user defined or major compaction. User defined compaction 
is available via JMX and allows you to compact any file you want. Manual / 
major compaction is available via node tool. We usually discourage it's use as 
it will create one big file that will not get compacted for a while. 


For background the tombstones / expired columns for a row are only purged from 
the database when all fragments of the row are  in the files been compacted. So 
if you have an old row that is spread out over many files it may not get 
purged. 

Hope that helps. 



-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 3:01 AM, Mike Smith m...@mailchannels.com wrote:

 I'm using 1.0.12 and I find that large sstables tend to get compacted 
 infrequently. I've got data that gets deleted or expired frequently. Is it 
 possible to use scrub to accelerate the clean up of expired/deleted data?
 
 -- 
 Mike Smith
 Director Development, MailChannels
 



Re: BulkOutputFormat error - org.apache.thrift.transport.TTransportException

2012-12-13 Thread aaron morton
Looks like it cannot connect to the server

conf.set(cassandra.output.thrift.address, localhost);
Is this the same address as the rpc_address in the cassandra config ? 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 9:57 AM, anand_balara...@homedepot.com wrote:

 Hi
  
 I am a newbie to Cassandra. Was trying out a sample (word count) code on 
 BulkOutputFormat and got stuck with an error.
  
 What I am trying to do is – migrate all Hive tables (from Hadoop cluster) to 
 Cassandra column families.
 My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by 
 pointing job config params ‘fs.default.name’ and ‘mapred.job.tracker’ 
 appropriately.
 The output is pointed to my local Cassandra v1.1.7.
 Have set the following params for writing to Cassandra:
 conf.set(cassandra.output.keyspace, Customer);
conf.set(cassandra.output.columnfamily, words);
conf.set(cassandra.output.partitioner.class, 
 org.apache.cassandra.dht.RandomPartitioner);
conf.set(cassandra.output.thrift.port,9160);// default
conf.set(cassandra.output.thrift.address, localhost);
conf.set(mapreduce.output.bulkoutputformat.streamthrottlembits, 
 10);
  
 But, programs fails with the below error:
 12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration 
 already set up for Hadoop, not re-installing.
 Cassandra thrift address   :  localhost
 Cassandra thrift port  :  9160
 12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1
 12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded
 12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622
 12/12/13 15:34:23 INFO mapred.JobClient:  map 0% reduce 0%
 12/12/13 15:34:28 INFO mapred.JobClient:  map 100% reduce 0%
 12/12/13 15:34:37 INFO mapred.JobClient:  map 100% reduce 33%
 12/12/13 15:34:39 INFO mapred.JobClient: Task Id : 
 attempt_20121201_4622_r_00_0, Status : FAILED
 java.lang.RuntimeException: Could not retrieve endpoint ranges:
at 
 org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328)
at 
 org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116)
at 
 org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111)
at 
 org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223)
at 
 org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208)
at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
 Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE
  
 Please help me out understand the problem.
  
 Regards
 Anand B
 
 
 The information in this Internet Email is confidential and may be legally 
 privileged. It is intended solely for the addressee. Access to this Email by 
 anyone else is unauthorized. If you are not the intended recipient, any 
 disclosure, copying, distribution or any action taken or omitted to be taken 
 in reliance on it, is prohibited and may be unlawful. When addressed to our 
 clients any opinions or advice contained in this Email are subject to the 
 terms and conditions expressed in any applicable governing The Home Depot 
 terms of business or client engagement letter. The Home Depot disclaims all 
 responsibility and liability for the accuracy and content of this attachment 
 and for any damages or losses arising from any inaccuracies, errors, viruses, 
 e.g., worms, trojan horses, etc., or other items of a destructive nature, 
 which may be contained in this attachment and shall not be liable for direct, 
 indirect, consequential or special damages in connection with this e-mail 
 message or its attachment.



Re: State of Cassandra and Java 7

2012-12-13 Thread Rob Coli
On Thu, Dec 13, 2012 at 11:43 AM, Drew Kutcharian d...@venarc.com wrote:
 With Java 6 begin EOL-ed soon 
 (https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the 
 status of Cassandra's Java 7 support? Anyone using it in production? Any 
 outstanding *known* issues?

I'd love to see an official statement from the project, due to the
sort of EOL issues you're referring to. Unfortunately previous
requests on this list for such a statement have gone unanswered.

The non-official response is that various people run in production
with Java 7 and it seems to work. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Chengying Fang
I do missed this important article about index, which discussing about the 
focus concerns. In fact, I have used Composite Column to resolve my problem. In 
some context, data model can resolves as 'alternate index',  but it's 
complicated and can result new problems: data redundancy and maintenance. The 
most , what items I can query are decided at design time, that is, No Design No 
Query, even all data is there. Thanks to all.
 
-- Original --
From:  Edward Caprioloedlinuxg...@gmail.com;
Date:  Fri, Dec 14, 2012 01:31 AM
To:  useruser@cassandra.apache.org; 

Subject:  Re: Why Secondary indexes is so slowly by my test?

 
Here is a good start.



http://www.anuff.com/2011/02/indexing-in-cassandra.html

On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 Hi Edward, can you share the link to this blog ?


 Alain

2012/12/13 Edward Capriolo edlinuxg...@gmail.com
 Ed ENuff s

ETL Tools to transfer data from Cassandra into other relational databases

2012-12-13 Thread cko2...@gmail.com
We will use Cassandra as logging storage in one of our web application. The 
application only insert rows into Cassandra but never update or delete any 
rows. The CF is expected to grow by about 0.5 million rows per day.
 
We need to transfer the data in Cassandra to another relational database daily. 
Due to the large size of the CF, instead of truncating the relational table and 
reloading all rows into it each time, we plan to run a job to select the 
delta rows since the last run and insert them into the relational database.
 
We know we can use Java, Pig or Hive to extract the delta rows to a flat file 
and load the data into the target relational table. We are particularly 
interested in a process that can extract delta rows without scanning the entire 
CF.
 
Has anyone used any other ETL tools to do this kind of delta extraction from 
Cassandra? We appreciate any comments and experience.
 
Thanks,
Chin


Re: Does a scrub remove deleted/expired columns?

2012-12-13 Thread Mike Smith
Thanks for the great explanation.

I'd just like some clarification on the last point. Is it the case that if
I constantly add new columns to a row, while periodically trimming the row
by by deleting the oldest columns, the deleted columns won't get cleaned up
until all fragments of the row exist in a single sstable and that sstable
undergoes a compaction?

If my understanding is correct, do you know if 1.2 will enable cleanup of
columns in rows that have scattered fragments? Or, should I take a
different approach?



On Thu, Dec 13, 2012 at 5:52 PM, aaron morton aa...@thelastpickle.comwrote:

  Is it possible to use scrub to accelerate the clean up of expired/deleted
 data?

 No.
 Scrub, and upgradesstables, are used to re-write each file on disk. Scrub
 may remove some rows from a file because of corruption, however
 upgradesstables will not.

 If you have long lived rows and a mixed work load of writes and deletes
 there are a couple of options.

 You can try levelled compaction
 http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

 You can tune the default sized tiered compaction by increasing the
 min_compaction_threshold. This will increase the number of files that must
 exist in each size tier before it will be compacted. As a result the speed
 at which rows move into the higher tiers will slow down.

 Note that having lots of files may have a negative impact on read
 performance. You can measure this my looking at the SSTables per read
 metric in the cfhistograms.

 Lastly you can run a user defined or major compaction. User defined
 compaction is available via JMX and allows you to compact any file you
 want. Manual / major compaction is available via node tool. We usually
 discourage it's use as it will create one big file that will not get
 compacted for a while.


 For background the tombstones / expired columns for a row are only purged
 from the database when all fragments of the row are  in the files been
 compacted. So if you have an old row that is spread out over many files it
 may not get purged.

 Hope that helps.



-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 14/12/2012, at 3:01 AM, Mike Smith m...@mailchannels.com wrote:

 I'm using 1.0.12 and I find that large sstables tend to get compacted
 infrequently. I've got data that gets deleted or expired frequently. Is it
 possible to use scrub to accelerate the clean up of expired/deleted data?

 --
 Mike Smith
 Director Development, MailChannels





-- 
Mike Smith
Director Development, MailChannels


Re: ETL Tools to transfer data from Cassandra into other relational databases

2012-12-13 Thread Milind Parikh
Why would you use Cassandra for primary store of logging information? Have
you considered Kafka ?

You could , of course, then fan out the logs to both Cassandra (on a near
real time basis ) and then on a daily basis (if you wish) extract the
deltas from Kafka into a RDBMS; with no PIG/Hive etc.


Regards
Milind


Regards
Milind



On Thu, Dec 13, 2012 at 7:19 PM, cko2...@gmail.com cko2...@gmail.comwrote:

 We will use Cassandra as logging storage in one of our web application.
 The application only insert rows into Cassandra but never update or delete
 any rows. The CF is expected to grow by about 0.5 million rows per day.

 We need to transfer the data in Cassandra to another relational database
 daily. Due to the large size of the CF, instead of truncating the
 relational table and reloading all rows into it each time, we plan to run a
 job to select the delta rows since the last run and insert them into the
 relational database.

 We know we can use Java, Pig or Hive to extract the delta rows to a flat
 file and load the data into the target relational table. We are
 particularly interested in a process that can extract delta rows without
 scanning the entire CF.

 Has anyone used any other ETL tools to do this kind of delta extraction
 from Cassandra? We appreciate any comments and experience.

 Thanks,
 Chin



RE: BulkOutputFormat error - org.apache.thrift.transport.TTransportException

2012-12-13 Thread ANAND_BALARAMAN
Aaron
Both the rpc_address in caasandra.yaml file and job configuration are same 
(localhost).
I will try connecting to a different Cassandra cluster and test it again.

-Original Message-
From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Thursday, December 13, 2012 9:03 PM
To: user@cassandra.apache.org
Subject: Re: BulkOutputFormat error - 
org.apache.thrift.transport.TTransportException

Looks like it cannot connect to the server

conf.set(cassandra.output.thrift.address, localhost);
Is this the same address as the rpc_address in the cassandra config ?

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 9:57 AM, anand_balara...@homedepot.com wrote:

 Hi

 I am a newbie to Cassandra. Was trying out a sample (word count) code on 
 BulkOutputFormat and got stuck with an error.

 What I am trying to do is - migrate all Hive tables (from Hadoop cluster) to 
 Cassandra column families.
 My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by 
 pointing job config params 'fs.default.name' and 'mapred.job.tracker' 
 appropriately.
 The output is pointed to my local Cassandra v1.1.7.
 Have set the following params for writing to Cassandra:
 conf.set(cassandra.output.keyspace, Customer);
conf.set(cassandra.output.columnfamily, words);
conf.set(cassandra.output.partitioner.class, 
 org.apache.cassandra.dht.RandomPartitioner);
conf.set(cassandra.output.thrift.port,9160);// default
conf.set(cassandra.output.thrift.address, localhost);
conf.set(mapreduce.output.bulkoutputformat.streamthrottlembits, 
 10);

 But, programs fails with the below error:
 12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration 
 already set up for Hadoop, not re-installing.
 Cassandra thrift address   :  localhost
 Cassandra thrift port  :  9160
 12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1
 12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded
 12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622
 12/12/13 15:34:23 INFO mapred.JobClient:  map 0% reduce 0%
 12/12/13 15:34:28 INFO mapred.JobClient:  map 100% reduce 0%
 12/12/13 15:34:37 INFO mapred.JobClient:  map 100% reduce 33%
 12/12/13 15:34:39 INFO mapred.JobClient: Task Id : 
 attempt_20121201_4622_r_00_0, Status : FAILED
 java.lang.RuntimeException: Could not retrieve endpoint ranges:
at 
 org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328)
at 
 org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116)
at 
 org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111)
at 
 org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223)
at 
 org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208)
at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
 Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE

 Please help me out understand the problem.

 Regards
 Anand B


 The information in this Internet Email is confidential and may be legally 
 privileged. It is intended solely for the addressee. Access to this Email by 
 anyone else is unauthorized. If you are not the intended recipient, any 
 disclosure, copying, distribution or any action taken or omitted to be taken 
 in reliance on it, is prohibited and may be unlawful. When addressed to our 
 clients any opinions or advice contained in this Email are subject to the 
 terms and conditions expressed in any applicable governing The Home Depot 
 terms of business or client engagement letter. The Home Depot disclaims all 
 responsibility and liability for the accuracy and content of this attachment 
 and for any damages or losses arising from any inaccuracies, errors, viruses, 
 e.g., worms, trojan horses, etc., or other items of a destructive nature, 
 which may be contained in this attachment and shall not be liable for direct, 
 indirect, consequential or special damages in connection with this e-mail 
 message or its attachment.