date:20140721

Re: How to prevent writing to a Keyspace?

2014-07-21 Thread Vivek Mishra

Create different user and assign role and privileges. Create a user like
guest and grant select only to that user. That way user cannot modify data
in specific keyspace or column family.

http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/grant_r.html

-Vivek


On Mon, Jul 21, 2014 at 7:57 AM, Lu, Boying boying...@emc.com wrote:

 Thanks a lot J



 But I think authorization and authentication do little help here.



 Once we allow an user to read the keyspace, how can we prevent him from
 writing DB

 without Cassandra’s help?



 Is there any way to support ‘read-only’ some keyspace in Cassandra ? e.g.
 set some specific strategy?



 Boying



 *From:* Vivek Mishra [mailto:mishra.v...@gmail.com]
 *Sent:* 2014年7月17日 18:35
 *To:* user@cassandra.apache.org
 *Subject:* Re: How to prevent writing to a Keyspace?



 Think about managing it via authorization and authentication support



 On Thu, Jul 17, 2014 at 4:00 PM, Lu, Boying boying...@emc.com wrote:

 Hi, All,



 I need to make a Cassandra keyspace to be read-only.

 Does anyone know how to do that?



 Thanks



 Boying

Re: Which way to Cassandraville?

2014-07-21 Thread DuyHai Doan

 Having said that, what Java clients should I be looking at?  Are there
 any reasonably mature PoJo mapping techs for Cassandra analogous to
 Hibernate?

The Java Driver offers a basic object mapper in the mapper module. If you
look for something more evolved, have a look at
http://doanduyhai.github.io/Achilles/

Re: ghost table is breaking compactions and won't go away… even during a drop.

2014-07-21 Thread Philo Yang

In my experience, SSTable FileNotFoundException, not only caused by
recreate a table but also other operations or even bug,  cannot be solved
by any nodetool command. However, restart the node for more than one time
can make this Exception disappear. I don't know the reason but it does
work...

Thanks,
Philo Yang



2014-07-17 10:32 GMT+08:00 Kevin Burton bur...@spinn3r.com:

 you rock… glad it's fixed in 2.1… :)




 On Wed, Jul 16, 2014 at 7:05 PM, graham sanderson gra...@vast.com wrote:

 Known issue deleting and recreating a CF with the same name, fixed in 2.1
 (manifests in lots of ways)

 https://issues.apache.org/jira/browse/CASSANDRA-5202

 On Jul 16, 2014, at 8:53 PM, Kevin Burton bur...@spinn3r.com wrote:

 looks like a restart of cassandra and a nodetool compact fixed this…


 On Wed, Jul 16, 2014 at 6:45 PM, Kevin Burton bur...@spinn3r.com wrote:

 this is really troubling…

 I have a ghost table.  I dropped it.. but it's not going away.

 (Cassandra 2.0.8 btw)

 I ran a 'drop table' on it.. then a 'describe tables' shows that it's
 not there.

 However, when I recreated it, with a new schema, all operations on it
 failed.

 Looking at why… it seems that cassandra had some old SSTables that I
 imagine are no longer being used but are now in an inconsistent state?

 This is popping up in the system.log:

 Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
 /d0/cassandra/data/blogindex/content_idx_source_hashcode/blogindex-content_idx_source_hashcode-jb-1447-Data.db
 (No such file or directory)

 so I think what happened… is that the original drop table, failed, and
 then left things in an inconsistent state.

 I tried a nodetool repair and a nodetool compact… those fail on the
 same java.io.FileNotFoundException … I moved the directories out of the
 way, same failure issue.

 … any advice on resolving this?

 --

 Founder/CEO Spinn3r.com http://spinn3r.com/
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com/




 --

 Founder/CEO Spinn3r.com http://spinn3r.com/
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com/





 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com

Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9

2014-07-21 Thread Ben Hood

On Sat, Jul 19, 2014 at 7:35 PM, Karl Rieb karl.r...@gmail.com wrote:
 Can now be followed at:
 https://issues.apache.org/jira/browse/CASSANDRA-7576.

Nice work! Finally we have a proper solution to this issue, so well done to you.

RE: How to prevent writing to a Keyspace?

2014-07-21 Thread Lu, Boying

I see.

Thanks a lot ☺

From: Vivek Mishra [mailto:mishra.v...@gmail.com]
Sent: 2014年7月21日 14:16
To: user@cassandra.apache.org
Subject: Re: How to prevent writing to a Keyspace?

Create different user and assign role and privileges. Create a user like guest 
and grant select only to that user. That way user cannot modify data in 
specific keyspace or column family.

http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/grant_r.html

-Vivek

On Mon, Jul 21, 2014 at 7:57 AM, Lu, Boying 
boying...@emc.commailto:boying...@emc.com wrote:
Thanks a lot ☺

But I think authorization and authentication do little help here.

Once we allow an user to read the keyspace, how can we prevent him from writing 
DB
without Cassandra’s help?

Is there any way to support ‘read-only’ some keyspace in Cassandra ? e.g. set 
some specific strategy?

Boying

From: Vivek Mishra [mailto:mishra.v...@gmail.commailto:mishra.v...@gmail.com]
Sent: 2014年7月17日 18:35
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: How to prevent writing to a Keyspace?

Think about managing it via authorization and authentication support

On Thu, Jul 17, 2014 at 4:00 PM, Lu, Boying 
boying...@emc.commailto:boying...@emc.com wrote:
Hi, All,

I need to make a Cassandra keyspace to be read-only.
Does anyone know how to do that?

Thanks

Boying

Re: horizontal query scaling issues follow on

2014-07-21 Thread Jonathan Lacefield

Hello,

  Here is the documentation for cfhistograms, which is in microseconds.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFhisto.html

  Your question about setting timeouts is subjective, but you have set your
timeout limits to 4 mins, which seems excessive.

  The default timeout values should be appropriate for a well sized and
operating cluster.  Increasing timeouts to achieve stability isn't a
recommended practice.

  You're VMs are undersized, and therefore, it is recommended that you
reduce your workload or add nodes until stability is achieved.

  The goal of your exersize is to prove out linear scalability, correct?
   Then it is recommended to find the load your small nodes/cluster can
handle without increasing timeout values, i.e. your cluster can remain
stable.  Once you found the sweet spot for load on your cluster, increase
load by X% while increasing cluster size by X%.  Do this for a few
iterations so you can see that the processing capabilities of your cluster
increases proportionally, and linearly, to the amount of load you are
putting on your cluster.  Note, with small VM's, you will not receive
production-like performance from individual nodes.

  Also, what type of storage do you have under the VMs?  It's not
recommended to leverage shared storage.  Leveraging shared storage will,
more than likely, not allow you to achieve linear scalability.  This is
because your hardware will not be scaling linearly fully through the stack.


  Hope this helps

Jonathan


On Sun, Jul 20, 2014 at 9:12 PM, Diane Griffith dfgriff...@gmail.com
wrote:

 I am running tests again across different number of client threads and
 number of nodes but this time I tweaked some of the timeouts configured for
 the nodes in the cluster.  I was able to get better performance on the
 nodes at 10 client threads by upping 4 timeout values in cassandra.yaml to
 24:


- read_request_timeout_in_ms
- range_request_timeout_in_ms
- write_request_timeout_in_ms
- request_timeout_in_ms


 I did this because of my interpretation of the cfhistograms output on one
 of the nodes.

 So 3 questions that come to mind:


1. Did I interpret the histogram information correctly in cassandra
2.0.6 nodetool output?  That the 2 column read latency output is the offset
or left column is the time in milliseconds and the right column is number
of requests that fell into that bucket range.
2. Was it reasonable for me to boost those 4 timeouts and just those?
3. What are reasonable timeout values for smaller vm sizes (i.e. 8GB
RAM, 4 CPUs)?

 If anyone has any  insight it would be appreciated.

 Thanks,
 Diane


 On Fri, Jul 18, 2014 at 2:23 PM, Tyler Hobbs ty...@datastax.com wrote:


 On Fri, Jul 18, 2014 at 8:01 AM, Diane Griffith dfgriff...@gmail.com
 wrote:


 Partition Size (bytes)
 1109 bytes: 1800

 Cell Count per Partition
 8 cells: 1800

 meaning I can't glean anything about how it partitioned or if it broke a
 key across partitions from this right?  Does it mean for 1800 (the
 number of unique keys) that each has 8 cells?


 Yes, your interpretation is correct.  Each of your 1800 partitions
 has 8 cells (taking up 1109 bytes).


 --
 Tyler Hobbs
 DataStax http://datastax.com/





-- 
Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
http://www.linkedin.com/in/jlacefield

http://www.datastax.com/cassandrasummit14

Re: TTransportException (java.net.SocketException: Broken pipe)

2014-07-21 Thread Bhaskar Singhal

I have not seen the issue after changing the commit log segment size to 1024MB. 


tpstats output:
Pool Name    Active   Pending  Completed   Blocked  All 
time blocked
ReadStage 0 0  0 0  
   0
RequestResponseStage  0 0  0 0  
   0
MutationStage    32    40    2526143 0  
   0
ReadRepairStage   0 0  0 0  
   0
ReplicateOnWriteStage 0 0  0 0  
   0
GossipStage   0 0  0 0  
   0
AntiEntropyStage  0 0  0 0  
   0
MigrationStage    0 0  3 0  
   0
MemoryMeter   0 0  24752 0  
   0
MemtablePostFlusher   1    19  12939 0  
   0
FlushWriter   6    10  12442 1  
    2940
MiscStage 0 0  0 0  
   0
PendingRangeCalculator    0 0  1 0  
   0
commitlog_archiver    0 0  0 0  
   0
InternalResponseStage 0 0  0 0  
   0
HintedHandoff 0 0  0 0  
   0

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  0
PAGED_RANGE  0
BINARY   0
READ 0
MUTATION 0
_TRACE   0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0




On Saturday, 19 July 2014 1:32 AM, Robert Coli rc...@eventbrite.com wrote:
 


On Mon, Jul 7, 2014 at 9:30 PM, Bhaskar Singhal bhaskarsing...@yahoo.com 
wrote:

I am using Cassandra 2.0.7 (with default settings and 16GB heap on quad core 
ubuntu server with 32gb ram)

16GB of heap will lead to significant GC pauses, and probably will not improve 
total performance versus 8gb heap.
 
I continue to maintain that your problem is that you are writing faster than 
you can flush.

Paste the output of nodetool tpstats?

=Rob

Re: estimated row count for a pk range

2014-07-21 Thread tommaso barbugli

thank you for the reply; I was hoping for something with a bit less
overhead than the first solution; the second is not really an option for me.

On Monday, 21 July 2014, DuyHai Doan doanduy...@gmail.com wrote:

 1) Use separate counter to count number of entries in each column family
 but it will require you to manage the counting manually
 2) SELECT DISTINCT partitionKey FROM   Normally this query is
 optimized and is much faster than a SELECT *. However if you have a very
 big number of distinct partitions it can be slow


 On Sun, Jul 20, 2014 at 6:48 PM, tommaso barbugli tbarbu...@gmail.com
 javascript:_e(%7B%7D,'cvml','tbarbu...@gmail.com'); wrote:

 Hello,
 Lately I collapsed several (around 1k) column families in a bunch (100)
 of column families.
 To keep data separated I have added an extra column (family) which is
 part of the PK.

 While previous approach allowed me to always have a clear picture of
 every column family's size; now I have no other option than select all the
 rows and make some estimation to guess the overall size used by one of the
 grouped data in this CFs.

 eg.
 SELECT * FROM cf_shard1 WHERE family = '1';

 Of course this does not work really well when cf_shard1 has some data in
 it; is there some way perhaps to get an estimated count for rows matching
 this query?

 Thanks,
 Tommaso




-- 
sent from iphone (sorry for the typos)

map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi,

I have the need to executing a map/reduce job to identity data stored in
Cassandra before indexing this data to Elastic Search.

I have already used ColumnFamilyInputFormat (before start using CQL) to
write hadoop jobs to do that, but I use to have a lot of troubles to
perform tunning, as hadoop depends on how map tasks are split in order to
successfull execute things in parallel, for IO/bound processes.

First question is: Am I the only one having problems with that? Is anyone
else using hadoop jobs that reads from Cassandra in production?

Second question is about the alternatives. I saw new version spark will
have Cassandra support, but using CqlPagingInputFormat, from hadoop. I
tried to use HIVE with Cassandra community, but it seems it only works with
Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/),
which we have been using reading from Cassandra and so far it has been
great for SQL-like queries. For custom map reduce jobs, however, it is not
enough.

Does anyone know some other tool that performs MR on Cassandra? My
impression is most tools were created to work on top of HDFS and reading
from a nosql db is some kind of workaround.

Third question is about how these tools work. Most of them writtes mapped
data on a intermediate storage, then data is shuffled and sorted, then it
is reduced. Even when using CqlPagingInputFormat, if you are using hadoop
it will write files to HDFS after the mapping phase, shuffle and sort this
data, and then reduce it.

I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
Is it faster to write all your data to a file and then sorting it, or batch
inserting data and already indexing it, as it happens when you store data
in a Cassandra CF? I didn't do the calculations to check the complexity of
each one, what should consider no index in Cassandra would be really large,
as the maximum index size will always depend on the maximum capacity of a
single host, but my guess is that a map / reduce tool written specifically
to Cassandra, from the beggining, could perform much better than a tool
written to HDFS and adapted. I hear people saying Map/Reduce on
Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool like described would
be reinventing the wheel? Or it makes sense?

Thanks in advance. Any opinions about this subject will be very appreciated.

Best regards,
Marcelo Valle.

Re: map reduce for Cassandra

2014-07-21 Thread Jonathan Haddad

Hey Marcelo,

You should check out spark.  It intelligently deals with a lot of the
issues you're mentioning.  Al Tobey did a walkthrough of how to set up
the OSS side of things here:
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

It'll be less work than writing a M/R framework from scratch :)
Jon


On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 Hi,

 I have the need to executing a map/reduce job to identity data stored in
 Cassandra before indexing this data to Elastic Search.

 I have already used ColumnFamilyInputFormat (before start using CQL) to
 write hadoop jobs to do that, but I use to have a lot of troubles to perform
 tunning, as hadoop depends on how map tasks are split in order to
 successfull execute things in parallel, for IO/bound processes.

 First question is: Am I the only one having problems with that? Is anyone
 else using hadoop jobs that reads from Cassandra in production?

 Second question is about the alternatives. I saw new version spark will have
 Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to
 use HIVE with Cassandra community, but it seems it only works with Cassandra
 Enterprise and doesn't do more than FB presto (http://prestodb.io/), which
 we have been using reading from Cassandra and so far it has been great for
 SQL-like queries. For custom map reduce jobs, however, it is not enough.

 Does anyone know some other tool that performs MR on Cassandra? My
 impression is most tools were created to work on top of HDFS and reading
 from a nosql db is some kind of workaround.

 Third question is about how these tools work. Most of them writtes mapped
 data on a intermediate storage, then data is shuffled and sorted, then it is
 reduced. Even when using CqlPagingInputFormat, if you are using hadoop it
 will write files to HDFS after the mapping phase, shuffle and sort this
 data, and then reduce it.

 I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
 Is it faster to write all your data to a file and then sorting it, or batch
 inserting data and already indexing it, as it happens when you store data in
 a Cassandra CF? I didn't do the calculations to check the complexity of each
 one, what should consider no index in Cassandra would be really large, as
 the maximum index size will always depend on the maximum capacity of a
 single host, but my guess is that a map / reduce tool written specifically
 to Cassandra, from the beggining, could perform much better than a tool
 written to HDFS and adapted. I hear people saying Map/Reduce on
 Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
 sense? Should we expect a result like this?

 Final question: Do you think writting a new M/R tool like described would be
 reinventing the wheel? Or it makes sense?

 Thanks in advance. Any opinions about this subject will be very appreciated.

 Best regards,
 Marcelo Valle.



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi Jonathan,

Do you know if this RDD can be used with Python? AFAIK, python + Cassandra
will be supported just in the next version, but I would like to be wrong...

Best regards,
Marcelo Valle.



2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Hey Marcelo,

 You should check out spark.  It intelligently deals with a lot of the
 issues you're mentioning.  Al Tobey did a walkthrough of how to set up
 the OSS side of things here:

 http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

 It'll be less work than writing a M/R framework from scratch :)
 Jon


 On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  Hi,
 
  I have the need to executing a map/reduce job to identity data stored in
  Cassandra before indexing this data to Elastic Search.
 
  I have already used ColumnFamilyInputFormat (before start using CQL) to
  write hadoop jobs to do that, but I use to have a lot of troubles to
 perform
  tunning, as hadoop depends on how map tasks are split in order to
  successfull execute things in parallel, for IO/bound processes.
 
  First question is: Am I the only one having problems with that? Is anyone
  else using hadoop jobs that reads from Cassandra in production?
 
  Second question is about the alternatives. I saw new version spark will
 have
  Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried
 to
  use HIVE with Cassandra community, but it seems it only works with
 Cassandra
  Enterprise and doesn't do more than FB presto (http://prestodb.io/),
 which
  we have been using reading from Cassandra and so far it has been great
 for
  SQL-like queries. For custom map reduce jobs, however, it is not enough.
 
  Does anyone know some other tool that performs MR on Cassandra? My
  impression is most tools were created to work on top of HDFS and reading
  from a nosql db is some kind of workaround.
 
  Third question is about how these tools work. Most of them writtes mapped
  data on a intermediate storage, then data is shuffled and sorted, then
 it is
  reduced. Even when using CqlPagingInputFormat, if you are using hadoop it
  will write files to HDFS after the mapping phase, shuffle and sort this
  data, and then reduce it.
 
  I wonder if a tool supporting Cassandra out of the box wouldn't be
 smarter.
  Is it faster to write all your data to a file and then sorting it, or
 batch
  inserting data and already indexing it, as it happens when you store
 data in
  a Cassandra CF? I didn't do the calculations to check the complexity of
 each
  one, what should consider no index in Cassandra would be really large, as
  the maximum index size will always depend on the maximum capacity of a
  single host, but my guess is that a map / reduce tool written
 specifically
  to Cassandra, from the beggining, could perform much better than a tool
  written to HDFS and adapted. I hear people saying Map/Reduce on
  Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
 make
  sense? Should we expect a result like this?
 
  Final question: Do you think writting a new M/R tool like described
 would be
  reinventing the wheel? Or it makes sense?
 
  Thanks in advance. Any opinions about this subject will be very
 appreciated.
 
  Best regards,
  Marcelo Valle.



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade

Re: map reduce for Cassandra

2014-07-21 Thread Jonathan Haddad

I haven't tried pyspark yet, but it's part of the distribution.  My
main language is Python too, so I intend on getting deep into it.

On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 Hi Jonathan,

 Do you know if this RDD can be used with Python? AFAIK, python + Cassandra
 will be supported just in the next version, but I would like to be wrong...

 Best regards,
 Marcelo Valle.



 2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Hey Marcelo,

 You should check out spark.  It intelligently deals with a lot of the
 issues you're mentioning.  Al Tobey did a walkthrough of how to set up
 the OSS side of things here:

 http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

 It'll be less work than writing a M/R framework from scratch :)
 Jon


 On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  Hi,
 
  I have the need to executing a map/reduce job to identity data stored in
  Cassandra before indexing this data to Elastic Search.
 
  I have already used ColumnFamilyInputFormat (before start using CQL) to
  write hadoop jobs to do that, but I use to have a lot of troubles to
  perform
  tunning, as hadoop depends on how map tasks are split in order to
  successfull execute things in parallel, for IO/bound processes.
 
  First question is: Am I the only one having problems with that? Is
  anyone
  else using hadoop jobs that reads from Cassandra in production?
 
  Second question is about the alternatives. I saw new version spark will
  have
  Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried
  to
  use HIVE with Cassandra community, but it seems it only works with
  Cassandra
  Enterprise and doesn't do more than FB presto (http://prestodb.io/),
  which
  we have been using reading from Cassandra and so far it has been great
  for
  SQL-like queries. For custom map reduce jobs, however, it is not enough.
 
  Does anyone know some other tool that performs MR on Cassandra? My
  impression is most tools were created to work on top of HDFS and reading
  from a nosql db is some kind of workaround.
 
  Third question is about how these tools work. Most of them writtes
  mapped
  data on a intermediate storage, then data is shuffled and sorted, then
  it is
  reduced. Even when using CqlPagingInputFormat, if you are using hadoop
  it
  will write files to HDFS after the mapping phase, shuffle and sort this
  data, and then reduce it.
 
  I wonder if a tool supporting Cassandra out of the box wouldn't be
  smarter.
  Is it faster to write all your data to a file and then sorting it, or
  batch
  inserting data and already indexing it, as it happens when you store
  data in
  a Cassandra CF? I didn't do the calculations to check the complexity of
  each
  one, what should consider no index in Cassandra would be really large,
  as
  the maximum index size will always depend on the maximum capacity of a
  single host, but my guess is that a map / reduce tool written
  specifically
  to Cassandra, from the beggining, could perform much better than a tool
  written to HDFS and adapted. I hear people saying Map/Reduce on
  Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
  make
  sense? Should we expect a result like this?
 
  Final question: Do you think writting a new M/R tool like described
  would be
  reinventing the wheel? Or it makes sense?
 
  Thanks in advance. Any opinions about this subject will be very
  appreciated.
 
  Best regards,
  Marcelo Valle.



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade





-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Jonathan,

By what I have read in the docs, Python API has some limitations yet, not
being possible to use any hadoop binary input format.

The python example for Cassandra is only in the master branch:
https://github.com/apache/spark/blob/master/examples/src/main/python/cassandra_inputformat.py

I may be lacking knowledge of Spark, but if I understood it correctly, the
access to Cassandra data is still made by the CqlPagingInputFormat, from
hadoop integration.

Here is where I ask: even if Spark supports Cassandra, will it be fast
enough?

My understanding (please some correct me if I am wrong) is that when you
insert N items in a Cassandra CF, you are executing N binary searches to
insert the item already indexed by a key. When you read the data, it's
already sorted. So you take O(N * log(N)) (binary search complexity to
insert all data already sorted.

However, by using a fast sort algorithm, you also take O(N * log(N)) to
sort the data after ir was inserted, but then using more IO.

If I write a job in Spark / Java with Cassandra, how will the mapped data
be stored and sorted? Will it be stored in Cassandra too? Will spark run
sort after the mapping?

Best regards,
Marcelo.

2014-07-21 14:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

I haven't tried pyspark yet, but it's part of the distribution. My
main language is Python too, so I intend on getting deep into it.

On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
Hi Jonathan,

Do you know if this RDD can be used with Python? AFAIK, python +
Cassandra
will be supported just in the next version, but I would like to be
wrong...

Best regards,
Marcelo Valle.

2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

Hey Marcelo,

You should check out spark. It intelligently deals with a lot of the
issues you're mentioning. Al Tobey did a walkthrough of how to set up
the OSS side of things here:

http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

It'll be less work than writing a M/R framework from scratch :)
Jon

On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
Hi,

I have the need to executing a map/reduce job to identity data stored
in
Cassandra before indexing this data to Elastic Search.

I have already used ColumnFamilyInputFormat (before start using CQL)
to
write hadoop jobs to do that, but I use to have a lot of troubles to
perform
tunning, as hadoop depends on how map tasks are split in order to
successfull execute things in parallel, for IO/bound processes.

First question is: Am I the only one having problems with that? Is
anyone
else using hadoop jobs that reads from Cassandra in production?

Second question is about the alternatives. I saw new version spark
will
have
Cassandra support, but using CqlPagingInputFormat, from hadoop. I
tried
to
use HIVE with Cassandra community, but it seems it only works with
Cassandra
Enterprise and doesn't do more than FB presto (http://prestodb.io/),
which
we have been using reading from Cassandra and so far it has been great
for
SQL-like queries. For custom map reduce jobs, however, it is not
enough.

Does anyone know some other tool that performs MR on Cassandra? My
impression is most tools were created to work on top of HDFS and
reading
from a nosql db is some kind of workaround.

Third question is about how these tools work. Most of them writtes
mapped
data on a intermediate storage, then data is shuffled and sorted, then
it is
reduced. Even when using CqlPagingInputFormat, if you are using hadoop
it
will write files to HDFS after the mapping phase, shuffle and sort
this
data, and then reduce it.

I wonder if a tool supporting Cassandra out of the box wouldn't be
smarter.
Is it faster to write all your data to a file and then sorting it, or
batch
inserting data and already indexing it, as it happens when you store
data in
a Cassandra CF? I didn't do the calculations to check the complexity
of
each
one, what should consider no index in Cassandra would be really large,
as
the maximum index size will always depend on the maximum capacity of a
single host, but my guess is that a map / reduce tool written
specifically
to Cassandra, from the beggining, could perform much better than a
tool
written to HDFS and adapted. I hear people saying Map/Reduce on
Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
make
sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool like described
would be
reinventing the wheel? Or it makes sense?

Thanks in advance. Any opinions about this subject will be very
appreciated.

Best regards,
Marcelo Valle.

--
Jon Haddad

Authentication exception

2014-07-21 Thread Jeremy Jongsma

I routinely get this exception from cqlsh on one of my clusters:

cql.cassandra.ttypes.AuthenticationException:
AuthenticationException(why='org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 2 responses.')

The system_auth keyspace is set to replicate X times given X nodes in each
datacenter, and at the time of the exception all nodes are reporting as
online and healthy. After a short period (i.e. 30 minutes), it will let me
in again.

What could be the cause of this?

Re: map reduce for Cassandra

2014-07-21 Thread Robert Coli

On Mon, Jul 21, 2014 at 10:54 AM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 My understanding (please some correct me if I am wrong) is that when you
 insert N items in a Cassandra CF, you are executing N binary searches to
 insert the item already indexed by a key. When you read the data, it's
 already sorted. So you take O(N * log(N)) (binary search complexity to
 insert all data already sorted.


You're wrong, unless you're talking about insertion into a memtable, which
you probably aren't and which probably doesn't actually work that way
enough to be meaningful.

On disk, Cassandra has immutable datafiles, from which row fragments are
merged into a row at read time. I'm pretty sure the rest of the stuff you
said doesn't make any sense in light of this?

=Rob

Re: TTransportException (java.net.SocketException: Broken pipe)

2014-07-21 Thread Robert Coli

On Mon, Jul 21, 2014 at 8:07 AM, Bhaskar Singhal bhaskarsing...@yahoo.com
wrote:

 I have not seen the issue after changing the commit log segment size to
 1024MB.


Yes... your insanely over-huge commitlog will be contained in fewer files
if you increase the size of segments that will not make it any less of
an insanely over-huge commitlog which indicates systemic failure in your
application's use of Cassandra. Congratulations on masking your actual
issue with your configuration change.


Pool NameActive   Pending  Completed   Blocked  All
time blocked
FlushWriter   610  12442
1  2940


1/4 attempts to flush blocked waiting for resources, and you have 6 actives
flushes and 10 pending, because YOU'RE WRITING TOO FAST.

As a meta aside, I am unlikely to respond to further questions of yours
which do not engage with what I have now told you three or four times, that
YOU'RE WRITING TOO FAST.

=Rob

Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9

2014-07-21 Thread Robert Coli

On Mon, Jul 21, 2014 at 1:58 AM, Ben Hood 0x6e6...@gmail.com wrote:

 On Sat, Jul 19, 2014 at 7:35 PM, Karl Rieb karl.r...@gmail.com wrote:
  Can now be followed at:
  https://issues.apache.org/jira/browse/CASSANDRA-7576.

 Nice work! Finally we have a proper solution to this issue, so well done
 to you.


For reference, I consider this issue of sufficient severity to recommend
against upgrading to any version of 2.0 before 2.0.10, unless you are
certain you have no such schema.

I'm pretty sure reversed comparator timestamps are a common type of schema,
given that there are blog posts recommending their use, so I struggle to
understand how this was not detected by unit tests.

Does your fix add unit tests which would catch this case on upgrade?

=Rob

Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9

2014-07-21 Thread Karl Rieb

I did not include unit tests in my patch. I think many people did not run into 
this issue because many Cassandra clients handle the DateType when found as a 
CUSTOM type. 

-Karl

 On Jul 21, 2014, at 8:26 PM, Robert Coli rc...@eventbrite.com wrote:
 
 On Mon, Jul 21, 2014 at 1:58 AM, Ben Hood 0x6e6...@gmail.com wrote:
 On Sat, Jul 19, 2014 at 7:35 PM, Karl Rieb karl.r...@gmail.com wrote:
  Can now be followed at:
  https://issues.apache.org/jira/browse/CASSANDRA-7576.
 
 Nice work! Finally we have a proper solution to this issue, so well done to 
 you.
 
 For reference, I consider this issue of sufficient severity to recommend 
 against upgrading to any version of 2.0 before 2.0.10, unless you are certain 
 you have no such schema.
 
 I'm pretty sure reversed comparator timestamps are a common type of schema, 
 given that there are blog posts recommending their use, so I struggle to 
 understand how this was not detected by unit tests.
 
 Does your fix add unit tests which would catch this case on upgrade?
 
 =Rob

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi Robert,

First of all, thanks for answering.


2014-07-21 20:18 GMT-03:00 Robert Coli rc...@eventbrite.com:

 You're wrong, unless you're talking about insertion into a memtable, which
 you probably aren't and which probably doesn't actually work that way
 enough to be meaningful.

 On disk, Cassandra has immutable datafiles, from which row fragments are
 merged into a row at read time. I'm pretty sure the rest of the stuff you
 said doesn't make any sense in light of this?


Although several sstables (disk fragments) may have the same row key,
inside a single sstable row keys and column keys are indexed, right?
Otherwise, doing a GET in Cassandra would take some time.
From the M/R perspective, I was reffering to the mem table, as I am trying
to compare the time to insert in Cassandra against the time of sorting in
hadoop.

To make it more clear: hadoop has it's own partitioner, which is used after
the map phase. The map output is written locally on each hadoop node, then
it's shuffled from one node to the other (see slide 17 in this
presentation: http://pt.slideshare.net/j_singh/nosql-and-mapreduce). In
other words, you may read Cassandra data on hadoop, but the intermediate
results are still stored in HDFS.

Instead of using hadoop partitioner, I would like to store the intermediate
results in a Cassandra CF, so the map output would go directly to an
intermediate column family via batch inserts, instead of being written to a
local disk first, then shuffled to the right node.

Therefore, the mapper would write it's output the same way all data enters
in Cassandra: first on a memtable, then being flush to a sstable, then read
during the reduce phase.

Shouldn't it be faster than storing intermediate results in HDFS?

Best regards,
Marcelo.

Re: map reduce for Cassandra

2014-07-21 Thread Robert Coli

On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 Although several sstables (disk fragments) may have the same row key,
 inside a single sstable row keys and column keys are indexed, right?
 Otherwise, doing a GET in Cassandra would take some time.
 From the M/R perspective, I was reffering to the mem table, as I am trying
 to compare the time to insert in Cassandra against the time of sorting in
 hadoop.


I was confused, because unless you are using new in-memory
columnfamilies, which I believe are only available in DSE, there is no way
to ensure that any given row stays in a memtable. Very rarely is there a
view of the function of a memtable that only cares about its properties and
not the closely related properties of SSTables. However yours is one of
them, I see now why your question makes sense, you only care about the
memtable for how quickly it sorts.

But if you are only relying on memtables to sort writes, that seems like a
pretty heavyweight reason to use Cassandra?

I'm certainly not an expert in this area of Cassandra... but Cassandra, as
a datastore with immutable data files, is not typically a good choice for
short lived intermediate result sets... are you planning to use DSE?

=Rob

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi,


 But if you are only relying on memtables to sort writes, that seems like a
 pretty heavyweight reason to use Cassandra?


Actually, it's not a reason to use Cassandra. I already use Cassandra and I
need to map reduce data from it. I am trying to see a reason to use the
conventional M/R tools or to build a tool specific to Cassandra.

but Cassandra, as a datastore with immutable data files, is not typically a
 good choice for short lived intermediate result sets...


Indeed, but so far I am seeing it as the best option. I storing this
intermediate files in HDFS is better, then I agree there is no reason to
consider Cassandra to do it.

are you planning to use DSE?


Our company will probably hire DSE support when it reaches some size, but
DSE as a product doesn't seem interesting to our case so far. The only tool
that would help be at this moment would be HIVE, but honestly I didn't like
the way DSE supports hive and I don't want to use a solution not available
to DSC (see
http://stackoverflow.com/questions/23959169/problems-using-hive-cassandra-community
for details).

[]s



2014-07-21 22:09 GMT-03:00 Robert Coli rc...@eventbrite.com:

 On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Although several sstables (disk fragments) may have the same row key,
 inside a single sstable row keys and column keys are indexed, right?
 Otherwise, doing a GET in Cassandra would take some time.
 From the M/R perspective, I was reffering to the mem table, as I am
 trying to compare the time to insert in Cassandra against the time of
 sorting in hadoop.


 I was confused, because unless you are using new in-memory
 columnfamilies, which I believe are only available in DSE, there is no way
 to ensure that any given row stays in a memtable. Very rarely is there a
 view of the function of a memtable that only cares about its properties and
 not the closely related properties of SSTables. However yours is one of
 them, I see now why your question makes sense, you only care about the
 memtable for how quickly it sorts.

 But if you are only relying on memtables to sort writes, that seems like a
 pretty heavyweight reason to use Cassandra?

 I'm certainly not an expert in this area of Cassandra... but Cassandra, as
 a datastore with immutable data files, is not typically a good choice for
 short lived intermediate result sets... are you planning to use DSE?

 =Rob

Re: horizontal query scaling issues follow on

2014-07-21 Thread Diane Griffith

So I appreciate all the help so far.  Upfront, it is possible the schema
and data query pattern could be contributing to the problem.  The schema
was born out of certain design requirements.  If it proves to be part of
what makes the scalability crumble, then I hope it will help shape the
design requirements.

Anyway, the premise of the question was my struggle where scalability
metrics fell apart going from 2 nodes to 4 nodes for the current schema and
query access pattern being modeled:
- 1 node was producing acceptable response times seemed to be the consensus
- 2 nodes showed marked improvement to the response times for the query
scenario being modeled which was welcomed news
- 4 nodes showed a decrease in performance and it was not clear why going 2
to 4 nodes triggered the decrease

Also what contributed to the question was 2 more items:
- cassandra-env.sh - where in the example for HEAP_NEWSIZE states in the
comments it assumes a modern 8 core machine for pause times
- a wiki article I had found and I am trying to relocate where a person set
up very small nodes for developers on that team and talked through all the
paramters that had to be changed from the default to get good throughput.
 It sort of implied the defaults maybe were based on a certain sized vm.

That was the main driver for those questions. I agree it does not seem
correct to boost the values let alone so high to minimize impact in some
respects (i.e. not trigger the reads to time out and start over given the
retry policy).

So the question really was are the defaults sized with the assumption of a
certain minimal vm size (i.e. the comment in cassandra-env.sh)

Does that explain where I am coming from better?

My question, despite being naive and ignoring other impacts still stands,
is there a minimal vm size that is more of the sweet spot for cassandra and
the defaults.  I get the point that a column family schema as it relates to
the desired queries can and do impact that answer.  I guess what bothered
me was it didn't impact that answer going from 1 node to 2 nodes but
started showing up going from 2 nodes to 4 nodes.

I'm building whatever facts I can to support the schema and query pattern
scales or does not.  If it does not, then I am trying to pull information
from some metrics outputted by nodetool or log statements on the cassandra
log files to support a case to change the design requirements.

Thanks,
Diane


On Mon, Jul 21, 2014 at 8:15 PM, Robert Coli rc...@eventbrite.com wrote:

 On Sun, Jul 20, 2014 at 6:12 PM, Diane Griffith dfgriff...@gmail.com
 wrote:

 I am running tests again across different number of client threads and
 number of nodes but this time I tweaked some of the timeouts configured for
 the nodes in the cluster.  I was able to get better performance on the
 nodes at 10 client threads by upping 4 timeout values in cassandra.yaml to
 24:


 If you have to tune these timeout values, you have probably modeled data
 in such a way that each of your requests is quite large or quite slow.

 This is usually, but not always, an indicator that you are Doing It Wrong.
 Massively multithreaded things don't generally like their threads to be
 long-lived, for what should hopefully be obvious reasons.


 I did this because of my interpretation of the cfhistograms output on one
 of the nodes.


 Could you be more specific?


 So 3 questions that come to mind:


1. Did I interpret the histogram information correctly in cassandra
2.0.6 nodetool output?  That the 2 column read latency output is the 
 offset
or left column is the time in milliseconds and the right column is number
of requests that fell into that bucket range.
2. Was it reasonable for me to boost those 4 timeouts and just those?

 Not really. In 5 years of operating Cassandra, I've never had a problem
 whose solution was to increase these timeouts from their default.


1. What are reasonable timeout values for smaller vm sizes (i.e. 8GB
RAM, 4 CPUs)?

 As above, I question the premise of this question.

 =Rob

Re: Authentication exception

2014-07-21 Thread Rahul Menon

I could you perhaps check your ntp?


On Tue, Jul 22, 2014 at 3:35 AM, Jeremy Jongsma jer...@barchart.com wrote:

 I routinely get this exception from cqlsh on one of my clusters:

 cql.cassandra.ttypes.AuthenticationException:
 AuthenticationException(why='org.apache.cassandra.exceptions.ReadTimeoutException:
 Operation timed out - received only 2 responses.')

 The system_auth keyspace is set to replicate X times given X nodes in each
 datacenter, and at the time of the exception all nodes are reporting as
 online and healthy. After a short period (i.e. 30 minutes), it will let me
 in again.

 What could be the cause of this?

Re: How to prevent writing to a Keyspace?

Re: Which way to Cassandraville?

Re: ghost table is breaking compactions and won't go away… even during a drop.

Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9

RE: How to prevent writing to a Keyspace?

Re: horizontal query scaling issues follow on

Re: TTransportException (java.net.SocketException: Broken pipe)

Re: estimated row count for a pk range

map reduce for Cassandra

Re: map reduce for Cassandra

Re: map reduce for Cassandra

Re: map reduce for Cassandra

Re: map reduce for Cassandra

Authentication exception

Re: map reduce for Cassandra

Re: TTransportException (java.net.SocketException: Broken pipe)

Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9

Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9

Re: map reduce for Cassandra

Re: map reduce for Cassandra

Re: map reduce for Cassandra

Re: horizontal query scaling issues follow on

Re: Authentication exception

23 matches

Site Navigation

Mail list logo

Footer information