RE: Two problems with Cassandra

2015-02-17 Thread SEAN_R_DURITY
Full table scans are not the best use case for Cassandra. Without some kind of 
pagination, the node taking the request (the coordinator node) will try to 
assemble the data from all nodes to return to the client. With a dataset of any 
decent size, it will overwhelm the single node.

Pagination is supported in newer versions of Cassandra (2.0.x+, I think) and 
some drivers. You can see there is other discussion on the list about the best 
ways to split your workload and do some parallel processing. Something I 
haven’t seen mentioned recently (but probably discussed before I joined the 
list) is setting up a separate, analytics DC. There you could integrate with 
hadoop or spark or just size your nodes differently to handle an analytics type 
workload.

We have found that it is better to use a list of known keys and pull back rows 
(aka partitions) individually for any table scan type operations. However, we 
are usually able to generate the list of keys outside of Cassandra…


Sean Durity – Cassandra Admin, Home Depot

From: Pavel Velikhov [mailto:pavel.velik...@gmail.com]
Sent: Thursday, February 12, 2015 4:23 AM
To: user@cassandra.apache.org
Subject: Re: Two problems with Cassandra


On Feb 12, 2015, at 12:37 AM, Robert Coli 
rc...@eventbrite.commailto:rc...@eventbrite.com wrote:

On Wed, Feb 11, 2015 at 2:22 AM, Pavel Velikhov 
pavel.velik...@gmail.commailto:pavel.velik...@gmail.com wrote:
  2. While trying to update the full dataset with a simple transformation 
(again via python driver), single node and clustered Cassandra run out of 
memory no matter what settings I try, even I put a lot of sleeps into the mix. 
However simpler transformations (updating just one column, specially when there 
is a lot of processing overhead) work just fine.

What does a simple transformation mean here? Assuming a reasonable sized 
heap, OOM sounds like you're trying to update a large number of large 
partitions in a single operation.

In general, in Cassandra, you're best off interacting with a single or small 
number of partitions in any given interaction.

=Rob


Hi Robert!

  Simple transformation is changing just a single column value (for I usually 
do it for the whole dataset).
  But when I was running out of memory, I was reading in 5 columns and updating 
3. Some of them could be big, but I need to check and rerun this case.
  (I worked around this by dumping to files and then scanning the files and 
updating the database, but this stinks!)

  I don’t quite understand the fundamentals of Cassandra - if I’m just doing 
one scan with a reasonable number of columns that I fetch, and I’m updating at 
the same time, what’s happening there? Why eat up so much memory and die?



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Two problems with Cassandra

2015-02-12 Thread Pavel Velikhov

 On Feb 12, 2015, at 12:37 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Feb 11, 2015 at 2:22 AM, Pavel Velikhov pavel.velik...@gmail.com 
 mailto:pavel.velik...@gmail.com wrote:
   2. While trying to update the full dataset with a simple transformation 
 (again via python driver), single node and clustered Cassandra run out of 
 memory no matter what settings I try, even I put a lot of sleeps into the 
 mix. However simpler transformations (updating just one column, specially 
 when there is a lot of processing overhead) work just fine.
 
 What does a simple transformation mean here? Assuming a reasonable sized 
 heap, OOM sounds like you're trying to update a large number of large 
 partitions in a single operation.
 
 In general, in Cassandra, you're best off interacting with a single or small 
 number of partitions in any given interaction.
 
 =Rob
 

Hi Robert!

  Simple transformation is changing just a single column value (for I usually 
do it for the whole dataset).
  But when I was running out of memory, I was reading in 5 columns and updating 
3. Some of them could be big, but I need to check and rerun this case.
  (I worked around this by dumping to files and then scanning the files and 
updating the database, but this stinks!)

  I don’t quite understand the fundamentals of Cassandra - if I’m just doing 
one scan with a reasonable number of columns that I fetch, and I’m updating at 
the same time, what’s happening there? Why eat up so much memory and die? 

Re: Two problems with Cassandra

2015-02-11 Thread Robert Coli
On Wed, Feb 11, 2015 at 2:22 AM, Pavel Velikhov pavel.velik...@gmail.com
wrote:

   2. While trying to update the full dataset with a simple transformation
 (again via python driver), single node and clustered Cassandra run out of
 memory no matter what settings I try, even I put a lot of sleeps into the
 mix. However simpler transformations (updating just one column, specially
 when there is a lot of processing overhead) work just fine.


What does a simple transformation mean here? Assuming a reasonable sized
heap, OOM sounds like you're trying to update a large number of large
partitions in a single operation.

In general, in Cassandra, you're best off interacting with a single or
small number of partitions in any given interaction.

=Rob


Re: Two problems with Cassandra

2015-02-11 Thread Pavel Velikhov
Hi Carlos,

  I tried on a single node and a 4-node cluster. On the 4-node cluster I setup 
the tables with replication factor = 2.
I usually iterate over a subset, but it can be about ~40% right now. Some of my 
column values could be quite big… I remember I was exporting to csv and I had 
to change the default csv max column length.

If I just update, there are no problems, its reading and updating that kills 
everything (could it have something to do with the driver?)

I’m using 2.0.8 release right now.

I was trying to tweak memory sizes. If I give Cassandra too much memory (8 or 
16 GB) it dies much faster due to GC not being able to keep up. But it 
consistently dies on a specific row in single instance case…

Is this enough info to point me somewhere?

Thank you,
Pavel

 On Feb 11, 2015, at 1:48 PM, Carlos Rolo r...@pythian.com wrote:
 
 Hello Pavel,
 
 What is the size of the Cluster (# of nodes)? And you need to iterate over 
 the full 1TB every time you do the update? Or just parts of it?
 
 IMO information is short to make any kind of assessment of the problem you 
 are having.
 
 I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same 
 problem. 
 
 Regards,
 
 Carlos Juzarte Rolo
 Cassandra Consultant
  
 Pythian - Love your data
 
 rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo 
 http://linkedin.com/in/carlosjuzarterolo
 Tel: 1649
 www.pythian.com http://www.pythian.com/
 On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com 
 mailto:pavel.velik...@gmail.com wrote:
 Hi,
 
   I’m using Cassandra to store NLP data, the dataset is not that huge (about 
 1TB), but I need to iterate over it quite frequently, updating the full 
 dataset (each record, but not necessarily each column).
 
   I’ve run into two problems (I’m using the latest Cassandra):
 
   1. I was trying to copy from one Cassandra cluster to another via a python 
 driver, however the driver confused the two instances
   2. While trying to update the full dataset with a simple transformation 
 (again via python driver), single node and clustered Cassandra run out of 
 memory no matter what settings I try, even I put a lot of sleeps into the 
 mix. However simpler transformations (updating just one column, specially 
 when there is a lot of processing overhead) work just fine.
 
 I’m really concerned about #2, since we’re moving all heavy processing to a 
 Spark cluster and will expand it, and I would expect much heavier traffic 
 to/from Cassandra. Any hints, war stories, etc. very appreciated!
 
 Thank you,
 Pavel Velikhov
 
 
 --
 
 
 
 
 



Re: Two problems with Cassandra

2015-02-11 Thread Carlos Rolo
Hello Pavel,

What is the size of the Cluster (# of nodes)? And you need to iterate over
the full 1TB every time you do the update? Or just parts of it?

IMO information is short to make any kind of assessment of the problem you
are having.

I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same
problem.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com
wrote:

 Hi,

   I’m using Cassandra to store NLP data, the dataset is not that huge
 (about 1TB), but I need to iterate over it quite frequently, updating the
 full dataset (each record, but not necessarily each column).

   I’ve run into two problems (I’m using the latest Cassandra):

   1. I was trying to copy from one Cassandra cluster to another via a
 python driver, however the driver confused the two instances
   2. While trying to update the full dataset with a simple transformation
 (again via python driver), single node and clustered Cassandra run out of
 memory no matter what settings I try, even I put a lot of sleeps into the
 mix. However simpler transformations (updating just one column, specially
 when there is a lot of processing overhead) work just fine.

 I’m really concerned about #2, since we’re moving all heavy processing to
 a Spark cluster and will expand it, and I would expect much heavier traffic
 to/from Cassandra. Any hints, war stories, etc. very appreciated!

 Thank you,
 Pavel Velikhov

-- 


--





Re: Two problems with Cassandra

2015-02-11 Thread Carlos Rolo
Update should not be a problem because no read is done, so no need to pull
the data out.

Is that row bigger than your memory capacity (Or HEAP size)? For dealing
with large heaps you can refer to this ticket: CASSANDRA-8150. It provides
some nice tips.

If someone else can share experience would be good.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Wed, Feb 11, 2015 at 12:05 PM, Pavel Velikhov pavel.velik...@gmail.com
wrote:

 Hi Carlos,

   I tried on a single node and a 4-node cluster. On the 4-node cluster I
 setup the tables with replication factor = 2.
 I usually iterate over a subset, but it can be about ~40% right now. Some
 of my column values could be quite big… I remember I was exporting to csv
 and I had to change the default csv max column length.

 If I just update, there are no problems, its reading and updating that
 kills everything (could it have something to do with the driver?)

 I’m using 2.0.8 release right now.

 I was trying to tweak memory sizes. If I give Cassandra too much memory
 (8 or 16 GB) it dies much faster due to GC not being able to keep up. But
 it consistently dies on a specific row in single instance case…

 Is this enough info to point me somewhere?

 Thank you,
 Pavel

 On Feb 11, 2015, at 1:48 PM, Carlos Rolo r...@pythian.com wrote:

 Hello Pavel,

 What is the size of the Cluster (# of nodes)? And you need to iterate over
 the full 1TB every time you do the update? Or just parts of it?

 IMO information is short to make any kind of assessment of the problem you
 are having.

 I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same
 problem.

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com
  wrote:

 Hi,

   I’m using Cassandra to store NLP data, the dataset is not that huge
 (about 1TB), but I need to iterate over it quite frequently, updating the
 full dataset (each record, but not necessarily each column).

   I’ve run into two problems (I’m using the latest Cassandra):

   1. I was trying to copy from one Cassandra cluster to another via a
 python driver, however the driver confused the two instances
   2. While trying to update the full dataset with a simple transformation
 (again via python driver), single node and clustered Cassandra run out of
 memory no matter what settings I try, even I put a lot of sleeps into the
 mix. However simpler transformations (updating just one column, specially
 when there is a lot of processing overhead) work just fine.

 I’m really concerned about #2, since we’re moving all heavy processing to
 a Spark cluster and will expand it, and I would expect much heavier traffic
 to/from Cassandra. Any hints, war stories, etc. very appreciated!

 Thank you,
 Pavel Velikhov



 --







-- 


--