Re: Spark Cassandra Python Connector

2016-06-20 Thread Jonathan Haddad
I wouldn't recommend the TargetHolding lib.  It's only useful for working
with RDDs which are a terrible idea in Python, as the perf will make you
cry with any reasonable sized dataset.

The Datastax spark Cassandra connector works with Python + Dataframes
without the crazy overhead of RDDs.  Docs for working with Python are here
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md

I did a talk at the Cassandra Summit on this, the slides are here.
http://www.slideshare.net/JonHaddad/enter-the-snake-pit-for-fast-and-easy-spark

On Mon, Jun 20, 2016 at 3:14 PM Dennis Lovely  wrote:

> https://github.com/TargetHolding/pyspark-cassandra
>
> On Mon, Jun 20, 2016 at 1:47 PM, Joaquin Alzola  > wrote:
>
>> Hi List
>>
>> Is there a Spark Cassandra connector in python? Of course there is the
>> one for scala ...
>>
>> BR
>>
>> Joaquin
>> This email is confidential and may be subject to privilege. If you are
>> not the intended recipient, please do not copy or disclose its content but
>> contact the sender immediately upon receipt.
>>
>
>


Re: Incremental repairs in 3.0

2016-06-20 Thread Bryan Cheng
Sorry, meant to say "therefore manual migration procedure should be
UNnecessary"

On Mon, Jun 20, 2016 at 3:21 PM, Bryan Cheng  wrote:

> I don't use 3.x so hopefully someone with operational experience can chime
> in, however my understanding is: 1) Incremental repairs should be the
> default in the 3.x release branch and 2) sstable repairedAt is now properly
> set in all sstables as of 2.2.x for standard repairs and therefore manual
> migration procedure should be necessary. It may still be a good idea to
> manually migrate if you have a sizable amount of data and are using LCS as
> anticompaction is rather painful.
>
> On Sun, Jun 19, 2016 at 6:37 AM, Vlad  wrote:
>
>> Hi,
>>
>> assuming I have new, empty Cassandra cluster, how should I start using
>> incremental repairs? Is incremental repair is default now (as I don't see
>> *-inc* option in nodetool) and nothing is needed to use it, or should we
>> perform migration procedure
>> 
>> anyway? And what happens to new column families?
>>
>> Regards.
>>
>
>


Re: Incremental repairs in 3.0

2016-06-20 Thread Bryan Cheng
I don't use 3.x so hopefully someone with operational experience can chime
in, however my understanding is: 1) Incremental repairs should be the
default in the 3.x release branch and 2) sstable repairedAt is now properly
set in all sstables as of 2.2.x for standard repairs and therefore manual
migration procedure should be necessary. It may still be a good idea to
manually migrate if you have a sizable amount of data and are using LCS as
anticompaction is rather painful.

On Sun, Jun 19, 2016 at 6:37 AM, Vlad  wrote:

> Hi,
>
> assuming I have new, empty Cassandra cluster, how should I start using
> incremental repairs? Is incremental repair is default now (as I don't see
> *-inc* option in nodetool) and nothing is needed to use it, or should we
> perform migration procedure
> 
> anyway? And what happens to new column families?
>
> Regards.
>


Re: Spark Cassandra Python Connector

2016-06-20 Thread Dennis Lovely
https://github.com/TargetHolding/pyspark-cassandra

On Mon, Jun 20, 2016 at 1:47 PM, Joaquin Alzola 
wrote:

> Hi List
>
> Is there a Spark Cassandra connector in python? Of course there is the one
> for scala ...
>
> BR
>
> Joaquin
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>


Spark Cassandra Python Connector

2016-06-20 Thread Joaquin Alzola
Hi List

Is there a Spark Cassandra connector in python? Of course there is the one for 
scala ...

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.


Re: High Heap Memory usage during nodetool repair in Cassandra 3.0.3

2016-06-20 Thread Atul Saroha
We have tried this with 3.5 and there also heap usage was optimized as in
3.7.
Though we have to roll-back from 3.5 to 3.0.3 due to CASSANDRA-11513
.

-
Atul Saroha
*Lead Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA

On Mon, Jun 20, 2016 at 10:00 PM, Paulo Motta 
wrote:

> You could also be hitting CASSANDRA-11739, which was fixed on 3.0.7 and
> could potentially cause OOMs for long-running repairs.
>
>
> 2016-06-20 13:26 GMT-03:00 Robert Stupp :
>
>> One possibility might be CASSANDRA-11206 (Support large partitions on the
>> 3.0 sstable format), which reduces heap usage for other operations (like
>> repair, compactions) as well.
>> You can verify that by setting column_index_cache_size_in_kb in c.yaml to
>> a really high value like 1000 - if you see the same behaviour in 3.7
>> with that setting, there’s not much you can do except upgrading to 3.7 as
>> that change went into 3.6 and not into 3.0.x.
>>
>> —
>> Robert Stupp
>> @snazy
>>
>> On 20 Jun 2016, at 18:13, Bhuvan Rawal  wrote:
>>
>> Hi All,
>>
>> We are running Cassandra 3.0.3 on Production with Max Heap Size of 8GB.
>> There has been a consistent issue with nodetool repair for a while and
>> we have tried issuing it with multiple options --pr, --local as well,
>> sometimes node went down with Out of Memory error and at times nodes did
>> stopped connecting any connection, even jmx nodetool commands.
>>
>> On trying with same data on 3.7 Repair Ran successfully without
>> encountering any of the above mentioned issues. I then tried increasing
>> heap to 16GB on 3.0.3 and repair ran successfully.
>>
>> I then analyzed memory usage during nodetool repair for 3.0.3(16GB heap)
>> vs 3.7 (8GB Heap) and 3.0.3 occupied 11-14 GB at all times, whereas 3.7
>> spiked between 1-4.5 GB while repair runs. As they ran on same dataset
>> and unrepaired data with full repair.
>>
>> We would like to know if it is a known bug that was fixed post 3.0.3 and
>> there could be a possible way by which we can run repair on 3.0.3 without
>> increasing heap size as for all other activities 8GB works for us.
>>
>> PFA the visualvm snapshots.
>>
>> 
>> ​3.0.3 VisualVM Snapshot, consistent heap usage of greater than 12 GB.
>>
>>
>> 
>> ​3.7 VisualVM Snapshot, 8GB Max Heap and max heap usage till about 5GB.
>>
>> Thanks & Regards,
>> Bhuvan Rawal
>>
>>
>> PS: In case if the snapshots are not visible, they can be viewed from the
>> following links:
>> 3.0.3:
>> https://s31.postimg.org/4e7ifsjaz/Screenshot_from_2016_06_20_21_06_09.png
>> 3.7:
>> https://s31.postimg.org/xak32s9m3/Screenshot_from_2016_06_20_21_05_57.png
>>
>>
>>
>


Counter update write timeouts with Datastax Driver/Native protocol, not with Astyanax/Thrift

2016-06-20 Thread Steven Levitt
I've posted the following to the Datastax Java Driver user forum, but no
one has responded, so I thought I'd try here, too.

We have a service that writes to a few legacy (pre-CQL) counter column
families in a Cassandra 2.1.11 cluster. We've been trying to migrate this
service from Astyanax to the Datastax Java Driver (version 2.1.10.1). We've
been testing the new version in a "shadow" deployment in a production
environment, using the same Cassandra cluster as the production version,
but writing to a testing-only keyspace.

Occasionally, unlogged batches of counter updates in the same partition
will fail with the following error from the coordinator:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra
timeout during write query at consistency ONE (1 replica were required but
only 0 acknowledged the write)

We've only observed these errors in the service version that uses the
Datastax Driver, not the version that uses Astyanax.

These batches are written with CL=LOCAL_QUORUM; the CL in the error message
doesn't match.This resembles the symptoms of the issue described in
CASSANDRA-10041 "timeout during write query at consistency ONE" when
updating counter at consistency QUORUM and 2 of 3 nodes alive

In that issue, the error occurs when a node is abruptly terminated.
However, we've also seen the error occur when all Cassandra nodes appeared
to be healthy.

There are a few possible explanations for why the errors only occur with
the Datastax driver, but I'm not sure which is correct:
a) There is a problem with how we're using the Datastax Driver to compose
batches of counter updates
b) There is a difference in the between the implementation of counter
updates in the Native protocol from the Thrift protocol such that the error
is reported to native clients, but not to Thrift clients.
c) There is a difference between the keyspace/column family definition of
the production and testing keyspaces.
d) The Astyanax/Thrift version is getting the error but is ignoring it for
some reason.

I doubt (c) is the reason; we've made an effort to ensure that the keyspace
and CF configurations are the same. Also, (d) seems unlikely because we've
seen other errors (such as unavailable exceptions) reported correctly. So,
I'm betting that either (a) or (b) is the reason.

Would someone please suggest which of these explanations is likely to be
correct, and what we might do to avoid the problem?

-- 
- Steven


Re: High Heap Memory usage during nodetool repair in Cassandra 3.0.3

2016-06-20 Thread Paulo Motta
You could also be hitting CASSANDRA-11739, which was fixed on 3.0.7 and
could potentially cause OOMs for long-running repairs.

2016-06-20 13:26 GMT-03:00 Robert Stupp :

> One possibility might be CASSANDRA-11206 (Support large partitions on the
> 3.0 sstable format), which reduces heap usage for other operations (like
> repair, compactions) as well.
> You can verify that by setting column_index_cache_size_in_kb in c.yaml to
> a really high value like 1000 - if you see the same behaviour in 3.7
> with that setting, there’s not much you can do except upgrading to 3.7 as
> that change went into 3.6 and not into 3.0.x.
>
> —
> Robert Stupp
> @snazy
>
> On 20 Jun 2016, at 18:13, Bhuvan Rawal  wrote:
>
> Hi All,
>
> We are running Cassandra 3.0.3 on Production with Max Heap Size of 8GB.
> There has been a consistent issue with nodetool repair for a while and we
> have tried issuing it with multiple options --pr, --local as well,
> sometimes node went down with Out of Memory error and at times nodes did
> stopped connecting any connection, even jmx nodetool commands.
>
> On trying with same data on 3.7 Repair Ran successfully without
> encountering any of the above mentioned issues. I then tried increasing
> heap to 16GB on 3.0.3 and repair ran successfully.
>
> I then analyzed memory usage during nodetool repair for 3.0.3(16GB heap)
> vs 3.7 (8GB Heap) and 3.0.3 occupied 11-14 GB at all times, whereas 3.7
> spiked between 1-4.5 GB while repair runs. As they ran on same dataset
> and unrepaired data with full repair.
>
> We would like to know if it is a known bug that was fixed post 3.0.3 and
> there could be a possible way by which we can run repair on 3.0.3 without
> increasing heap size as for all other activities 8GB works for us.
>
> PFA the visualvm snapshots.
>
> 
> ​3.0.3 VisualVM Snapshot, consistent heap usage of greater than 12 GB.
>
>
> 
> ​3.7 VisualVM Snapshot, 8GB Max Heap and max heap usage till about 5GB.
>
> Thanks & Regards,
> Bhuvan Rawal
>
>
> PS: In case if the snapshots are not visible, they can be viewed from the
> following links:
> 3.0.3:
> https://s31.postimg.org/4e7ifsjaz/Screenshot_from_2016_06_20_21_06_09.png
> 3.7:
> https://s31.postimg.org/xak32s9m3/Screenshot_from_2016_06_20_21_05_57.png
>
>
>


Re: High Heap Memory usage during nodetool repair in Cassandra 3.0.3

2016-06-20 Thread Robert Stupp
One possibility might be CASSANDRA-11206 (Support large partitions on the 3.0 
sstable format), which reduces heap usage for other operations (like repair, 
compactions) as well.
You can verify that by setting column_index_cache_size_in_kb in c.yaml to a 
really high value like 1000 - if you see the same behaviour in 3.7 with 
that setting, there’s not much you can do except upgrading to 3.7 as that 
change went into 3.6 and not into 3.0.x.

—
Robert Stupp
@snazy

> On 20 Jun 2016, at 18:13, Bhuvan Rawal  wrote:
> 
> Hi All,
> 
> We are running Cassandra 3.0.3 on Production with Max Heap Size of 8GB. There 
> has been a consistent issue with nodetool repair for a while and we have 
> tried issuing it with multiple options --pr, --local as well, sometimes node 
> went down with Out of Memory error and at times nodes did stopped connecting 
> any connection, even jmx nodetool commands. 
> 
> On trying with same data on 3.7 Repair Ran successfully without encountering 
> any of the above mentioned issues. I then tried increasing heap to 16GB on 
> 3.0.3 and repair ran successfully.
> 
> I then analyzed memory usage during nodetool repair for 3.0.3(16GB heap) vs 
> 3.7 (8GB Heap) and 3.0.3 occupied 11-14 GB at all times, whereas 3.7 spiked 
> between 1-4.5 GB while repair runs. As they ran on same dataset and 
> unrepaired data with full repair. 
> 
> We would like to know if it is a known bug that was fixed post 3.0.3 and 
> there could be a possible way by which we can run repair on 3.0.3 without 
> increasing heap size as for all other activities 8GB works for us.
> 
> PFA the visualvm snapshots.
> 
> 
> ​3.0.3 VisualVM Snapshot, consistent heap usage of greater than 12 GB.
> 
> 
> 
> ​3.7 VisualVM Snapshot, 8GB Max Heap and max heap usage till about 5GB.
> 
> Thanks & Regards,
> Bhuvan Rawal
> 
> 
> PS: In case if the snapshots are not visible, they can be viewed from the 
> following links:
> 3.0.3: 
> https://s31.postimg.org/4e7ifsjaz/Screenshot_from_2016_06_20_21_06_09.png 
> 
> 3.7: 
> https://s31.postimg.org/xak32s9m3/Screenshot_from_2016_06_20_21_05_57.png 
> 


Estimating partition size for C*2.X and C*3.X and Time Series Data Modelling.

2016-06-20 Thread G P
Hello,

I'm currently enrolled in a master's degree and my thesis project involves the 
usage of Big Data tools in the context of Smart Grid applications. I explored 
sever storage solutions and found Cassandra to be fitting to my problem.
The data is mostly Time Series data, incoming from multiple PLCs, currently 
being captured and stored by a proprietary SCADA software connected to a MSSQL 
server. Reading into C* storage engine and how Time Series should be modelled, 
it is inevitable that I have to use a sort of time bucketing for splitting into 
multiple partitions.

Here is the issue, in the MSSQL server, each PLC has very wide tables (5 at the 
moment for one building) with around 36 columns of data being collected every 
10 seconds. Data is being queried as much as 15 columns at a time with time 
ranges varying between 1 hour and a whole month. A simple mapping of the same 
tables in MSSQL to C* is not recommended due to the way C*2.X stores its data.

I took the DS220: Data Modelling Course, that showcases two formulas for 
estimating a partition size based on the Table design.

[cid:image003.png@01D1CB16.9A41FD30]
[cid:image004.png@01D1CB16.9A41FD30]
Note: This Ps formula does not account for column name length, TTLs, counter 
columns, and additional overhead.

If my calculations are correct, with a table such as the one below and a the 
time resolution of 10 seconds, the Ps (Partition Size) would be shy of 10 MB 
(value often recommended) if I partitioned it weekly.

CREATE TABLE TEST (
BuildingAnalyzer text,
Time timestamp,
P1 double,
P2 double,
P3 double,
Acte1 int,
Acte2 int,
Acte3 int,
PRIMARY KEY (BuildingAnalyzer, Time)
)

However, as of C*3.0, a major refactor of the storage engine brought efficiency 
in storage costs. From what I could gather in [1], clustering columns and 
column name are no longer repeated for each value in a record and, among other 
things, the timestamps for conflict resolution (the 8 × Nv of the 2nd formula) 
can be stored only once per record if they have the same value and are encoded 
as varints.

I also read [2], which explains the storage in intricate detail, adding too 
much complexity to a simple estimation formula.

Is there any way to estimate partition size of a table with similar formulas as 
the ones above?
Should I just model my tables similar to what is done with metric collection 
(table with columns, "parametername" and "value")?


[1]http://www.datastax.com/2015/12/storage-engine-30

[2]
http://thelastpickle.com/blog/2016/03/04/introductiont-to-the-apache-cassandra-3-storage-engine.html

Sorry for the long wall of text,
Best regards,
Gil Pinheiro.