答复: A difficult data model with C*

2016-11-08 Thread ben ben
Hi Vladimir Yudovin,


Thank you very much for your detailed explaining. Maybe I didn't describe 
the requirement clearly. The use cases should be:

1. a user login our app.

2. show the recent ten movies watched by the user within 30 days.

3. the user can click any one of the ten movie and continue to watch from the 
last position she/he did. BTW, a movie can be watched several times by a user 
and the last positon is needed indeed.


BRs,

BEN


发件人: Vladimir Yudovin 
发送时间: 2016年11月8日 22:35:48
收件人: user
主题: Re: A difficult data model with C*

Hi Ben,

if need very limited number of positions (as you said ten) may be you can store 
them in LIST of UDT? Or just as JSON string?
So you'll have one row per each pair user-video.

It can be something like this:

CREATE TYPE play (position int, last_time timestamp);
CREATE TABLE recent (user_name text, video_id text, review LIST, 
PRIMARY KEY (user_name, video_id));

UPDATE recent set review = review + [(1234,12345)] where user_name='some user' 
AND video_id='great video';
UPDATE recent set review = review + [(1234,123456)] where user_name='some user' 
AND video_id='great video';
UPDATE recent set review = review + [(1234,1234567)] where user_name='some 
user' AND video_id='great video';

You can delete the oldest entry by index:
DELETE review[0] FROM recent WHERE user_name='some user' AND video_id='great 
video';

or by value, if you know the oldest entry:

UPDATE recent SET review = review - [(1234,12345)]  WHERE user_name='some user' 
AND video_id='great video';

Best regards, Vladimir Yudovin,
Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.


 On Mon, 07 Nov 2016 21:54:08 -0500ben ben  wrote 




Hi guys,

We are maintaining a system for an on-line video service. ALL users' viewing 
records of every movie are stored in C*. So she/he can continue to enjoy the 
movie from the last point next time. The table is designed as below:
CREATE TABLE recent (
user_name text,
vedio_id text,
position int,
last_time timestamp,
PRIMARY KEY (user_name, vedio_id)
)

It worked well before. However, the records increase every day and the last ten 
items may be adequate for the business. The current model use vedio_id as 
cluster key to keep a row for a movie, but as you know, the business prefer to 
order by the last_time desc. If we use last_time as cluster key, there will be 
many records for a singe movie and the recent one is actually desired. So how 
to model that? Do you have any suggestions?
Thanks!


BRs,
BEN




Re: failure node rejoin

2016-11-08 Thread Ben Slater
There have been a few commit log bugs around in the last couple of months
so perhaps you’ve hit something that was fixed recently. Would be
interesting to know the problem is still occurring in 2.2.8.

I suspect what is happening is that when you do your initial read (without
flush) to check the number of rows, the data is in memtables and
theoretically the commitlogs but not sstables. With the forced stop the
memtables are lost and Cassandra should read the commitlog from disk at
startup to reconstruct the memtables. However, it looks like that didn’t
happen for some (bad) reason.

Good news that 3.0.9 fixes the problem so up to you if you want to
investigate further and see if you can narrow it down to file a JIRA
(although the first step of that would be trying 2.2.9 to make sure it’s
not already fixed there).

Cheers
Ben

On Wed, 9 Nov 2016 at 12:56 Yuji Ito  wrote:

> I tried C* 3.0.9 instead of 2.2.
> The data lost problem hasn't happen for now (without `nodetool flush`).
>
> Thanks
>
> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito  wrote:
>
> Thanks Ben,
>
> When I added `nodetool flush` on all nodes after step 2, the problem
> didn't happen.
> Did replay from old commit logs delete rows?
>
> Perhaps, the flush operation just detected that some nodes were down in
> step 2 (just after truncating tables).
> (Insertion and check in step2 would succeed if one node was down because
> consistency levels was serial.
> If the flush failed on more than one node, the test would retry step 2.)
> However, if so, the problem would happen without deleting Cassandra data.
>
> Regards,
> yuji
>
>
> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater 
> wrote:
>
> Definitely sounds to me like something is not working as expected but I
> don’t really have any idea what would cause that (other than the fairly
> extreme failure scenario). A couple of things I can think of to try to
> narrow it down:
> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
> data is written to sstables rather than relying on commit logs
> 2) Run the test with consistency level quorom rather than serial
> (shouldn’t be any different but quorom is more widely used so maybe there
> is a bug that’s specific to serial)
>
> Cheers
> Ben
>
> On Mon, 24 Oct 2016 at 10:29 Yuji Ito  wrote:
>
> Hi Ben,
>
> The test without killing nodes has been working well without data lost.
> I've repeated my test about 200 times after removing data and
> rebuild/repair.
>
> Regards,
>
>
> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito  wrote:
>
> > Just to confirm, are you saying:
> > a) after operation 2, you select all and get 1000 rows
> > b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> That's right!
>
> I've started the test without killing nodes.
> I'll report the result to you next Monday.
>
> Thanks
>
>
> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater 
> wrote:
>
> Just to confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> If so, that would be very unexpected. If you run your tests without
> killing nodes do you get the expected (1,000) rows?
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 17:00 Yuji Ito  wrote:
>
> > Are you certain your tests don’t generate any overlapping inserts (by
> PK)?
>
> Yes. The operation 2) also checks the number of rows just after all
> insertions.
>
>
> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater 
> wrote:
>
> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito  wrote:
>
> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater 
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is 

Re: failure node rejoin

2016-11-08 Thread Yuji Ito
I tried C* 3.0.9 instead of 2.2.
The data lost problem hasn't happen for now (without `nodetool flush`).

Thanks

On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito  wrote:

> Thanks Ben,
>
> When I added `nodetool flush` on all nodes after step 2, the problem
> didn't happen.
> Did replay from old commit logs delete rows?
>
> Perhaps, the flush operation just detected that some nodes were down in
> step 2 (just after truncating tables).
> (Insertion and check in step2 would succeed if one node was down because
> consistency levels was serial.
> If the flush failed on more than one node, the test would retry step 2.)
> However, if so, the problem would happen without deleting Cassandra data.
>
> Regards,
> yuji
>
>
> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater 
> wrote:
>
>> Definitely sounds to me like something is not working as expected but I
>> don’t really have any idea what would cause that (other than the fairly
>> extreme failure scenario). A couple of things I can think of to try to
>> narrow it down:
>> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
>> data is written to sstables rather than relying on commit logs
>> 2) Run the test with consistency level quorom rather than serial
>> (shouldn’t be any different but quorom is more widely used so maybe there
>> is a bug that’s specific to serial)
>>
>> Cheers
>> Ben
>>
>> On Mon, 24 Oct 2016 at 10:29 Yuji Ito  wrote:
>>
>>> Hi Ben,
>>>
>>> The test without killing nodes has been working well without data lost.
>>> I've repeated my test about 200 times after removing data and
>>> rebuild/repair.
>>>
>>> Regards,
>>>
>>>
>>> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito  wrote:
>>>
>>> > Just to confirm, are you saying:
>>> > a) after operation 2, you select all and get 1000 rows
>>> > b) after operation 3 (which only does updates and read) you select and
>>> only get 953 rows?
>>>
>>> That's right!
>>>
>>> I've started the test without killing nodes.
>>> I'll report the result to you next Monday.
>>>
>>> Thanks
>>>
>>>
>>> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater 
>>> wrote:
>>>
>>> Just to confirm, are you saying:
>>> a) after operation 2, you select all and get 1000 rows
>>> b) after operation 3 (which only does updates and read) you select and
>>> only get 953 rows?
>>>
>>> If so, that would be very unexpected. If you run your tests without
>>> killing nodes do you get the expected (1,000) rows?
>>>
>>> Cheers
>>> Ben
>>>
>>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito  wrote:
>>>
>>> > Are you certain your tests don’t generate any overlapping inserts (by
>>> PK)?
>>>
>>> Yes. The operation 2) also checks the number of rows just after all
>>> insertions.
>>>
>>>
>>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater 
>>> wrote:
>>>
>>> OK. Are you certain your tests don’t generate any overlapping inserts
>>> (by PK)? Cassandra basically treats any inserts with the same primary key
>>> as updates (so 1000 insert operations may not necessarily result in 1000
>>> rows in the DB).
>>>
>>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito  wrote:
>>>
>>> thanks Ben,
>>>
>>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>>
>>> after operation 3), at operation 4) which reads all rows by cqlsh with
>>> CL.SERIAL
>>>
>>> > 2) What replication factor and replication strategy is used by the
>>> test keyspace? What consistency level is used by your operations?
>>>
>>> - create keyspace testkeyspace WITH REPLICATION =
>>> {'class':'SimpleStrategy','replication_factor':3};
>>> - consistency level is SERIAL
>>>
>>>
>>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater >> > wrote:
>>>
>>>
>>> A couple of questions:
>>> 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>> 2) What replication factor and replication strategy is used by the test
>>> keyspace? What consistency level is used by your operations?
>>>
>>>
>>> Cheers
>>> Ben
>>>
>>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito  wrote:
>>>
>>> Thanks Ben,
>>>
>>> I tried to run a rebuild and repair after the failure node rejoined the
>>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>>> The failure node could rejoined and I could read all rows successfully.
>>> (Sometimes a repair failed because the node cannot access other node. If
>>> it failed, I retried a repair)
>>>
>>> But some rows were lost after my destructive test repeated (after about
>>> 5-6 hours).
>>> After the test inserted 1000 rows, there were only 953 rows at the end
>>> of the test.
>>>
>>> My destructive test:
>>> - each C* node is killed & 

Re: How to confirm TWCS is fully in-place

2016-11-08 Thread Oskar Kjellin
Hi,

You could manually trigger it with nodetool compact. 

/Oskar 

> On 8 nov. 2016, at 21:47, Lahiru Gamathige  wrote:
> 
> Hi Users,
> 
> I am thinking of migrating our timeseries tables to use TWCS. I am using JMX 
> to set the new compaction and one node at a time and I am not sure how to 
> confirm that after the flush all the compaction is done in each node. I tried 
> this in a small cluster but after setting the compaction I didn't see any 
> compaction triggering  and ran nodetool flush and still didn't see a 
> compaction triggering.
> 
> Now I am about to do the same thing in our staging cluster, so curious how do 
> I confirm compaction ran in each node before I change the table schema 
> because I am worried it will start the compaction in all the nodes at the 
> same time.
> 
> Lahiru


How to confirm TWCS is fully in-place

2016-11-08 Thread Lahiru Gamathige
Hi Users,

I am thinking of migrating our timeseries tables to use TWCS. I am using
JMX to set the new compaction and one node at a time and I am not sure how
to confirm that after the flush all the compaction is done in each node. I
tried this in a small cluster but after setting the compaction I didn't see
any compaction triggering  and ran nodetool flush and still didn't see a
compaction triggering.

Now I am about to do the same thing in our staging cluster, so curious how
do I confirm compaction ran in each node before I change the table schema
because I am worried it will start the compaction in all the nodes at the
same time.

Lahiru


Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Alain Rastoul

On 11/08/2016 08:52 PM, Alain Rastoul wrote:

For example if you had to track the position of a lot of objects,
instead of updating the object records, each second you could insert a
new event with : (object: object_id, event_type: position_move, position
: x, y ).



and add a timestamp of course
and eventually TTL the data, with a decreasing clustering sort order


--
best,
Alain


Re: Slow performance after upgrading from 2.0.9 to 2.1.11

2016-11-08 Thread Dikang Gu
Michael, thanks for the info. It sounds to me a very serious performance
regression. :(

On Tue, Nov 8, 2016 at 11:39 AM, Michael Kjellman <
mkjell...@internalcircle.com> wrote:

> Yes, We hit this as well. We have a internal patch that I wrote to mostly
> revert the behavior back to ByteBuffers with as small amount of code change
> as possible. Performance of our build is now even with 2.0.x and we've also
> forward ported it to 3.x (although the 3.x patch was even more complicated
> due to Bounds, RangeTombstoneBound, ClusteringPrefix which actually
> increases the number of allocations to somewhere between 11 and 13
> depending on how I count it per indexed block -- making it even worse than
> what you're observing in 2.1.
>
> We haven't upstreamed it as 2.1 is obviously not taking any changes at
> this point and the longer term solution is https://issues.apache.org/
> jira/browse/CASSANDRA-9754 (which also includes the changes to go back to
> ByteBuffers and remove as much of the Composites from the storage engine as
> possible.) Also, the solution is a bit of a hack -- although it was a
> blocker from us deploying 2.1 -- so i'm not sure how "hacky" it is if it
> works..
>
> best,
> kjellman
>
>
> On Nov 8, 2016, at 11:31 AM, Dikang Gu  an...@gmail.com>> wrote:
>
> This is very expensive:
>
> "MessagingService-Incoming-/2401:db00:21:1029:face:0:9:0" prio=10
> tid=0x7f2fd57e1800 nid=0x1cc510 runnable [0x7f2b971b]
>java.lang.Thread.State: RUNNABLE
> at org.apache.cassandra.db.marshal.IntegerType.compare(
> IntegerType.java:29)
> at org.apache.cassandra.db.composites.AbstractSimpleCellNameType.
> compare(AbstractSimpleCellNameType.java:98)
> at org.apache.cassandra.db.composites.AbstractSimpleCellNameType.
> compare(AbstractSimpleCellNameType.java:31)
> at java.util.TreeMap.put(TreeMap.java:545)
> at java.util.TreeSet.add(TreeSet.java:255)
> at org.apache.cassandra.db.filter.NamesQueryFilter$
> Serializer.deserialize(NamesQueryFilter.java:254)
> at org.apache.cassandra.db.filter.NamesQueryFilter$
> Serializer.deserialize(NamesQueryFilter.java:228)
> at org.apache.cassandra.db.SliceByNamesReadCommandSeriali
> zer.deserialize(SliceByNamesReadCommand.java:104)
> at org.apache.cassandra.db.ReadCommandSerializer.
> deserialize(ReadCommand.java:156)
> at org.apache.cassandra.db.ReadCommandSerializer.
> deserialize(ReadCommand.java:132)
> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
> at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(
> IncomingTcpConnection.java:195)
> at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(
> IncomingTcpConnection.java:172)
> at org.apache.cassandra.net.IncomingTcpConnection.run(
> IncomingTcpConnection.java:88)
>
>
> Checked the git history, it comes from this jira:
> https://issues.apache.org/jira/browse/CASSANDRA-5417
>
> Any thoughts?
> ​
>
> On Fri, Oct 28, 2016 at 10:32 AM, Paulo Motta  mailto:pauloricard...@gmail.com>> wrote:
> Haven't seen this before, but perhaps it's related to CASSANDRA-10433?
> This is just a wild guess as it's in a related codepath, but maybe worth
> trying out the patch available to see if it helps anything...
>
> 2016-10-28 15:03 GMT-02:00 Dikang Gu  an...@gmail.com>>:
> We are seeing huge cpu regression when upgrading one of our 2.0.16 cluster
> to 2.1.14 as well. The 2.1.14 node is not able to handle the same amount of
> read traffic as the 2.0.16 node, actually, it's less than 50%.
>
> And in the perf results, the first line could go as high as 50%, as we
> turn up the read traffic, which never appeared in 2.0.16.
>
> Any thoughts?
> Thanks
>
>
> Samples: 952K of event 'cycles', Event count (approx.): 229681774560
> Overhead  Shared Object  Symbol
>6.52%  perf-196410.map[.]
> Lorg/apache/cassandra/db/marshal/IntegerType;.compare in
> Lorg/apache/cassandra/db/composites/AbstractSimpleCellNameType;.compare
>4.84%  libzip.so  [.] adler32
>2.88%  perf-196410.map[.]
> Ljava/nio/HeapByteBuffer;.get in Lorg/apache/cassandra/db/
> marshal/IntegerType;.compare
>2.39%  perf-196410.map[.]
> Ljava/nio/Buffer;.checkIndex in Lorg/apache/cassandra/db/
> marshal/IntegerType;.findMostSignificantByte
>2.03%  perf-196410.map[.]
> Ljava/math/BigInteger;.compareTo in Lorg/apache/cassandra/db/
> DecoratedKey;.compareTo
>1.65%  perf-196410.map[.] vtable chunks
>1.44%  perf-196410.map[.]
> Lorg/apache/cassandra/db/DecoratedKey;.compareTo in Ljava/util/concurrent/
> ConcurrentSkipListMap;.findNode
>1.02%  perf-196410.map[.]
> 

Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Alain Rastoul

On 11/08/2016 11:05 AM, DuyHai Doan wrote:

Are you sure Cassandra is a good fit for this kind of heavy update &
delete scenario ?


+1
this sounds like relational thinking scenario... (no offense, I like 
relational systems)
As if you want to maintain the state of a lot of entities with updates & 
deletes, and you have a lot of state changes for your entities.


May be an eventstore/DDD approach would be a better model for that?

You could have an aggregate for each entity (ie. a record) you have in 
your system and insert a new event record on each update of this agregate.


For example if you had to track the position of a lot of objects, 
instead of updating the object records, each second you could insert a 
new event with : (object: object_id, event_type: position_move, position 
: x, y ).


Just a suggestion.

--
best,
Alain


Re: Are Cassandra writes are faster than reads?

2016-11-08 Thread Ben Bromhead
Awesome! For a full explanation of what you are seeing (we call it micro
batching) check out Adam Zegelins talk on it
https://www.youtube.com/watch?v=wF3Ec1rdWgc

On Tue, 8 Nov 2016 at 02:21 Rajesh Radhakrishnan <
rajesh.radhakrish...@phe.gov.uk> wrote:

>
> Hi,
>
> Just found that reducing the batch size below 20 also increases the
> writing speed and reduction in memory usage(especially for Python driver).
>
> Kind regards,
> Rajesh R
>
> --
> *From:* Ben Bromhead [b...@instaclustr.com]
> *Sent:* 07 November 2016 05:44
> *To:* user@cassandra.apache.org
> *Subject:* Re: Are Cassandra writes are faster than reads?
>
> They can be and it depends on your compaction strategy :)
>
> On Sun, 6 Nov 2016 at 21:24 Ali Akhtar  >
> wrote:
>
> tl;dr? I just want to know if updates are bad for performance, and if so,
> for how long.
>
> On Mon, Nov 7, 2016 at 10:23 AM, Ben Bromhead  
> > wrote:
>
> Check out https://wiki.apache.org/cassandra/WritePathForUsers
> 
>  for
> the full gory details.
>
> On Sun, 6 Nov 2016 at 21:09 Ali Akhtar  >
> wrote:
>
> How long does it take for updates to get merged / compacted into the main
> data file?
>
> On Mon, Nov 7, 2016 at 5:31 AM, Ben Bromhead  
> > wrote:
>
> To add some flavor as to how the commitlog implementation is so quick.
>
> It only flushes to disk every 10s by default. So writes are effectively
> done to memory and then to disk asynchronously later on. This is generally
> accepted to be OK, as the write is also going to other nodes.
>
> You can of course change this behavior to flush on each write or to skip
> the commitlog altogether (danger!). This however will change how "safe"
> things are from a durability perspective.
>
> On Sun, Nov 6, 2016, 12:51 Jeff Jirsa  >
> wrote:
>
> Cassandra writes are particularly fast, for a few reasons:
>
>
>
> 1)   Most writes go to a commitlog (append-only file, written
> linearly, so particularly fast in terms of disk operations) and then pushed
> to the memTable. Memtable is flushed in batches to the permanent data
> files, so it buffers many mutations and then does a sequential write to
> persist that data to disk.
>
> 2)   Reads may have to merge data from many data tables on disk.
> Because the writes (described very briefly in step 1) write to immutable
> files, updates/deletes have to be merged on read – this is extra effort for
> the read path.
>
>
>
> If you don’t do much in terms of overwrites/deletes, and your partitions
> are particularly small, and your data fits in RAM (probably mmap/page cache
> of data files, unless you’re using the row cache), reads may be very fast
> for you. Certainly individual reads on low-merge workloads can be < 0.1ms.
>
>
>
> -  Jeff
>
>
>
> *From: *Vikas Jaiman  
> >
> *Reply-To: *"user@cassandra.apache.org
> "
>  
> >
> *Date: *Sunday, November 6, 2016 at 12:42 PM
> *To: *"user@cassandra.apache.org
> "
>  
> >
> *Subject: *Are Cassandra writes are faster than reads?
>
>
>
> Hi all,
>
>
>
> Are Cassandra writes are faster than reads ?? If yes, why is this so? I am
> using consistency 1 and data is in memory.
>
>
>
> Vikas
>
> --
> Ben Bromhead
> CTO | Instaclustr
> 
> +1 650 284 9692
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>
>
> --
> Ben Bromhead
> CTO | Instaclustr
> 

Re: A difficult data model with C*

2016-11-08 Thread Vladimir Yudovin
Hi Ben,



if need very limited number of positions (as you said ten) may be you can store 
them in LIST of UDT? Or just as JSON string?

So you'll have one row per each pair user-video. 



It can be something like this:



CREATE TYPE play (position int, last_time timestamp);

CREATE TABLE recent (user_name text, video_id text, review 
LISTfrozenplay, PRIMARY KEY (user_name, video_id));



UPDATE recent set review = review + [(1234,12345)] where user_name='some user' 
AND video_id='great video';

UPDATE recent set review = review + [(1234,123456)] where user_name='some user' 
AND video_id='great video';

UPDATE recent set review = review + [(1234,1234567)] where user_name='some 
user' AND video_id='great video';



You can delete the oldest entry by index:

DELETE review[0] FROM recent WHERE user_name='some user' AND video_id='great 
video';



or by value, if you know the oldest entry:



UPDATE recent SET review = review - [(1234,12345)]  WHERE user_name='some user' 
AND video_id='great video';



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Mon, 07 Nov 2016 21:54:08 -0500ben ben diamond@outlook.com 
wrote 






Hi guys,

 

 We are maintaining a system for an on-line video service. ALL users' viewing 
records of every movie are stored in C*. So she/he can continue to enjoy the 
movie from the last point next time. The table is designed as below:

 CREATE TABLE recent (

 user_name text,

 vedio_id text,

 position int,

 last_time timestamp,

 PRIMARY KEY (user_name, vedio_id)

 )

 

 It worked well before. However, the records increase every day and the last 
ten items may be adequate for the business. The current model use vedio_id as 
cluster key to keep a row for a movie, but as you know, the business prefer to 
order by the last_time desc. If we use last_time as cluster key, there will be 
many records for a singe movie and the recent one is actually desired. So how 
to model that? Do you have any suggestions? 

 Thanks!

 

 

 BRs,

 BEN












Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Vladimir Yudovin
Yes, as doc says "Expired data is marked with a tombstone" but you save 
communication with host and processing of DELETE operator.





Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Tue, 08 Nov 2016 09:32:16 -0500Ali Akhtar ali.rac...@gmail.com 
wrote 




Does TTL also cause tombstones?



On Tue, Nov 8, 2016 at 6:57 PM, Vladimir Yudovin vla...@winguzone.com 
wrote:








The deletes will be done at a scheduled time, probably at the end of the 
day, each day.





Probably you can use TTL? 
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Tue, 08 Nov 2016 05:04:12 -0500Ali Akhtar ali.rac...@gmail.com 
wrote 




I have a use case where a lot of updates and deletes to a table will be 
necessary.



The deletes will be done at a scheduled time, probably at the end of the day, 
each day.



Updates will be done throughout the day, as new data comes in.



Are there any guidelines on improving cassandra's performance for this use 
case? Any caveats to be aware of? Any tips, like running nodetool repair every 
X days?




Thanks.
















Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Hannu Kröger
Also in they are being read before compaction:
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html 


Hannu

> On 8 Nov 2016, at 16.36, DuyHai Doan  wrote:
> 
> "Does TTL also cause tombstones?" --> Yes, after the TTL expires, at the next 
> compaction the TTLed column is replaced by a tombstone, as per my 
> understanding
> 
> On Tue, Nov 8, 2016 at 3:32 PM, Ali Akhtar  > wrote:
> Does TTL also cause tombstones?
> 
> On Tue, Nov 8, 2016 at 6:57 PM, Vladimir Yudovin  > wrote:
> >The deletes will be done at a scheduled time, probably at the end of the 
> >day, each day.
> 
> Probably you can use TTL? 
> http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html 
> 
> 
> Best regards, Vladimir Yudovin, 
> Winguzone  - Hosted Cloud Cassandra
> Launch your cluster in minutes.
> 
> 
>  On Tue, 08 Nov 2016 05:04:12 -0500Ali Akhtar  > wrote 
> 
> I have a use case where a lot of updates and deletes to a table will be 
> necessary.
> 
> The deletes will be done at a scheduled time, probably at the end of the day, 
> each day.
> 
> Updates will be done throughout the day, as new data comes in.
> 
> Are there any guidelines on improving cassandra's performance for this use 
> case? Any caveats to be aware of? Any tips, like running nodetool repair 
> every X days?
> 
> Thanks.
> 
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread DuyHai Doan
"Does TTL also cause tombstones?" --> Yes, after the TTL expires, at the
next compaction the TTLed column is replaced by a tombstone, as per my
understanding

On Tue, Nov 8, 2016 at 3:32 PM, Ali Akhtar  wrote:

> Does TTL also cause tombstones?
>
> On Tue, Nov 8, 2016 at 6:57 PM, Vladimir Yudovin 
> wrote:
>
>> >The deletes will be done at a scheduled time, probably at the end of the
>> day, each day.
>>
>> Probably you can use TTL? http://docs.datastax.com/en/cq
>> l/3.1/cql/cql_using/use_expire_c.html
>>
>> Best regards, Vladimir Yudovin,
>>
>> *Winguzone  - Hosted Cloud
>> CassandraLaunch your cluster in minutes.*
>>
>>
>>  On Tue, 08 Nov 2016 05:04:12 -0500*Ali Akhtar > >* wrote 
>>
>> I have a use case where a lot of updates and deletes to a table will be
>> necessary.
>>
>> The deletes will be done at a scheduled time, probably at the end of the
>> day, each day.
>>
>> Updates will be done throughout the day, as new data comes in.
>>
>> Are there any guidelines on improving cassandra's performance for this
>> use case? Any caveats to be aware of? Any tips, like running nodetool
>> repair every X days?
>>
>> Thanks.
>>
>>
>>
>


Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Ali Akhtar
Does TTL also cause tombstones?

On Tue, Nov 8, 2016 at 6:57 PM, Vladimir Yudovin 
wrote:

> >The deletes will be done at a scheduled time, probably at the end of the
> day, each day.
>
> Probably you can use TTL? http://docs.datastax.com/en/
> cql/3.1/cql/cql_using/use_expire_c.html
>
> Best regards, Vladimir Yudovin,
>
> *Winguzone  - Hosted Cloud
> CassandraLaunch your cluster in minutes.*
>
>
>  On Tue, 08 Nov 2016 05:04:12 -0500*Ali Akhtar  >* wrote 
>
> I have a use case where a lot of updates and deletes to a table will be
> necessary.
>
> The deletes will be done at a scheduled time, probably at the end of the
> day, each day.
>
> Updates will be done throughout the day, as new data comes in.
>
> Are there any guidelines on improving cassandra's performance for this use
> case? Any caveats to be aware of? Any tips, like running nodetool repair
> every X days?
>
> Thanks.
>
>
>


Re: Designing a table in cassandra

2016-11-08 Thread Vladimir Yudovin
Hi Sathish,



probably I didn't catch exactly your requirements, but why not create single 
table for all devices, and represent each device as rows, storing both user and 
network configuration per device. You can use MAP for flexible storage model.



If you have thousandth of devices creating own table for each device can be 
quite heavy solution. 



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Sun, 06 Nov 2016 19:23:20 -0500sat sathish.al...@gmail.com 
wrote 




Hi,



We are new to Cassandra. For our POC, we tried creating table and inserting 
them as JSON and all these went fine. Now we are trying to implement one of the 
application scenario, and I am having difficulty in coming up with the best 
approach. 



Scenario:

We have a Device POJO which have some attributes/fields which are read/write by 
users as well as network and some attributes/fields only network can modify. 
When users need to configure they will create an instance of Device POJO and 
set/configure applicable fields, however network can update those attributes. 
We wanted to know the discrepancy by the values configured by users versus the 
values updated by network. Hence we have thought of 3 different approaches



1) Create multiple tables for the same Device like Device_Users and 
Device_Network so that we can see the difference.



2) Create different Keyspace as multiple objects like Device can have the same 
requirement



3) Create one "Device" table and insert one row for user configuration and 
another row for network update. We will create this table with multiple primary 
key (device_name, updated_by)



Please let us know which is the best option (with their pros and cons if 
possible) among these 3, and also let us know if there are other options.



Thanks and Regards

A.SathishKumar 









Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Vladimir Yudovin
The deletes will be done at a scheduled time, probably at the end of the 
day, each day.



Probably you can use TTL? 
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Tue, 08 Nov 2016 05:04:12 -0500Ali Akhtar ali.rac...@gmail.com 
wrote 




I have a use case where a lot of updates and deletes to a table will be 
necessary.



The deletes will be done at a scheduled time, probably at the end of the day, 
each day.



Updates will be done throughout the day, as new data comes in.



Are there any guidelines on improving cassandra's performance for this use 
case? Any caveats to be aware of? Any tips, like running nodetool repair every 
X days?




Thanks.









Re: store individual inventory items in a table, how to assign them correctly

2016-11-08 Thread Vladimir Yudovin
Hi,



can you elaborate a little your data model?

Would you like to create 100 rows for each product and then remove one row and 
add this row to customer?



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Mon, 07 Nov 2016 14:51:56 -0500S Ahmed sahmed1...@gmail.com 
wrote 




Say I have 100 products in inventory, instead of having a counter I want to 
create 100 rows per inventory item.



When someone purchases a product, how can I correctly assign that customer a 
product from inventory without having any race conditions etc?



Thanks.









Re: operation and maintenance tools

2016-11-08 Thread Vladimir Yudovin
For memory usage you can use small command line tool 
https://github.com/patric-r/jvmtop

Also there are number of GUI tools that connect to JMX port, like jvisualvm



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Mon, 07 Nov 2016 22:25:47 -0500wxn...@zjqunshuo.com wrote 




Hi All,



I need to do maintenance work for a C* cluster with about 10 nodes. Please 
recommend a C* operation and maintenance tools you are using.

I also noticed my C* deamon using large memory while doing nothing. Is there 
any convenent tool to deeply analysize the C* node memory?



Cheers,

Simon








Re: store individual inventory items in a table, how to assign them correctly

2016-11-08 Thread Carlos Alonso
Bear in mind that LWT will, under certain circumstances fail too. See
amazing Chris Batey's talk about it on Cassandra Summit:
https://www.youtube.com/watch?v=wcxQM3ZN20c

Carlos Alonso | Software Engineer | @calonso 

On 7 November 2016 at 22:22, Justin Cameron  wrote:

> You can use lightweight transactions to achieve this.
>
> Example:
> UPDATE item SET customer = 'Joe' WHERE item_id = 2 IF customer = null;
>
> Keep in mind that lightweight transactions have performance tradeoffs (
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
> )
>
>
> On Mon, 7 Nov 2016 at 11:52 S Ahmed  wrote:
>
>> Say I have 100 products in inventory, instead of having a counter I want
>> to create 100 rows per inventory item.
>>
>> When someone purchases a product, how can I correctly assign that
>> customer a product from inventory without having any race conditions etc?
>>
>> Thanks.
>>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>


RE: Are Cassandra writes are faster than reads?

2016-11-08 Thread Rajesh Radhakrishnan

Hi,

Just found that reducing the batch size below 20 also increases the writing 
speed and reduction in memory usage(especially for Python driver).

Kind regards,
Rajesh R


From: Ben Bromhead [b...@instaclustr.com]
Sent: 07 November 2016 05:44
To: user@cassandra.apache.org
Subject: Re: Are Cassandra writes are faster than reads?

They can be and it depends on your compaction strategy :)

On Sun, 6 Nov 2016 at 21:24 Ali Akhtar 
>
 wrote:
tl;dr? I just want to know if updates are bad for performance, and if so, for 
how long.

On Mon, Nov 7, 2016 at 10:23 AM, Ben Bromhead 
>
 wrote:
Check out 
https://wiki.apache.org/cassandra/WritePathForUsers
 for the full gory details.

On Sun, 6 Nov 2016 at 21:09 Ali Akhtar 
>
 wrote:
How long does it take for updates to get merged / compacted into the main data 
file?

On Mon, Nov 7, 2016 at 5:31 AM, Ben Bromhead 
>
 wrote:
To add some flavor as to how the commitlog implementation is so quick.

It only flushes to disk every 10s by default. So writes are effectively done to 
memory and then to disk asynchronously later on. This is generally accepted to 
be OK, as the write is also going to other nodes.

You can of course change this behavior to flush on each write or to skip the 
commitlog altogether (danger!). This however will change how "safe" things are 
from a durability perspective.

On Sun, Nov 6, 2016, 12:51 Jeff Jirsa 
>
 wrote:

Cassandra writes are particularly fast, for a few reasons:



1)   Most writes go to a commitlog (append-only file, written linearly, so 
particularly fast in terms of disk operations) and then pushed to the memTable. 
Memtable is flushed in batches to the permanent data files, so it buffers many 
mutations and then does a sequential write to persist that data to disk.

2)   Reads may have to merge data from many data tables on disk. Because 
the writes (described very briefly in step 1) write to immutable files, 
updates/deletes have to be merged on read – this is extra effort for the read 
path.



If you don’t do much in terms of overwrites/deletes, and your partitions are 
particularly small, and your data fits in RAM (probably mmap/page cache of data 
files, unless you’re using the row cache), reads may be very fast for you. 
Certainly individual reads on low-merge workloads can be < 0.1ms.



-  Jeff



From: Vikas Jaiman 
>
Reply-To: 
"user@cassandra.apache.org"
 
>
Date: Sunday, November 6, 2016 at 12:42 PM
To: 
"user@cassandra.apache.org"
 
>
Subject: Are Cassandra writes are faster than reads?



Hi all,



Are Cassandra writes are faster than reads ?? If yes, why is this so? I am 
using consistency 1 and data is in memory.



Vikas

--
Ben Bromhead
CTO | 
Instaclustr
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

--
Ben Bromhead
CTO | 
Instaclustr
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

--
Ben Bromhead
CTO | 
Instaclustr
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

**
The information contained in the EMail and 

RE: Cassandra Python Driver : execute_async consumes lots of memory?

2016-11-08 Thread Rajesh Radhakrishnan
Hi Lahiru,

Great! you know what, REDUCTION of BATCH size from 50 to 20 solved my issue.

Thank you very much. Good job man! and Memory issue solved.

Next I will try using Spark to speed it up.


Kind regards,
Rajesh Radhakrishnan


From: Lahiru Gamathige [lah...@highfive.com]
Sent: 07 November 2016 17:10
To: user@cassandra.apache.org
Subject: Re: Cassandra Python Driver : execute_async consumes lots of memory?

Hi Rajesh,

By looking at your code I see that the memory would definitely grow because you 
write big batches async and you will end up large number of batch statements 
and the all end up slowing down. We recently migrated some data to C* and what 
we did was we created a data stream and wrote in batches and used a library 
which is sensitive to back-pressure of the stream. In your implementation 
there's is no back-pressure to control it. We migrated data pretty fast by 
keeping the CPU 100% constantly and achieve the highest performance (used Scala 
with akka-streams and phantom-websudo).

I would consider using some streaming API to implement this. When you do 
batching make sure you don't exceed the max match size, then things will slow 
down anyways.

Lahiru

On Mon, Nov 7, 2016 at 8:51 AM, Rajesh Radhakrishnan 
>
 wrote:
Hi

We are trying to inject millions to data into a table by executing Batches of 
PreparedStatments.

We found that when we use 'session.execute(batch)', it write more data but very 
very slow.
However if we use  'session.execute_async(batch)' then its relatively fast but 
when it reaches certain limit, its fillup the memory (python process)

Our implementation:
Cassandra 3.7.0 cluster  ring with 3 nodes (RedHat, 150GB Disk, 8GB of RAM each)

Python 2.7.12

Anybody know how to reduce the memory use of Cassandra-python driver API 
specifically for execute_async? Thank you!



===CODE ==
  sqlQuery = "INSERT INTO tableV  (id, sample_name, pos, ref_base, 
var_base) values (?,?,?,?,?)"
   random_numbers_for_strains = random.sample(xrange(1,300), 200)
random_numbers = random.sample(xrange(1,200), 20)

totalCounter  = 0
c = 0
time_init = time.time()
for random_number_strain in random_numbers_for_strains:

sample_name = None
sample_name = 'sample'+str(random_number_strain)

cassandraCluster = CassandraCluster.CassandraCluster()
cluster = cassandraCluster.create_cluster_with_protocol2()
session = cluster.connect();
#session.default_timeout = 1800
session.set_keyspace(self.KEYSPACE_NAME)

preparedStatement = session.prepare(sqlQuery)

counter = 0
c = c + 1

for random_number in random_numbers:

totalCounter += 1
if counter == 0 :
batch = BatchStatement()

counter += 1
if totalCounter % 1 == 0 :
print "Total Count "+ str(totalCounter)

batch.add(preparedStatement.bind([ uuid.uuid1(), sample_name, 
random_number, random.choice('GT'), random.choice('AC')]))
if counter % 50 == 0:
session.execute_async(batch)
#session.execute(batch)
batch = None
del batch
counter = 0

time.sleep(2);
session.cluster.shutdown()
random_number= None
del random_number
preparedStatement = None
session = None
del session
cluster = None
del cluster
cassandraCluster = None
del cassandraCluster
gc.collect()

===CODE ==



Kind regards,
Rajesh Radhakrishnan


**
The information contained in the EMail and any attachments is confidential and 
intended solely and for the attention and use of the named addressee(s). It may 
not be disclosed to any other person without the express authority of Public 
Health England, or the intended recipient, or both. If you are not the intended 
recipient, you must not disclose, copy, distribute or retain this message or 
any part of it. This footnote also confirms that this EMail has been swept for 
computer viruses by Symantec.Cloud, but please re-sweep any attachments before 
opening or saving. 
http://www.gov.uk/PHE
**


**
The information contained 

Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Ali Akhtar
Yes, because there will also be a lot of inserts, and the linear
scalability that c* offers is required.

But the inserts aren't static, and the data that comes in will need to be
updated in response to user events.

Data which hasn't been touched for over a week has to be deleted.
(Sensitive data, so better to delete when its out of date rather than store
it).

Couldn't really do the weekly tables without massively complicating my
report generation, as the entire dataset needs to be queried for generating
certain reports.

So my question is really about how to get the best out of c* in this sort
of scenario.

On Tue, Nov 8, 2016 at 3:05 PM, DuyHai Doan  wrote:

> Are you sure Cassandra is a good fit for this kind of heavy update &
> delete scenario ?
>
> Otherwise, you can always use several tables (one table/day, rotating
> through 7 days for a week) and do a truncate of the table at the end of the
> day.
>
> On Tue, Nov 8, 2016 at 11:04 AM, Ali Akhtar  wrote:
>
>> I have a use case where a lot of updates and deletes to a table will be
>> necessary.
>>
>> The deletes will be done at a scheduled time, probably at the end of the
>> day, each day.
>>
>> Updates will be done throughout the day, as new data comes in.
>>
>> Are there any guidelines on improving cassandra's performance for this
>> use case? Any caveats to be aware of? Any tips, like running nodetool
>> repair every X days?
>>
>> Thanks.
>>
>
>


Re: Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread DuyHai Doan
Are you sure Cassandra is a good fit for this kind of heavy update & delete
scenario ?

Otherwise, you can always use several tables (one table/day, rotating
through 7 days for a week) and do a truncate of the table at the end of the
day.

On Tue, Nov 8, 2016 at 11:04 AM, Ali Akhtar  wrote:

> I have a use case where a lot of updates and deletes to a table will be
> necessary.
>
> The deletes will be done at a scheduled time, probably at the end of the
> day, each day.
>
> Updates will be done throughout the day, as new data comes in.
>
> Are there any guidelines on improving cassandra's performance for this use
> case? Any caveats to be aware of? Any tips, like running nodetool repair
> every X days?
>
> Thanks.
>


Improving performance where a lot of updates and deletes are required?

2016-11-08 Thread Ali Akhtar
I have a use case where a lot of updates and deletes to a table will be
necessary.

The deletes will be done at a scheduled time, probably at the end of the
day, each day.

Updates will be done throughout the day, as new data comes in.

Are there any guidelines on improving cassandra's performance for this use
case? Any caveats to be aware of? Any tips, like running nodetool repair
every X days?

Thanks.


RE: Are Cassandra writes are faster than reads?

2016-11-08 Thread Rajesh Radhakrishnan
Hi,

In my case writing is slower using Python driver, using Batch execution and 
prepared statements.
I am looking at different ways to speed it up, as I am trying to write 100 * 
200 Million records .

Cheers
Rajesh R

From: Vikas Jaiman [er.vikasjai...@gmail.com]
Sent: 07 November 2016 10:43
To: user@cassandra.apache.org
Subject: Re: Are Cassandra writes are faster than reads?

Thanks Jeff and Ben for the info.

On Mon, Nov 7, 2016 at 6:44 AM, Ben Bromhead 
>
 wrote:
They can be and it depends on your compaction strategy :)

On Sun, 6 Nov 2016 at 21:24 Ali Akhtar 
>
 wrote:
tl;dr? I just want to know if updates are bad for performance, and if so, for 
how long.

On Mon, Nov 7, 2016 at 10:23 AM, Ben Bromhead 
>
 wrote:
Check out 
https://wiki.apache.org/cassandra/WritePathForUsers
 for the full gory details.

On Sun, 6 Nov 2016 at 21:09 Ali Akhtar 
>
 wrote:
How long does it take for updates to get merged / compacted into the main data 
file?

On Mon, Nov 7, 2016 at 5:31 AM, Ben Bromhead 
>
 wrote:
To add some flavor as to how the commitlog implementation is so quick.

It only flushes to disk every 10s by default. So writes are effectively done to 
memory and then to disk asynchronously later on. This is generally accepted to 
be OK, as the write is also going to other nodes.

You can of course change this behavior to flush on each write or to skip the 
commitlog altogether (danger!). This however will change how "safe" things are 
from a durability perspective.

On Sun, Nov 6, 2016, 12:51 Jeff Jirsa 
>
 wrote:

Cassandra writes are particularly fast, for a few reasons:



1)   Most writes go to a commitlog (append-only file, written linearly, so 
particularly fast in terms of disk operations) and then pushed to the memTable. 
Memtable is flushed in batches to the permanent data files, so it buffers many 
mutations and then does a sequential write to persist that data to disk.

2)   Reads may have to merge data from many data tables on disk. Because 
the writes (described very briefly in step 1) write to immutable files, 
updates/deletes have to be merged on read – this is extra effort for the read 
path.



If you don’t do much in terms of overwrites/deletes, and your partitions are 
particularly small, and your data fits in RAM (probably mmap/page cache of data 
files, unless you’re using the row cache), reads may be very fast for you. 
Certainly individual reads on low-merge workloads can be < 0.1ms.



-  Jeff



From: Vikas Jaiman 
>
Reply-To: 
"user@cassandra.apache.org"
 
>
Date: Sunday, November 6, 2016 at 12:42 PM
To: 
"user@cassandra.apache.org"
 
>
Subject: Are Cassandra writes are faster than reads?



Hi all,



Are Cassandra writes are faster than reads ?? If yes, why is this so? I am 
using consistency 1 and data is in memory.



Vikas

--
Ben Bromhead
CTO | 
Instaclustr
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

--
Ben Bromhead
CTO | 
Instaclustr
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

--
Ben Bromhead
CTO | 

RE: Cassandra Python Driver : execute_async consumes lots of memory?

2016-11-08 Thread Rajesh Radhakrishnan
Hi Lahiru,

Thank you for the reply. I will try reducing the batch size to 20 and see how 
much memory usage I can reduce.

I might try Spark streaming too. Cheers!


Kind regards,
Rajesh R


From: Lahiru Gamathige [lah...@highfive.com]
Sent: 07 November 2016 17:10
To: user@cassandra.apache.org
Subject: Re: Cassandra Python Driver : execute_async consumes lots of memory?

Hi Rajesh,

By looking at your code I see that the memory would definitely grow because you 
write big batches async and you will end up large number of batch statements 
and the all end up slowing down. We recently migrated some data to C* and what 
we did was we created a data stream and wrote in batches and used a library 
which is sensitive to back-pressure of the stream. In your implementation 
there's is no back-pressure to control it. We migrated data pretty fast by 
keeping the CPU 100% constantly and achieve the highest performance (used Scala 
with akka-streams and phantom-websudo).

I would consider using some streaming API to implement this. When you do 
batching make sure you don't exceed the max match size, then things will slow 
down anyways.

Lahiru

On Mon, Nov 7, 2016 at 8:51 AM, Rajesh Radhakrishnan 
>
 wrote:
Hi

We are trying to inject millions to data into a table by executing Batches of 
PreparedStatments.

We found that when we use 'session.execute(batch)', it write more data but very 
very slow.
However if we use  'session.execute_async(batch)' then its relatively fast but 
when it reaches certain limit, its fillup the memory (python process)

Our implementation:
Cassandra 3.7.0 cluster  ring with 3 nodes (RedHat, 150GB Disk, 8GB of RAM each)

Python 2.7.12

Anybody know how to reduce the memory use of Cassandra-python driver API 
specifically for execute_async? Thank you!



===CODE ==
  sqlQuery = "INSERT INTO tableV  (id, sample_name, pos, ref_base, 
var_base) values (?,?,?,?,?)"
   random_numbers_for_strains = random.sample(xrange(1,300), 200)
random_numbers = random.sample(xrange(1,200), 20)

totalCounter  = 0
c = 0
time_init = time.time()
for random_number_strain in random_numbers_for_strains:

sample_name = None
sample_name = 'sample'+str(random_number_strain)

cassandraCluster = CassandraCluster.CassandraCluster()
cluster = cassandraCluster.create_cluster_with_protocol2()
session = cluster.connect();
#session.default_timeout = 1800
session.set_keyspace(self.KEYSPACE_NAME)

preparedStatement = session.prepare(sqlQuery)

counter = 0
c = c + 1

for random_number in random_numbers:

totalCounter += 1
if counter == 0 :
batch = BatchStatement()

counter += 1
if totalCounter % 1 == 0 :
print "Total Count "+ str(totalCounter)

batch.add(preparedStatement.bind([ uuid.uuid1(), sample_name, 
random_number, random.choice('GT'), random.choice('AC')]))
if counter % 50 == 0:
session.execute_async(batch)
#session.execute(batch)
batch = None
del batch
counter = 0

time.sleep(2);
session.cluster.shutdown()
random_number= None
del random_number
preparedStatement = None
session = None
del session
cluster = None
del cluster
cassandraCluster = None
del cassandraCluster
gc.collect()

===CODE ==



Kind regards,
Rajesh Radhakrishnan


**
The information contained in the EMail and any attachments is confidential and 
intended solely and for the attention and use of the named addressee(s). It may 
not be disclosed to any other person without the express authority of Public 
Health England, or the intended recipient, or both. If you are not the intended 
recipient, you must not disclose, copy, distribute or retain this message or 
any part of it. This footnote also confirms that this EMail has been swept for 
computer viruses by Symantec.Cloud, but please re-sweep any attachments before 
opening or saving. 
http://www.gov.uk/PHE
**


**
The information contained in the EMail and any attachments is