Re: Insert with both TTL and timestamp behavior

2016-12-30 Thread Jeff Jirsa
Your last sentence is correct - TWCS and dtcs add meaning (date/timestamp) to 
the long writetime that the rest of Cassandra ignores. If you're trying to 
backload data, you'll need to calculate the TTL yourself per write like you 
calculate the writetime.

The TTL behavior doesn't consider the client provided writetime at all, it's 
based on a delta from system time at the time of write.

The fact that Cassandra doesn't try to force meaning on writetime actually 
enables some great (but dangerous) data models where you use the writetime as 
the value for leaderboards and similar - dangerous and don't try it unless you 
actually understand why it works.


-- 
Jeff Jirsa


> On Dec 28, 2016, at 1:15 PM, Voytek Jarnot  wrote:
> 
> >It's not clear to me why for your use case you would want to manipulate the 
> >timestamps as you're loading the records unless you're concerned about 
> >conflicting writes getting applied in the correct order.
> 
> Simple use-case: want to load historical data, want to use TWCS, want to use 
> TTL.
> 
> Scenario:
> Importing data using standard write path (inserts)
> Using timestamp to give TWCS something to work with (import records contain a 
> created-on timestamp from which I populate "using timestamp")
> Need records to expire according to TTL
> Don't want to calculate TTL for every insert individually (obviously what I 
> want and what I get differ)
> I'm importing in chrono order, so TWCS should be able to keep things from 
> getting out of hand.
> 
> >I think in general timestamp manipulation is caveat utilitor.
> 
> Yeah; although I'd probably choose stronger words. TWCS (and perhaps DTCS?) 
> appears to treat writetimes as timestamps; the rest of Cassandra appears to 
> treat them as integers.
> 
> 
>> On Wed, Dec 28, 2016 at 2:50 PM, Eric Stevens  wrote:
>> The purpose of timestamps is to guarantee out-of-order conflicting writes 
>> are resolved as last-write-wins.  Cassandra doesn't really expect you to be 
>> writing timestamps with wide variations from record to record.  Indeed, if 
>> you're doing this, it'll violate some of the assumptions in places such as 
>> time windowed / date tiered compaction.  It's possible to dodge those 
>> landmines but it would be hard to know if you got it wrong.
>> 
>> I think in general timestamp manipulation is caveat utilitor.  It's not 
>> clear to me why for your use case you would want to manipulate the 
>> timestamps as you're loading the records unless you're concerned about 
>> conflicting writes getting applied in the correct order. 
>> 
>> Probably worth a footnote in the documentation indicating that if you're 
>> doing both USING TTL and WITH TIMESTAMP that those don't relate to each 
>> other.  At rest TTL'd records get written with an expiration timestamp, not 
>> a delta from the writetime.
>> 
>>> On Wed, Dec 28, 2016 at 9:38 AM Voytek Jarnot  
>>> wrote:
>>> It appears as though, when inserting with "using ttl [foo] and timestamp 
>>> [bar]" that the TTL does not take the provided timestamp into account.
>>> 
>>> In other words, the TTL starts at insert time, not at the time specified by 
>>> the timestamp.
>>> 
>>> Similarly, if inserting with just "using timestamp [bar]" and relying on 
>>> the table's default_time_to_live property, the timestamp is again ignored 
>>> in terms of TTL expiration.
>>> 
>>> Seems like a bug to me, but I'm guessing this is intended behavior?
>>> 
>>> Use-case is importing data (some of it historical) and setting the 
>>> timestamp manually (based on a timestamp within the data itself). Anyone 
>>> familiar with any work-arounds that don't rely on calculating a TTL 
>>> client-side for each record?
> 


smime.p7s
Description: S/MIME cryptographic signature


Re: Query

2016-12-30 Thread Work
Actually, "noSQL" is a misleading misnomer. With C* you have CQL which is 
adapted from SQL syntax and purpose.

For a poster boy, try Netflix.

Regards,

James 

Sent from my iPhone

> On Dec 30, 2016, at 4:59 AM, Sikander Rafiq  wrote:
> 
> Thanks for your comments/suggestions.
> 
> 
> Yes I understand my project needs and requirements. Surely it requires to 
> handle huge data for what i'm exploring what suits for it.
> 
> 
> Though Cassandra is distributed, scalable and highly available, but it is 
> NoSql means Sql part is missing and needs to be handled.
> 
> 
> 
> Can anyone please tell me some big name who is using Cassandra for handling 
> its huge data sets like Twitter etc.
> 
> 
> 
> Sent from Outlook
> 
> 
>  
> From: Edward Capriolo 
> Sent: Friday, December 30, 2016 5:53 AM
> To: user@cassandra.apache.org
> Subject: Re: Query
>  
> You should start with understanding your needs. Once you understand your need 
> you can pick the software that fits your need. Staring with a software stack 
> is backwards.
> 
>> On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater  
>> wrote:
>> I wasn’t familiar with Gizzard either so I thought I’d take a look. The 
>> first things on their github readme is:
>> NB: This project is currently not recommended as a base for new consumers.
>> (And no commits since 2013)
>> 
>> So, Cassandra definitely looks like a better choice as your datastore for a 
>> new project.
>> 
>> Cheers
>> Ben
>> 
>>> On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar  
>>> wrote:
>>> I am not that familiar with gizzard but with gizzard + mysql , you have 
>>> multiple moving parts in the system that need to managed separately. You'll 
>>> need the mysql expert for mysql and the gizzard expert to manage the 
>>> distributed part. It can be argued that long term this will have higher 
>>> adminstration cost
>>> 
>>> Cassandra's value add is its simple peer to peer architecture that is easy 
>>> to manage - a single database solution that is distributed, scalable, 
>>> highly available etc. In other words, once you gain expertise cassandra, 
>>> you get everything in one package.
>>> 
>>> regards
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq  
>>> wrote:
>>> Hi,
>>> 
>>> I'm exploring Cassandra for handling large data sets for mobile app, but 
>>> i'm not clear where it stands.
>>> 
>>> 
>>> If we use MySQL as  underlying database and Gizzard for building custom 
>>> distributed databases (with arbitrary storage technology) and Memcached for 
>>> highly queried data, then where lies Cassandra?
>>> 
>>> 
>>> 
>>> As i have read that Twitter uses both Cassandra and Gizzard. Please explain 
>>> me where Cassandra will act.
>>> 
>>> 
>>> Thanks in advance.
>>> 
>>> 
>>> Regards,
>>> 
>>> Sikander
>>> 
>>> 
>>> 
>>> Sent from Outlook
>>> 
>>> 
>>> 
>>> -- 
>>> http://khangaonkar.blogspot.com/
> 


Re: Read efficiency question

2016-12-30 Thread Voytek Jarnot
Thank you Janne.  Yes, these are random-access (scatter) reads - I've
decided on option 1; having also considered (as you wrote) that it will
never make sense to look at ranges of key3.

On Fri, Dec 30, 2016 at 3:40 AM, Janne Jalkanen 
wrote:

> In practice, the performance you’re getting is likely to be impacted by
> your reading patterns.  If you do a lot of sequential reads where key1 and
> key2 stay the same, and only key3 varies, then you may be getting better
> peformance out of the second option due to hitting the row and disk caches
> more often. If you are doing a lot of scatter reads, then you’re likely to
> get better performance out of the first option, because the reads will be
> distributed more evenly to multiple nodes.  It also depends on how large
> rows you’re planning to use, as this will directly impact things like
> compaction which has an overall impact of the entire cluster speed.  For
> just a few values of key3, I doubt there would be much difference in
> performance, but if key3 has a cardinality of say, a million, you might be
> better off with option 1.
>
> As always the advice is - benchmark your intended use case - put a few
> hundred gigs of mock data to a cluster, trigger compactions and do perf
> tests for different kinds of read/write loads. :-)
>
> (Though if I didn’t know what my read pattern would be, I’d probably go
> for option 1 purely on a gut feeling if I was sure I would never need range
> queries on key3; shorter rows *usually* are a bit better for performance,
> compaction, etc.  Really wide rows can sometimes be a headache
> operationally.)
>
> May you have energy and success!
> /Janne
>
>
>
> On 28 Dec 2016, at 16:44, Manoj Khangaonkar  wrote:
>
> In the first case, the partitioning is based on key1,key2,key3.
>
> In the second case, partitioning is based on key1 , key2. Additionally you
> have a clustered key key3. This means within a partition you can do range
> queries on key3 efficiently. That is the difference.
>
> regards
>
> On Tue, Dec 27, 2016 at 7:42 AM, Voytek Jarnot 
> wrote:
>
>> Wondering if there's a difference when querying by primary key between
>> the two definitions below:
>>
>> primary key ((key1, key2, key3))
>> primary key ((key1, key2), key3)
>>
>> In terms of read speed/efficiency... I don't have much of a reason
>> otherwise to prefer one setup over the other, so would prefer the most
>> efficient for querying.
>>
>> Thanks.
>>
>
>
>
> --
> http://khangaonkar.blogspot.com/
>
>
>


Announcement: Atlanta Meetup, January 10th

2016-12-30 Thread SEAN_R_DURITY
Down and Durity - Cassandra Admin Discussion

Now, you are running several Cassandra clusters (or leaning heavily that way). 
How do you deploy them, monitor them, and do various other administrative 
tasks? Come and join in a discussion and let's learn from each other.

Sean Durity, our facilitator for the evening, is a Lead Cassandra Administrator 
(aka "lord of the (C*) rings") from The Home Depot, where he administers 
several production clusters. They range from 6-112 nodes, and he has worked on 
Cassandra versions 1.0.8 - 2.1 (and the accompanying DataStax Enterprise 
versions). He will share tips and tools from his 3+ years of hands-on Cassandra 
experience and hope to learn even more from the rest of the group. Expect a 
lively discussion with many take-aways.

More details or RSVP:
https://www.meetup.com/atlcassandra/events/236498977/


Sean Durity




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


RE: Query

2016-12-30 Thread SEAN_R_DURITY
A few of the many companies that rely on Cassandra are mentioned here:
http://cassandra.apache.org
Apple, Netflix, Weather Channel, etc.
(Not nearly as good as the Planet Cassandra list that DataStax used to 
maintain. Boo for the Apache/DataStax squabble!)

DataStax has a list of many case studies, too, with their enterprise version of 
Cassandra:
http://www.datastax.com/resources/casestudies


Sean Durity

From: Sikander Rafiq [mailto:hafiz_ra...@hotmail.com]
Sent: Friday, December 30, 2016 8:00 AM
To: user@cassandra.apache.org
Subject: Re: Query


Thanks for your comments/suggestions.



Yes I understand my project needs and requirements. Surely it requires to 
handle huge data for what i'm exploring what suits for it.



Though Cassandra is distributed, scalable and highly available, but it is NoSql 
means Sql part is missing and needs to be handled.



Can anyone please tell me some big name who is using Cassandra for handling its 
huge data sets like Twitter etc.





Sent from Outlook


From: Edward Capriolo >
Sent: Friday, December 30, 2016 5:53 AM
To: user@cassandra.apache.org
Subject: Re: Query

You should start with understanding your needs. Once you understand your need 
you can pick the software that fits your need. Staring with a software stack is 
backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
> wrote:
I wasn't familiar with Gizzard either so I thought I'd take a look. The first 
things on their github readme is:
NB: This project is currently not recommended as a base for new consumers.
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a new 
project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
> wrote:
I am not that familiar with gizzard but with gizzard + mysql , you have 
multiple moving parts in the system that need to managed separately. You'll 
need the mysql expert for mysql and the gizzard expert to manage the 
distributed part. It can be argued that long term this will have higher 
adminstration cost
Cassandra's value add is its simple peer to peer architecture that is easy to 
manage - a single database solution that is distributed, scalable, highly 
available etc. In other words, once you gain expertise cassandra, you get 
everything in one package.
regards




On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
> wrote:

Hi,

I'm exploring Cassandra for handling large data sets for mobile app, but i'm 
not clear where it stands.



If we use MySQL as  underlying database and Gizzard for building custom 
distributed databases (with arbitrary storage technology) and Memcached for 
highly queried data, then where lies Cassandra?



As i have read that Twitter uses both Cassandra and Gizzard. Please explain me 
where Cassandra will act.



Thanks in advance.



Regards,

Sikander




Sent from Outlook


--
http://khangaonkar.blogspot.com/




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this 

Re: Query

2016-12-30 Thread Sikander Rafiq
Thanks for your comments/suggestions.


Yes I understand my project needs and requirements. Surely it requires to 
handle huge data for what i'm exploring what suits for it.


Though Cassandra is distributed, scalable and highly available, but it is NoSql 
means Sql part is missing and needs to be handled.


Can anyone please tell me some big name who is using Cassandra for handling its 
huge data sets like Twitter etc.



Sent from Outlook



From: Edward Capriolo 
Sent: Friday, December 30, 2016 5:53 AM
To: user@cassandra.apache.org
Subject: Re: Query

You should start with understanding your needs. Once you understand your need 
you can pick the software that fits your need. Staring with a software stack is 
backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
> wrote:
I wasn't familiar with Gizzard either so I thought I'd take a look. The first 
things on their github readme is:
NB: This project is currently not recommended as a base for new consumers.
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a new 
project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
> wrote:
I am not that familiar with gizzard but with gizzard + mysql , you have 
multiple moving parts in the system that need to managed separately. You'll 
need the mysql expert for mysql and the gizzard expert to manage the 
distributed part. It can be argued that long term this will have higher 
adminstration cost

Cassandra's value add is its simple peer to peer architecture that is easy to 
manage - a single database solution that is distributed, scalable, highly 
available etc. In other words, once you gain expertise cassandra, you get 
everything in one package.

regards





On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
> wrote:

Hi,

I'm exploring Cassandra for handling large data sets for mobile app, but i'm 
not clear where it stands.


If we use MySQL as  underlying database and Gizzard for building custom 
distributed databases (with arbitrary storage technology) and Memcached for 
highly queried data, then where lies Cassandra?


As i have read that Twitter uses both Cassandra and Gizzard. Please explain me 
where Cassandra will act.


Thanks in advance.


Regards,

Sikander



Sent from Outlook



--
http://khangaonkar.blogspot.com/



Re: How to change Replication Strategy and RF

2016-12-30 Thread techpyaasa .
Thanks a lot kurt Greaves

On Fri, Dec 30, 2016 at 5:58 AM, kurt Greaves  wrote:

>
> ​If you're already using the cluster in production and require no downtime
> you should perform a datacenter migration first to change the RF to 3.
> Rough process would be as follows:
>
>1. Change keyspace to NetworkTopologyStrategy with RF=1. You shouldn't
>increase RF here as you will receive read failures as not all nodes have
>the data they own. You would have to wait for a repair to complete to stop
>any read failures.
>2. Configure your clients to use a LOCAL_* consistency and
>DCAwareRoundRobinPolicy for load balancing (with the current DC configured)
>3. Add a new datacenter, configure it's replication to be 3.
>4. Rebuild the new datacenter by running nodetool rebuild  on
>each node in the new DC.
>5. Migrate your clients to use the new datacenter, by switching the
>contact points to nodes in the new DC and the load balancing policy DC to
>the new DC
>6. At this point you could increase the replication factor on the old
>DC to 3, and then run a repair. Once the repair successfully completes you
>should have 2 DCs that you can use. If you need the DCs in separate
>locations you could change this step to adding another DC in the desired
>other location and running rebuilds as per steps 2-4.
>
> - Kurt
>


Re: Read efficiency question

2016-12-30 Thread Janne Jalkanen
In practice, the performance you’re getting is likely to be impacted by your reading patterns.  If you do a lot of sequential reads where key1 and key2 stay the same, and only key3 varies, then you may be getting better peformance out of the second option due to hitting the row and disk caches more often. If you are doing a lot of scatter reads, then you’re likely to get better performance out of the first option, because the reads will be distributed more evenly to multiple nodes.  It also depends on how large rows you’re planning to use, as this will directly impact things like compaction which has an overall impact of the entire cluster speed.  For just a few values of key3, I doubt there would be much difference in performance, but if key3 has a cardinality of say, a million, you might be better off with option 1.As always the advice is - benchmark your intended use case - put a few hundred gigs of mock data to a cluster, trigger compactions and do perf tests for different kinds of read/write loads. :-)(Though if I didn’t know what my read pattern would be, I’d probably go for option 1 purely on a gut feeling if I was sure I would never need range queries on key3; shorter rows *usually* are a bit better for performance, compaction, etc.  Really wide rows can sometimes be a headache operationally.)
May you have energy and success!/Janne



On 28 Dec 2016, at 16:44, Manoj Khangaonkar  wrote:In the first case, the partitioning is based on key1,key2,key3.In the second case, partitioning is based on key1 , key2. Additionally you have a clustered key key3. This means within a partition you can do range queries on key3 efficiently. That is the difference.regardsOn Tue, Dec 27, 2016 at 7:42 AM, Voytek Jarnot  wrote:Wondering if there's a difference when querying by primary key between the two definitions below:primary key ((key1, key2, key3))primary key ((key1, key2), key3)In terms of read speed/efficiency... I don't have much of a reason otherwise to prefer one setup over the other, so would prefer the most efficient for querying.Thanks.
-- http://khangaonkar.blogspot.com/