Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

2015-05-04 Thread Steve Robenalt
A few observations from what you've said so far:

1) IN clauses in CQL can have performance impact by including sets of keys
that are spread across the cluster.

2) We previously used m3.large instances in our cluster and would see
occasional read timeouts even at CL.ONE. We upgraded to i2.xlarge with
local SSD drive and no longer experience those problems.

3) You didn't state how you had your storage configured, but if you're
using EBS for your Cassandra partitions, it can seriously impact
performance due to network lag. If you're using local storage on your
instances (which is recommended), you should be using separate drives for
your data and commitlog partitions because the access patterns are very
different. SSD is the preferred local storage option (over spinning disks)
in any case.

4) If you have all of the above covered, you might also want to compare
CL.ONE read results with CL.QUORUM read results. The former will likely
perform much better.

Any of the issues above can cause excessive contention between nodes and
seriously degrade performance as traffic increases.

These are just some of the first things that jump out. Others on the list
are a lot more experienced than I am with Cassandra performance and may
have additional advice. There are also quite a few good papers and videos
on planet cassandra and the youtube channel regarding performance, storage,
data models and the interactions between them.

Hope that helps,
Steve



On Sun, May 3, 2015 at 4:05 PM, Erick Ramirez er...@ramirez.com.au wrote:

 Hello, there.

 In relation to the Java driver, I would recommend updating to the latest
 version as there were a lot of issues reported in versions earlier that
 2.0.9 were the driver is incorrectly marking nodes as down/not available.

 In fact, there is a new version of the driver being released in the next
 24-48 hours that reverts JAVA-425 to resolve this issue.

 Cheers,
 Erick

 *Erick Ramirez*
 About Me about.me/erickramirezonline

 Make a difference today!
 * Reduce your carbon footprint http://on.mash.to/1vZL7fX
 * Give back to the community http://www.govolunteer.com.au
 * Write free software http://www.opensource.org


 On Wed, Apr 29, 2015 at 4:56 AM, dlu66061 dlu66...@yahoo.com wrote:

 Cassandra gurus, I am really puzzled by my observations, and hope to get
 some help explaining the results. Thanks in advance.

 I think it has always been advocated in Cassandra community that
 de-normalization leads to better performance. I wanted to see how much
 performance improvement it can offer, but the results were totally
 opposite. The performance degraded dramatically for simultaneously requests
 for the same set of data.

 *Environment:*

 I have a Cassandra cluster consisting of 3 AWS m3.large instances, with
 Cassandra 2.0.6 installed and pretty much default settings. My program is
 written in Java using Java Driver 2.0.8.

 *Normalized case:*

 I have two tables created with the following 2 CQL statements

 CREATE TABLE event (event_id UUID, time_token timeuuid, …­ 30 other
 attributes, …­ PRIMARY KEY (event_id))

 CREATE TABLE event_index (index_key text, time_token timeuuid, event_id
 UUID,   PRIMARY KEY (index_key, time_token))

 In my program, given the proper index_key and a token range
 (tokenLowerBound to tokenUpperBound), I first query the event_index table

 *Query 1:*

 SELECT * FROM event_index WHERE index_key in (…­) AND time_token 
 tokenLowerBound AND time_token = tokenUpperBound ORDER BY time_token ASC
 LIMIT 2000

 to get a list of event_ids and then run the following CQL to get the
 event details.

 *Query 2:*

 SELECT * FROM event WHERE event_id IN (a list of event_ids from the above
 query)

 I repeat the above process, with updated token range from the previous
 run. This actually performs pretty well.

 In this normalized process, I have to *run 2 queries* to get data: the
 first one should be very quick since it is getting a slice of an internally
 wide row. The second query may take long because it needs to hit up to 2000
 rows of event table.

 *De-normalized case:*

 What if we can attach event detail to the index and run just 1 query?
 Like Query 1, would it be much faster since it is also getting a slice of
 an internally wide row?

 I created a third table that merged the above two tables together. Notice
 the first three attributes and the PRIMARY KEY definition are exactly the
 same as the event_index table.

 CREATE TABLE event_index_with_detail (index_key text, time_token
 timeuuid, event_id UUID, … 30 other attributes, …­ PRIMARY KEY (index_key,
 time_token))

 Then I can just run the following query to achieve my goal, with the same
 index and token range as in query 1:

 *Query 3:*

 SELECT * FROM event_index_with_detail WHERE index_key in (…­) AND
 time_token  tokenLowerBound AND time_token = tokenUpperBound ORDER BY
 time_token ASC LIMIT 2000

 *Performance observations*

 Using Java Driver 2.0.8, I wrote a program that runs Query 1 + Query 2 in
 

Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

2015-05-03 Thread Erick Ramirez
Hello, there.

In relation to the Java driver, I would recommend updating to the latest
version as there were a lot of issues reported in versions earlier that
2.0.9 were the driver is incorrectly marking nodes as down/not available.

In fact, there is a new version of the driver being released in the next
24-48 hours that reverts JAVA-425 to resolve this issue.

Cheers,
Erick

*Erick Ramirez*
About Me about.me/erickramirezonline

Make a difference today!
* Reduce your carbon footprint http://on.mash.to/1vZL7fX
* Give back to the community http://www.govolunteer.com.au
* Write free software http://www.opensource.org


On Wed, Apr 29, 2015 at 4:56 AM, dlu66061 dlu66...@yahoo.com wrote:

 Cassandra gurus, I am really puzzled by my observations, and hope to get
 some help explaining the results. Thanks in advance.

 I think it has always been advocated in Cassandra community that
 de-normalization leads to better performance. I wanted to see how much
 performance improvement it can offer, but the results were totally
 opposite. The performance degraded dramatically for simultaneously requests
 for the same set of data.

 *Environment:*

 I have a Cassandra cluster consisting of 3 AWS m3.large instances, with
 Cassandra 2.0.6 installed and pretty much default settings. My program is
 written in Java using Java Driver 2.0.8.

 *Normalized case:*

 I have two tables created with the following 2 CQL statements

 CREATE TABLE event (event_id UUID, time_token timeuuid, …­ 30 other
 attributes, …­ PRIMARY KEY (event_id))

 CREATE TABLE event_index (index_key text, time_token timeuuid, event_id
 UUID,   PRIMARY KEY (index_key, time_token))

 In my program, given the proper index_key and a token range
 (tokenLowerBound to tokenUpperBound), I first query the event_index table

 *Query 1:*

 SELECT * FROM event_index WHERE index_key in (…­) AND time_token 
 tokenLowerBound AND time_token = tokenUpperBound ORDER BY time_token ASC
 LIMIT 2000

 to get a list of event_ids and then run the following CQL to get the event
 details.

 *Query 2:*

 SELECT * FROM event WHERE event_id IN (a list of event_ids from the above
 query)

 I repeat the above process, with updated token range from the previous
 run. This actually performs pretty well.

 In this normalized process, I have to *run 2 queries* to get data: the
 first one should be very quick since it is getting a slice of an internally
 wide row. The second query may take long because it needs to hit up to 2000
 rows of event table.

 *De-normalized case:*

 What if we can attach event detail to the index and run just 1 query? Like
 Query 1, would it be much faster since it is also getting a slice of an
 internally wide row?

 I created a third table that merged the above two tables together. Notice
 the first three attributes and the PRIMARY KEY definition are exactly the
 same as the event_index table.

 CREATE TABLE event_index_with_detail (index_key text, time_token timeuuid,
 event_id UUID, … 30 other attributes, …­ PRIMARY KEY (index_key,
 time_token))

 Then I can just run the following query to achieve my goal, with the same
 index and token range as in query 1:

 *Query 3:*

 SELECT * FROM event_index_with_detail WHERE index_key in (…­) AND
 time_token  tokenLowerBound AND time_token = tokenUpperBound ORDER BY
 time_token ASC LIMIT 2000

 *Performance observations*

 Using Java Driver 2.0.8, I wrote a program that runs Query 1 + Query 2 in
 the normalized case, or Query 3 in the denormalized case. All queries is
 set with LOCAL_QUORUM consistency level.

 Then I created 1 or more instances of the program to simultaneously
 retrieve the SAME set of 1 million events stored in Cassandra. Each test
 runs for 5 minutes, and the results are shown below.



 1 instance

 5 instances

 10 instances

 Normalized

 89

 315

 417

 Denormalized

 100

 *43*

 *3*

 Note that the unit of measure is number of operations. So in the
 normalized case, the programs runs 89 times and retrieves 178K events for a
 single instance, 315 times and 630K events to 5 instances (each instance
 gets about 126K events), and 417 times and 834K events to 10 instances
 simultaneously (each instance gets about 83.4K events).

 Well for the de-normalized case, the performance is little better for a
 single instance case, in which the program runs 100 times and retrieves
 200K events. However, it turns sharply south for multiple simultaneous
 instances. All 5 instances completed successfully only 43 operations
 together, and all 10 instances completed successfully only 3 operations
 together. For the latter case, the log showed that 3 instances each
 retrieved 2000 events successfully, and 7 other instances retrieved 0.

 In the de-normalized case, the program reported a lot of exceptions like
 below:

 com.datastax.driver.core.exceptions.ReadTimeoutException, Cassandra
 timeout during read query at consistency LOCAL_QUORUM (2 responses were
 required but only 1 replica responded)

 

Re: Denormalization

2013-01-28 Thread chandra Varahala
My experience we can design main column families  and lookup column
families.
Main column family have all denormalized data,lookup column  families have
rowkey of denormalized column families's column.

In users column family  all user's  denormalized data and  lookup column
family name like  userByemail.
when i first make request to userByemail retuns unique key which is rowkey
of User column family then call to User column family returns all data,
same other lookup column families too.

-
Chandra



On Sun, Jan 27, 2013 at 8:53 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Agreed, was just making sure others knew ;).

 Dean

 From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
 
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, January 27, 2013 6:51 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Denormalization

 When I said that writes were cheap, I was speaking that in a normal case
 people are making 2-10 inserts what in a relational database might be one.
 30K inserts is certainly not cheap.

 Your use case with 30,000 inserts is probably a special case. Most
 directory services that I am aware of OpenLDAP, Active Directory, Sun
 Directory server do eventually consistent master/slave and multi-master
 replication. So no worries about having to background something. You just
 want the replication to be fast enough so that when you call the employee
 about to be fired into the office, that by the time he leaves and gets home
 he can not VPN to rm -rf / your main file server :)


 On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean dean.hil...@nrel.gov
 mailto:dean.hil...@nrel.gov wrote:
 Sometimes this is true, sometimes not…..….We have a use case where we have
 an admin tool where we choose to do this denorm for ACL on permission
 checks to make permission checks extremely fast.  That said, we have one
 issue with one object that too many children(30,000) so when someone gives
 a user access to this one object with 30,000 children, we end up with a bad
 60 second wait and users ended up getting frustrated and trying to
 cancel(our plan since admin activity hardly ever happens is to do it on our
 background thread and just return immediately to the user and tell him his
 changes will take affect in 1 minute ).  After all, admin changes are
 infrequent anyways.  This example demonstrates how sometimes it could
 almost burn you.

 I guess my real point is it really depends on your use cases ;).  In a lot
 of cases denorm can work but in some cases it burns you so you have to
 balance it all.  In 90% of our cases our denorm is working great and for
 this one case, we need to background the permission change as we still LOVE
 the performance of our ACL checks.

 Ps. 30,000 writes in cassandra is not cheap when done from one server ;)
 but in general parallized writes is very fast for like 500.

 Later,
 Dean

 From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
 mailto:edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 mailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, January 27, 2013 5:50 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Denormalization

 One technique is on the client side you build a tool that takes the even
 and produces N mutations. In c* writes are cheap so essentially, re-write
 everything on all changes.

 On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck 
 fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se
 mailto:fredrik.l.stigb...@sitevision.semailto:
 fredrik.l.stigb...@sitevision.se wrote:
 Hi.
 Since denormalized data is first-class citizen in Cassandra, how to
 handle updating denormalized data.
 E.g. If we have  a USER cf with name, email etc. and denormalize user
 data into many other CF:s and then
 update the information about a user (name, email...). What is the best
 way to handle updating those user data properties
 which might be spread out over many cf:s and many rows?

 Regards
 /Fredrik





Re: Denormalization

2013-01-27 Thread Hiller, Dean
There is a really a mix of denormalization and normalization.  It really
depends on specific use-cases.  To get better help on the email list, a
more specific use case may be appropriate.

Dean

On 1/27/13 2:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.se
wrote:

Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik



Re: Denormalization

2013-01-27 Thread Fredrik Stigbäck
I don't have a current use-case. I was just curious how applications
handle and how to think when modelling, since I guess denormalization
might increase the complexity of the application.

Fredrik

2013/1/27 Hiller, Dean dean.hil...@nrel.gov:
 There is a really a mix of denormalization and normalization.  It really
 depends on specific use-cases.  To get better help on the email list, a
 more specific use case may be appropriate.

 Dean

 On 1/27/13 2:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.se
 wrote:

Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik




-- 
Fredrik Larsson Stigbäck
SiteVision AB Vasagatan 10, 107 10 Örebro
019-17 30 30


Re: Denormalization

2013-01-27 Thread Adam Venturella
In my experience, if you foresee needing to do a lot of updates where a
master record would need to propagate its changes to other
records, then in general a non-sql based data store may be the wrong fit
for your data.

If you have a lot of data that doesn't really change or is not linked in
some way to other rows (in Cassandra's case). Then a non-sql based data
store could be a great fit.

Yes, you can do some fancy stuff to force things like Cassandra to behave
like an RDBMS, but it's at the cost of application complexity; more code,
more bugs.

I often end up mixing the data stores sql/non-sql to play to their
respective strengths.

If I start seeing a lot of related data, relational databases are really
good at solving that problem.


On Sunday, January 27, 2013, Fredrik Stigbäck wrote:

 I don't have a current use-case. I was just curious how applications
 handle and how to think when modelling, since I guess denormalization
 might increase the complexity of the application.

 Fredrik

 2013/1/27 Hiller, Dean dean.hil...@nrel.gov javascript:;:
  There is a really a mix of denormalization and normalization.  It really
  depends on specific use-cases.  To get better help on the email list, a
  more specific use case may be appropriate.
 
  Dean
 
  On 1/27/13 2:03 PM, Fredrik Stigbäck 
  fredrik.l.stigb...@sitevision.sejavascript:;
 
  wrote:
 
 Hi.
 Since denormalized data is first-class citizen in Cassandra, how to
 handle updating denormalized data.
 E.g. If we have  a USER cf with name, email etc. and denormalize user
 data into many other CF:s and then
 update the information about a user (name, email...). What is the best
 way to handle updating those user data properties
 which might be spread out over many cf:s and many rows?
 
 Regards
 /Fredrik
 



 --
 Fredrik Larsson Stigbäck
 SiteVision AB Vasagatan 10, 107 10 Örebro
 019-17 30 30



Re: Denormalization

2013-01-27 Thread Hiller, Dean
Things like PlayOrm exist to help you with half and half of denormalized and 
normalized data.  There are more and more patterns out there of denormalization 
and normalization but allowing for scalability still.  Here is one patterns page

https://github.com/deanhiller/playorm/wiki/Patterns-Page

Dean

From: Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Sunday, January 27, 2013 3:44 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Denormalization

In my experience, if you foresee needing to do a lot of updates where a 
master record would need to propagate its changes to other records, then in 
general a non-sql based data store may be the wrong fit for your data.

If you have a lot of data that doesn't really change or is not linked in some 
way to other rows (in Cassandra's case). Then a non-sql based data store could 
be a great fit.

Yes, you can do some fancy stuff to force things like Cassandra to behave like 
an RDBMS, but it's at the cost of application complexity; more code, more bugs.

I often end up mixing the data stores sql/non-sql to play to their respective 
strengths.

If I start seeing a lot of related data, relational databases are really good 
at solving that problem.


On Sunday, January 27, 2013, Fredrik Stigbäck wrote:
I don't have a current use-case. I was just curious how applications
handle and how to think when modelling, since I guess denormalization
might increase the complexity of the application.

Fredrik

2013/1/27 Hiller, Dean dean.hil...@nrel.govjavascript:;:
 There is a really a mix of denormalization and normalization.  It really
 depends on specific use-cases.  To get better help on the email list, a
 more specific use case may be appropriate.

 Dean

 On 1/27/13 2:03 PM, Fredrik Stigbäck 
 fredrik.l.stigb...@sitevision.sejavascript:;
 wrote:

Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik




--
Fredrik Larsson Stigbäck
SiteVision AB Vasagatan 10, 107 10 Örebro
019-17 30 30


Re: Denormalization

2013-01-27 Thread Hiller, Dean
Oh and check out the last pattern Scalable equals only index which can allow 
you to still have normalized data though the pattern does denormalization just 
enough that you can

 1.  Update just two pieces of info (the users email for instance and the Xref 
table email as well).
 2.  Allow everyone else to have foreign references into that piece. (everyone 
references the guid not the email….while the xref table has an email to guid 
for your use…this can be quite a common pattern actually when you may be having 
issues denormalizing)

Dean

From: Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Sunday, January 27, 2013 3:44 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Denormalization

In my experience, if you foresee needing to do a lot of updates where a 
master record would need to propagate its changes to other records, then in 
general a non-sql based data store may be the wrong fit for your data.

If you have a lot of data that doesn't really change or is not linked in some 
way to other rows (in Cassandra's case). Then a non-sql based data store could 
be a great fit.

Yes, you can do some fancy stuff to force things like Cassandra to behave like 
an RDBMS, but it's at the cost of application complexity; more code, more bugs.

I often end up mixing the data stores sql/non-sql to play to their respective 
strengths.

If I start seeing a lot of related data, relational databases are really good 
at solving that problem.


On Sunday, January 27, 2013, Fredrik Stigbäck wrote:
I don't have a current use-case. I was just curious how applications
handle and how to think when modelling, since I guess denormalization
might increase the complexity of the application.

Fredrik

2013/1/27 Hiller, Dean dean.hil...@nrel.govjavascript:;:
 There is a really a mix of denormalization and normalization.  It really
 depends on specific use-cases.  To get better help on the email list, a
 more specific use case may be appropriate.

 Dean

 On 1/27/13 2:03 PM, Fredrik Stigbäck 
 fredrik.l.stigb...@sitevision.sejavascript:;
 wrote:

Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik




--
Fredrik Larsson Stigbäck
SiteVision AB Vasagatan 10, 107 10 Örebro
019-17 30 30


Re: Denormalization

2013-01-27 Thread Edward Capriolo
One technique is on the client side you build a tool that takes the even
and produces N mutations. In c* writes are cheap so essentially, re-write
everything on all changes.

On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck 
fredrik.l.stigb...@sitevision.se wrote:

 Hi.
 Since denormalized data is first-class citizen in Cassandra, how to
 handle updating denormalized data.
 E.g. If we have  a USER cf with name, email etc. and denormalize user
 data into many other CF:s and then
 update the information about a user (name, email...). What is the best
 way to handle updating those user data properties
 which might be spread out over many cf:s and many rows?

 Regards
 /Fredrik



Re: Denormalization

2013-01-27 Thread Edward Capriolo
When I said that writes were cheap, I was speaking that in a normal case
people are making 2-10 inserts what in a relational database might be one.
30K inserts is certainly not cheap.

Your use case with 30,000 inserts is probably a special case. Most
directory services that I am aware of OpenLDAP, Active Directory, Sun
Directory server do eventually consistent master/slave and multi-master
replication. So no worries about having to background something. You just
want the replication to be fast enough so that when you call the employee
about to be fired into the office, that by the time he leaves and gets home
he can not VPN to rm -rf / your main file server :)


On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Sometimes this is true, sometimes not…..….We have a use case where we have
 an admin tool where we choose to do this denorm for ACL on permission
 checks to make permission checks extremely fast.  That said, we have one
 issue with one object that too many children(30,000) so when someone gives
 a user access to this one object with 30,000 children, we end up with a bad
 60 second wait and users ended up getting frustrated and trying to
 cancel(our plan since admin activity hardly ever happens is to do it on our
 background thread and just return immediately to the user and tell him his
 changes will take affect in 1 minute ).  After all, admin changes are
 infrequent anyways.  This example demonstrates how sometimes it could
 almost burn you.

 I guess my real point is it really depends on your use cases ;).  In a lot
 of cases denorm can work but in some cases it burns you so you have to
 balance it all.  In 90% of our cases our denorm is working great and for
 this one case, we need to background the permission change as we still LOVE
 the performance of our ACL checks.

 Ps. 30,000 writes in cassandra is not cheap when done from one server ;)
 but in general parallized writes is very fast for like 500.

 Later,
 Dean

 From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
 
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, January 27, 2013 5:50 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Denormalization

 One technique is on the client side you build a tool that takes the even
 and produces N mutations. In c* writes are cheap so essentially, re-write
 everything on all changes.

 On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck 
 fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se
 wrote:
 Hi.
 Since denormalized data is first-class citizen in Cassandra, how to
 handle updating denormalized data.
 E.g. If we have  a USER cf with name, email etc. and denormalize user
 data into many other CF:s and then
 update the information about a user (name, email...). What is the best
 way to handle updating those user data properties
 which might be spread out over many cf:s and many rows?

 Regards
 /Fredrik




Re: Denormalization

2013-01-27 Thread Hiller, Dean
Agreed, was just making sure others knew ;).

Dean

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Sunday, January 27, 2013 6:51 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Denormalization

When I said that writes were cheap, I was speaking that in a normal case people 
are making 2-10 inserts what in a relational database might be one. 30K inserts 
is certainly not cheap.

Your use case with 30,000 inserts is probably a special case. Most directory 
services that I am aware of OpenLDAP, Active Directory, Sun Directory server do 
eventually consistent master/slave and multi-master replication. So no worries 
about having to background something. You just want the replication to be fast 
enough so that when you call the employee about to be fired into the office, 
that by the time he leaves and gets home he can not VPN to rm -rf / your main 
file server :)


On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote:
Sometimes this is true, sometimes not…..….We have a use case where we have an 
admin tool where we choose to do this denorm for ACL on permission checks to 
make permission checks extremely fast.  That said, we have one issue with one 
object that too many children(30,000) so when someone gives a user access to 
this one object with 30,000 children, we end up with a bad 60 second wait and 
users ended up getting frustrated and trying to cancel(our plan since admin 
activity hardly ever happens is to do it on our background thread and just 
return immediately to the user and tell him his changes will take affect in 1 
minute ).  After all, admin changes are infrequent anyways.  This example 
demonstrates how sometimes it could almost burn you.

I guess my real point is it really depends on your use cases ;).  In a lot of 
cases denorm can work but in some cases it burns you so you have to balance it 
all.  In 90% of our cases our denorm is working great and for this one case, we 
need to background the permission change as we still LOVE the performance of 
our ACL checks.

Ps. 30,000 writes in cassandra is not cheap when done from one server ;) but in 
general parallized writes is very fast for like 500.

Later,
Dean

From: Edward Capriolo 
edlinuxg...@gmail.commailto:edlinuxg...@gmail.commailto:edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Sunday, January 27, 2013 5:50 PM
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Denormalization

One technique is on the client side you build a tool that takes the even and 
produces N mutations. In c* writes are cheap so essentially, re-write 
everything on all changes.

On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck 
fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se
 wrote:
Hi.
Since denormalized data is first-class citizen in Cassandra, how to
handle updating denormalized data.
E.g. If we have  a USER cf with name, email etc. and denormalize user
data into many other CF:s and then
update the information about a user (name, email...). What is the best
way to handle updating those user data properties
which might be spread out over many cf:s and many rows?

Regards
/Fredrik