Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled
in the normalized case, or Query 3 in the denormalized case. All queries is set with LOCAL_QUORUM consistency level. Then I created 1 or more instances of the program to simultaneously retrieve the SAME set of 1 million events stored in Cassandra. Each test runs for 5 minutes, and the results are shown below. 1 instance 5 instances 10 instances Normalized 89 315 417 Denormalized 100 *43* *3* Note that the unit of measure is number of operations. So in the normalized case, the programs runs 89 times and retrieves 178K events for a single instance, 315 times and 630K events to 5 instances (each instance gets about 126K events), and 417 times and 834K events to 10 instances simultaneously (each instance gets about 83.4K events). Well for the de-normalized case, the performance is little better for a single instance case, in which the program runs 100 times and retrieves 200K events. However, it turns sharply south for multiple simultaneous instances. All 5 instances completed successfully only 43 operations together, and all 10 instances completed successfully only 3 operations together. For the latter case, the log showed that 3 instances each retrieved 2000 events successfully, and 7 other instances retrieved 0. In the de-normalized case, the program reported a lot of exceptions like below: com.datastax.driver.core.exceptions.ReadTimeoutException, Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded) com.datastax.driver.core.exceptions.NoHostAvailableException, All host(s) tried for query failed (no host was tried) I repeated the two cases back and forth several times, and the results remained the same. I also observed CPU usage on the 3 Cassandra servers, and they were all much higher for the de-normalized case. 1 instance 5 instances 10 instances Normalized 7% usr, 2% sys 30% usr, 8% sys 40% usr, 10% sys Denormalized 44% usr, 0.3% sys 65% usr, 1% sys 70% usr, 2% sys *Questions* This is really not what I expected, and I am puzzled and have not figured out a good explanation. - Why are there so many exceptions in the de-normalized case? I would think Cassandra should be able to handle simultaneous accesses to the same data. Why are there NO exceptions for the normalized case? I meant that the environments for the two cases are basically the same. - Is (internally) wide row only good for small amount of data under each column name? - Or is it an issue with Java Driver? - Or did I do something wrong? -- View this message in context: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Denormalization-leads-to-terrible-rather-than-better-Cassandra-performance-I-am-really-puzzled-tp7600561.html Sent from the cassandra-u...@incubator.apache.org mailing list archive http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/ at Nabble.com.
Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled
) com.datastax.driver.core.exceptions.NoHostAvailableException, All host(s) tried for query failed (no host was tried) I repeated the two cases back and forth several times, and the results remained the same. I also observed CPU usage on the 3 Cassandra servers, and they were all much higher for the de-normalized case. 1 instance 5 instances 10 instances Normalized 7% usr, 2% sys 30% usr, 8% sys 40% usr, 10% sys Denormalized 44% usr, 0.3% sys 65% usr, 1% sys 70% usr, 2% sys *Questions* This is really not what I expected, and I am puzzled and have not figured out a good explanation. - Why are there so many exceptions in the de-normalized case? I would think Cassandra should be able to handle simultaneous accesses to the same data. Why are there NO exceptions for the normalized case? I meant that the environments for the two cases are basically the same. - Is (internally) wide row only good for small amount of data under each column name? - Or is it an issue with Java Driver? - Or did I do something wrong? -- View this message in context: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Denormalization-leads-to-terrible-rather-than-better-Cassandra-performance-I-am-really-puzzled-tp7600561.html Sent from the cassandra-u...@incubator.apache.org mailing list archive http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/ at Nabble.com.
Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled
, and I am puzzled andhave not figured out a good explanation. Why are there so many exceptions in the de-normalized case? Iwould think Cassandra should be able to handle simultaneous accesses to thesame data. Why are there NO exceptions for the normalized case? I meant that the environments for the two cases are basically the same. Is (internally) wide row only good for small amount of data undereach column name? Or is it an issue with Java Driver? Or did I do something wrong? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Denormalization-leads-to-terrible-rather-than-better-Cassandra-performance-I-am-really-puzzled-tp7600561.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Denormalization
My experience we can design main column families and lookup column families. Main column family have all denormalized data,lookup column families have rowkey of denormalized column families's column. In users column family all user's denormalized data and lookup column family name like userByemail. when i first make request to userByemail retuns unique key which is rowkey of User column family then call to User column family returns all data, same other lookup column families too. - Chandra On Sun, Jan 27, 2013 at 8:53 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Agreed, was just making sure others knew ;). Dean From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 6:51 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization When I said that writes were cheap, I was speaking that in a normal case people are making 2-10 inserts what in a relational database might be one. 30K inserts is certainly not cheap. Your use case with 30,000 inserts is probably a special case. Most directory services that I am aware of OpenLDAP, Active Directory, Sun Directory server do eventually consistent master/slave and multi-master replication. So no worries about having to background something. You just want the replication to be fast enough so that when you call the employee about to be fired into the office, that by the time he leaves and gets home he can not VPN to rm -rf / your main file server :) On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean dean.hil...@nrel.gov mailto:dean.hil...@nrel.gov wrote: Sometimes this is true, sometimes not…..….We have a use case where we have an admin tool where we choose to do this denorm for ACL on permission checks to make permission checks extremely fast. That said, we have one issue with one object that too many children(30,000) so when someone gives a user access to this one object with 30,000 children, we end up with a bad 60 second wait and users ended up getting frustrated and trying to cancel(our plan since admin activity hardly ever happens is to do it on our background thread and just return immediately to the user and tell him his changes will take affect in 1 minute ). After all, admin changes are infrequent anyways. This example demonstrates how sometimes it could almost burn you. I guess my real point is it really depends on your use cases ;). In a lot of cases denorm can work but in some cases it burns you so you have to balance it all. In 90% of our cases our denorm is working great and for this one case, we need to background the permission change as we still LOVE the performance of our ACL checks. Ps. 30,000 writes in cassandra is not cheap when done from one server ;) but in general parallized writes is very fast for like 500. Later, Dean From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com mailto:edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org mailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 5:50 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes. On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se mailto:fredrik.l.stigb...@sitevision.semailto: fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik
Denormalization
Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik
Re: Denormalization
There is a really a mix of denormalization and normalization. It really depends on specific use-cases. To get better help on the email list, a more specific use case may be appropriate. Dean On 1/27/13 2:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik
Re: Denormalization
I don't have a current use-case. I was just curious how applications handle and how to think when modelling, since I guess denormalization might increase the complexity of the application. Fredrik 2013/1/27 Hiller, Dean dean.hil...@nrel.gov: There is a really a mix of denormalization and normalization. It really depends on specific use-cases. To get better help on the email list, a more specific use case may be appropriate. Dean On 1/27/13 2:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik -- Fredrik Larsson Stigbäck SiteVision AB Vasagatan 10, 107 10 Örebro 019-17 30 30
Re: Denormalization
In my experience, if you foresee needing to do a lot of updates where a master record would need to propagate its changes to other records, then in general a non-sql based data store may be the wrong fit for your data. If you have a lot of data that doesn't really change or is not linked in some way to other rows (in Cassandra's case). Then a non-sql based data store could be a great fit. Yes, you can do some fancy stuff to force things like Cassandra to behave like an RDBMS, but it's at the cost of application complexity; more code, more bugs. I often end up mixing the data stores sql/non-sql to play to their respective strengths. If I start seeing a lot of related data, relational databases are really good at solving that problem. On Sunday, January 27, 2013, Fredrik Stigbäck wrote: I don't have a current use-case. I was just curious how applications handle and how to think when modelling, since I guess denormalization might increase the complexity of the application. Fredrik 2013/1/27 Hiller, Dean dean.hil...@nrel.gov javascript:;: There is a really a mix of denormalization and normalization. It really depends on specific use-cases. To get better help on the email list, a more specific use case may be appropriate. Dean On 1/27/13 2:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.sejavascript:; wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik -- Fredrik Larsson Stigbäck SiteVision AB Vasagatan 10, 107 10 Örebro 019-17 30 30
Re: Denormalization
Things like PlayOrm exist to help you with half and half of denormalized and normalized data. There are more and more patterns out there of denormalization and normalization but allowing for scalability still. Here is one patterns page https://github.com/deanhiller/playorm/wiki/Patterns-Page Dean From: Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 3:44 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization In my experience, if you foresee needing to do a lot of updates where a master record would need to propagate its changes to other records, then in general a non-sql based data store may be the wrong fit for your data. If you have a lot of data that doesn't really change or is not linked in some way to other rows (in Cassandra's case). Then a non-sql based data store could be a great fit. Yes, you can do some fancy stuff to force things like Cassandra to behave like an RDBMS, but it's at the cost of application complexity; more code, more bugs. I often end up mixing the data stores sql/non-sql to play to their respective strengths. If I start seeing a lot of related data, relational databases are really good at solving that problem. On Sunday, January 27, 2013, Fredrik Stigbäck wrote: I don't have a current use-case. I was just curious how applications handle and how to think when modelling, since I guess denormalization might increase the complexity of the application. Fredrik 2013/1/27 Hiller, Dean dean.hil...@nrel.govjavascript:;: There is a really a mix of denormalization and normalization. It really depends on specific use-cases. To get better help on the email list, a more specific use case may be appropriate. Dean On 1/27/13 2:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.sejavascript:; wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik -- Fredrik Larsson Stigbäck SiteVision AB Vasagatan 10, 107 10 Örebro 019-17 30 30
Re: Denormalization
Oh and check out the last pattern Scalable equals only index which can allow you to still have normalized data though the pattern does denormalization just enough that you can 1. Update just two pieces of info (the users email for instance and the Xref table email as well). 2. Allow everyone else to have foreign references into that piece. (everyone references the guid not the email….while the xref table has an email to guid for your use…this can be quite a common pattern actually when you may be having issues denormalizing) Dean From: Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 3:44 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization In my experience, if you foresee needing to do a lot of updates where a master record would need to propagate its changes to other records, then in general a non-sql based data store may be the wrong fit for your data. If you have a lot of data that doesn't really change or is not linked in some way to other rows (in Cassandra's case). Then a non-sql based data store could be a great fit. Yes, you can do some fancy stuff to force things like Cassandra to behave like an RDBMS, but it's at the cost of application complexity; more code, more bugs. I often end up mixing the data stores sql/non-sql to play to their respective strengths. If I start seeing a lot of related data, relational databases are really good at solving that problem. On Sunday, January 27, 2013, Fredrik Stigbäck wrote: I don't have a current use-case. I was just curious how applications handle and how to think when modelling, since I guess denormalization might increase the complexity of the application. Fredrik 2013/1/27 Hiller, Dean dean.hil...@nrel.govjavascript:;: There is a really a mix of denormalization and normalization. It really depends on specific use-cases. To get better help on the email list, a more specific use case may be appropriate. Dean On 1/27/13 2:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.sejavascript:; wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik -- Fredrik Larsson Stigbäck SiteVision AB Vasagatan 10, 107 10 Örebro 019-17 30 30
Re: Denormalization
One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes. On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik
Re: Denormalization
When I said that writes were cheap, I was speaking that in a normal case people are making 2-10 inserts what in a relational database might be one. 30K inserts is certainly not cheap. Your use case with 30,000 inserts is probably a special case. Most directory services that I am aware of OpenLDAP, Active Directory, Sun Directory server do eventually consistent master/slave and multi-master replication. So no worries about having to background something. You just want the replication to be fast enough so that when you call the employee about to be fired into the office, that by the time he leaves and gets home he can not VPN to rm -rf / your main file server :) On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Sometimes this is true, sometimes not…..….We have a use case where we have an admin tool where we choose to do this denorm for ACL on permission checks to make permission checks extremely fast. That said, we have one issue with one object that too many children(30,000) so when someone gives a user access to this one object with 30,000 children, we end up with a bad 60 second wait and users ended up getting frustrated and trying to cancel(our plan since admin activity hardly ever happens is to do it on our background thread and just return immediately to the user and tell him his changes will take affect in 1 minute ). After all, admin changes are infrequent anyways. This example demonstrates how sometimes it could almost burn you. I guess my real point is it really depends on your use cases ;). In a lot of cases denorm can work but in some cases it burns you so you have to balance it all. In 90% of our cases our denorm is working great and for this one case, we need to background the permission change as we still LOVE the performance of our ACL checks. Ps. 30,000 writes in cassandra is not cheap when done from one server ;) but in general parallized writes is very fast for like 500. Later, Dean From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 5:50 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes. On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik
Re: Denormalization
Agreed, was just making sure others knew ;). Dean From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 6:51 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization When I said that writes were cheap, I was speaking that in a normal case people are making 2-10 inserts what in a relational database might be one. 30K inserts is certainly not cheap. Your use case with 30,000 inserts is probably a special case. Most directory services that I am aware of OpenLDAP, Active Directory, Sun Directory server do eventually consistent master/slave and multi-master replication. So no worries about having to background something. You just want the replication to be fast enough so that when you call the employee about to be fired into the office, that by the time he leaves and gets home he can not VPN to rm -rf / your main file server :) On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: Sometimes this is true, sometimes not…..….We have a use case where we have an admin tool where we choose to do this denorm for ACL on permission checks to make permission checks extremely fast. That said, we have one issue with one object that too many children(30,000) so when someone gives a user access to this one object with 30,000 children, we end up with a bad 60 second wait and users ended up getting frustrated and trying to cancel(our plan since admin activity hardly ever happens is to do it on our background thread and just return immediately to the user and tell him his changes will take affect in 1 minute ). After all, admin changes are infrequent anyways. This example demonstrates how sometimes it could almost burn you. I guess my real point is it really depends on your use cases ;). In a lot of cases denorm can work but in some cases it burns you so you have to balance it all. In 90% of our cases our denorm is working great and for this one case, we need to background the permission change as we still LOVE the performance of our ACL checks. Ps. 30,000 writes in cassandra is not cheap when done from one server ;) but in general parallized writes is very fast for like 500. Later, Dean From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.commailto:edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 5:50 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes. On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik