Re: Correct way to design a cassandra database
I you have a way to partition tables, relational can be ok. Thing of a business that has trillions of clients as customers and clients have a whole slew of things they are related to. Partitioning by client can be a good way to go. Here are some patterns we have seen in nosql and perhaps they can help your situation…. https://github.com/deanhiller/playorm/wiki/Patterns-Page Later, Dean From: David Mohl d...@dave.cxmailto:d...@dave.cx Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, December 21, 2012 4:49 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Correct way to design a cassandra database Hello! I've recently started learning cassandra but still have troubles understanding the best way to design a cassandra database. I've posted my question already on stackoverflow but because this would very likely result in a discussion, it got closed. Orginal question here: http://stackoverflow.com/questions/13975868/correct-way-to-design-a-cassandra-database Assuming you have 3 types of objects: User, Photo and Album. Obviously a photo belongs to a user and can be part of a album. For querying, assume we just want to order by last goes first. Paging by 10 elements should be possible. Would you go like every document has all the informations needed for a correct output. Something like this: -- User | -- Name | -- ... | -- Photos | -- Photoname | -- Uploaded at Or go a more relational way (while having a secondary index on the belongs_to columns: -- User (userid is the row key) | -- Name | -- ... -- Photoid | -- belongs_to (userid) | -- belongs_to_album (albumid) | -- ... -- Albumid | -- belongs_to (userid) | -- ... Another way that came in my mind would be kind of a mix: -- User | -- Name | -- ... | -- Photoids (e.g. 1,2,3,4,5) | -- Albumids (e.g. 1,2,3,4,5) -- Photoid (photoid is the row key) | -- Name | -- Uploaded at | -- ... -- Albumid (albumid is the row key) | -- Name | -- Photoids (e.g. 1,2,3,4,5) | -- ... When using a random partitioner, the last example would be (IMO) the way to go. I can query the user object (out of a session id or something) and would get all the row keys I need for fetching photo / album data. However this would result in veeery large columns. Another down point would be inconsistency and identification problems. A photo (or a album) could not be identified by the row itself. Example: If I fetch a photo with ID 3456, I don't know in which albums it is part nor which user owns it. Adding this kind of information would result in a fairly large stack of points I have to alter on creation / update. The second example has all the information needed. However, if I want to fetch all photos that are part of album x, I have to query by a secondary index that COULD contain millions of entries over the whole cluster. And I guess I can forget the random partitioner on this example. Am I thinking to relational? It'd be great to hear some other opinions on this topic --- David
Re: Correct way to design a cassandra database
. On Fri, Dec 21, 2012 at 5:07 AM, Hiller, Dean dean.hil...@nrel.gov wrote: I you have a way to partition tables, relational can be ok. Thing of a business that has trillions of clients as customers and clients have a whole slew of things they are related to. Partitioning by client can be a good way to go. Here are some patterns we have seen in nosql and perhaps they can help your situation…. https://github.com/deanhiller/playorm/wiki/Patterns-Page Later, Dean From: David Mohl d...@dave.cxmailto:d...@dave.cx Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, December 21, 2012 4:49 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Correct way to design a cassandra database Hello! I've recently started learning cassandra but still have troubles understanding the best way to design a cassandra database. I've posted my question already on stackoverflow but because this would very likely result in a discussion, it got closed. Orginal question here: http://stackoverflow.com/questions/13975868/correct-way-to-design-a-cassandra-database Assuming you have 3 types of objects: User, Photo and Album. Obviously a photo belongs to a user and can be part of a album. For querying, assume we just want to order by last goes first. Paging by 10 elements should be possible. Would you go like every document has all the informations needed for a correct output. Something like this: -- User | -- Name | -- ... | -- Photos | -- Photoname | -- Uploaded at Or go a more relational way (while having a secondary index on the belongs_to columns: -- User (userid is the row key) | -- Name | -- ... -- Photoid | -- belongs_to (userid) | -- belongs_to_album (albumid) | -- ... -- Albumid | -- belongs_to (userid) | -- ... Another way that came in my mind would be kind of a mix: -- User | -- Name | -- ... | -- Photoids (e.g. 1,2,3,4,5) | -- Albumids (e.g. 1,2,3,4,5) -- Photoid (photoid is the row key) | -- Name | -- Uploaded at | -- ... -- Albumid (albumid is the row key) | -- Name | -- Photoids (e.g. 1,2,3,4,5) | -- ... When using a random partitioner, the last example would be (IMO) the way to go. I can query the user object (out of a session id or something) and would get all the row keys I need for fetching photo / album data. However this would result in veeery large columns. Another down point would be inconsistency and identification problems. A photo (or a album) could not be identified by the row itself. Example: If I fetch a photo with ID 3456, I don't know in which albums it is part nor which user owns it. Adding this kind of information would result in a fairly large stack of points I have to alter on creation / update. The second example has all the information needed. However, if I want to fetch all photos that are part of album x, I have to query by a secondary index that COULD contain millions of entries over the whole cluster. And I guess I can forget the random partitioner on this example. Am I thinking to relational? It'd be great to hear some other opinions on this topic --- David
Re: Correct way to design a cassandra database
Ok.. So here is my latest thinking... Including that index: CREATE TABLE Users ( user_name text, password text, PRIMARY KEY (user_name) ); ^ Same as before CREATE TABLE Photos( user_name text, photo_id uuid, created_time timestamp, data text, PRIMARY KEY (user_name, photo_id, created_time) ) WITH CLUSTERING ORDER BY (created_time DESC); ^ Note the addition of a photo id and using that in the PK def with the created_time Data is a JSON like this: { thumbnail: url, standard_resolution:url } CREATE TABLE PhotosAlbums ( user_name text, album_name text, poster_image_url text, data text PRIMARY KEY (user_name, album_name) ); ^ Same as before, data represents a JSON array of the photos: [{photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}] CREATE TABLE PhotosAlbumsIndex ( user_name text, photo_id uuid, album_name text, created_time timestamp PRIMARY KEY (user_name, photo_id, album_name) ); The create_time column here is because you need to have at least 1 column that is not part of the PK. Or that's what it looks like in my quick test. ^ Each photo added to an album needs to be added to this index row As before, your application will need to keep the order of the array in tact as your users modify the order of things. Now however if they delete a photo you need to fetch the PhotoAlbums the photo existed in and update them accordingly: SELECT * FROM PhotosAlbumsIndex WHERE user_name='the_user' AND photo_id=uuid This should return to you all of the albums that the photo was a part of. Now you need to: SELECT * FROM PhotosAlbums where user_name = the_user and album_name IN (name1, name2, name3 ) name1,2,3 are the album names you selected from the PhotosAlbumsIndex query So now you have all of the photo albums, you would then iterate over those in your application, deserializing the JSON data locating the photo ID was was removed and taking it out of the array, then reserializing to JSON and updating the record. When that is complete you need to remove the Photo from the PhotosAlbumsIndex. Now there is where I het stuck a little.. because this will fail: DELETE FROM PhotosAlbumsIndex WHERE user_name='the_user' AND photo_id=uuid; It seems to want the album name as well since it's part of the PK. Admittedly, I don't know how to get around that and just delete everything where the first 2 components of the PK are true. You would already possess the list of album names though, so it could be BATCH that you need to perform for the deletes, specifying 1 delete per album_name. Anyway, that's my current thinking. I would love to know if it's possible to get around the DELETE issue another way. On Fri, Dec 21, 2012 at 7:15 AM, Adam Venturella aventure...@gmail.comwrote: Hmmm it just occurred to me that in my examples, there is no convenient way to delete a photo and also remove that photo from the albums it is a part of. As it stands, you would need to iterate over all of the users albums to locate the photo and remove it; that's no good. Probably need another table that holds just the photo / album identifiers, an index. So when the user deletes a photo, you ask the index which albums that photo belongs too and just fetch those to update the album with that photo removed. :: mobile emails :: On Dec 21, 2012, at 3:50, David Mohl d...@dave.cx wrote: Hello! I've recently started learning cassandra but still have troubles understanding the best way to design a cassandra database. I've posted my question already on stackoverflow but because this would very likely result in a discussion, it got closed. Orginal question here: http://stackoverflow.com/questions/13975868/correct-way-to-design-a-cassandra-database Assuming you have 3 types of objects: User, Photo and Album. Obviously a photo belongs to a user and can be part of a album. For querying, assume we just want to order by last goes first. Paging by 10 elements should be possible. Would you go like every document has all the informations needed for a correct output. Something like this: -- User | -- Name | -- ... | -- Photos | -- Photoname | -- Uploaded at Or go a more relational way (while having a secondary index on the belongs_to columns: -- User (userid is the row key) | -- Name | -- ... -- Photoid | -- belongs_to (userid) | -- belongs_to_album (albumid) | -- ... -- Albumid | -- belongs_to (userid) | -- ... Another way that came in my mind would be kind of a mix: -- User | -- Name | -- ... | -- Photoids (e.g. 1,2,3,4,5) | -- Albumids (e.g. 1,2,3,4,5) -- Photoid (photoid is the row key
Re: Correct way to design a cassandra database
One more link that might be helpful. It's a similar system to photo's but instead of Photos/Albums it's Songs/Playlists: http://www.datastax.com/dev/blog/cql3-for-cassandra-experts. It's not exactly 1:1 but it covers related concepts in making it work. On Fri, Dec 21, 2012 at 8:02 AM, Adam Venturella aventure...@gmail.comwrote: Ok.. So here is my latest thinking... Including that index: CREATE TABLE Users ( user_name text, password text, PRIMARY KEY (user_name) ); ^ Same as before CREATE TABLE Photos( user_name text, photo_id uuid, created_time timestamp, data text, PRIMARY KEY (user_name, photo_id, created_time) ) WITH CLUSTERING ORDER BY (created_time DESC); ^ Note the addition of a photo id and using that in the PK def with the created_time Data is a JSON like this: { thumbnail: url, standard_resolution:url } CREATE TABLE PhotosAlbums ( user_name text, album_name text, poster_image_url text, data text PRIMARY KEY (user_name, album_name) ); ^ Same as before, data represents a JSON array of the photos: [{photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}] CREATE TABLE PhotosAlbumsIndex ( user_name text, photo_id uuid, album_name text, created_time timestamp PRIMARY KEY (user_name, photo_id, album_name) ); The create_time column here is because you need to have at least 1 column that is not part of the PK. Or that's what it looks like in my quick test. ^ Each photo added to an album needs to be added to this index row As before, your application will need to keep the order of the array in tact as your users modify the order of things. Now however if they delete a photo you need to fetch the PhotoAlbums the photo existed in and update them accordingly: SELECT * FROM PhotosAlbumsIndex WHERE user_name='the_user' AND photo_id=uuid This should return to you all of the albums that the photo was a part of. Now you need to: SELECT * FROM PhotosAlbums where user_name = the_user and album_name IN (name1, name2, name3 ) name1,2,3 are the album names you selected from the PhotosAlbumsIndex query So now you have all of the photo albums, you would then iterate over those in your application, deserializing the JSON data locating the photo ID was was removed and taking it out of the array, then reserializing to JSON and updating the record. When that is complete you need to remove the Photo from the PhotosAlbumsIndex. Now there is where I het stuck a little.. because this will fail: DELETE FROM PhotosAlbumsIndex WHERE user_name='the_user' AND photo_id=uuid; It seems to want the album name as well since it's part of the PK. Admittedly, I don't know how to get around that and just delete everything where the first 2 components of the PK are true. You would already possess the list of album names though, so it could be BATCH that you need to perform for the deletes, specifying 1 delete per album_name. Anyway, that's my current thinking. I would love to know if it's possible to get around the DELETE issue another way. On Fri, Dec 21, 2012 at 7:15 AM, Adam Venturella aventure...@gmail.comwrote: Hmmm it just occurred to me that in my examples, there is no convenient way to delete a photo and also remove that photo from the albums it is a part of. As it stands, you would need to iterate over all of the users albums to locate the photo and remove it; that's no good. Probably need another table that holds just the photo / album identifiers, an index. So when the user deletes a photo, you ask the index which albums that photo belongs too and just fetch those to update the album with that photo removed. :: mobile emails :: On Dec 21, 2012, at 3:50, David Mohl d...@dave.cx wrote: Hello! I've recently started learning cassandra but still have troubles understanding the best way to design a cassandra database. I've posted my question already on stackoverflow but because this would very likely result in a discussion, it got closed. Orginal question here: http://stackoverflow.com/questions/13975868/correct-way-to-design-a-cassandra-database Assuming you have 3 types of objects: User, Photo and Album. Obviously a photo belongs to a user and can be part of a album. For querying, assume we just want to order by last goes first. Paging by 10 elements should be possible. Would you go like every document has all the informations needed for a correct output. Something like this: -- User | -- Name | -- ... | -- Photos | -- Photoname | -- Uploaded at Or go a more relational way (while having a secondary index on the belongs_to columns: -- User (userid is the row key) | -- Name
Re: Correct way to design a cassandra database
You could store the order as the first part of a composite string say first picture as A and second as B. To insert one between call it AA. If you shuffle alot the strings could get really long. Might be better to store the order in a separate column. Neither solution mentioned deals with concurrent access well. On Friday, December 21, 2012, Adam Venturella aventure...@gmail.com wrote: One more link that might be helpful. It's a similar system to photo's but instead of Photos/Albums it's Songs/Playlists: http://www.datastax.com/dev/blog/cql3-for-cassandra-experts. It's not exactly 1:1 but it covers related concepts in making it work. On Fri, Dec 21, 2012 at 8:02 AM, Adam Venturella aventure...@gmail.com wrote: Ok.. So here is my latest thinking... Including that index: CREATE TABLE Users ( user_name text, password text, PRIMARY KEY (user_name) ); ^ Same as before CREATE TABLE Photos( user_name text, photo_id uuid, created_time timestamp, data text, PRIMARY KEY (user_name, photo_id, created_time) ) WITH CLUSTERING ORDER BY (created_time DESC); ^ Note the addition of a photo id and using that in the PK def with the created_time Data is a JSON like this: { thumbnail: url, standard_resolution:url } CREATE TABLE PhotosAlbums ( user_name text, album_name text, poster_image_url text, data text PRIMARY KEY (user_name, album_name) ); ^ Same as before, data represents a JSON array of the photos: [{photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}] CREATE TABLE PhotosAlbumsIndex ( user_name text, photo_id uuid, album_name text, created_time timestamp PRIMARY KEY (user_name, photo_id, album_name) ); The create_time column here is because you need to have at least 1 column that is not part of the PK. Or that's what it looks like in my quick test. ^ Each photo added to an album needs to be added to this index row As before, your application will need to keep the order of the array in tact as your users modify the order of things. Now however if they delete a photo you need to fetch the PhotoAlbums the photo existed in and update them accordingly: SELECT * FROM PhotosAlbumsIndex WHERE user_name='the_user' AND photo_id=uuid This should return to you all of the albums that the photo was a part of. Now you need to: SELECT * FROM PhotosAlbums where user_name = the_user and album_name IN