Re: data modeling from batch_mutate point of view

aaron morton Thu, 11 Apr 2013 03:29:36 -0700

> b) the "batch_mutate" advantages are better, for the communication 
> "client<=>coordinator node" __and__ for the communications "coordinator 
> node<=>replicas".
Yes. A single row mutation can write to many CFs.


> Is there any experience out there about such data modeling (option_a vs 
> option_b) from the batch_mutate perspective ?
> Thanks.
I would not worry about the internal network lag as much as creating hot rows 
in the model. Sometimes it makes sense for an entity to map to rows in several 
CF's that use the same key, e.g. user info or a blog post. However it is 
normally bad when many entities require storing data on the same row, e.g. all 
blog posts have to update one row. 

From my understanding of what you are doing I would look to spread out the 
index entries to use different row keys. If the indexes are small you may get 
away with using the same key, but I would start with spreading it out. 

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 2:27 AM, DE VITO Dominique <dominique.dev...@thalesgroup.com> 
wrote:

> Thanks Aaron.
>  
> It helped.
>  
> Let's me rephrase a little bit my questions. It's about data modeling impact 
> on "batch_mutate" advantages.
>  
> I have one CF for storing data, and ~10 (all different) CF used for indexing 
> that data.
>  
> when adding a piece of data, I need to add indexes too, and then, I need to 
> add columns to one row for each of the 10 indexing CF => 2 main designs are 
> possible for adding these new indexes.
>  
> a) all the updated 10 rows of indexing CF have different rowkeys
> b) all the updated 10 rows of indexing CF have all the same rowkey
>  
> AFAIK, this has effect on batch_mutate:
>  
> a) the "batch_mutate" advantages stop at the coordinator node. The advantage 
> appears for the communication "client<=>coordinator node"
> b) the "batch_mutate" advantages are better, for the communication 
> "client<=>coordinator node" __and__ for the communications "coordinator 
> node<=>replicas".
>  
> So, for resuming:
>  
> a) CF with few data repeats (good) but the coordinator node needs to 
> communicate to different replicas according to different rowkeys
> b) CF with more denormalization, repeating some data, again and again over 
> composite columns,  but batch_mutate performs better (good) up to replicas, 
> and not only up to coordinator node.
>  
> Each option has one pro and one con.
>  
> Is there any experience out there about such data modeling (option_a vs 
> option_b) from the batch_mutate perspective ?
> Thanks.
>  
> Dominique
>  
>  
>  
> De : aaron morton [mailto:aa...@thelastpickle.com] 
> Envoyé : mardi 9 avril 2013 10:12
> À : user@cassandra.apache.org
> Objet : Re: data modeling from batch_mutate point of view
>  
> So, one alternative design for indexing CF could be:
> rowkey = folder_id
> colname = (indexed value, timestamp, file_id)
> colvalue = ""
>  
> If you always search in a folder what about 
> rowkey = <folder_id : property_name : property_value>
> colname = <file_id>
>  
> (That's closer to secondary indexes in cassandra with the addition of the 
> folder_id)
>  
> According to pro vs con, is the alternative design more or less interesting ?
> IMHO it's normally better to spread the rows and consider how they grow over 
> time. 
> You can send updates for multiple rows in the same batch mutation. 
>  
> Hope that helps. 
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>  
> @aaronmorton
> http://www.thelastpickle.com
>  
> On 9/04/2013, at 3:57 AM, DE VITO Dominique 
> <dominique.dev...@thalesgroup.com> wrote:
> 
> 
> Hi,
>  
> I have a use case that sounds like storing data associated with files. So, I 
> store them with the CF:
> rowkey = (folder_id, file_id)
> colname = property name (about the file corresponding to file_id)
> colvalue = property value
>  
> And I have CF for "manual" indexing:
> rowkey = (folder_id, indexed value)
> colname = (timestamp, file_id)
> colvalue = ""
>  
> like
> rowkey = (folder_id, note_of_5) or (folder_id, some_status)
> colname = (some_date, some_filename)
> colvalue = ""
>  
> I have many CF for indexing, as I index according to different (file) 
> properties.
>  
> So, one alternative design for indexing CF could be:
> rowkey = folder_id
> colname = (indexed value, timestamp, file_id)
> colvalue = ""
>  
> Alternative design :
> * pro: same rowkey for all indexing CF => **all** indexing CF could be 
> updated through one batch_mutate
> * con: repeating "indexed value" (1er colname part) again ang again (= a 
> string up to 20c)
>  
> According to pro vs con, is the alternative design more or less interesting ?
>  
> Thanks.
>  
> Dominique
>  
>

Re: data modeling from batch_mutate point of view

Reply via email to