RE: data modeling from batch_mutate point of view

DE VITO Dominique Tue, 09 Apr 2013 07:27:35 -0700

Thanks Aaron.

It helped.


Let's me rephrase a little bit my questions. It's about data modeling impact on 
"batch_mutate" advantages.

I have one CF for storing data, and ~10 (all different) CF used for indexing 
that data.

when adding a piece of data, I need to add indexes too, and then, I need to add 
columns to one row for each of the 10 indexing CF => 2 main designs are 
possible for adding these new indexes.

a) all the updated 10 rows of indexing CF have different rowkeys
b) all the updated 10 rows of indexing CF have all the same rowkey

AFAIK, this has effect on batch_mutate:

a) the "batch_mutate" advantages stop at the coordinator node. The advantage 
appears for the communication "client<=>coordinator node"
b) the "batch_mutate" advantages are better, for the communication 
"client<=>coordinator node" __and__ for the communications "coordinator 
node<=>replicas".

So, for resuming:

a) CF with few data repeats (good) but the coordinator node needs to 
communicate to different replicas according to different rowkeys
b) CF with more denormalization, repeating some data, again and again over 
composite columns,  but batch_mutate performs better (good) up to replicas, and 
not only up to coordinator node.

Each option has one pro and one con.

Is there any experience out there about such data modeling (option_a vs 
option_b) from the batch_mutate perspective ?
Thanks.

Dominique



De : aaron morton [mailto:aa...@thelastpickle.com]
Envoyé : mardi 9 avril 2013 10:12
À : user@cassandra.apache.org
Objet : Re: data modeling from batch_mutate point of view

So, one alternative design for indexing CF could be:
rowkey = folder_id
colname = (indexed value, timestamp, file_id)
colvalue = ""

If you always search in a folder what about
rowkey = <folder_id : property_name : property_value>
colname = <file_id>

(That's closer to secondary indexes in cassandra with the addition of the 
folder_id)

According to pro vs con, is the alternative design more or less interesting ?
IMHO it's normally better to spread the rows and consider how they grow over 
time.
You can send updates for multiple rows in the same batch mutation.

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 9/04/2013, at 3:57 AM, DE VITO Dominique 
<dominique.dev...@thalesgroup.com<mailto:dominique.dev...@thalesgroup.com>> 
wrote:


Hi,

I have a use case that sounds like storing data associated with files. So, I 
store them with the CF:
rowkey = (folder_id, file_id)
colname = property name (about the file corresponding to file_id)
colvalue = property value

And I have CF for "manual" indexing:
rowkey = (folder_id, indexed value)
colname = (timestamp, file_id)
colvalue = ""

like
rowkey = (folder_id, note_of_5) or (folder_id, some_status)
colname = (some_date, some_filename)
colvalue = ""

I have many CF for indexing, as I index according to different (file) 
properties.

So, one alternative design for indexing CF could be:
rowkey = folder_id
colname = (indexed value, timestamp, file_id)
colvalue = ""

Alternative design :
* pro: same rowkey for all indexing CF => **all** indexing CF could be updated 
through one batch_mutate
* con: repeating "indexed value" (1er colname part) again ang again (= a string 
up to 20c)

According to pro vs con, is the alternative design more or less interesting ?

Thanks.

Dominique

RE: data modeling from batch_mutate point of view

Reply via email to