Re: Re: Data model storage optimization

2018-07-30 Thread James Shaw
considering:
row size large or not
update a lot or not   - update is insert actually
read heavy or not
overall read performance

if row size large , you may consider table:user_detail , add column id in
all tables.
In application side, merge/join by id.
But paid read price, 2nd query to user_detail.

Just my 2 cents.  hope helpful.

Thanks,

James


On Sun, Jul 29, 2018 at 11:20 PM, onmstester onmstester  wrote:

>
> How many rows in average per partition?
>
> around 10K.
>
>
> Let me get this straight : You are bifurcating your partitions on either
> email or username , essentially potentially doubling the data because you
> don’t have a way to manage a central system of record of users ?
>
> We are just analyzing output logs of a "perfectly" running application!,
> so no one let me change its data design, i thought maybe it would be a more
> general problem for cassandra users that someone both
> 1. needed to access a identical set of columns by multiple keys (all the
> keys should be present in rows)
> 2. there was a storage limit (due to TTL * input rate would be some TBs)
> I know that there is a strict rule in cassandra data modeling : "never use
> foreign keys and sacrifice disk instead", but anyone ever been forced to do
> such a thing and How?
>
>


Fwd: Re: Data model storage optimization

2018-07-29 Thread onmstester onmstester
How many rows in average per partition? around 10K. Let me get this straight : 
You are bifurcating your partitions on either email or username , essentially 
potentially doubling the data because you don’t have a way to manage a central 
system of record of users ? We are just analyzing output logs of a "perfectly" 
running application!, so no one let me change its data design, i thought maybe 
it would be a more general problem for cassandra users that someone both 1. 
needed to access a identical set of columns by multiple keys (all the keys 
should be present in rows) 2. there was a storage limit (due to TTL * input 
rate would be some TBs) I know that there is a strict rule in cassandra data 
modeling : "never use foreign keys and sacrifice disk instead", but anyone ever 
been forced to do such a thing and How?

Re: Data model storage optimization

2018-07-29 Thread Rahul Singh
How many rows in average per partition?

Let me get this straight : You are bifurcating your partitions on either email 
or username , essentially potentially doubling the data because you don’t have 
a way to manage a central system of record of users ?

I would do this: (my opinion)
Migrate to a single sign on System that uses one or the other. Map and migrate 
your data to use a singular record as “identity”.

I know that seems painful but I _hate_ perpetuating bad design because someone 
, in the past, presence , or future chooses to not solve the problem but get 
around it.

This is not a storage optimization problem - it’s a data architecture problem.

Rahul
On Jul 28, 2018, 3:11 AM -0400, onmstester onmstester , 
wrote:
> The current data model described as table name: 
> ((partition_key),cluster_key),other_column1,other_column2,...
>
> user_by_name: ((time_bucket, username)),ts,request,email
> user_by_mail: ((time_bucket, email)),ts,request,username
>
> The reason that all 2 keys (username, email) repeated in all tables is that 
> there may be different username with the same email or different email with 
> same username, and the query for data model is:
> 1.  username = X
> 2. mail=Y
> 3. username = X and mail= Y (we query one of tables and because there is 
> small number of records in result, we filter the other column)
>
> This data model results in wasting lots of storage.
> I thought using UUID or hash code or sequence to handle this but i can't keep 
> track of the old vs new records (the ones that already have UUID).
> Any recommendation on optimizing data model to save storage?
>
> Sent using Zoho Mail
>
>


Data model storage optimization

2018-07-28 Thread onmstester onmstester
The current data model described as table name: 
((partition_key),cluster_key),other_column1,other_column2,... user_by_name: 
((time_bucket, username)),ts,request,email user_by_mail: ((time_bucket, 
email)),ts,request,username The reason that all 2 keys (username, email) 
repeated in all tables is that there may be different username with the same 
email or different email with same username, and the query for data model is: 
1.  username = X 2. mail=Y 3. username = X and mail= Y (we query one of tables 
and because there is small number of records in result, we filter the other 
column) This data model results in wasting lots of storage. I thought using 
UUID or hash code or sequence to handle this but i can't keep track of the old 
vs new records (the ones that already have UUID). Any recommendation on 
optimizing data model to save storage? Sent using Zoho Mail