1122: {
gender: MALE
birthdate: 1987.11.09
name: Alfred Tester
pwd: e72c504dc16c8fcd2fe8c74bb492affa
alias1: [email protected] <mailto:[email protected]>
alias2: [email protected] <mailto:[email protected]>
alias3: [email protected] <mailto:[email protected]>
}
...and you can use secondary indexes to query on anything.
Maxim
On 11/17/2011 4:08 PM, Maciej Miklas wrote:
Hallo all,
I need your help to design structure for simple login service. It
contains about 100.000.000 customers and each one can have about 10
different logins - this results 1.000.000.000 different logins.
Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5
Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all
required login data in single call. Also usually we expect low write
traffic and heavy read traffic - round trips for reading data should
be avoided.
Below I've described two possible cassandra data models based on
example: we have two users, first user has two logins and second user
has three logins
A) Skinny rows
- row key contains login name - this is the main search criteria
- login data is replicated - each possible login is stored as single
row which contains all user data - 10 logins for single customer
create 10 rows, where each row has different key and the same content
// first 3 rows has different key and the same replicated data
[email protected] <mailto:[email protected]> {
id: 1122
gender: MALE
birthdate: 1987.11.09
name: Alfred Tester
pwd: e72c504dc16c8fcd2fe8c74bb492affa
},
[email protected] <mailto:[email protected]> {
id: 1122
gender: MALE
birthdate: 1987.11.09
name: Alfred Tester
pwd: e72c504dc16c8fcd2fe8c74bb492affa
},
[email protected] <mailto:[email protected]> {
id: 1122
gender: MALE
birthdate: 1987.11.09
name: Alfred Tester
pwd: e72c504dc16c8fcd2fe8c74bb492affa
},
// two following rows has again the same data for second customer
[email protected] <mailto:[email protected]> {
id: 1133
gender: MALE
birthdate: 1997.02.01
name: Manfredus Maximus
pwd: e44c504ff16c8fcd2fe8c74bb492adda
},
[email protected] <mailto:[email protected]> {
id: 1133
gender: MALE
birthdate: 1997.02.01
name: Manfredus Maximus
pwd: e44c504ff16c8fcd2fe8c74bb492adda
}
B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with
key 'a' contains all logins which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have
positive performance impact (??)
- to avoid super columns each row contains directly columns, where
column name is the user login and column value is corresponding data
in kind of serialized form (I would like to have is human readable)
a {
[email protected] <mailto:[email protected]>:"1122;MALE;1987.11.09;
Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",
[email protected]@xyz.de <http://xyz.de>:"1122;MALE;1987.11.09;
Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",
[email protected]@xyz.de <http://xyz.de>:"1122;MALE;1987.11.09;
Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa"
},
m {
[email protected] <mailto:[email protected]>:"1133;MALE;1997.02.01;
Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
},
r {
[email protected] <mailto:[email protected]>:"1133;MALE;1997.02.01;
Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
}
Which solution is better, especially for better read performance? Do
you have better idea?
Thanks,
Maciej