Column Family per User

2012-04-18 Thread Trevor Francis
Our application has users that can write in upwards of 50 million records per 
day. However, they all write the same format of records (20 fields…columns). 
Should I put each user in their own column family, even though the column 
family schema will be the same per user?

Would this help with dimensioning, if each user is querying their keyspace and 
only their keyspace?


Trevor Francis




Re: Column Family per User

2012-04-18 Thread Janne Jalkanen

Each CF takes a fair chunk of memory regardless of how much data it has, so 
this is probably not a good idea, if you have lots of users. Also using a 
single CF means that compression is likely to work better (more redundant data).

However, Cassandra distributes the load across different nodes based on the row 
key, and the writes scale roughly linearly according to the number of nodes. So 
if you can make sure that no single row gets overly burdened by writes (50 
million writes/day to a single row would always go to the same nodes - this is 
in the order of 600 writes/second/node, which shouldn't really pose a problem, 
IMHO). The main problem is that if a single row gets lots of columns it'll 
start to slow down at some point, and your row caches become less useful, as 
they cache the entire row.

Keep your rows suitably sized and you should be fine. To partition the data, 
you can either distribute it to a few CFs based on use or use some other 
distribution method (like user:1234:00 where the 00 is the hour-of-the-day.

(There's a great article by Aaron Morton on how wide rows impact performance at 
http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/, but as always, 
running your own tests to determine the optimal setup is recommended.)

/Janne

On Apr 18, 2012, at 21:20 , Trevor Francis wrote:

 Our application has users that can write in upwards of 50 million records per 
 day. However, they all write the same format of records (20 fields…columns). 
 Should I put each user in their own column family, even though the column 
 family schema will be the same per user?
 
 Would this help with dimensioning, if each user is querying their keyspace 
 and only their keyspace?
 
 
 Trevor Francis
 
 



Re: Column Family per User

2012-04-18 Thread Trevor Francis
Janne,


Of course, I am new to the Cassandra world, so it is taking some getting used 
to understand how everything translates into my MYSQL head.

We are building an enterprise application that will ingest log information and 
provide metrics and trending based upon the data contained in the logs. The 
application is transactional in nature such that a record will be written to a 
log and our system will need to query that record and assign two values to it 
in addition to using the information to develop trending metrics. 

The logs are being fed into cassandra by Flume.

Each of our users will be assigned their own piece of hardware that generates 
these log events, some of which can peak at up to 2500 transactions per second 
for a couple of hours. The log entries are around 150-bytes each and contain 
around 20 different pieces of information. Neither us, nor our users are 
interested in generating any queries across the entire database. Users are only 
concerned with the data that their particular piece of hardware generates. 

Should I just setup a single column family with 20 columns, the first of which 
being the row key and make the row key the username of that user?

We would also need probably 2 more columns to store Value A and Value B 
assigned to that particular record.

Our metrics will be be something like this: For this particular user, during 
this particular timeframe, what is the average of field X? And then store 
that value, which we can generate historical trending over the course a week. 
We will do this every 15 minutes. 

Any suggestions on where I should head to start my journey into Cassandra for 
my particular application?


Trevor Francis


On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote:

 
 Each CF takes a fair chunk of memory regardless of how much data it has, so 
 this is probably not a good idea, if you have lots of users. Also using a 
 single CF means that compression is likely to work better (more redundant 
 data).
 
 However, Cassandra distributes the load across different nodes based on the 
 row key, and the writes scale roughly linearly according to the number of 
 nodes. So if you can make sure that no single row gets overly burdened by 
 writes (50 million writes/day to a single row would always go to the same 
 nodes - this is in the order of 600 writes/second/node, which shouldn't 
 really pose a problem, IMHO). The main problem is that if a single row gets 
 lots of columns it'll start to slow down at some point, and your row caches 
 become less useful, as they cache the entire row.
 
 Keep your rows suitably sized and you should be fine. To partition the data, 
 you can either distribute it to a few CFs based on use or use some other 
 distribution method (like user:1234:00 where the 00 is the 
 hour-of-the-day.
 
 (There's a great article by Aaron Morton on how wide rows impact performance 
 at http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/, but as always, 
 running your own tests to determine the optimal setup is recommended.)
 
 /Janne
 
 On Apr 18, 2012, at 21:20 , Trevor Francis wrote:
 
 Our application has users that can write in upwards of 50 million records 
 per day. However, they all write the same format of records (20 
 fields…columns). Should I put each user in their own column family, even 
 though the column family schema will be the same per user?
 
 Would this help with dimensioning, if each user is querying their keyspace 
 and only their keyspace?
 
 
 Trevor Francis
 
 
 



Re: Column Family per User

2012-04-18 Thread Dave Brosius
 Your design should be around how you want to query. If you are only querying 
by user, then having a user as part of the row key makes sense. To manage row 
size, you should think of a row as being a bucket of time. Cassandra supports a 
large (but not without bounds) row size. To manage row size you might say that 
this row is for user fred for the month of april, or if that's too much perhaps 
the row is for user fred for the day 4/18/12. To do this you can use composite 
keys to hold both pieces of information in the key. (user, bucketpos)The nice 
thing is that once the time period has come and gone, that row is complete, and 
you can perform background jobs against that row and store summary information 
for that time period.  - Original Message -From: quot;Trevor 
Francisquot; ;trevor.fran...@tgrahamcapital.com 

Re: Column Family per User

2012-04-18 Thread Trevor Francis
I am trying to grasp this concept….so let me try a scenario.

Lets say I have 5 data points being captured in the log file. Here would be a 
typical table schema in mysql.

Id, Username, Time, Wind, Rain, Sunshine

Select * from table; would reveal:

1, george, 2012-04-12T12:22:23.293, 55, 45, 10
2, george, 2012-04-12T12:22:24.293, 45, 25, 25
3, george, 2012-04-12T12:22:25.293, 35, 15, 11
4, george, 2012-04-12T12:22:26.293, 55, 65, 16
5, george, 2012-04-12T12:22:27.293, 12, 5, 22

And it would just continue from there adding rows as log files are imported.

A select * from table where sunshine=16 would yield:

4, george, 2012-04-12T12:22:26.293, 55, 65, 16

  
Now, you are saying that in Cassandra, Instead of having a bunch of rows 
containing ordered information (which is what I would have), I would have a 
single row with multiple columns:

George | 2012-04-12T12:22:23.293, wind=55 | 2012-04-12T12:22:23.293, Rain=45 | 
2012-04-12T12:22:23.293, Sunshine=10 | .continued.

So George would be the row and the columns would be the actual data. The data 
would be oriented horizontally, vs vertically (mysql).

So for instance, log generation on our application isn't linear as it peaks at 
certain times of the day. A user generating at peak 2500 would typically 
generate 60M log entries per day. Multiply that times 20 data pieces and you 
are looking at 1.2B Columns in a given day for that user. Assuming we batches 
the writes every minute, can a node handle this sort of load?

Also, can we rotate the row every day? Would it make more sense to rotate 
hourly? At peak, hourly rotation would decrease the row size to 180M data 
points vs. 1.2B.

At max, we may only have 500 users on our platform. That means that if we did 
hourly row rotation, that would be 12,000 rows per day…with the maximum column 
size of 180M columns.


Am I grasping this concept properly?

Trevor Francis


On Apr 18, 2012, at 3:06 PM, Dave Brosius wrote:

 
 Your design should be around how you want to query. If you are only querying 
 by user, then having a user as part of the row key makes sense. To manage row 
 size, you should think of a row as being a bucket of time. Cassandra supports 
 a large (but not without bounds) row size. To manage row size you might say 
 that this row is for user fred for the month of april, or if that's too much 
 perhaps the row is for user fred for the day 4/18/12. To do this you can use 
 composite keys to hold both pieces of information in the key. (user, 
 bucketpos)
 
 The nice thing is that once the time period ha s come and gone, that row is 
 complete, and you can perform background jobs against that row and store 
 summary information for that time period.
 
 
 - Original Message -
 From: Trevor Francis trevor.fran...@tgrahamcapital.com 
 Sent: Wed, April 18, 2012 15:48
 Subject: Re: Column Family per User
 
 Janne,
  
  
 Of course, I am new to the Cassandra world, so it is taking some getting used 
 to understand how everything translates into my MYSQL head.
  
 We are building an enterprise application that will ingest log inf ormation 
 and provide metrics and trending based upon the data contained in the logs. 
 The application is transactional in nature such that a record will be written 
 to a log and our system will need to query that record and assign two values 
 to it in addition to using the information to develop trending metrics. 
  
 The logs are being fed into cassandra by Flume.
  
 Each of our users will be assigned their own piece of hardware that generates 
 these log events, some of which can peak at up to 2500 transactions per 
 second for a couple of hours. The log entries are around 150-bytes each and 
 contain around 20 different pieces of information. Neither us, nor our users 
 are interested in generating any queries across the entire database. Users 
 are only concerned with the data that their particular piece of hardware 
 generates. 
  
 Should I just setup a single column family with 20 columns, the first of 
 which bei ng the row key and make the row key the username of that user?
  
 We would also need probably 2 more columns to store Value A and Value B 
 assigned to that particular record.
  
 Our metrics will be be something like this: For this particular user, during 
 this particular timeframe, what is the average of field X? And then store 
 that value, which we can generate historical trending over the course a week. 
 We will do this every 15 minutes. 
  
 Any suggestions on where I should head to start my journey into Cassandra for 
 my particular application?
  
 
 Trevor Francis
  
 
 On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote:
 
  
 Each CF takes a fair chunk of memory regardless of how much data it has, so 
 this is probably not a good idea, if you have lots of users. Also using a 
 single CF means that compression is likely to work better (more redundant 
 data).
  
 However, Cassandra distributes the load across different nodes

Re: Column Family per User

2012-04-18 Thread Dave Brosius
Yes in this cassandra model, time wouldn't be a column value, it would be part 
of the column name. Depending on how you want to access your data (give me all 
data points for time X) and how many separate datapoints you have for time X, 
you might consider packing all the data for a time in one column thru composite 
columnscolumn name: 2012-04-12T12:22:23.293/55/45/10 (where / is a human 
readable representation of the composite separator) in this case there wouldn't 
actually be a value, the data is just encoded in the column name.Obviously if 
you are storing dozens of separate datapoints for a timestamp than this gets 
out of hand quickly, and perhaps you need to go back to column names with 
time/fieldname format with a real value.the advantage tho of the composite key 
is that you eliminate all that constant blather about 'Wind' 'Rain' 'Sunshine' 
in your data and only hold real data. (granted compression will probably help 
here, but not having it all is even better).as for row size, obv
 iously that takes some experimentation on you part. You can bucket a row to be 
any time frame you want. If you feel that 15 minutes is the correct length of 
time given the amount of data you will write, then use 15 minutes. It it's 1 
hour, use 1 hour. The only thing you have to figure out is a 'bucket time' 
definition that you understand, likely it's the timestamp of when that time 
period starts.As for 'rotating the row', perhaps it's just semantics, but there 
really is no such concept. You are at some point in time, and you want to write 
some data to the database.The steps are1) get the user2) get the timestamp of 
the current bucket based on 'now'3) build a composite key4) insert the data 
with that keyWhether that row existed before or is a new row has no bearing on 
your client code.  - Original Message -From: quot;Trevor Francisquot; 
;trevor.fran...@tgrahamcapital.com 

Re: Column Family per User

2012-04-18 Thread Dave Brosius
  Yes in this cassandra model, time wouldn't be a column value, it would be 
part of the column name. Depending on how you want to access your data (give me 
all data points for time X) and how many separate datapoints you have for time 
X, you might consider packing all the data for a time in one column thru 
composite columnscolumn name: 2012-04-12T12:22:23.293/55/45/10 (where / is a 
human readable representation of the composite separator) in this case there 
wouldn't actually be a value, the data is just encoded in the column 
name.Obviously if you are storing dozens of separate datapoints for a timestamp 
than this gets out of hand quickly, and perhaps you need to go back to column 
names with time/fieldname format with a real value.the advantage tho of the 
composite key is that you eliminate all that constant blather about 'Wind' 
'Rain' 'Sunshine' in your data and only hold real data. (granted compression 
will probably help   here, but not having it all is even better).as for row 
size,
  obviously that takes some experimentation on you part. You can bucket a row 
to be any time frame you want. If you feel that 15 minutes is the correct 
length of time given the amount of data you will write, then use 15 minutes. It 
it's 1 hour, use 1 hour. The only thing you have to figure out is a 'bucket 
time' definition that you understand, likely it's the timestamp of when that 
time period starts.As for 'rotating the row', perhaps it's just semantics, but 
there really is no such concept. You are at some point in time, and you want to 
write some data to the database.The steps are1) get the user2) get the 
timestamp of the current bucket based on 'now'3) build a composite key4) insert 
the data with that keyWhether that row existed before or is a new row has no 
bearing on your client code.  - Original Message -From: quot;Trevor 
Francisquot; 
;trevor.fran...@tgrahamcapital.com;trevor.fran...@tgrahamcapital.com

Re: Column Family per User

2012-04-18 Thread Trevor Francis
Regarding Rotating, I was thinking about the concept of log rotate, where you 
write to a file for a specific period of time, then you create a new file and 
write to it after a specific set of time. So yes, it closes a row and opens 
another row.

Since I will be generating analytics every 15 minutes, its would make sense to 
me to bucket a row every 15 minutes. Since I would only have at most 500 users, 
this doesn't strike me as too many rows in a given day (48,000). Potential 
downsides to doing this?

Since I am analyzing 20 separate data points for a given log entry, it would 
make sense that querying based upon a specific metric (wind, rain, sunshine) 
would be easier if the data was separated. However, couldn't we build composite 
columns for time and value where all that would be left in data?

So composite row key would be:

george
2012-04-12T12:20

And Columns would be: 

12:22:23.293/Wind

12:22:23.293/Rain

12:22:23.293/Sunshine

Data would be:
55
45
10


Our the columns could be 12:22:23.293

Data:
Wind/55/45/35


Or something like that….Am I headed in the right direction?


Trevor Francis


On Apr 18, 2012, at 3:10 PM, Janne Jalkanen wrote:

 
 Hi!
 
 A simple model to do this would be
 
 * ColumnFamily Data
   * key: userid
   * column: Composite( timestamp, entrytype ) = value
 
 For example, userid janne would have columns 
(2012-04-12T12:22:23.293,speed) = 24;
(2012-04-12T12:22:23.293,temperature) = 12.4
(2012-04-12T12:22:23.293,direction) = 356;
(2012-04-12T12:22:23.295,speed) = 24.1;
(2012-04-12T12:22:23.295,temperature) = 12.3
(2012-04-12T12:22:23.295,direction) = 352;
 
 Note that Cassandra does not require you to know which columns you're going 
 to put in it (unlike MySQL). You can declare types ahead if you know what 
 they are, but if you'll need to start adding a new column, just start writing 
 it and Cassandra should do the right things.
 
 However, there are a few points which you might want to consider
 * Using ISO dates for timestamps have a minor problem: if two events occur 
 during the same millisecond, they'll overwrite each other. This is why most 
 time series in C* use TimeUUIDs, which contain a millisecond timestamp + a 
 random component. 
 (http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/)
 * This will generate timestamp*entrytype columns. So for 2500 entries/second 
 and 20 columns this means about 2500*20 = 5 wps (granted that you will 
 most probably batch the writes though). You will need to performance test 
 your cluster to see if this schema is right for you. If not, you might want 
 to try and see how you can distribute the keys differently, e.g. by bucketing 
 the data somehow. However, I recommend that you build a first-shot of your 
 app structure, then load test it until it breaks and that should give you 
 pretty good understanding of what exactly cassandra is doing.
 
 To do then analytics multiple options are possible; a popular one is to run 
 MapReduce queries using a tool like Apache Pig on regular intervals. DataStax 
 has good documentation and you probably want to take a look at their offering 
 as well, since they have pretty good Hadoop/MapReduce support for Cassandra.
 
 CLI syntax to try with:
 
 create keyspace DataTest with 
 placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and 
 strategy_options = {replication_factor:1};
 use DataTest;
 create column family Data with key_validation_class=UTF8Type and 
 comparator='CompositeType(UUIDType,UTF8Type)';
 
 Then start writing using your fav client.
 
 /Janne
 
 On Apr 18, 2012, at 22:36 , Trevor Francis wrote:
 
 Janne,
 
 
 Of course, I am new to the Cassandra world, so it is taking some getting 
 used to understand how everything translates into my MYSQL head.
 
 We are building an enterprise application that will ingest log information 
 and provide metrics and trending based upon the data contained in the logs. 
 The application is transactional in nature such that a record will be 
 written to a log and our system will need to query that record and assign 
 two values to it in addition to using the information to develop trending 
 metrics. 
 
 The logs are being fed into cassandra by Flume.
 
 Each of our users will be assigned their own piece of hardware that 
 generates these log events, some of which can peak at up to 2500 
 transactions per second for a couple of hours. The log entries are around 
 150-bytes each and contain around 20 different pieces of information. 
 Neither us, nor our users are interested in generating any queries across 
 the entire database. Users are only concerned with the data that their 
 particular piece of hardware generates. 
 
 Should I just setup a single column family with 20 columns, the first of 
 which being the row key and make the row key the username of that user?
 
 We would also need probably 2 more columns to store Value A and Value B 
 assigned to that particular record.
 
 

Re: Column Family per User

2012-04-18 Thread Dave Brosius
It seems to me you are on the right track. Finding the right balance of # rows 
vs row width is the part that will take the most experimentation.  - 
Original Message -From: quot;Trevor Francisquot; 
;trevor.fran...@tgrahamcapital.com