Seeking advice on Schema and Caching

2011-11-15 Thread Aditya Narayan
Hi

I need to add 'search users' functionality to my application. (The trigger
for fetching searched items(like google instant search) is made when 3
letters have been typed in).

For this, I make a CF with String type keys. Each such key is made of first
3 letters of a user's name.

Thus all names starting with 'Mar-' are stored in single row (with
key=Mar).
The column names are framed as remaining letters of the names. Thus, a name
'Marcos' will be stored within rowkey Mar  col name cos. The id will
be stored as column value. Since there could be many users with same name.
Thus I would have multple userIds(of users named Marcos) to be stored
inside columnname cos under key Mar. Thus,

1. Supercolumn seems to be a better fit for my use case(so that ids of
users with same name may fit as sub-columns inside a super-column) but
since supercolumns are not encouraged thus I want to use an alternative
schema for this usecase if possible. Could you suggest some ideas on this ?

2. Another thing, I would like to row cache this CF so that when the user
types in the next character  the query is made consequently, then this row
be retrieved from the cache without touching DB. It is expected while
searching for a single username, the query(as a part of making
instantaneous suggestions) will be made at least 2-3 times. One may also
suggest to fetch all the columns starting with queried string to be
retrieved  then filter out at application level but what about just
fecthing the exact no of columns(ids/names of users) I need to show to the
user. Thus instead of keeping all the hundreds of cols in the application
layer what about keeping it within the DB cache.!?
The space alloted for the cache will be very small so that row remains in
cache for a very short time(enough to serve only for the time duration
while user is making a single search!?) ?


Re: Seeking advice on Schema and Caching

2011-11-15 Thread Aditya Narayan
Any insights on this ?

On Tue, Nov 15, 2011 at 9:40 PM, Quintero quinteros8...@gmail.com wrote:



 Aditya Narayan ady...@gmail.com wrote:

 Hi
 
 I need to add 'search users' functionality to my application. (The trigger
 for fetching searched items(like google instant search) is made when 3
 letters have been typed in).
 
 For this, I make a CF with String type keys. Each such key is made of
 first
 3 letters of a user's name.
 
 Thus all names starting with 'Mar-' are stored in single row (with
 key=Mar).
 The column names are framed as remaining letters of the names. Thus, a
 name
 'Marcos' will be stored within rowkey Mar  col name cos. The id will
 be stored as column value. Since there could be many users with same name.
 Thus I would have multple userIds(of users named Marcos) to be stored
 inside columnname cos under key Mar. Thus,
 
 1. Supercolumn seems to be a better fit for my use case(so that ids of
 users with same name may fit as sub-columns inside a super-column) but
 since supercolumns are not encouraged thus I want to use an alternative
 schema for this usecase if possible. Could you suggest some ideas on this
 ?
 
 2. Another thing, I would like to row cache this CF so that when the user
 types in the next character  the query is made consequently, then this
 row
 be retrieved from the cache without touching DB. It is expected while
 searching for a single username, the query(as a part of making
 instantaneous suggestions) will be made at least 2-3 times. One may also
 suggest to fetch all the columns starting with queried string to be
 retrieved  then filter out at application level but what about just
 fecthing the exact no of columns(ids/names of users) I need to show to the
 user. Thus instead of keeping all the hundreds of cols in the application
 layer what about keeping it within the DB cache.!?
 The space alloted for the cache will be very small so that row remains in
 cache for a very short time(enough to serve only for the time duration
 while user is making a single search!?) ?



Re: Seeking advice on Schema and Caching

2011-11-15 Thread Aditya Narayan
Hi Ben,

Solr, as I understood is for implementing full text
search capability within documents, but in my case, as of now I just need
to implement search on user names which seems to be easily provided
by Cassandra as user names (as column names) may be sorted alphabetically
within rows. I am splitting these rows by the first three characters of the
username. Thus all user names starting with 'Mar' are stored in a row with
key 'Mar'.  Column values store the userId of that user.

So Cassandra seems to fully satisfy my needs for this. Only issue i m
having is how to deal with multiple users of same name. Thus super
columns seem to fit appropriately but I really want to avoid them since
they are seriously discouraged by everyone.


On Wed, Nov 16, 2011 at 3:19 AM, Ben Gambley ben.gamb...@intoscience.comwrote:

 Hi Aditya

 Not sure the best way to do in Cassandra but have you considered using
 apache solr - you could then include just the row keys pointing back
 to Cassandra where the actual data is.

 Solr seems quite capable of performing google like searches and is fast.



 Cheers
 Ben

 On 16/11/2011, at 1:50 AM, Aditya Narayan ady...@gmail.com wrote:

  Hi
 
  I need to add 'search users' functionality to my application. (The
 trigger for fetching searched items(like google instant search) is made
 when 3 letters have been typed in).
 
  For this, I make a CF with String type keys. Each such key is made of
 first 3 letters of a user's name.
 
  Thus all names starting with 'Mar-' are stored in single row (with
 key=Mar).
  The column names are framed as remaining letters of the names. Thus, a
 name 'Marcos' will be stored within rowkey Mar  col name cos. The id
 will be stored as column value. Since there could be many users with same
 name. Thus I would have multple userIds(of users named Marcos) to be
 stored inside columnname cos under key Mar. Thus,
 
  1. Supercolumn seems to be a better fit for my use case(so that ids of
 users with same name may fit as sub-columns inside a super-column) but
 since supercolumns are not encouraged thus I want to use an alternative
 schema for this usecase if possible. Could you suggest some ideas on this ?
 
  2. Another thing, I would like to row cache this CF so that when the
 user types in the next character  the query is made consequently, then
 this row be retrieved from the cache without touching DB. It is expected
 while searching for a single username, the query(as a part of making
 instantaneous suggestions) will be made at least 2-3 times. One may also
 suggest to fetch all the columns starting with queried string to be
 retrieved  then filter out at application level but what about just
 fecthing the exact no of columns(ids/names of users) I need to show to the
 user. Thus instead of keeping all the hundreds of cols in the application
 layer what about keeping it within the DB cache.!?
  The space alloted for the cache will be very small so that row remains
 in cache for a very short time(enough to serve only for the time duration
 while user is making a single search!?) ?
 
 



Re: Seeking advice on Schema and Caching

2011-11-15 Thread Aditya Narayan
Regarding the first option that you suggested through composite columns,
can I store the username  id both in the column name and keep the column
valueless?
Will I be able to retrieve both the username and id from the composite col
name ?

Thanks a lot

On Wed, Nov 16, 2011 at 10:56 AM, Aditya Narayan ady...@gmail.com wrote:

 Got the first option that you suggested.

 However, In the second one, are you suggested to use, for e.g,
 key='Marcos'  store cols, for all users of that name, containing userId
 inside that row. That way it would have to read multiple rows while user is
 doing a single search.


 On Wed, Nov 16, 2011 at 10:47 AM, samal samalgo...@gmail.com wrote:


   I need to add 'search users' functionality to my application. (The
 trigger for fetching searched items(like google instant search) is made
 when 3 letters have been typed in).
 
  For this, I make a CF with String type keys. Each such key is made of
 first 3 letters of a user's name.
 
  Thus all names starting with 'Mar-' are stored in single row (with
 key=Mar).
  The column names are framed as remaining letters of the names. Thus,
 a name 'Marcos' will be stored within rowkey Mar  col name cos. The id
 will be stored as column value. Since there could be many users with same
 name. Thus I would have multple userIds(of users named Marcos) to be
 stored inside columnname cos under key Mar. Thus,
 
  1. Supercolumn seems to be a better fit for my use case(so that ids
 of users with same name may fit as sub-columns inside a super-column) but
 since supercolumns are not encouraged thus I want to use an alternative
 schema for this usecase if possible. Could you suggest some ideas on this ?
 


 Aditya,

 Have you any given thought on Composite columns [1]. I think it can help
 you solve your problem of multiple user with same name.

 mar:{
   {cos,unique_user_id}:unique_user_id,
   {cos,1}:1,
   {cos,2}:2,
   {cos,3}:3,

 //  {utf8,timeUUID}:timeUUID,
 }
 OR
 you can try wide rows indexing user name to ID's

 marcos{
user1:' ',
user2:' ',
user3:' '
 }

 [1]http://www.slideshare.net/edanuff/indexing-in-cassandra





Store profile pics of users in Cassandra or file system ?

2011-11-11 Thread Aditya Narayan
Would it be recommended to store the profile pics of users on an
application in Cassandra ? Or file system would be a better way to go. I
came across an interesting paper which advocates storing in DB for blobs
sized up to 1 MB. I was planning to store the image bytes in the same row
that contained other information of the user. So that all the related data
of a user could retrieved at once. But I realized that since Cassandra
would replicate this data multiple times which will make this storage
expensive. This would lead me to either using separate CF with RF=1 which
will remove the encouraging factor of keeping all the user related data at
hte same place.

So what would be a good strategy to store the profile pics of users ? (The
image size would be around 70*70px).

What are the pros and cons in terms of performance, storage space
requirements etc ?


Re: Store profile pics of users in Cassandra or file system ?

2011-11-11 Thread Aditya Narayan
just forgot to add the paper link if this is useful at all : To BLOB or Not
To BLOB: Large Object Storage in a Database or a
Filesystemhttp://research.microsoft.com/apps/pubs/default.aspx?id=64525

On Sat, Nov 12, 2011 at 12:34 AM, Aditya Narayan ady...@gmail.com wrote:

 Would it be recommended to store the profile pics of users on an
 application in Cassandra ? Or file system would be a better way to go. I
 came across an interesting paper which advocates storing in DB for blobs
 sized up to 1 MB. I was planning to store the image bytes in the same row
 that contained other information of the user. So that all the related data
 of a user could retrieved at once. But I realized that since Cassandra
 would replicate this data multiple times which will make this storage
 expensive. This would lead me to either using separate CF with RF=1 which
 will remove the encouraging factor of keeping all the user related data at
 hte same place.

 So what would be a good strategy to store the profile pics of users ? (The
 image size would be around 70*70px).

 What are the pros and cons in terms of performance, storage space
 requirements etc ?





Re: Concatenating ids with extension to keep multiple rows related to an entity in a single CF

2011-11-04 Thread Aditya Narayan
the data in different rows of an entity  is all of similar type but serves
different features but still has almost similar storage and retrieval needs
thus I wanted to put them in one CF and reduce column families.

From my knowledge, I believe compositeType existed for columns as
an alternative choice to implement something similar to supercolumns, are
there any cassandra's built in features to design composite keys using two
provided Integer ids.

Is my approach correct and recommended if I need to keep multiple rows
related to an entity in single CF ?

On Fri, Nov 4, 2011 at 10:11 AM, Tyler Hobbs ty...@datastax.com wrote:

 On Thu, Nov 3, 2011 at 3:48 PM, Aditya Narayan ady...@gmail.com wrote:

 I am concatenating  two Integer ids through bitwise operations(as
 described below) to create a single primary key of type long. I wanted to
 know if this is a good practice. This would help me in keeping multiple
 rows of an entity in a single column family by appending different
 extensions to the entityId.
 Are there better ways ? My Ids are of type Integer(4 bytes).


 public static final long makeCompositeKey(int k1, int k2){
 return (long)k1  32 | k2;
 }


 You could use an actual CompositeType(IntegerType, IntegerType), but it
 would use a little extra space and not buy you much.

 It doesn't sound like this is the case for you, but if you have several
 distinct types of rows, you should consider using separate column families
 for them rather than putting them all into one big CF.

 --
 Tyler Hobbs
 DataStax http://datastax.com/




Concatenating ids with extension to keep multiple rows related to an entity in a single CF

2011-11-03 Thread Aditya Narayan
I am concatenating  two Integer ids through bitwise operations(as described
below) to create a single primary key of type long. I wanted to know if
this is a good practice. This would help me in keeping multiple rows of an
entity in a single column family by appending different extensions to the
entityId.
Are there better ways ? My Ids are of type Integer(4 bytes).


public static final long makeCompositeKey(int k1, int k2){
return (long)k1  32 | k2;
}


Re: Cassandra Cluster Admin - phpMyAdmin for Cassandra

2011-11-01 Thread Aditya Narayan
Yes that would be pretty nice feature to see!



On Mon, Oct 31, 2011 at 10:45 PM, Ertio Lew ertio...@gmail.com wrote:

 Thanks so much SebWajam  for this great piece of work!

 Is there a way to set a data type for displaying the column names/ values
 of a CF ? It seems that your project always uses String Serializer for
 any piece of data however most of the times in real world cases this is not
 true so can we anyhow configure what serializer to use while reading the
 data so that the data may be properly identified by your project 
 delivered in a readable format ?


 On Mon, Aug 22, 2011 at 7:17 AM, SebWajam sebast...@wajam.com wrote:

 Hi,

 I'm working on this project for a few months now and I think it's mature
 enough to post it here:
 Cassandra Cluster Admin on 
 GitHubhttps://github.com/sebgiroux/Cassandra-Cluster-Admin

 Basically, it's a GUI for Cassandra. If you're like me and used MySQL for
 a while (and still using it!), you get used to phpMyAdmin and its simple
 and easy to use user interface. I thought it would be nice to have a
 similar tool for Cassandra and I couldn't find any, so I build my own!

 Supported actions:

- Keyspace manipulation (add/edit/drop)
- Column Family manipulation (add/edit/truncate/drop)
- Row manipulation on column family and super column family
(insert/edit/remove)
- Basic data browser to navigate in the data of a column family
(seems to be the favorite feature so far)
- Support Cassandra 0.8+ atomic counters
- Support management of multiple Cassandra clusters

 Bug report and/or pull request are always welcome!

 --
 View this message in context: Cassandra Cluster Admin - phpMyAdmin for
 Cassandrahttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Cluster-Admin-phpMyAdmin-for-Cassandra-tp6709930p6709930.html
 Sent from the cassandra-u...@incubator.apache.org mailing list 
 archivehttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/at 
 Nabble.com.





Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

2011-10-29 Thread Aditya Narayan
..so that I can retrieve them through a single query.

For reading cols from two CFs you need two queries, right ?




On Sat, Oct 29, 2011 at 9:53 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Why not use 2 CFs?

 On Fri, Oct 28, 2011 at 9:42 PM, Aditya Narayan ady...@gmail.com wrote:
  I need to keep the data of some entities in a single CF but split in two
  rows for each entity. One row contains an overview information for the
  entity  another row contains detailed information about entity. I am
  wanting to keep both rows in single CF so they may be retrieved in a
 single
  query when required together.
 
  Now the problem I am facing is that I want to cache only first type of
  rows(ie, the overview containing rows)  avoid second type rows(that
  contains large data) from getting into cache.
 
  Is there a way I can manipulate such filtering of cache entering rows
 from a
  single CF?
 
 
 



Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

2011-10-29 Thread Aditya Narayan
@Mohit:
I have stated the example scenarios in my first post under this heading.
Also I have stated above why I want to split that data in two rows  like
Ikeda below stated, I'm too trying out to prevent the frequently accessed
rows being bloated with large data  want to prevent that data from entering
cache as well.

Okay so as most know this practice is called a wide row - we use them quite
 a lot. However, as your schema shows it will cache (while being active) all
 the row in memory.  One way we got around this issue was to basically create
 some materialized views of any more common data so we can easily get to the
 minimum amount of information required without blowing too much memory with
 the larger representations.

Yes exactly this is problem I am facing but I want to keep the both the
types(common + large/detailed) of data in single CF so that it could server
'two materialized views'.



 My perspective is that indexing some of the higher levels of data would be
 the way to go - Solr or elastic search for distributed or if you know you
 only need it local just use a caching solution like ehcache

What do you mean exactly by  indexing some of the higher levels of data ?

Thanks you guys!




 Anthony


 On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:

  I need to keep the data of some entities in a single CF but split in two
 rows for each entity. One row contains an overview information for the
 entity  another row contains detailed information about entity. I am
 wanting to keep both rows in single CF so they may be retrieved in a single
 query when required together.
 
  Now the problem I am facing is that I want to cache only first type of
 rows(ie, the overview containing rows)  avoid second type rows(that
 contains large data) from getting into cache.
 
  Is there a way I can manipulate such filtering of cache entering rows
 from a single CF?
 
 




Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

2011-10-29 Thread Aditya Narayan
Thanks Zach, Nice Idea !

and what about looking at, may be, some custom caching solutions, leaving
aside cassandra caching   .. ?



On Sun, Oct 30, 2011 at 2:00 AM, Zach Richardson 
j.zach.richard...@gmail.com wrote:

 Aditya,

 Depending on how often you have to write to the database, you could
 perform dual writes to two different column families, one that has
 summary + details in it, and one that only has the summary.

 This way you can get everything with one query, or the summary with
 one query, this should also help optimize your caching.

 The question here would of course be whether or not you have a read or
 write heavy workload.  Since you seem to be concerned about the
 caching, it sounds like you have more of a read heavy workload and
 wouldn't pay to heavily with the dual writes.

 Zach


 On Sat, Oct 29, 2011 at 2:21 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  On Sat, Oct 29, 2011 at 11:23 AM, Aditya Narayan ady...@gmail.com
 wrote:
  @Mohit:
  I have stated the example scenarios in my first post under this heading.
  Also I have stated above why I want to split that data in two rows 
 like
  Ikeda below stated, I'm too trying out to prevent the frequently
 accessed
  rows being bloated with large data  want to prevent that data from
 entering
  cache as well.
 
  I think you are missing the point. You don't get any benefit
  (performance, access), you are already breaking it into 2 rows.
 
  Also, I don't know of any way where you can selectively keep the rows
  or keys in the cache. Other than having some background job that keeps
  the cache hot with those keys/rows you only have one option of keeping
  it in different CF since you are already breaking a row in 2 rows.
 
 
  Okay so as most know this practice is called a wide row - we use them
  quite a lot. However, as your schema shows it will cache (while being
  active) all the row in memory.  One way we got around this issue was to
  basically create some materialized views of any more common data so we
 can
  easily get to the minimum amount of information required without
 blowing too
  much memory with the larger representations.
 
  Yes exactly this is problem I am facing but I want to keep the both the
  types(common + large/detailed) of data in single CF so that it could
 server
  'two materialized views'.
 
 
  My perspective is that indexing some of the higher levels of data would
 be
  the way to go - Solr or elastic search for distributed or if you know
 you
  only need it local just use a caching solution like ehcache
 
  What do you mean exactly by  indexing some of the higher levels of
 data ?
 
  Thanks you guys!
 
 
 
 
  Anthony
 
 
  On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:
 
   I need to keep the data of some entities in a single CF but split in
 two
   rows for each entity. One row contains an overview information for
 the
   entity  another row contains detailed information about entity. I am
   wanting to keep both rows in single CF so they may be retrieved in a
 single
   query when required together.
  
   Now the problem I am facing is that I want to cache only first type
 of
   rows(ie, the overview containing rows)  avoid second type rows(that
   contains large data) from getting into cache.
  
   Is there a way I can manipulate such filtering of cache entering rows
   from a single CF?
  
  
 
 
 
 



Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-14 Thread Aditya Narayan
Thanks Aaron  Chris, I appreciate your help.

With dedicated CF for counters, in addition to the issue pointed by Chris,
the major drawback I see is that I cant read *in a single query* the
counters with the regular columns row which is widely required by my
application.
My use case is like storing  reading the 'views count' of a post along with
other post details(like post content,postedBy etc) in my application. I
wanted to store the views count(*counter column*) along with the details of
the post.



On Thu, Jul 14, 2011 at 10:20 PM, Chris Burroughs chris.burrou...@gmail.com
 wrote:

 On 07/13/2011 03:57 PM, Aaron Morton wrote:
  You can always use a dedicated CF for the counters, and use the same row
 key.

 Of course one could do this.  The problem is you are now spending ~2x
 disk space on row keys, and app specific client code just became more
 complicated.



Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-11 Thread Aditya Narayan
Oops that's really very much disheartening and it could seriously impact our
plans for going live in near future. Without this facility I guess counters
currently have very little usefulness.

On Mon, Jul 11, 2011 at 8:16 PM, Chris Burroughs
chris.burrou...@gmail.comwrote:

 On 07/10/2011 01:09 PM, Aditya Narayan wrote:
  Is there any target version in near future for which this has been
 promised
  ?

 The ticket is problematic in that it would -- unless someone has a
 clever new idea -- require breaking thrift compatibility to add it to
 the api.  Since is unfortunate since it would be so useful.

 If it's in the 0.8.x series it will only be through CQL.



Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-10 Thread Aditya Narayan
Thanks for info.

Is there any target version in near future for which this has been promised
?

On Sun, Jul 10, 2011 at 9:12 PM, Sasha Dolgy sdo...@gmail.com wrote:

 No, it's not possible.

 To achieve it, there are two options ... contribute to the issue or
 wait for it to be resolved ...

 https://issues.apache.org/jira/browse/CASSANDRA-2614

 -sd

 On Sun, Jul 10, 2011 at 5:04 PM, Aditya Narayan ady...@gmail.com wrote:
  Is it now possible to store counters in the standard column families
 along
  with non counter type columns ? How to achieve this ?



Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-10 Thread Aditya Narayan
Cool. I am looking forward to the addition of this very much required
facility to Cassandra.


On Sun, Jul 10, 2011 at 11:01 PM, samal sa...@wakya.in wrote:

 Yes. may be 0.8.2

 current version need specific validation class CounterColumn for CCF, that
 only count [+,-,do not replace] stuff, where as normal CF simply just add or
 replace.


 On Sun, Jul 10, 2011 at 10:39 PM, Aditya Narayan ady...@gmail.com wrote:

 Thanks for info.

 Is there any target version in near future for which this has been
 promised ?


 On Sun, Jul 10, 2011 at 9:12 PM, Sasha Dolgy sdo...@gmail.com wrote:

 No, it's not possible.

 To achieve it, there are two options ... contribute to the issue or
 wait for it to be resolved ...

 https://issues.apache.org/jira/browse/CASSANDRA-2614

 -sd

 On Sun, Jul 10, 2011 at 5:04 PM, Aditya Narayan ady...@gmail.com
 wrote:
  Is it now possible to store counters in the standard column families
 along
  with non counter type columns ? How to achieve this ?






Design for 'Most viewed Discussions' in a forum

2011-05-18 Thread Aditya Narayan
*
For a discussions forum, I need to show a page of most viewed discussions.

For implementing this, I maintain a count of views of a discussion  when
this views count of a discussion passes a certain threshold limit, the
discussion Id is added to a row of most viewed discussions.

This row of most viewed discussions contains columns with Integer names 
values containing serialized lists of Ids of all discussions whose views
count equals the Integral name of this column.

Thus if the view count of a discussion increases I'll need to move its 'Id'
from serialized list in some column to serialized list in another column
whose name represents the updated views count on that discussion.

Thus I can get the most viewed discussions by getting the appropriate no of
columns from one end of this Integer sorted row.



I wanted to get feedback from you all, to know if this is a good design.

Thanks


Re: Design for 'Most viewed Discussions' in a forum

2011-05-18 Thread Aditya Narayan
I would arrange for memtable flush period in such a manner that the time
period for which these most viewed discussions are generated equals the
memtable flush timeperiod, so that the entire row of most viewed discussion
on a topic is in one or maximum two memtables/ SST tables.
This would also help minimize several versions of the same column in the row
parts in different SST tables.


On Wed, May 18, 2011 at 11:04 PM, Aditya Narayan ady...@gmail.com wrote:

 *
 For a discussions forum, I need to show a page of most viewed discussions.

 For implementing this, I maintain a count of views of a discussion  when
 this views count of a discussion passes a certain threshold limit, the
 discussion Id is added to a row of most viewed discussions.

 This row of most viewed discussions contains columns with Integer names 
 values containing serialized lists of Ids of all discussions whose views
 count equals the Integral name of this column.

 Thus if the view count of a discussion increases I'll need to move its 'Id'
 from serialized list in some column to serialized list in another column
 whose name represents the updated views count on that discussion.

 Thus I can get the most viewed discussions by getting the appropriate no of
 columns from one end of this Integer sorted row.

 

 I wanted to get feedback from you all, to know if this is a good design.

 Thanks








Re: Design for 'Most viewed Discussions' in a forum

2011-05-18 Thread Aditya Narayan
Thanks victor!

Aren't there any good ways by using Cassandra alone ?

On Wed, May 18, 2011 at 11:41 PM, openvictor Open openvic...@gmail.comwrote:

 Have you thought about user another kind of Database, which supports
 volative content for example ?

 I am currently thinking about doing something similar. The best and
 simplest option at the moment that I can think of is Redis. In redis you
 have the option of querying keys with wildcards. Your problem can be done by
 just inserting an UUID into Redis for a certain amount of time ( the best is
 to tailor this amount of time as an inverse function of the number of keys
 existing in Redis).

 *With Redis*
 What I would do : I cut down time in pieces of X minutes ( 15 minutes, for
 example by truncating a timestamp). Let timestampN be the timestamp for the
 period of time ( [N,N+15] ), let Topic1 Topic2 be two topics then :

 One or more people will view Topic 1 then Topic2 then again Topic1 in this
 period of 15 minutes
 (HINCRBY is the Increment)
 H http://redis.io/commands/hincrbyINCRBYhttp://redis.io/commands/hincrby 
 topics:Topic1:timestampN
 viewcount 1
 H http://redis.io/commands/hincrbyINCRBYhttp://redis.io/commands/hincrby 
 topics:Topic2:timestampN
 viewcount 1
 H http://redis.io/commands/hincrbyINCRBYhttp://redis.io/commands/hincrby 
 topics:Topic1:timestampN
 viewcount 1

 Then you just query in the following way :

 MGET http://redis.io/commands/mget topics:*:timestampN

 * is the wildcard, you order by viewcount and you have what you are asking
 for !
 This is a simplified version of what you should do but personnally I really
 like the combination of Cassandra and Redis.


 Victor

 2011/5/18 Aditya Narayan ady...@gmail.com

 I would arrange for memtable flush period in such a manner that the time
 period for which these most viewed discussions are generated equals the
 memtable flush timeperiod, so that the entire row of most viewed discussion
 on a topic is in one or maximum two memtables/ SST tables.
 This would also help minimize several versions of the same column in the
 row parts in different SST tables.



 On Wed, May 18, 2011 at 11:04 PM, Aditya Narayan ady...@gmail.comwrote:

 *
 For a discussions forum, I need to show a page of most viewed
 discussions.

 For implementing this, I maintain a count of views of a discussion  when
 this views count of a discussion passes a certain threshold limit, the
 discussion Id is added to a row of most viewed discussions.

 This row of most viewed discussions contains columns with Integer names 
 values containing serialized lists of Ids of all discussions whose views
 count equals the Integral name of this column.

 Thus if the view count of a discussion increases I'll need to move its
 'Id' from serialized list in some column to serialized list in another
 column whose name represents the updated views count on that discussion.

 Thus I can get the most viewed discussions by getting the appropriate no
 of columns from one end of this Integer sorted row.

 

 I wanted to get feedback from you all, to know if this is a good design.

 Thanks










Re: Splitting the data of a single blog into 2 CFs (to implement effective caching) according to views.

2011-03-08 Thread Aditya Narayan
Yes Aaron I thought about that but that doesnt seem to be just a small
amount of data either (contains text), but yes we can consider to do so
later as we find the need for it..

Thank you both!



On Tue, Mar 8, 2011 at 2:25 PM, aaron morton aa...@thelastpickle.comwrote:

 You could duplicate the data from CF1 in CF2 as well (use a batch_mutation
 through whatever client you have). So when serving the second page you only
 need to read one row from CF2.


 Aaron

 On 8/03/2011, at 8:13 PM, Norman Maurer wrote:

 Yeah this make sense as far as I can tell.


 Bye,
 Norman


 2011/3/8 Aditya Narayan ady...@gmail.com


 My application  displays list of several blogs' overview data (like
 blogTitle/ nameOfBlogger/ shortDescrption for each blog) on 1st page (in
 very much similar manner like Digg's newsfeed) and when the user selects a
 particular blog to see., the application takes him to that specific blog's
 full page view which displays entire data of the blog.

 Thus I am trying to split a blog's data in *two rows*, in two **different
 CFs ** (one CF is row-cached(with less amount of data in each row) and
 another(with each row having entire remaining blog data) without caching).

 Data for 1st page view (like titles and other overview data of a blog) are
 put in a row in 1st CF. This CF is cached so as to improve the performance
 of heavily read data. Only the data from cached CF is read for 1st page. The
 other remaining data(bulk amount of text of blog and entire comments data)
 are stored as another row in 2nd CF. For 2nd page, **rows from both of the
 two CFs have to be read**. This will take two read operations.

 Does this seem to be a good design ?






Does the memtable replace the old version of column with the new overwriting version or is it just a simple append ?

2011-03-08 Thread Aditya Narayan
Do the overwrites of newly written columns(that are present in
memtable) *replace the old column* or is it just a simple append.

I am trying to understand that if I update these column very very
frequently(while they are in memtable), does the read performance of
these columns gets affected, since Cassandra will have to read so many
versions of the same column. If this is just replacement with old
column then I guess read will be much better since it needs to see
just single existing version of column.

Thanks
Aditya Narayan


Re: Does the memtable replace the old version of column with the new overwriting version or is it just a simple append ?

2011-03-08 Thread Aditya Narayan
so this means that in memtable only the most recent version of a
column will reside!? For this implementation, while writing to
memtable Cassandra will see if there are other versions and will
overwrite them (reconcilation while writing) !?

I know that different SST tables may have different versions of the
same column.(and for them reconcilation will happen at reads)

Thanks Narendra!

On 3/9/11, Narendra Sharma narendra.sha...@gmail.com wrote:
 Multiple write for same key and column will result in overwriting of column
 in a memtable. Basically multiple updates for same (key, column) are
 reconciled based on the column's timestamp. This happens per memtable. So if
 a memtable is flushed to an sstable, this rule will be valid for the next
 memtable.
 Note that sstables are immutable. So, different sstables may have different
 versions of same (key, column), and the reconciliation of that happens
 during read (read repair). This is why reads are slower than writes because
 conflict resolution happens during read.

 Hope this answers the question!

 Thanks,
 -Naren

 On Tue, Mar 8, 2011 at 10:44 PM, Aditya Narayan ady...@gmail.com wrote:

 Do the overwrites of newly written columns(that are present in
 memtable) *replace the old column* or is it just a simple append.

 I am trying to understand that if I update these column very very
 frequently(while they are in memtable), does the read performance of
 these columns gets affected, since Cassandra will have to read so many
 versions of the same column. If this is just replacement with old
 column then I guess read will be much better since it needs to see
 just single existing version of column.

 Thanks
 Aditya Narayan




Splitting the data of a single blog into 2 CFs (to implement effective caching) according to views.

2011-03-07 Thread Aditya Narayan
My application  displays list of several blogs' overview data (like
blogTitle/ nameOfBlogger/ shortDescrption for each blog) on 1st page (in
very much similar manner like Digg's newsfeed) and when the user selects a
particular blog to see., the application takes him to that specific blog's
full page view which displays entire data of the blog.

Thus I am trying to split a blog's data in *two rows*, in two **different
CFs ** (one CF is row-cached(with less amount of data in each row) and
another(with each row having entire remaining blog data) without caching).

Data for 1st page view (like titles and other overview data of a blog) are
put in a row in 1st CF. This CF is cached so as to improve the performance
of heavily read data. Only the data from cached CF is read for 1st page. The
other remaining data(bulk amount of text of blog and entire comments data)
are stored as another row in 2nd CF. For 2nd page, **rows from both of the
two CFs have to be read**. This will take two read operations.

Does this seem to be a good design ?


What would be a good strategy for Storing the large text contents like blog posts in Cassandra.

2011-03-06 Thread Aditya Narayan
What would be a good strategy to store large text content/(blog posts
of around 1500-3000 characters)  in cassandra? I need to store these
blog posts along with their metadata like bloggerId, blogTags. I am
looking forward to store this data in a single row giving each
attribute a single column. So one blog per row. Is using a single
column for a large blog post like this a good strategy?

Next, I also need to store the blogComments which I am planning to
store all, in another single row. 1 comment per column. Thus the
entire information about the a single comment like  commentBody,
commentor would be serialized(using google Protocol buffers) and
stored in a single column,
For storing the no. of likes of each comment itself,  I am planning to
keep a counter_column, in the same row, for each comment that will
hold an no. specifiying no. of 'likes' of that comment.

Any suggestions on the above design highly appreciated.. Thanks.


Re: What would be a good strategy for Storing the large text contents like blog posts in Cassandra.

2011-03-06 Thread Aditya Narayan
Thanks Aaron!!

I didnt knew about the upcoming facility for inbuilt counters. This
sounds really great for my use-case!! Could you let me know where can
I read more about this, if this had been blogged about, somewhere ?

I'll go forward with the one (entire)blog per column design.

Thanks



On Mon, Mar 7, 2011 at 5:10 AM, Aaron Morton aa...@thelastpickle.com wrote:
 Sounds reasonable, one CF for the blog post one CF for the comments. You 
 could also use a single CF if you will often read the blog and the comments 
 at the same time. The best design is the one that suits how your app works, 
 try one and be prepared to change.

 Note that counters are only in the 0.8 trunk and are still under development, 
 they are not going to be released for a couple of months.

 Your per column data size is nothing to be concerned abut.

 Hope that helps.
 Aaron

 On 7/03/2011, at 6:35 AM, Aditya Narayan ady...@gmail.com wrote:

 What would be a good strategy to store large text content/(blog posts
 of around 1500-3000 characters)  in cassandra? I need to store these
 blog posts along with their metadata like bloggerId, blogTags. I am
 looking forward to store this data in a single row giving each
 attribute a single column. So one blog per row. Is using a single
 column for a large blog post like this a good strategy?

 Next, I also need to store the blogComments which I am planning to
 store all, in another single row. 1 comment per column. Thus the
 entire information about the a single comment like  commentBody,
 commentor would be serialized(using google Protocol buffers) and
 stored in a single column,
 For storing the no. of likes of each comment itself,  I am planning to
 keep a counter_column, in the same row, for each comment that will
 hold an no. specifiying no. of 'likes' of that comment.

 Any suggestions on the above design highly appreciated.. Thanks.



Splitting a single row into multiple

2011-02-23 Thread Aditya Narayan
Does it make any difference if I split a row, that needs to be
accessed together, into two or three rows and then read those multiple
rows ??
(Assume the keys of all the three rows are known to me programatically
since I split columns by certain categories).
Would the performance be any better if all the three were just a single row ??

I guess the performance should be same in both cases, the columns
remain the same in quantity  there spread into several SST files..


Re: Splitting a single row into multiple

2011-02-23 Thread Aditya Narayan
Thanks Aaron.. I was looking to spliting the rows so that I could use
a standard CF instead of super.. but your argument also makes sense.



On Thu, Feb 24, 2011 at 1:19 AM, Aaron Morton aa...@thelastpickle.com wrote:
 AFAIK performance in the single row case will better. Multi get may require 
 multiple seeks and reads in an sstable,, verses obviously a single seek and 
 read for a single row. Multiplied by the number of sstables that contain row 
 data.

 Using the key cache would reduce the the seeks.

 If it makes sense in your app do it. In general though try to model data so a 
 single row read gets what you need.

 Aaron

 On 24/02/2011, at 5:59 AM, Aditya Narayan ady...@gmail.com wrote:

 Does it make any difference if I split a row, that needs to be
 accessed together, into two or three rows and then read those multiple
 rows ??
 (Assume the keys of all the three rows are known to me programatically
 since I split columns by certain categories).
 Would the performance be any better if all the three were just a single row 
 ??

 I guess the performance should be same in both cases, the columns
 remain the same in quantity  there spread into several SST files..



Re: Confused about get_slice SliceRange behavior with bloom filter

2011-02-14 Thread Aditya Narayan
Thanks Sylvain,

I guess I might have misunderstood the meaning of column_index_size_in_kb,
My previous understanding about that was: it is the threshold size for a row
to pass, after which its columns will be indexed.

If I have understood it correctly, it implies the size of the blocks
(containing columns) that are kept together on the same index. So if you
make that high, a large no of columns will need to be deseralized for a
single column access, in that block. And it you make it lower than optimal
than indexes size will grow up, right?

So I guess we should vary that depending on the size of our columns and not
the size of rows !? I have valueless columns for my usecase.




On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 As said by aaron, if the whole row is under 64k, it won't matter. But since
 you spoke of very wide row, I'll assume the whole will be much more than
 64k.

 If so, the row is indexed by block (of 64k, configurable). Then the read
 performance depends on how many of those block are needed for the query,
 since each block potentially means a seek (potentially because some block
 could happen to be sequential on disk). So if the columns you ask for are
 really randomly distributed, then yes, the biggest the row is, the biggest
 the chance is to have to hit many blocks and the biggest the chance is for
 these block to be far apart on disk.

 --
 Sylvain

 On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan ady...@gmail.com wrote:

 Jonathan,
 If I ask for around 150-200 columns (totally random not sequential) from a
 very wide row that contains more than a million or even more columns then,
 is the read performance of the SliceQuery operation affected by or depends
 on the length of the row ?? (For my use case, I would use the column names
 list for this SliceQuery operation).


 Thanks
 Aditya


 On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis jbel...@gmail.comwrote:

 On Sun, Feb 13, 2011 at 12:37 AM, E S tr1skl...@yahoo.com wrote:
  I've gotten myself really confused by
  http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping
 someone can
  help me understand what the io behavior of this operation would be.
 
  When I do a get_slice for a column range, will it seek to every
 SSTable?  I had
  thought that it would use the bloom filter on the row key so that it
 would only
  do a seek to SSTables that have a very high probability of containing
 columns
  for that row.

 Yes.

  In the linked doc above, it seems to say that it is only used for
  exact column names.  Am I misunderstanding this?

 Yes.  You may be confusing multi-row behavior with multi-column.

  On a related note, if instead of using a SliceRange I provide an
 explicit list
  of columns, will I have to read all SSTables that have values for the
 columns

 Yes.

  or is it smart enough to stop after finding a value from the most
 recent
  SSTable?

 There is no way to know which value is most recent without having to
 read it first.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com






Re: Confused about get_slice SliceRange behavior with bloom filter

2011-02-14 Thread Aditya Narayan
Thanks for the clarifications..

On Mon, Feb 14, 2011 at 6:13 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Mon, Feb 14, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote:

 Thanks Sylvain,

 I guess I might have misunderstood the meaning of column_index_size_in_kb,
 My previous understanding about that was: it is the threshold size for a row
 to pass, after which its columns will be indexed.


 It is the size of the index 'bucket'. But given that there is no point to
 have an index with only one entry, it is true that it is also the threshold
 after wich row start to be indexed.



 If I have understood it correctly, it implies the size of the blocks
 (containing columns) that are kept together on the same index. So if you
 make that high, a large no of columns will need to be deseralized for a
 single column access, in that block. And it you make it lower than optimal
 than indexes size will grow up, right?


 yes


 So I guess we should vary that depending on the size of our columns and
 not the size of rows !? I have valueless columns for my usecase.


 Yes it depends mainly on the size of your columns. But if you have big
 rows, even with very tiny columns, you may still not want to put a too small
 value there. In general I would really make careful tests with your workload
 before changing the value of column_index_size_in_kb to see if it does make
 a difference. Not sure there is much to gain here.

 --
 Sylvain






 On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne 
 sylv...@datastax.comwrote:

 As said by aaron, if the whole row is under 64k, it won't matter. But
 since you spoke of very wide row, I'll assume the whole will be much more
 than 64k.

 If so, the row is indexed by block (of 64k, configurable). Then the read
 performance depends on how many of those block are needed for the query,
 since each block potentially means a seek (potentially because some block
 could happen to be sequential on disk). So if the columns you ask for are
 really randomly distributed, then yes, the biggest the row is, the biggest
 the chance is to have to hit many blocks and the biggest the chance is for
 these block to be far apart on disk.

 --
 Sylvain

 On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan ady...@gmail.comwrote:

 Jonathan,
 If I ask for around 150-200 columns (totally random not sequential) from
 a very wide row that contains more than a million or even more columns 
 then,
 is the read performance of the SliceQuery operation affected by or depends
 on the length of the row ?? (For my use case, I would use the column names
 list for this SliceQuery operation).


 Thanks
 Aditya


 On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis jbel...@gmail.comwrote:

 On Sun, Feb 13, 2011 at 12:37 AM, E S tr1skl...@yahoo.com wrote:
  I've gotten myself really confused by
  http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping
 someone can
  help me understand what the io behavior of this operation would be.
 
  When I do a get_slice for a column range, will it seek to every
 SSTable?  I had
  thought that it would use the bloom filter on the row key so that it
 would only
  do a seek to SSTables that have a very high probability of containing
 columns
  for that row.

 Yes.

  In the linked doc above, it seems to say that it is only used for
  exact column names.  Am I misunderstanding this?

 Yes.  You may be confusing multi-row behavior with multi-column.

  On a related note, if instead of using a SliceRange I provide an
 explicit list
  of columns, will I have to read all SSTables that have values for the
 columns

 Yes.

  or is it smart enough to stop after finding a value from the most
 recent
  SSTable?

 There is no way to know which value is most recent without having to
 read it first.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com








Re: Confused about get_slice SliceRange behavior with bloom filter

2011-02-13 Thread Aditya Narayan
Jonathan,
If I ask for around 150-200 columns (totally random not sequential) from a
very wide row that contains more than a million or even more columns then,
is the read performance of the SliceQuery operation affected by or depends
on the length of the row ?? (For my use case, I would use the column names
list for this SliceQuery operation).


Thanks
Aditya

On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Sun, Feb 13, 2011 at 12:37 AM, E S tr1skl...@yahoo.com wrote:
  I've gotten myself really confused by
  http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping
 someone can
  help me understand what the io behavior of this operation would be.
 
  When I do a get_slice for a column range, will it seek to every SSTable?
  I had
  thought that it would use the bloom filter on the row key so that it
 would only
  do a seek to SSTables that have a very high probability of containing
 columns
  for that row.

 Yes.

  In the linked doc above, it seems to say that it is only used for
  exact column names.  Am I misunderstanding this?

 Yes.  You may be confusing multi-row behavior with multi-column.

  On a related note, if instead of using a SliceRange I provide an explicit
 list
  of columns, will I have to read all SSTables that have values for the
 columns

 Yes.

  or is it smart enough to stop after finding a value from the most recent
  SSTable?

 There is no way to know which value is most recent without having to
 read it first.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: Merging the rows of two column families(with similar attributes) into one ??

2011-02-12 Thread Aditya Narayan
What if the caching requirements, sorting needs of two kind of data
are very much similar, is it preferable to go with a single CF in
those cases ?


Regards
Aditya

 On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbsty...@datastax.com  wrote:

 I read somewhere that more no of column families is not a good idea as
 it consumes more memory and more compactions to occur

 This is primarily true, but not in every case.

 But the caching requirements may be different as they cater to two
 different features.

 This is a great reason to *not* merge them.  Besides the key and row
 caches,
 don't forget about the OS buffer cache.

 Is it recommended to merge these two column families into one ??
 Thoughts
 ?

 No, this sounds like an anti-pattern to me.  The overhead from having
 two
 separate CFs is not that high.

 --
 Tyler Hobbs
 Software Engineer, DataStax
 Maintainer of the pycassa Cassandra Python client library


Re: Merging the rows of two column families(with similar attributes) into one ??

2011-02-12 Thread Aditya Narayan
Any comments/view points on this?


--On Sat, Feb 12, 2011 at 5:05 PM, Aditya Narayan ady...@gmail.comwrote:

What if the caching requirements, sorting needs of two kind of data
are very much similar, is it preferable to go with a single CF in
those cases ?


Regards
Aditya


  On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbsty...@datastax.com
  wrote:
 
  I read somewhere that more no of column families is not a good idea
 as
  it consumes more memory and more compactions to occur
 
  This is primarily true, but not in every case.
 
  But the caching requirements may be different as they cater to two
  different features.
 
  This is a great reason to *not* merge them.  Besides the key and row
  caches,
  don't forget about the OS buffer cache.
 
  Is it recommended to merge these two column families into one ??
  Thoughts
  ?
 
  No, this sounds like an anti-pattern to me.  The overhead from having
  two
  separate CFs is not that high.
 
  --
  Tyler Hobbs
  Software Engineer, DataStax
  Maintainer of the pycassa Cassandra Python client library



Re: Calculating the size of rows in KBs

2011-02-10 Thread Aditya Narayan
Thank you Aaron!!

But, If you are reading partial rows(that otherwise contain several
thousands of **valueless** columns) then do the column indexes help in
making the reads faster  more efficient than if they were not
valueless?
Perhaps, because they would only need to  look up whether the asked
column names exist in indexes for that row/key  they dont need to
deserialize the blocks in SST tables searching for column values. Am I
thinking right way??


-Aditya



On Fri, Feb 11, 2011 at 1:54 AM, Aaron Morton aa...@thelastpickle.com wrote:
 If you want to get the byte size of a particular row you will need to read it 
 all back.

 If you connect with JConsole at look at you column families, there are 
 attributes for the max, min and mean row sizes.

 In general the entire row only exists in memory when it is contained in the 
 first Memtable it's written to. It may then be partially or fully read from 
 disk during subsequent reads or compactions.

 On disk format  described here may help 
 http://wiki.apache.org/cassandra/ArchitectureSSTable

 Hope that helps
 Aaron
 On 10/02/2011, at 11:56 PM, Aditya Narayan ady...@gmail.com wrote:

 How can I get or calculate the size of  rows/ columns ? what are the
 any overheads on memory for each column/row ?



Re: Does variation in no of columns in rows over the column family has any performance impact ?

2011-02-07 Thread Aditya Narayan
Thanks for the detailed explanation Peter! Definitely cleared my doubts !



On Mon, Feb 7, 2011 at 1:52 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
 Does huge variation in no. of columns in rows, over the column family
 has *any* impact on the performance ?

 Can I have like just 100 columns in some rows and like hundred
 thousands of columns in another set of rows, without any downsides ?

 If I interpret your question the way I think you mean it, then no,
 Cassandra doesn't do anything with the data such that the smaller
 rows are somehow directly less efficient because there are other rows
 that are bigger. It doesn't affect the on-disk format or the on-disk
 efficiency of accessing the rows.

 However, there are almost always indirect effects when it comes to
 performance, in and particular storage systems. In the case of
 Cassandra, the *variation* itself should not impose a direct
 performance penalty, but there are potential other effects. For
 example the row cache is only useful for small works, so if you are
 looking to use the row cache the huge rows would perhaps prevent that.
 This could be interpreted as a performance impact on the smaller rows
 by the larger rows Compaction may become more expensive due to
 e.g. additional GC pressure resulting from
 large-but-still-within-in-memory-limits rows being compacted (or not,
 depending on JVM/GC settings). There is also the effect of cache
 locality as data set grows, and the cache locality for the smaller
 rows will likely be worse than had they been in e.g. a separate CF.

 Those are just three random example; I'm just trying to make the point
 that without any downsides is a very strong and blanket requirement
 for making the decision to mix small rows with larger ones.

 --
 / Peter Schuller



Column Sorting of integer names

2011-02-04 Thread Aditya Narayan
Is there any way to sort the columns named as integers in the descending order ?


Regards
-Aditya


Re: Using Cassandra to store files

2011-02-04 Thread Aditya Narayan
I am also looking to possible solutions to store pdfs  word documents.

But why wont you store in them in the filesystem instead of a database
unless your files are too small in which case it would be recommended
to use a database.

-Aditya


On Fri, Feb 4, 2011 at 5:30 PM, Daniel Doubleday
daniel.double...@gmx.net wrote:
 We are doing this with cassandra.
 But we cache a lot. We get around 20 writes/s and 1k reads/s (~ 100Mbit/s)
 for that particular CF but only 1% of them hit our cassandra cluster (5
 nodes, rf=3).

 /Daniel
 On Feb 4, 2011, at 9:37 AM, Brendan Poole wrote:

 Hi Daniel

 When you say We are doing this do you mean via NFS or Cassandra.

 Thanks

 Brendan





 Signature.jpg Brendan Poole
  Systems Developer
   NewLaw Solicitors
  Helmont House
  Churchill Way
  Cardiff
  brendan.po...@new-law.co.uk
  029 2078 4283
  www.new-law.co.uk

 


 From: Daniel Doubleday [mailto:daniel.double...@gmx.net]
 Sent: 03 February 2011 17:21
 To: user@cassandra.apache.org
 Subject: Re: Using Cassandra to store files

 Hundreds of thousands doesn't sound too bad. Good old NFS would do with an
 ok directory structure.
 We are doing this. Our documents are pretty small though (a few kb). We have
 around 40M right now with around 300GB total.
 Generally the problem is that much data usually means that cassandra becomes
 io bound during repairs and compactions even if your hot dataset would fit
 in the page cache. There are efforts to overcome this and 0.7 will help with
 repair problems but for the time being you have to have quite some headroom
 in terms of io performance to handle these situations.
 Here is a related post:
 http://comments.gmane.org/gmane.comp.db.cassandra.user/11190

 On Feb 3, 2011, at 1:33 PM, Brendan Poole wrote:

 Hi

 Would anyone recommend using Cassandra for storing hundreds of thousands of
 documents in Word/PDF format? The manual says it can store documents under
 64MB with no issue but was wondering if anyone is using it for this specific
 perpose.  Would it be efficient/reliable and is there anything I need to
 bear in mind?

 Thanks in advance

 Signature.jpg Brendan Poole
  Systems Developer
   NewLaw Solicitors
  Helmont House
  Churchill Way
  Cardiff
  brendan.po...@new-law.co.uk
  029 2078 4283
  www.new-law.co.uk



 P Please consider the environment before printing this e-mail
 Important - The information contained in this email (and any attached files)
 is confidential and may be legally privileged and protected by law.

 The intended recipient is authorised to access it. If you are not the
 intended recipient, please notify the sender immediately and delete or
 destroy all copies. You must not disclose the contents of this email to
 anyone. Unauthorised use, dissemination, distribution, publication or
 copying of this communication is prohibited.

 NewLaw Solicitors does not accept any liability for any inaccuracies or
 omissions in the contents of this email that may have arisen as a result of
 transmission. This message and any attachments are believed to be free of
 any virus or defect that might affect any computer system into which it is
 received and opened. However, it is the responsibility of the recipient to
 ensure that it is virus free; therefore, no responsibility is accepted for
 any loss or damage in any way arising from its use.

 NewLaw Solicitors is the trading name of NewLaw Legal Ltd, a limited company
 registered in England and Wales with registered number 07200038.
 NewLaw Legal Ltd is regulated by the Solicitors Regulation Authority whose
 website is http://www.sra.org.uk

 The registered office of NewLaw Legal Ltd is at Helmont House, Churchill
 Way, Cardiff, CF10 2HE. Tel: 0845 756 6870, Fax: 0845 756 6871, Email:
 i...@new-law.co.uk. www.new-law.co.uk.

 We use the word ‘partner’ to refer to a shareowner or director of the
 company, or an employee or consultant of the company who is a lawyer with
 equivalent standing and qualifications. A list of the directors is displayed
 at the above address, together with a list of those persons who are
 designated as partners.



 P Please consider the environment before printing this e-mail
 Important - The information contained in this email (and any attached files)
 is confidential and may be legally privileged and protected by law.

 The intended recipient is authorised to access it. If you are not the
 intended recipient, please notify the sender immediately and delete or
 destroy all copies. You must not disclose the contents of this email to
 anyone. Unauthorised use, dissemination, distribution, publication or
 copying of this communication is prohibited.

 NewLaw Solicitors does not accept any liability for any inaccuracies or
 omissions in the contents of this email that may have arisen as a result of
 transmission. This message and any attachments are believed to be free of
 any virus or 

Re: Using Cassandra to store files

2011-02-04 Thread Aditya Narayan
yes, definitely a database for mapping ofcourse!

On Fri, Feb 4, 2011 at 11:17 PM, buddhasystem potek...@bnl.gov wrote:

 Even when storage is in NFS, Cassandra can still be quite useful as a file
 catalog. Your physical storage can change, move etc. Therefore, it's a good
 idea to provide mapping of logical names to physical store points (which in
 fact can be many). This is a standard technique used in mass storage.

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Using-Cassandra-to-store-files-tp5988698p5993357.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Sorting in time order without using TimeUUID type column names

2011-02-04 Thread Aditya Narayan
Thanks Aaron,

Yes I can put the column names without using the userId in the
timeline row, and when I want to retrieve the row corresponding to
that column name, I will attach the userId to get the row key.

Yes I'll store it as a long  I guess I'll have to write with a custom
comparator type (ReversedIntegerType) to sort those longs in
descending order.

Regards
Aditya


On Sat, Feb 5, 2011 at 6:24 AM, aaron morton aa...@thelastpickle.com wrote:
 IMHO If you know the time of the event use store the time as a long, rather 
 than a UUID. It will make it easier to get back to a
 time and make it easier for you to compare columns. TimeUUIDS has a pseudo 
 random part as well as the time part, it could be set to a constant. By why 
 bother if you know the absolute time.

 I'm not sure what the ReminderCountOfThisUser is for, and as Sylvain says 
 there is no need for the user name if this is in a row just for the user.

 Hope that helps.
 Aaron

 On 4 Feb 2011, at 01:32, Aditya Narayan wrote:

 If I use : TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser
 as key pattern for the rows of reminders, then I am storing the key,
 just as it is, as the column name and thus column values  need not
 contain a link to the row containing the reminder details.

 I think UserId would be required along with timestamp in the key
 pattern to provide uniqueness to the key as there may be several
 reminders generated by users on the application, at the same time.

 But my question is about whether it is really advisable to even
 generate the keys like this pattern ... instead of going with
 timeuuids ?
 Are there are any downsides which I am not perhaps not aware of ?



 On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com 
 wrote:
 On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote:

 Hey all,

 I want to store some columns that are reminders to the users on my
 application, in time sorted order in a row(timeline row of the user).

 Would it be recommended to store these reminder columns in the
 timeline row with column names like: combination of timestamp(of time
 when the reminder gets due) + UserId+ Reminders Count of that user;
 Column Name= TimestampOfDueTimeInFuture: UserId :
 ReminderCountOfThisUser

 If you have one row by user (which is a good idea), why keep the UserId in
 the column name ?


 Then what comparator could I use to sort them in order of the their
 due time ? This comparator should be able to sort no. in descending
 order.(I guess ascii type would do the opposite order) (Reminders need
 to be sorted in the timeline in the order of their due time.)

 *The* solution is write a custom comparator.
 Have a look at http://www.datastax.com/docs/0.7/data_model/column_families
 and http://www.sodeso.nl/?p=421 for instance.

 As a side note, the fact that the comparator sort in ascending order when
 you
 need descending order would be that much of a problem, since you can always
 do slice queries in reversed order. But even then, asciiType is not a very
 satisfying solution as you would have to be careful about the padding of
 your
 timestamp for it to work correctly. So again, custom comparator is the way
 to go.

 Basically I am trying to avoid 16 bytes long timeUUID first because
 they are too long and the above defined key pattern is guaranteeing me
 a unique key/Id for the reminder row always.


 Thanks
 Aditya Narayan

 --
 Sylvain




Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-03 Thread Aditya Narayan
Thanks Tyler!


On Thu, Feb 3, 2011 at 12:06 PM, Tyler Hobbs ty...@datastax.com wrote:
 On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote:

 Can I have some more feedback about my schema perhaps somewhat more
 criticisive/harsh ?

 It sounds reasonable to me.

 Since you're writing/reading all of the subcolumns at the same time, I would
 opt for a standard column with the tags serialized into a column value.

 I don't think you need to worry about row lengths here.

 Depending on the reminder size and how many times it's likely to be repeated
 in the timeline, you could explore denormalizing a bit more by storing the
 reminders in the timelines themselves, perhaps with a separate row per
 (user, tag) combination.  This would cut down on your seeks quite a bit, but
 it may not be necessary at this point (or at all).

 --
 Tyler Hobbs
 Software Engineer, DataStax
 Maintainer of the pycassa Cassandra Python client library




Sorting in time order without using TimeUUID type column names

2011-02-03 Thread Aditya Narayan
Hey all,

I want to store some columns that are reminders to the users on my
application, in time sorted order in a row(timeline row of the user).

Would it be recommended to store these reminder columns in the
timeline row with column names like: combination of timestamp(of time
when the reminder gets due) + UserId+ Reminders Count of that user;
Column Name= TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser

Then what comparator could I use to sort them in order of the their
due time ? This comparator should be able to sort no. in descending
order.(I guess ascii type would do the opposite order) (Reminders need
to be sorted in the timeline in the order of their due time.)

Basically I am trying to avoid 16 bytes long timeUUID first because
they are too long and the above defined key pattern is guaranteeing me
a unique key/Id for the reminder row always.


Thanks
Aditya Narayan


Re: Sorting in time order without using TimeUUID type column names

2011-02-03 Thread Aditya Narayan
If I use : TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser
as key pattern for the rows of reminders, then I am storing the key,
just as it is, as the column name and thus column values  need not
contain a link to the row containing the reminder details.

I think UserId would be required along with timestamp in the key
pattern to provide uniqueness to the key as there may be several
reminders generated by users on the application, at the same time.

But my question is about whether it is really advisable to even
generate the keys like this pattern ... instead of going with
timeuuids ?
Are there are any downsides which I am not perhaps not aware of ?



On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote:

 Hey all,

 I want to store some columns that are reminders to the users on my
 application, in time sorted order in a row(timeline row of the user).

 Would it be recommended to store these reminder columns in the
 timeline row with column names like: combination of timestamp(of time
 when the reminder gets due) + UserId+ Reminders Count of that user;
 Column Name= TimestampOfDueTimeInFuture: UserId :
 ReminderCountOfThisUser

 If you have one row by user (which is a good idea), why keep the UserId in
 the column name ?


 Then what comparator could I use to sort them in order of the their
 due time ? This comparator should be able to sort no. in descending
 order.(I guess ascii type would do the opposite order) (Reminders need
 to be sorted in the timeline in the order of their due time.)

 *The* solution is write a custom comparator.
 Have a look at http://www.datastax.com/docs/0.7/data_model/column_families
 and http://www.sodeso.nl/?p=421 for instance.

 As a side note, the fact that the comparator sort in ascending order when
 you
 need descending order would be that much of a problem, since you can always
 do slice queries in reversed order. But even then, asciiType is not a very
 satisfying solution as you would have to be careful about the padding of
 your
 timestamp for it to work correctly. So again, custom comparator is the way
 to go.

 Basically I am trying to avoid 16 bytes long timeUUID first because
 they are too long and the above defined key pattern is guaranteeing me
 a unique key/Id for the reminder row always.


 Thanks
 Aditya Narayan

 --
 Sylvain


Re: Sorting in time order without using TimeUUID type column names

2011-02-03 Thread Aditya Narayan
If I use : TimestampOfDueTimeInFuture| UserId | ReminderCountOfThisUser
as key pattern for the rows of reminders, then I am storing the key,
just as it is, as the column name and thus column values  need not
contain a link to the row containing the reminder details.

I think UserId would be required along with timestamp in the key
pattern to provide uniqueness to the key as there may be several
reminders generated, at the same time by other users on the
application.

But my question is about whether it is really advisable to even
generate the keys like this pattern ... instead of going with
timeuuids ?
Are there are any downsides which I am not perhaps not aware of ?



On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote:

 Hey all,

 I want to store some columns that are reminders to the users on my
 application, in time sorted order in a row(timeline row of the user).

 Would it be recommended to store these reminder columns in the
 timeline row with column names like: combination of timestamp(of time
 when the reminder gets due) + UserId+ Reminders Count of that user;
 Column Name= TimestampOfDueTimeInFuture: UserId :
 ReminderCountOfThisUser

 If you have one row by user (which is a good idea), why keep the UserId in
 the column name ?


 Then what comparator could I use to sort them in order of the their
 due time ? This comparator should be able to sort no. in descending
 order.(I guess ascii type would do the opposite order) (Reminders need
 to be sorted in the timeline in the order of their due time.)

 *The* solution is write a custom comparator.
 Have a look at http://www.datastax.com/docs/0.7/data_model/column_families
 and http://www.sodeso.nl/?p=421 for instance.

 As a side note, the fact that the comparator sort in ascending order when
 you
 need descending order would be that much of a problem, since you can always
 do slice queries in reversed order. But even then, asciiType is not a very
 satisfying solution as you would have to be careful about the padding of
 your
 timestamp for it to work correctly. So again, custom comparator is the way
 to go.

 Basically I am trying to avoid 16 bytes long timeUUID first because
 they are too long and the above defined key pattern is guaranteeing me
 a unique key/Id for the reminder row always.


 Thanks
 Aditya Narayan

 --
 Sylvain


Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing all the subcolumns data serialized in single
column(s)  ?

Thanks
Aditya Narayan


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan ady...@gmail.com wrote:
 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan



Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders ( not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up..

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs bill.spe...@gmail.com wrote:
 To reiterate, so I know we're both on the same page, your schema would be
 something like this:

 - A column family (as you describe) to store the details of a reminder. One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names would be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for a
 day, that would only contain sub column names with the tags you're looking
 for? Then based upon the column names returned, you'd look-up the reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan




Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
You got me wrong perhaps..

I am already splitting the row on per user basis ofcourse, otherwise
the schema wont make sense for my usage. The row contains only
*reminders of a single user* sorted in chronological order. The
reminder Id are stored as supercolumn name and subcolumn contain tags
for that reminder.



On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs bill.spe...@gmail.com wrote:
 Any time I see/hear a single row containing all ... I get nervous. That
 single row is going to reside on a single node. That is potentially a lot of
 load (don't know the system) for that single node. Why wouldn't you split it
 by at least user? If it won't be a lot of load, then why are you using
 Cassandra? This seems like something that could easily fit into an
 SQL/relational style DB. If it's too much data (millions of users, 100s of
 millions of reminders) for a standard SQL/relational model, then it's
 probably too much for a single row.

 I'm not familiar with the TTL functionality of Cassandra... sorry cannot
 help/comment there, still learning :-)

 Yea, my $0.02 is that this is an effective way to leverage super columns.

 Bill-

 On 02/02/2011 10:43 AM, Aditya Narayan wrote:

 I think you got it exactly what I wanted to convey except for few
 things I want to clarify:

 I was thinking of a single row containing all reminders (  not split
 by day). History of the reminders need to be maintained for some time.
 After certain time (say 3 or 6 months) they may be deleted by ttl
 facility.

 While presenting the reminders timeline to the user, latest
 supercolumns like around 50 from the start_end will be picked up and
 their subcolumns values will be compared to the Tags user has chosen
 to see and, corresponding to the filtered subcolumn values(tags), the
 rows of the reminder details would be picked up..

 Is supercolumn a preferable choice for this ? Can there be a better
 schema than this ?


 -Aditya Narayan



 On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 To reiterate, so I know we're both on the same page, your schema would be
 something like this:

 - A column family (as you describe) to store the details of a reminder.
 One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each
 day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names would
 be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for a
 day, that would only contain sub column names with the tags you're
 looking
 for? Then based upon the column names returned, you'd look-up the
 reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com
  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan





Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
@Bill
Thank you BIll!

@Cassandra users
Can others also leave their suggestions and comments about my schema, please.
Also my question about whether to use a superColumn or alternatively,
just store the data (that would otherwise be stored in subcolumns) as
serialized into a single column in standard type column family.

Thanks

-Aditya Narayan



On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com wrote:
 I did not understand before... sorry.

 Again, depending upon how many reminders you have for a single user, this
 could be a long/wide row. Again, it really comes down to how many reminders
 are we talking about and how often will they be read/written. While a single
 row can contain millions (maybe more) columns, that doesn't mean it's a good
 idea.

 I'm working on a logging system with Cassandra and ran into this same type
 of problem. Do I put all of the messages for a single system into a single
 row keyed off that system's name? I quickly came to the answer of no and
 now I break my row keys into POSIX_timestamp:system where my timestamps are
 buckets for every 5 minutes. This nicely distributes the load across the
 nodes in my system.

 Bill-

 On 02/02/2011 11:18 AM, Aditya Narayan wrote:

 You got me wrong perhaps..

 I am already splitting the row on per user basis ofcourse, otherwise
 the schema wont make sense for my usage. The row contains only
 *reminders of a single user* sorted in chronological order. The
 reminder Id are stored as supercolumn name and subcolumn contain tags
 for that reminder.



 On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 Any time I see/hear a single row containing all ... I get nervous. That
 single row is going to reside on a single node. That is potentially a lot
 of
 load (don't know the system) for that single node. Why wouldn't you split
 it
 by at least user? If it won't be a lot of load, then why are you using
 Cassandra? This seems like something that could easily fit into an
 SQL/relational style DB. If it's too much data (millions of users, 100s
 of
 millions of reminders) for a standard SQL/relational model, then it's
 probably too much for a single row.

 I'm not familiar with the TTL functionality of Cassandra... sorry cannot
 help/comment there, still learning :-)

 Yea, my $0.02 is that this is an effective way to leverage super columns.

 Bill-

 On 02/02/2011 10:43 AM, Aditya Narayan wrote:

 I think you got it exactly what I wanted to convey except for few
 things I want to clarify:

 I was thinking of a single row containing all reminders (    not split
 by day). History of the reminders need to be maintained for some time.
 After certain time (say 3 or 6 months) they may be deleted by ttl
 facility.

 While presenting the reminders timeline to the user, latest
 supercolumns like around 50 from the start_end will be picked up and
 their subcolumns values will be compared to the Tags user has chosen
 to see and, corresponding to the filtered subcolumn values(tags), the
 rows of the reminder details would be picked up..

 Is supercolumn a preferable choice for this ? Can there be a better
 schema than this ?


 -Aditya Narayan



 On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 To reiterate, so I know we're both on the same page, your schema would
 be
 something like this:

 - A column family (as you describe) to store the details of a reminder.
 One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each
 day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names
 would
 be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for
 a
 day, that would only contain sub column names with the tags you're
 looking
 for? Then based upon the column names returned, you'd look-up the
 reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column
 family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside

Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Can I have some more feedback about my schema perhaps somewhat more
criticisive/harsh ?


Thanks again,
Aditya Narayan

On Wed, Feb 2, 2011 at 10:27 PM, Aditya Narayan ady...@gmail.com wrote:
 @Bill
 Thank you BIll!

 @Cassandra users
 Can others also leave their suggestions and comments about my schema, please.
 Also my question about whether to use a superColumn or alternatively,
 just store the data (that would otherwise be stored in subcolumns) as
 serialized into a single column in standard type column family.

 Thanks

 -Aditya Narayan



 On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com 
 wrote:
 I did not understand before... sorry.

 Again, depending upon how many reminders you have for a single user, this
 could be a long/wide row. Again, it really comes down to how many reminders
 are we talking about and how often will they be read/written. While a single
 row can contain millions (maybe more) columns, that doesn't mean it's a good
 idea.

 I'm working on a logging system with Cassandra and ran into this same type
 of problem. Do I put all of the messages for a single system into a single
 row keyed off that system's name? I quickly came to the answer of no and
 now I break my row keys into POSIX_timestamp:system where my timestamps are
 buckets for every 5 minutes. This nicely distributes the load across the
 nodes in my system.

 Bill-

 On 02/02/2011 11:18 AM, Aditya Narayan wrote:

 You got me wrong perhaps..

 I am already splitting the row on per user basis ofcourse, otherwise
 the schema wont make sense for my usage. The row contains only
 *reminders of a single user* sorted in chronological order. The
 reminder Id are stored as supercolumn name and subcolumn contain tags
 for that reminder.



 On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 Any time I see/hear a single row containing all ... I get nervous. That
 single row is going to reside on a single node. That is potentially a lot
 of
 load (don't know the system) for that single node. Why wouldn't you split
 it
 by at least user? If it won't be a lot of load, then why are you using
 Cassandra? This seems like something that could easily fit into an
 SQL/relational style DB. If it's too much data (millions of users, 100s
 of
 millions of reminders) for a standard SQL/relational model, then it's
 probably too much for a single row.

 I'm not familiar with the TTL functionality of Cassandra... sorry cannot
 help/comment there, still learning :-)

 Yea, my $0.02 is that this is an effective way to leverage super columns.

 Bill-

 On 02/02/2011 10:43 AM, Aditya Narayan wrote:

 I think you got it exactly what I wanted to convey except for few
 things I want to clarify:

 I was thinking of a single row containing all reminders (    not split
 by day). History of the reminders need to be maintained for some time.
 After certain time (say 3 or 6 months) they may be deleted by ttl
 facility.

 While presenting the reminders timeline to the user, latest
 supercolumns like around 50 from the start_end will be picked up and
 their subcolumns values will be compared to the Tags user has chosen
 to see and, corresponding to the filtered subcolumn values(tags), the
 rows of the reminder details would be picked up..

 Is supercolumn a preferable choice for this ? Can there be a better
 schema than this ?


 -Aditya Narayan



 On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 To reiterate, so I know we're both on the same page, your schema would
 be
 something like this:

 - A column family (as you describe) to store the details of a reminder.
 One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each
 day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names
 would
 be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for
 a
 day, that would only contain sub column names with the tags you're
 looking
 for? Then based upon the column names returned, you'd look-up the
 reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column
 family.
 -For presenting the timeline