Seeking advice on Schema and Caching
Hi I need to add 'search users' functionality to my application. (The trigger for fetching searched items(like google instant search) is made when 3 letters have been typed in). For this, I make a CF with String type keys. Each such key is made of first 3 letters of a user's name. Thus all names starting with 'Mar-' are stored in single row (with key=Mar). The column names are framed as remaining letters of the names. Thus, a name 'Marcos' will be stored within rowkey Mar col name cos. The id will be stored as column value. Since there could be many users with same name. Thus I would have multple userIds(of users named Marcos) to be stored inside columnname cos under key Mar. Thus, 1. Supercolumn seems to be a better fit for my use case(so that ids of users with same name may fit as sub-columns inside a super-column) but since supercolumns are not encouraged thus I want to use an alternative schema for this usecase if possible. Could you suggest some ideas on this ? 2. Another thing, I would like to row cache this CF so that when the user types in the next character the query is made consequently, then this row be retrieved from the cache without touching DB. It is expected while searching for a single username, the query(as a part of making instantaneous suggestions) will be made at least 2-3 times. One may also suggest to fetch all the columns starting with queried string to be retrieved then filter out at application level but what about just fecthing the exact no of columns(ids/names of users) I need to show to the user. Thus instead of keeping all the hundreds of cols in the application layer what about keeping it within the DB cache.!? The space alloted for the cache will be very small so that row remains in cache for a very short time(enough to serve only for the time duration while user is making a single search!?) ?
Re: Seeking advice on Schema and Caching
Any insights on this ? On Tue, Nov 15, 2011 at 9:40 PM, Quintero quinteros8...@gmail.com wrote: Aditya Narayan ady...@gmail.com wrote: Hi I need to add 'search users' functionality to my application. (The trigger for fetching searched items(like google instant search) is made when 3 letters have been typed in). For this, I make a CF with String type keys. Each such key is made of first 3 letters of a user's name. Thus all names starting with 'Mar-' are stored in single row (with key=Mar). The column names are framed as remaining letters of the names. Thus, a name 'Marcos' will be stored within rowkey Mar col name cos. The id will be stored as column value. Since there could be many users with same name. Thus I would have multple userIds(of users named Marcos) to be stored inside columnname cos under key Mar. Thus, 1. Supercolumn seems to be a better fit for my use case(so that ids of users with same name may fit as sub-columns inside a super-column) but since supercolumns are not encouraged thus I want to use an alternative schema for this usecase if possible. Could you suggest some ideas on this ? 2. Another thing, I would like to row cache this CF so that when the user types in the next character the query is made consequently, then this row be retrieved from the cache without touching DB. It is expected while searching for a single username, the query(as a part of making instantaneous suggestions) will be made at least 2-3 times. One may also suggest to fetch all the columns starting with queried string to be retrieved then filter out at application level but what about just fecthing the exact no of columns(ids/names of users) I need to show to the user. Thus instead of keeping all the hundreds of cols in the application layer what about keeping it within the DB cache.!? The space alloted for the cache will be very small so that row remains in cache for a very short time(enough to serve only for the time duration while user is making a single search!?) ?
Re: Seeking advice on Schema and Caching
Hi Ben, Solr, as I understood is for implementing full text search capability within documents, but in my case, as of now I just need to implement search on user names which seems to be easily provided by Cassandra as user names (as column names) may be sorted alphabetically within rows. I am splitting these rows by the first three characters of the username. Thus all user names starting with 'Mar' are stored in a row with key 'Mar'. Column values store the userId of that user. So Cassandra seems to fully satisfy my needs for this. Only issue i m having is how to deal with multiple users of same name. Thus super columns seem to fit appropriately but I really want to avoid them since they are seriously discouraged by everyone. On Wed, Nov 16, 2011 at 3:19 AM, Ben Gambley ben.gamb...@intoscience.comwrote: Hi Aditya Not sure the best way to do in Cassandra but have you considered using apache solr - you could then include just the row keys pointing back to Cassandra where the actual data is. Solr seems quite capable of performing google like searches and is fast. Cheers Ben On 16/11/2011, at 1:50 AM, Aditya Narayan ady...@gmail.com wrote: Hi I need to add 'search users' functionality to my application. (The trigger for fetching searched items(like google instant search) is made when 3 letters have been typed in). For this, I make a CF with String type keys. Each such key is made of first 3 letters of a user's name. Thus all names starting with 'Mar-' are stored in single row (with key=Mar). The column names are framed as remaining letters of the names. Thus, a name 'Marcos' will be stored within rowkey Mar col name cos. The id will be stored as column value. Since there could be many users with same name. Thus I would have multple userIds(of users named Marcos) to be stored inside columnname cos under key Mar. Thus, 1. Supercolumn seems to be a better fit for my use case(so that ids of users with same name may fit as sub-columns inside a super-column) but since supercolumns are not encouraged thus I want to use an alternative schema for this usecase if possible. Could you suggest some ideas on this ? 2. Another thing, I would like to row cache this CF so that when the user types in the next character the query is made consequently, then this row be retrieved from the cache without touching DB. It is expected while searching for a single username, the query(as a part of making instantaneous suggestions) will be made at least 2-3 times. One may also suggest to fetch all the columns starting with queried string to be retrieved then filter out at application level but what about just fecthing the exact no of columns(ids/names of users) I need to show to the user. Thus instead of keeping all the hundreds of cols in the application layer what about keeping it within the DB cache.!? The space alloted for the cache will be very small so that row remains in cache for a very short time(enough to serve only for the time duration while user is making a single search!?) ?
Re: Seeking advice on Schema and Caching
Regarding the first option that you suggested through composite columns, can I store the username id both in the column name and keep the column valueless? Will I be able to retrieve both the username and id from the composite col name ? Thanks a lot On Wed, Nov 16, 2011 at 10:56 AM, Aditya Narayan ady...@gmail.com wrote: Got the first option that you suggested. However, In the second one, are you suggested to use, for e.g, key='Marcos' store cols, for all users of that name, containing userId inside that row. That way it would have to read multiple rows while user is doing a single search. On Wed, Nov 16, 2011 at 10:47 AM, samal samalgo...@gmail.com wrote: I need to add 'search users' functionality to my application. (The trigger for fetching searched items(like google instant search) is made when 3 letters have been typed in). For this, I make a CF with String type keys. Each such key is made of first 3 letters of a user's name. Thus all names starting with 'Mar-' are stored in single row (with key=Mar). The column names are framed as remaining letters of the names. Thus, a name 'Marcos' will be stored within rowkey Mar col name cos. The id will be stored as column value. Since there could be many users with same name. Thus I would have multple userIds(of users named Marcos) to be stored inside columnname cos under key Mar. Thus, 1. Supercolumn seems to be a better fit for my use case(so that ids of users with same name may fit as sub-columns inside a super-column) but since supercolumns are not encouraged thus I want to use an alternative schema for this usecase if possible. Could you suggest some ideas on this ? Aditya, Have you any given thought on Composite columns [1]. I think it can help you solve your problem of multiple user with same name. mar:{ {cos,unique_user_id}:unique_user_id, {cos,1}:1, {cos,2}:2, {cos,3}:3, // {utf8,timeUUID}:timeUUID, } OR you can try wide rows indexing user name to ID's marcos{ user1:' ', user2:' ', user3:' ' } [1]http://www.slideshare.net/edanuff/indexing-in-cassandra
Store profile pics of users in Cassandra or file system ?
Would it be recommended to store the profile pics of users on an application in Cassandra ? Or file system would be a better way to go. I came across an interesting paper which advocates storing in DB for blobs sized up to 1 MB. I was planning to store the image bytes in the same row that contained other information of the user. So that all the related data of a user could retrieved at once. But I realized that since Cassandra would replicate this data multiple times which will make this storage expensive. This would lead me to either using separate CF with RF=1 which will remove the encouraging factor of keeping all the user related data at hte same place. So what would be a good strategy to store the profile pics of users ? (The image size would be around 70*70px). What are the pros and cons in terms of performance, storage space requirements etc ?
Re: Store profile pics of users in Cassandra or file system ?
just forgot to add the paper link if this is useful at all : To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystemhttp://research.microsoft.com/apps/pubs/default.aspx?id=64525 On Sat, Nov 12, 2011 at 12:34 AM, Aditya Narayan ady...@gmail.com wrote: Would it be recommended to store the profile pics of users on an application in Cassandra ? Or file system would be a better way to go. I came across an interesting paper which advocates storing in DB for blobs sized up to 1 MB. I was planning to store the image bytes in the same row that contained other information of the user. So that all the related data of a user could retrieved at once. But I realized that since Cassandra would replicate this data multiple times which will make this storage expensive. This would lead me to either using separate CF with RF=1 which will remove the encouraging factor of keeping all the user related data at hte same place. So what would be a good strategy to store the profile pics of users ? (The image size would be around 70*70px). What are the pros and cons in terms of performance, storage space requirements etc ?
Re: Concatenating ids with extension to keep multiple rows related to an entity in a single CF
the data in different rows of an entity is all of similar type but serves different features but still has almost similar storage and retrieval needs thus I wanted to put them in one CF and reduce column families. From my knowledge, I believe compositeType existed for columns as an alternative choice to implement something similar to supercolumns, are there any cassandra's built in features to design composite keys using two provided Integer ids. Is my approach correct and recommended if I need to keep multiple rows related to an entity in single CF ? On Fri, Nov 4, 2011 at 10:11 AM, Tyler Hobbs ty...@datastax.com wrote: On Thu, Nov 3, 2011 at 3:48 PM, Aditya Narayan ady...@gmail.com wrote: I am concatenating two Integer ids through bitwise operations(as described below) to create a single primary key of type long. I wanted to know if this is a good practice. This would help me in keeping multiple rows of an entity in a single column family by appending different extensions to the entityId. Are there better ways ? My Ids are of type Integer(4 bytes). public static final long makeCompositeKey(int k1, int k2){ return (long)k1 32 | k2; } You could use an actual CompositeType(IntegerType, IntegerType), but it would use a little extra space and not buy you much. It doesn't sound like this is the case for you, but if you have several distinct types of rows, you should consider using separate column families for them rather than putting them all into one big CF. -- Tyler Hobbs DataStax http://datastax.com/
Concatenating ids with extension to keep multiple rows related to an entity in a single CF
I am concatenating two Integer ids through bitwise operations(as described below) to create a single primary key of type long. I wanted to know if this is a good practice. This would help me in keeping multiple rows of an entity in a single column family by appending different extensions to the entityId. Are there better ways ? My Ids are of type Integer(4 bytes). public static final long makeCompositeKey(int k1, int k2){ return (long)k1 32 | k2; }
Re: Cassandra Cluster Admin - phpMyAdmin for Cassandra
Yes that would be pretty nice feature to see! On Mon, Oct 31, 2011 at 10:45 PM, Ertio Lew ertio...@gmail.com wrote: Thanks so much SebWajam for this great piece of work! Is there a way to set a data type for displaying the column names/ values of a CF ? It seems that your project always uses String Serializer for any piece of data however most of the times in real world cases this is not true so can we anyhow configure what serializer to use while reading the data so that the data may be properly identified by your project delivered in a readable format ? On Mon, Aug 22, 2011 at 7:17 AM, SebWajam sebast...@wajam.com wrote: Hi, I'm working on this project for a few months now and I think it's mature enough to post it here: Cassandra Cluster Admin on GitHubhttps://github.com/sebgiroux/Cassandra-Cluster-Admin Basically, it's a GUI for Cassandra. If you're like me and used MySQL for a while (and still using it!), you get used to phpMyAdmin and its simple and easy to use user interface. I thought it would be nice to have a similar tool for Cassandra and I couldn't find any, so I build my own! Supported actions: - Keyspace manipulation (add/edit/drop) - Column Family manipulation (add/edit/truncate/drop) - Row manipulation on column family and super column family (insert/edit/remove) - Basic data browser to navigate in the data of a column family (seems to be the favorite feature so far) - Support Cassandra 0.8+ atomic counters - Support management of multiple Cassandra clusters Bug report and/or pull request are always welcome! -- View this message in context: Cassandra Cluster Admin - phpMyAdmin for Cassandrahttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Cluster-Admin-phpMyAdmin-for-Cassandra-tp6709930p6709930.html Sent from the cassandra-u...@incubator.apache.org mailing list archivehttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/at Nabble.com.
Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE
..so that I can retrieve them through a single query. For reading cols from two CFs you need two queries, right ? On Sat, Oct 29, 2011 at 9:53 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Why not use 2 CFs? On Fri, Oct 28, 2011 at 9:42 PM, Aditya Narayan ady...@gmail.com wrote: I need to keep the data of some entities in a single CF but split in two rows for each entity. One row contains an overview information for the entity another row contains detailed information about entity. I am wanting to keep both rows in single CF so they may be retrieved in a single query when required together. Now the problem I am facing is that I want to cache only first type of rows(ie, the overview containing rows) avoid second type rows(that contains large data) from getting into cache. Is there a way I can manipulate such filtering of cache entering rows from a single CF?
Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE
@Mohit: I have stated the example scenarios in my first post under this heading. Also I have stated above why I want to split that data in two rows like Ikeda below stated, I'm too trying out to prevent the frequently accessed rows being bloated with large data want to prevent that data from entering cache as well. Okay so as most know this practice is called a wide row - we use them quite a lot. However, as your schema shows it will cache (while being active) all the row in memory. One way we got around this issue was to basically create some materialized views of any more common data so we can easily get to the minimum amount of information required without blowing too much memory with the larger representations. Yes exactly this is problem I am facing but I want to keep the both the types(common + large/detailed) of data in single CF so that it could server 'two materialized views'. My perspective is that indexing some of the higher levels of data would be the way to go - Solr or elastic search for distributed or if you know you only need it local just use a caching solution like ehcache What do you mean exactly by indexing some of the higher levels of data ? Thanks you guys! Anthony On 28/10/2011, at 21:42 PM, Aditya Narayan wrote: I need to keep the data of some entities in a single CF but split in two rows for each entity. One row contains an overview information for the entity another row contains detailed information about entity. I am wanting to keep both rows in single CF so they may be retrieved in a single query when required together. Now the problem I am facing is that I want to cache only first type of rows(ie, the overview containing rows) avoid second type rows(that contains large data) from getting into cache. Is there a way I can manipulate such filtering of cache entering rows from a single CF?
Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE
Thanks Zach, Nice Idea ! and what about looking at, may be, some custom caching solutions, leaving aside cassandra caching .. ? On Sun, Oct 30, 2011 at 2:00 AM, Zach Richardson j.zach.richard...@gmail.com wrote: Aditya, Depending on how often you have to write to the database, you could perform dual writes to two different column families, one that has summary + details in it, and one that only has the summary. This way you can get everything with one query, or the summary with one query, this should also help optimize your caching. The question here would of course be whether or not you have a read or write heavy workload. Since you seem to be concerned about the caching, it sounds like you have more of a read heavy workload and wouldn't pay to heavily with the dual writes. Zach On Sat, Oct 29, 2011 at 2:21 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Sat, Oct 29, 2011 at 11:23 AM, Aditya Narayan ady...@gmail.com wrote: @Mohit: I have stated the example scenarios in my first post under this heading. Also I have stated above why I want to split that data in two rows like Ikeda below stated, I'm too trying out to prevent the frequently accessed rows being bloated with large data want to prevent that data from entering cache as well. I think you are missing the point. You don't get any benefit (performance, access), you are already breaking it into 2 rows. Also, I don't know of any way where you can selectively keep the rows or keys in the cache. Other than having some background job that keeps the cache hot with those keys/rows you only have one option of keeping it in different CF since you are already breaking a row in 2 rows. Okay so as most know this practice is called a wide row - we use them quite a lot. However, as your schema shows it will cache (while being active) all the row in memory. One way we got around this issue was to basically create some materialized views of any more common data so we can easily get to the minimum amount of information required without blowing too much memory with the larger representations. Yes exactly this is problem I am facing but I want to keep the both the types(common + large/detailed) of data in single CF so that it could server 'two materialized views'. My perspective is that indexing some of the higher levels of data would be the way to go - Solr or elastic search for distributed or if you know you only need it local just use a caching solution like ehcache What do you mean exactly by indexing some of the higher levels of data ? Thanks you guys! Anthony On 28/10/2011, at 21:42 PM, Aditya Narayan wrote: I need to keep the data of some entities in a single CF but split in two rows for each entity. One row contains an overview information for the entity another row contains detailed information about entity. I am wanting to keep both rows in single CF so they may be retrieved in a single query when required together. Now the problem I am facing is that I want to cache only first type of rows(ie, the overview containing rows) avoid second type rows(that contains large data) from getting into cache. Is there a way I can manipulate such filtering of cache entering rows from a single CF?
Re: Storing counters in the standard column families along with non-counter columns ?
Thanks Aaron Chris, I appreciate your help. With dedicated CF for counters, in addition to the issue pointed by Chris, the major drawback I see is that I cant read *in a single query* the counters with the regular columns row which is widely required by my application. My use case is like storing reading the 'views count' of a post along with other post details(like post content,postedBy etc) in my application. I wanted to store the views count(*counter column*) along with the details of the post. On Thu, Jul 14, 2011 at 10:20 PM, Chris Burroughs chris.burrou...@gmail.com wrote: On 07/13/2011 03:57 PM, Aaron Morton wrote: You can always use a dedicated CF for the counters, and use the same row key. Of course one could do this. The problem is you are now spending ~2x disk space on row keys, and app specific client code just became more complicated.
Re: Storing counters in the standard column families along with non-counter columns ?
Oops that's really very much disheartening and it could seriously impact our plans for going live in near future. Without this facility I guess counters currently have very little usefulness. On Mon, Jul 11, 2011 at 8:16 PM, Chris Burroughs chris.burrou...@gmail.comwrote: On 07/10/2011 01:09 PM, Aditya Narayan wrote: Is there any target version in near future for which this has been promised ? The ticket is problematic in that it would -- unless someone has a clever new idea -- require breaking thrift compatibility to add it to the api. Since is unfortunate since it would be so useful. If it's in the 0.8.x series it will only be through CQL.
Re: Storing counters in the standard column families along with non-counter columns ?
Thanks for info. Is there any target version in near future for which this has been promised ? On Sun, Jul 10, 2011 at 9:12 PM, Sasha Dolgy sdo...@gmail.com wrote: No, it's not possible. To achieve it, there are two options ... contribute to the issue or wait for it to be resolved ... https://issues.apache.org/jira/browse/CASSANDRA-2614 -sd On Sun, Jul 10, 2011 at 5:04 PM, Aditya Narayan ady...@gmail.com wrote: Is it now possible to store counters in the standard column families along with non counter type columns ? How to achieve this ?
Re: Storing counters in the standard column families along with non-counter columns ?
Cool. I am looking forward to the addition of this very much required facility to Cassandra. On Sun, Jul 10, 2011 at 11:01 PM, samal sa...@wakya.in wrote: Yes. may be 0.8.2 current version need specific validation class CounterColumn for CCF, that only count [+,-,do not replace] stuff, where as normal CF simply just add or replace. On Sun, Jul 10, 2011 at 10:39 PM, Aditya Narayan ady...@gmail.com wrote: Thanks for info. Is there any target version in near future for which this has been promised ? On Sun, Jul 10, 2011 at 9:12 PM, Sasha Dolgy sdo...@gmail.com wrote: No, it's not possible. To achieve it, there are two options ... contribute to the issue or wait for it to be resolved ... https://issues.apache.org/jira/browse/CASSANDRA-2614 -sd On Sun, Jul 10, 2011 at 5:04 PM, Aditya Narayan ady...@gmail.com wrote: Is it now possible to store counters in the standard column families along with non counter type columns ? How to achieve this ?
Design for 'Most viewed Discussions' in a forum
* For a discussions forum, I need to show a page of most viewed discussions. For implementing this, I maintain a count of views of a discussion when this views count of a discussion passes a certain threshold limit, the discussion Id is added to a row of most viewed discussions. This row of most viewed discussions contains columns with Integer names values containing serialized lists of Ids of all discussions whose views count equals the Integral name of this column. Thus if the view count of a discussion increases I'll need to move its 'Id' from serialized list in some column to serialized list in another column whose name represents the updated views count on that discussion. Thus I can get the most viewed discussions by getting the appropriate no of columns from one end of this Integer sorted row. I wanted to get feedback from you all, to know if this is a good design. Thanks
Re: Design for 'Most viewed Discussions' in a forum
I would arrange for memtable flush period in such a manner that the time period for which these most viewed discussions are generated equals the memtable flush timeperiod, so that the entire row of most viewed discussion on a topic is in one or maximum two memtables/ SST tables. This would also help minimize several versions of the same column in the row parts in different SST tables. On Wed, May 18, 2011 at 11:04 PM, Aditya Narayan ady...@gmail.com wrote: * For a discussions forum, I need to show a page of most viewed discussions. For implementing this, I maintain a count of views of a discussion when this views count of a discussion passes a certain threshold limit, the discussion Id is added to a row of most viewed discussions. This row of most viewed discussions contains columns with Integer names values containing serialized lists of Ids of all discussions whose views count equals the Integral name of this column. Thus if the view count of a discussion increases I'll need to move its 'Id' from serialized list in some column to serialized list in another column whose name represents the updated views count on that discussion. Thus I can get the most viewed discussions by getting the appropriate no of columns from one end of this Integer sorted row. I wanted to get feedback from you all, to know if this is a good design. Thanks
Re: Design for 'Most viewed Discussions' in a forum
Thanks victor! Aren't there any good ways by using Cassandra alone ? On Wed, May 18, 2011 at 11:41 PM, openvictor Open openvic...@gmail.comwrote: Have you thought about user another kind of Database, which supports volative content for example ? I am currently thinking about doing something similar. The best and simplest option at the moment that I can think of is Redis. In redis you have the option of querying keys with wildcards. Your problem can be done by just inserting an UUID into Redis for a certain amount of time ( the best is to tailor this amount of time as an inverse function of the number of keys existing in Redis). *With Redis* What I would do : I cut down time in pieces of X minutes ( 15 minutes, for example by truncating a timestamp). Let timestampN be the timestamp for the period of time ( [N,N+15] ), let Topic1 Topic2 be two topics then : One or more people will view Topic 1 then Topic2 then again Topic1 in this period of 15 minutes (HINCRBY is the Increment) H http://redis.io/commands/hincrbyINCRBYhttp://redis.io/commands/hincrby topics:Topic1:timestampN viewcount 1 H http://redis.io/commands/hincrbyINCRBYhttp://redis.io/commands/hincrby topics:Topic2:timestampN viewcount 1 H http://redis.io/commands/hincrbyINCRBYhttp://redis.io/commands/hincrby topics:Topic1:timestampN viewcount 1 Then you just query in the following way : MGET http://redis.io/commands/mget topics:*:timestampN * is the wildcard, you order by viewcount and you have what you are asking for ! This is a simplified version of what you should do but personnally I really like the combination of Cassandra and Redis. Victor 2011/5/18 Aditya Narayan ady...@gmail.com I would arrange for memtable flush period in such a manner that the time period for which these most viewed discussions are generated equals the memtable flush timeperiod, so that the entire row of most viewed discussion on a topic is in one or maximum two memtables/ SST tables. This would also help minimize several versions of the same column in the row parts in different SST tables. On Wed, May 18, 2011 at 11:04 PM, Aditya Narayan ady...@gmail.comwrote: * For a discussions forum, I need to show a page of most viewed discussions. For implementing this, I maintain a count of views of a discussion when this views count of a discussion passes a certain threshold limit, the discussion Id is added to a row of most viewed discussions. This row of most viewed discussions contains columns with Integer names values containing serialized lists of Ids of all discussions whose views count equals the Integral name of this column. Thus if the view count of a discussion increases I'll need to move its 'Id' from serialized list in some column to serialized list in another column whose name represents the updated views count on that discussion. Thus I can get the most viewed discussions by getting the appropriate no of columns from one end of this Integer sorted row. I wanted to get feedback from you all, to know if this is a good design. Thanks
Re: Splitting the data of a single blog into 2 CFs (to implement effective caching) according to views.
Yes Aaron I thought about that but that doesnt seem to be just a small amount of data either (contains text), but yes we can consider to do so later as we find the need for it.. Thank you both! On Tue, Mar 8, 2011 at 2:25 PM, aaron morton aa...@thelastpickle.comwrote: You could duplicate the data from CF1 in CF2 as well (use a batch_mutation through whatever client you have). So when serving the second page you only need to read one row from CF2. Aaron On 8/03/2011, at 8:13 PM, Norman Maurer wrote: Yeah this make sense as far as I can tell. Bye, Norman 2011/3/8 Aditya Narayan ady...@gmail.com My application displays list of several blogs' overview data (like blogTitle/ nameOfBlogger/ shortDescrption for each blog) on 1st page (in very much similar manner like Digg's newsfeed) and when the user selects a particular blog to see., the application takes him to that specific blog's full page view which displays entire data of the blog. Thus I am trying to split a blog's data in *two rows*, in two **different CFs ** (one CF is row-cached(with less amount of data in each row) and another(with each row having entire remaining blog data) without caching). Data for 1st page view (like titles and other overview data of a blog) are put in a row in 1st CF. This CF is cached so as to improve the performance of heavily read data. Only the data from cached CF is read for 1st page. The other remaining data(bulk amount of text of blog and entire comments data) are stored as another row in 2nd CF. For 2nd page, **rows from both of the two CFs have to be read**. This will take two read operations. Does this seem to be a good design ?
Does the memtable replace the old version of column with the new overwriting version or is it just a simple append ?
Do the overwrites of newly written columns(that are present in memtable) *replace the old column* or is it just a simple append. I am trying to understand that if I update these column very very frequently(while they are in memtable), does the read performance of these columns gets affected, since Cassandra will have to read so many versions of the same column. If this is just replacement with old column then I guess read will be much better since it needs to see just single existing version of column. Thanks Aditya Narayan
Re: Does the memtable replace the old version of column with the new overwriting version or is it just a simple append ?
so this means that in memtable only the most recent version of a column will reside!? For this implementation, while writing to memtable Cassandra will see if there are other versions and will overwrite them (reconcilation while writing) !? I know that different SST tables may have different versions of the same column.(and for them reconcilation will happen at reads) Thanks Narendra! On 3/9/11, Narendra Sharma narendra.sha...@gmail.com wrote: Multiple write for same key and column will result in overwriting of column in a memtable. Basically multiple updates for same (key, column) are reconciled based on the column's timestamp. This happens per memtable. So if a memtable is flushed to an sstable, this rule will be valid for the next memtable. Note that sstables are immutable. So, different sstables may have different versions of same (key, column), and the reconciliation of that happens during read (read repair). This is why reads are slower than writes because conflict resolution happens during read. Hope this answers the question! Thanks, -Naren On Tue, Mar 8, 2011 at 10:44 PM, Aditya Narayan ady...@gmail.com wrote: Do the overwrites of newly written columns(that are present in memtable) *replace the old column* or is it just a simple append. I am trying to understand that if I update these column very very frequently(while they are in memtable), does the read performance of these columns gets affected, since Cassandra will have to read so many versions of the same column. If this is just replacement with old column then I guess read will be much better since it needs to see just single existing version of column. Thanks Aditya Narayan
Splitting the data of a single blog into 2 CFs (to implement effective caching) according to views.
My application displays list of several blogs' overview data (like blogTitle/ nameOfBlogger/ shortDescrption for each blog) on 1st page (in very much similar manner like Digg's newsfeed) and when the user selects a particular blog to see., the application takes him to that specific blog's full page view which displays entire data of the blog. Thus I am trying to split a blog's data in *two rows*, in two **different CFs ** (one CF is row-cached(with less amount of data in each row) and another(with each row having entire remaining blog data) without caching). Data for 1st page view (like titles and other overview data of a blog) are put in a row in 1st CF. This CF is cached so as to improve the performance of heavily read data. Only the data from cached CF is read for 1st page. The other remaining data(bulk amount of text of blog and entire comments data) are stored as another row in 2nd CF. For 2nd page, **rows from both of the two CFs have to be read**. This will take two read operations. Does this seem to be a good design ?
What would be a good strategy for Storing the large text contents like blog posts in Cassandra.
What would be a good strategy to store large text content/(blog posts of around 1500-3000 characters) in cassandra? I need to store these blog posts along with their metadata like bloggerId, blogTags. I am looking forward to store this data in a single row giving each attribute a single column. So one blog per row. Is using a single column for a large blog post like this a good strategy? Next, I also need to store the blogComments which I am planning to store all, in another single row. 1 comment per column. Thus the entire information about the a single comment like commentBody, commentor would be serialized(using google Protocol buffers) and stored in a single column, For storing the no. of likes of each comment itself, I am planning to keep a counter_column, in the same row, for each comment that will hold an no. specifiying no. of 'likes' of that comment. Any suggestions on the above design highly appreciated.. Thanks.
Re: What would be a good strategy for Storing the large text contents like blog posts in Cassandra.
Thanks Aaron!! I didnt knew about the upcoming facility for inbuilt counters. This sounds really great for my use-case!! Could you let me know where can I read more about this, if this had been blogged about, somewhere ? I'll go forward with the one (entire)blog per column design. Thanks On Mon, Mar 7, 2011 at 5:10 AM, Aaron Morton aa...@thelastpickle.com wrote: Sounds reasonable, one CF for the blog post one CF for the comments. You could also use a single CF if you will often read the blog and the comments at the same time. The best design is the one that suits how your app works, try one and be prepared to change. Note that counters are only in the 0.8 trunk and are still under development, they are not going to be released for a couple of months. Your per column data size is nothing to be concerned abut. Hope that helps. Aaron On 7/03/2011, at 6:35 AM, Aditya Narayan ady...@gmail.com wrote: What would be a good strategy to store large text content/(blog posts of around 1500-3000 characters) in cassandra? I need to store these blog posts along with their metadata like bloggerId, blogTags. I am looking forward to store this data in a single row giving each attribute a single column. So one blog per row. Is using a single column for a large blog post like this a good strategy? Next, I also need to store the blogComments which I am planning to store all, in another single row. 1 comment per column. Thus the entire information about the a single comment like commentBody, commentor would be serialized(using google Protocol buffers) and stored in a single column, For storing the no. of likes of each comment itself, I am planning to keep a counter_column, in the same row, for each comment that will hold an no. specifiying no. of 'likes' of that comment. Any suggestions on the above design highly appreciated.. Thanks.
Splitting a single row into multiple
Does it make any difference if I split a row, that needs to be accessed together, into two or three rows and then read those multiple rows ?? (Assume the keys of all the three rows are known to me programatically since I split columns by certain categories). Would the performance be any better if all the three were just a single row ?? I guess the performance should be same in both cases, the columns remain the same in quantity there spread into several SST files..
Re: Splitting a single row into multiple
Thanks Aaron.. I was looking to spliting the rows so that I could use a standard CF instead of super.. but your argument also makes sense. On Thu, Feb 24, 2011 at 1:19 AM, Aaron Morton aa...@thelastpickle.com wrote: AFAIK performance in the single row case will better. Multi get may require multiple seeks and reads in an sstable,, verses obviously a single seek and read for a single row. Multiplied by the number of sstables that contain row data. Using the key cache would reduce the the seeks. If it makes sense in your app do it. In general though try to model data so a single row read gets what you need. Aaron On 24/02/2011, at 5:59 AM, Aditya Narayan ady...@gmail.com wrote: Does it make any difference if I split a row, that needs to be accessed together, into two or three rows and then read those multiple rows ?? (Assume the keys of all the three rows are known to me programatically since I split columns by certain categories). Would the performance be any better if all the three were just a single row ?? I guess the performance should be same in both cases, the columns remain the same in quantity there spread into several SST files..
Re: Confused about get_slice SliceRange behavior with bloom filter
Thanks Sylvain, I guess I might have misunderstood the meaning of column_index_size_in_kb, My previous understanding about that was: it is the threshold size for a row to pass, after which its columns will be indexed. If I have understood it correctly, it implies the size of the blocks (containing columns) that are kept together on the same index. So if you make that high, a large no of columns will need to be deseralized for a single column access, in that block. And it you make it lower than optimal than indexes size will grow up, right? So I guess we should vary that depending on the size of our columns and not the size of rows !? I have valueless columns for my usecase. On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne sylv...@datastax.comwrote: As said by aaron, if the whole row is under 64k, it won't matter. But since you spoke of very wide row, I'll assume the whole will be much more than 64k. If so, the row is indexed by block (of 64k, configurable). Then the read performance depends on how many of those block are needed for the query, since each block potentially means a seek (potentially because some block could happen to be sequential on disk). So if the columns you ask for are really randomly distributed, then yes, the biggest the row is, the biggest the chance is to have to hit many blocks and the biggest the chance is for these block to be far apart on disk. -- Sylvain On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan ady...@gmail.com wrote: Jonathan, If I ask for around 150-200 columns (totally random not sequential) from a very wide row that contains more than a million or even more columns then, is the read performance of the SliceQuery operation affected by or depends on the length of the row ?? (For my use case, I would use the column names list for this SliceQuery operation). Thanks Aditya On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis jbel...@gmail.comwrote: On Sun, Feb 13, 2011 at 12:37 AM, E S tr1skl...@yahoo.com wrote: I've gotten myself really confused by http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping someone can help me understand what the io behavior of this operation would be. When I do a get_slice for a column range, will it seek to every SSTable? I had thought that it would use the bloom filter on the row key so that it would only do a seek to SSTables that have a very high probability of containing columns for that row. Yes. In the linked doc above, it seems to say that it is only used for exact column names. Am I misunderstanding this? Yes. You may be confusing multi-row behavior with multi-column. On a related note, if instead of using a SliceRange I provide an explicit list of columns, will I have to read all SSTables that have values for the columns Yes. or is it smart enough to stop after finding a value from the most recent SSTable? There is no way to know which value is most recent without having to read it first. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Confused about get_slice SliceRange behavior with bloom filter
Thanks for the clarifications.. On Mon, Feb 14, 2011 at 6:13 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Mon, Feb 14, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote: Thanks Sylvain, I guess I might have misunderstood the meaning of column_index_size_in_kb, My previous understanding about that was: it is the threshold size for a row to pass, after which its columns will be indexed. It is the size of the index 'bucket'. But given that there is no point to have an index with only one entry, it is true that it is also the threshold after wich row start to be indexed. If I have understood it correctly, it implies the size of the blocks (containing columns) that are kept together on the same index. So if you make that high, a large no of columns will need to be deseralized for a single column access, in that block. And it you make it lower than optimal than indexes size will grow up, right? yes So I guess we should vary that depending on the size of our columns and not the size of rows !? I have valueless columns for my usecase. Yes it depends mainly on the size of your columns. But if you have big rows, even with very tiny columns, you may still not want to put a too small value there. In general I would really make careful tests with your workload before changing the value of column_index_size_in_kb to see if it does make a difference. Not sure there is much to gain here. -- Sylvain On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne sylv...@datastax.comwrote: As said by aaron, if the whole row is under 64k, it won't matter. But since you spoke of very wide row, I'll assume the whole will be much more than 64k. If so, the row is indexed by block (of 64k, configurable). Then the read performance depends on how many of those block are needed for the query, since each block potentially means a seek (potentially because some block could happen to be sequential on disk). So if the columns you ask for are really randomly distributed, then yes, the biggest the row is, the biggest the chance is to have to hit many blocks and the biggest the chance is for these block to be far apart on disk. -- Sylvain On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan ady...@gmail.comwrote: Jonathan, If I ask for around 150-200 columns (totally random not sequential) from a very wide row that contains more than a million or even more columns then, is the read performance of the SliceQuery operation affected by or depends on the length of the row ?? (For my use case, I would use the column names list for this SliceQuery operation). Thanks Aditya On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis jbel...@gmail.comwrote: On Sun, Feb 13, 2011 at 12:37 AM, E S tr1skl...@yahoo.com wrote: I've gotten myself really confused by http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping someone can help me understand what the io behavior of this operation would be. When I do a get_slice for a column range, will it seek to every SSTable? I had thought that it would use the bloom filter on the row key so that it would only do a seek to SSTables that have a very high probability of containing columns for that row. Yes. In the linked doc above, it seems to say that it is only used for exact column names. Am I misunderstanding this? Yes. You may be confusing multi-row behavior with multi-column. On a related note, if instead of using a SliceRange I provide an explicit list of columns, will I have to read all SSTables that have values for the columns Yes. or is it smart enough to stop after finding a value from the most recent SSTable? There is no way to know which value is most recent without having to read it first. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Confused about get_slice SliceRange behavior with bloom filter
Jonathan, If I ask for around 150-200 columns (totally random not sequential) from a very wide row that contains more than a million or even more columns then, is the read performance of the SliceQuery operation affected by or depends on the length of the row ?? (For my use case, I would use the column names list for this SliceQuery operation). Thanks Aditya On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis jbel...@gmail.com wrote: On Sun, Feb 13, 2011 at 12:37 AM, E S tr1skl...@yahoo.com wrote: I've gotten myself really confused by http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping someone can help me understand what the io behavior of this operation would be. When I do a get_slice for a column range, will it seek to every SSTable? I had thought that it would use the bloom filter on the row key so that it would only do a seek to SSTables that have a very high probability of containing columns for that row. Yes. In the linked doc above, it seems to say that it is only used for exact column names. Am I misunderstanding this? Yes. You may be confusing multi-row behavior with multi-column. On a related note, if instead of using a SliceRange I provide an explicit list of columns, will I have to read all SSTables that have values for the columns Yes. or is it smart enough to stop after finding a value from the most recent SSTable? There is no way to know which value is most recent without having to read it first. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Merging the rows of two column families(with similar attributes) into one ??
What if the caching requirements, sorting needs of two kind of data are very much similar, is it preferable to go with a single CF in those cases ? Regards Aditya On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbsty...@datastax.com wrote: I read somewhere that more no of column families is not a good idea as it consumes more memory and more compactions to occur This is primarily true, but not in every case. But the caching requirements may be different as they cater to two different features. This is a great reason to *not* merge them. Besides the key and row caches, don't forget about the OS buffer cache. Is it recommended to merge these two column families into one ?? Thoughts ? No, this sounds like an anti-pattern to me. The overhead from having two separate CFs is not that high. -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Re: Merging the rows of two column families(with similar attributes) into one ??
Any comments/view points on this? --On Sat, Feb 12, 2011 at 5:05 PM, Aditya Narayan ady...@gmail.comwrote: What if the caching requirements, sorting needs of two kind of data are very much similar, is it preferable to go with a single CF in those cases ? Regards Aditya On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbsty...@datastax.com wrote: I read somewhere that more no of column families is not a good idea as it consumes more memory and more compactions to occur This is primarily true, but not in every case. But the caching requirements may be different as they cater to two different features. This is a great reason to *not* merge them. Besides the key and row caches, don't forget about the OS buffer cache. Is it recommended to merge these two column families into one ?? Thoughts ? No, this sounds like an anti-pattern to me. The overhead from having two separate CFs is not that high. -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Re: Calculating the size of rows in KBs
Thank you Aaron!! But, If you are reading partial rows(that otherwise contain several thousands of **valueless** columns) then do the column indexes help in making the reads faster more efficient than if they were not valueless? Perhaps, because they would only need to look up whether the asked column names exist in indexes for that row/key they dont need to deserialize the blocks in SST tables searching for column values. Am I thinking right way?? -Aditya On Fri, Feb 11, 2011 at 1:54 AM, Aaron Morton aa...@thelastpickle.com wrote: If you want to get the byte size of a particular row you will need to read it all back. If you connect with JConsole at look at you column families, there are attributes for the max, min and mean row sizes. In general the entire row only exists in memory when it is contained in the first Memtable it's written to. It may then be partially or fully read from disk during subsequent reads or compactions. On disk format described here may help http://wiki.apache.org/cassandra/ArchitectureSSTable Hope that helps Aaron On 10/02/2011, at 11:56 PM, Aditya Narayan ady...@gmail.com wrote: How can I get or calculate the size of rows/ columns ? what are the any overheads on memory for each column/row ?
Re: Does variation in no of columns in rows over the column family has any performance impact ?
Thanks for the detailed explanation Peter! Definitely cleared my doubts ! On Mon, Feb 7, 2011 at 1:52 PM, Peter Schuller peter.schul...@infidyne.com wrote: Does huge variation in no. of columns in rows, over the column family has *any* impact on the performance ? Can I have like just 100 columns in some rows and like hundred thousands of columns in another set of rows, without any downsides ? If I interpret your question the way I think you mean it, then no, Cassandra doesn't do anything with the data such that the smaller rows are somehow directly less efficient because there are other rows that are bigger. It doesn't affect the on-disk format or the on-disk efficiency of accessing the rows. However, there are almost always indirect effects when it comes to performance, in and particular storage systems. In the case of Cassandra, the *variation* itself should not impose a direct performance penalty, but there are potential other effects. For example the row cache is only useful for small works, so if you are looking to use the row cache the huge rows would perhaps prevent that. This could be interpreted as a performance impact on the smaller rows by the larger rows Compaction may become more expensive due to e.g. additional GC pressure resulting from large-but-still-within-in-memory-limits rows being compacted (or not, depending on JVM/GC settings). There is also the effect of cache locality as data set grows, and the cache locality for the smaller rows will likely be worse than had they been in e.g. a separate CF. Those are just three random example; I'm just trying to make the point that without any downsides is a very strong and blanket requirement for making the decision to mix small rows with larger ones. -- / Peter Schuller
Column Sorting of integer names
Is there any way to sort the columns named as integers in the descending order ? Regards -Aditya
Re: Using Cassandra to store files
I am also looking to possible solutions to store pdfs word documents. But why wont you store in them in the filesystem instead of a database unless your files are too small in which case it would be recommended to use a database. -Aditya On Fri, Feb 4, 2011 at 5:30 PM, Daniel Doubleday daniel.double...@gmx.net wrote: We are doing this with cassandra. But we cache a lot. We get around 20 writes/s and 1k reads/s (~ 100Mbit/s) for that particular CF but only 1% of them hit our cassandra cluster (5 nodes, rf=3). /Daniel On Feb 4, 2011, at 9:37 AM, Brendan Poole wrote: Hi Daniel When you say We are doing this do you mean via NFS or Cassandra. Thanks Brendan Signature.jpg Brendan Poole Systems Developer NewLaw Solicitors Helmont House Churchill Way Cardiff brendan.po...@new-law.co.uk 029 2078 4283 www.new-law.co.uk From: Daniel Doubleday [mailto:daniel.double...@gmx.net] Sent: 03 February 2011 17:21 To: user@cassandra.apache.org Subject: Re: Using Cassandra to store files Hundreds of thousands doesn't sound too bad. Good old NFS would do with an ok directory structure. We are doing this. Our documents are pretty small though (a few kb). We have around 40M right now with around 300GB total. Generally the problem is that much data usually means that cassandra becomes io bound during repairs and compactions even if your hot dataset would fit in the page cache. There are efforts to overcome this and 0.7 will help with repair problems but for the time being you have to have quite some headroom in terms of io performance to handle these situations. Here is a related post: http://comments.gmane.org/gmane.comp.db.cassandra.user/11190 On Feb 3, 2011, at 1:33 PM, Brendan Poole wrote: Hi Would anyone recommend using Cassandra for storing hundreds of thousands of documents in Word/PDF format? The manual says it can store documents under 64MB with no issue but was wondering if anyone is using it for this specific perpose. Would it be efficient/reliable and is there anything I need to bear in mind? Thanks in advance Signature.jpg Brendan Poole Systems Developer NewLaw Solicitors Helmont House Churchill Way Cardiff brendan.po...@new-law.co.uk 029 2078 4283 www.new-law.co.uk P Please consider the environment before printing this e-mail Important - The information contained in this email (and any attached files) is confidential and may be legally privileged and protected by law. The intended recipient is authorised to access it. If you are not the intended recipient, please notify the sender immediately and delete or destroy all copies. You must not disclose the contents of this email to anyone. Unauthorised use, dissemination, distribution, publication or copying of this communication is prohibited. NewLaw Solicitors does not accept any liability for any inaccuracies or omissions in the contents of this email that may have arisen as a result of transmission. This message and any attachments are believed to be free of any virus or defect that might affect any computer system into which it is received and opened. However, it is the responsibility of the recipient to ensure that it is virus free; therefore, no responsibility is accepted for any loss or damage in any way arising from its use. NewLaw Solicitors is the trading name of NewLaw Legal Ltd, a limited company registered in England and Wales with registered number 07200038. NewLaw Legal Ltd is regulated by the Solicitors Regulation Authority whose website is http://www.sra.org.uk The registered office of NewLaw Legal Ltd is at Helmont House, Churchill Way, Cardiff, CF10 2HE. Tel: 0845 756 6870, Fax: 0845 756 6871, Email: i...@new-law.co.uk. www.new-law.co.uk. We use the word ‘partner’ to refer to a shareowner or director of the company, or an employee or consultant of the company who is a lawyer with equivalent standing and qualifications. A list of the directors is displayed at the above address, together with a list of those persons who are designated as partners. P Please consider the environment before printing this e-mail Important - The information contained in this email (and any attached files) is confidential and may be legally privileged and protected by law. The intended recipient is authorised to access it. If you are not the intended recipient, please notify the sender immediately and delete or destroy all copies. You must not disclose the contents of this email to anyone. Unauthorised use, dissemination, distribution, publication or copying of this communication is prohibited. NewLaw Solicitors does not accept any liability for any inaccuracies or omissions in the contents of this email that may have arisen as a result of transmission. This message and any attachments are believed to be free of any virus or
Re: Using Cassandra to store files
yes, definitely a database for mapping ofcourse! On Fri, Feb 4, 2011 at 11:17 PM, buddhasystem potek...@bnl.gov wrote: Even when storage is in NFS, Cassandra can still be quite useful as a file catalog. Your physical storage can change, move etc. Therefore, it's a good idea to provide mapping of logical names to physical store points (which in fact can be many). This is a standard technique used in mass storage. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Using-Cassandra-to-store-files-tp5988698p5993357.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Sorting in time order without using TimeUUID type column names
Thanks Aaron, Yes I can put the column names without using the userId in the timeline row, and when I want to retrieve the row corresponding to that column name, I will attach the userId to get the row key. Yes I'll store it as a long I guess I'll have to write with a custom comparator type (ReversedIntegerType) to sort those longs in descending order. Regards Aditya On Sat, Feb 5, 2011 at 6:24 AM, aaron morton aa...@thelastpickle.com wrote: IMHO If you know the time of the event use store the time as a long, rather than a UUID. It will make it easier to get back to a time and make it easier for you to compare columns. TimeUUIDS has a pseudo random part as well as the time part, it could be set to a constant. By why bother if you know the absolute time. I'm not sure what the ReminderCountOfThisUser is for, and as Sylvain says there is no need for the user name if this is in a row just for the user. Hope that helps. Aaron On 4 Feb 2011, at 01:32, Aditya Narayan wrote: If I use : TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser as key pattern for the rows of reminders, then I am storing the key, just as it is, as the column name and thus column values need not contain a link to the row containing the reminder details. I think UserId would be required along with timestamp in the key pattern to provide uniqueness to the key as there may be several reminders generated by users on the application, at the same time. But my question is about whether it is really advisable to even generate the keys like this pattern ... instead of going with timeuuids ? Are there are any downsides which I am not perhaps not aware of ? On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote: Hey all, I want to store some columns that are reminders to the users on my application, in time sorted order in a row(timeline row of the user). Would it be recommended to store these reminder columns in the timeline row with column names like: combination of timestamp(of time when the reminder gets due) + UserId+ Reminders Count of that user; Column Name= TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser If you have one row by user (which is a good idea), why keep the UserId in the column name ? Then what comparator could I use to sort them in order of the their due time ? This comparator should be able to sort no. in descending order.(I guess ascii type would do the opposite order) (Reminders need to be sorted in the timeline in the order of their due time.) *The* solution is write a custom comparator. Have a look at http://www.datastax.com/docs/0.7/data_model/column_families and http://www.sodeso.nl/?p=421 for instance. As a side note, the fact that the comparator sort in ascending order when you need descending order would be that much of a problem, since you can always do slice queries in reversed order. But even then, asciiType is not a very satisfying solution as you would have to be careful about the padding of your timestamp for it to work correctly. So again, custom comparator is the way to go. Basically I am trying to avoid 16 bytes long timeUUID first because they are too long and the above defined key pattern is guaranteeing me a unique key/Id for the reminder row always. Thanks Aditya Narayan -- Sylvain
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Thanks Tyler! On Thu, Feb 3, 2011 at 12:06 PM, Tyler Hobbs ty...@datastax.com wrote: On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote: Can I have some more feedback about my schema perhaps somewhat more criticisive/harsh ? It sounds reasonable to me. Since you're writing/reading all of the subcolumns at the same time, I would opt for a standard column with the tags serialized into a column value. I don't think you need to worry about row lengths here. Depending on the reminder size and how many times it's likely to be repeated in the timeline, you could explore denormalizing a bit more by storing the reminders in the timelines themselves, perhaps with a separate row per (user, tag) combination. This would cut down on your seeks quite a bit, but it may not be necessary at this point (or at all). -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Sorting in time order without using TimeUUID type column names
Hey all, I want to store some columns that are reminders to the users on my application, in time sorted order in a row(timeline row of the user). Would it be recommended to store these reminder columns in the timeline row with column names like: combination of timestamp(of time when the reminder gets due) + UserId+ Reminders Count of that user; Column Name= TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser Then what comparator could I use to sort them in order of the their due time ? This comparator should be able to sort no. in descending order.(I guess ascii type would do the opposite order) (Reminders need to be sorted in the timeline in the order of their due time.) Basically I am trying to avoid 16 bytes long timeUUID first because they are too long and the above defined key pattern is guaranteeing me a unique key/Id for the reminder row always. Thanks Aditya Narayan
Re: Sorting in time order without using TimeUUID type column names
If I use : TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser as key pattern for the rows of reminders, then I am storing the key, just as it is, as the column name and thus column values need not contain a link to the row containing the reminder details. I think UserId would be required along with timestamp in the key pattern to provide uniqueness to the key as there may be several reminders generated by users on the application, at the same time. But my question is about whether it is really advisable to even generate the keys like this pattern ... instead of going with timeuuids ? Are there are any downsides which I am not perhaps not aware of ? On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote: Hey all, I want to store some columns that are reminders to the users on my application, in time sorted order in a row(timeline row of the user). Would it be recommended to store these reminder columns in the timeline row with column names like: combination of timestamp(of time when the reminder gets due) + UserId+ Reminders Count of that user; Column Name= TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser If you have one row by user (which is a good idea), why keep the UserId in the column name ? Then what comparator could I use to sort them in order of the their due time ? This comparator should be able to sort no. in descending order.(I guess ascii type would do the opposite order) (Reminders need to be sorted in the timeline in the order of their due time.) *The* solution is write a custom comparator. Have a look at http://www.datastax.com/docs/0.7/data_model/column_families and http://www.sodeso.nl/?p=421 for instance. As a side note, the fact that the comparator sort in ascending order when you need descending order would be that much of a problem, since you can always do slice queries in reversed order. But even then, asciiType is not a very satisfying solution as you would have to be careful about the padding of your timestamp for it to work correctly. So again, custom comparator is the way to go. Basically I am trying to avoid 16 bytes long timeUUID first because they are too long and the above defined key pattern is guaranteeing me a unique key/Id for the reminder row always. Thanks Aditya Narayan -- Sylvain
Re: Sorting in time order without using TimeUUID type column names
If I use : TimestampOfDueTimeInFuture| UserId | ReminderCountOfThisUser as key pattern for the rows of reminders, then I am storing the key, just as it is, as the column name and thus column values need not contain a link to the row containing the reminder details. I think UserId would be required along with timestamp in the key pattern to provide uniqueness to the key as there may be several reminders generated, at the same time by other users on the application. But my question is about whether it is really advisable to even generate the keys like this pattern ... instead of going with timeuuids ? Are there are any downsides which I am not perhaps not aware of ? On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote: Hey all, I want to store some columns that are reminders to the users on my application, in time sorted order in a row(timeline row of the user). Would it be recommended to store these reminder columns in the timeline row with column names like: combination of timestamp(of time when the reminder gets due) + UserId+ Reminders Count of that user; Column Name= TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser If you have one row by user (which is a good idea), why keep the UserId in the column name ? Then what comparator could I use to sort them in order of the their due time ? This comparator should be able to sort no. in descending order.(I guess ascii type would do the opposite order) (Reminders need to be sorted in the timeline in the order of their due time.) *The* solution is write a custom comparator. Have a look at http://www.datastax.com/docs/0.7/data_model/column_families and http://www.sodeso.nl/?p=421 for instance. As a side note, the fact that the comparator sort in ascending order when you need descending order would be that much of a problem, since you can always do slice queries in reversed order. But even then, asciiType is not a very satisfying solution as you would have to be careful about the padding of your timestamp for it to work correctly. So again, custom comparator is the way to go. Basically I am trying to avoid 16 bytes long timeUUID first because they are too long and the above defined key pattern is guaranteeing me a unique key/Id for the reminder row always. Thanks Aditya Narayan -- Sylvain
Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan ady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs bill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs bill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
@Bill Thank you BIll! @Cassandra users Can others also leave their suggestions and comments about my schema, please. Also my question about whether to use a superColumn or alternatively, just store the data (that would otherwise be stored in subcolumns) as serialized into a single column in standard type column family. Thanks -Aditya Narayan On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com wrote: I did not understand before... sorry. Again, depending upon how many reminders you have for a single user, this could be a long/wide row. Again, it really comes down to how many reminders are we talking about and how often will they be read/written. While a single row can contain millions (maybe more) columns, that doesn't mean it's a good idea. I'm working on a logging system with Cassandra and ran into this same type of problem. Do I put all of the messages for a single system into a single row keyed off that system's name? I quickly came to the answer of no and now I break my row keys into POSIX_timestamp:system where my timestamps are buckets for every 5 minutes. This nicely distributes the load across the nodes in my system. Bill- On 02/02/2011 11:18 AM, Aditya Narayan wrote: You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Can I have some more feedback about my schema perhaps somewhat more criticisive/harsh ? Thanks again, Aditya Narayan On Wed, Feb 2, 2011 at 10:27 PM, Aditya Narayan ady...@gmail.com wrote: @Bill Thank you BIll! @Cassandra users Can others also leave their suggestions and comments about my schema, please. Also my question about whether to use a superColumn or alternatively, just store the data (that would otherwise be stored in subcolumns) as serialized into a single column in standard type column family. Thanks -Aditya Narayan On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com wrote: I did not understand before... sorry. Again, depending upon how many reminders you have for a single user, this could be a long/wide row. Again, it really comes down to how many reminders are we talking about and how often will they be read/written. While a single row can contain millions (maybe more) columns, that doesn't mean it's a good idea. I'm working on a logging system with Cassandra and ran into this same type of problem. Do I put all of the messages for a single system into a single row keyed off that system's name? I quickly came to the answer of no and now I break my row keys into POSIX_timestamp:system where my timestamps are buckets for every 5 minutes. This nicely distributes the load across the nodes in my system. Bill- On 02/02/2011 11:18 AM, Aditya Narayan wrote: You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline