Why don't you start off with a single small Cassandra server as you usually do it with MySQL ?
For any website just starting out, the load initially is minimal grows with a slow pace initially. People usually start with their MySQL based sites with a single server(***that too a VPS not a dedicated server) running as both app server as well as DB server usually get too far with this setup only as they feel the need they separate the DB from the app server giving it a separate VPS server. This is what a start up expects the things to be while planning about resources procurement. But so far what I have seen, it's something very different with Cassandra. People usually recommend starting out with atleast a 3 node cluster, (on dedicated servers) with lots lots of RAM. 4GB or 8GB RAM is what they suggest to start with. So is it that Cassandra requires more hardware resources in comparison to MySQL, for a website to deliver similar performance, serve similar load/ traffic same amount of data. I understand about higher storage requirements of Cassandra due to replication but what about other hardware resources ? Can't we start off with Cassandra based apps just like MySQL. Starting with 1 or 2 VPS adding more whenever there's a need. Renting out dedicated servers with lots of RAM just from the beginning may be viable for very well funded startups but not for all.
Re: Schema question : Query to support Find which all of these 500 email ids have been registered
Sorry for the confusion created. I need to store emails registered just for a single application. So although my data model would fit into just a single row. But is storing a hundred million columns(col name size= 8 byte; col value size=4 byte ) in a single row a good idea ? I am very much tempted to store it in single row but I also heard it is recommended to keep a row size within 10s of MBs for optimal performance.
Re: Schema question : Query to support Find which all of these 500 email ids have been registered
What about if I spread these columns across 20 rows ? Then I have to query each of these 20 rows for 500 columns. but still this seems a better solution than one row for all cols or separate row for each email id approaches !? On Fri, Jul 27, 2012 at 11:36 AM, Aklin_81 asdk...@gmail.com wrote: Sorry for the confusion created. I need to store emails registered just for a single application. So although my data model would fit into just a single row. But is storing a hundred million columns(col name size= 8 byte; col value size=4 byte ) in a single row a good idea ? I am very much tempted to store it in single row but I also heard it is recommended to keep a row size within 10s of MBs for optimal performance.
Re: What is the future of supercolumns ?
Hmm .. it would be great if the supercolumns API remains. Also I believe we can replace the full functionality of supercolumns through composite column names in case this issue (related to reading multiple column ranges) is resolved:https://issues.apache.org/jira/browse/CASSANDRA-2710 On Sun, Jan 8, 2012 at 5:45 AM, Brandon Williams dri...@gmail.com wrote: On Sat, Jan 7, 2012 at 5:42 PM, Rustam Aliyev rus...@code.az wrote: My suggestion is simple: don't use any deprecated stuff out there. In practically any case there is a good reason why it's deprecated. SuperColumns are not deprecated. The supercolumn API will remain: https://issues.apache.org/jira/browse/CASSANDRA-3237 -Brandon
Re: What is the future of supercolumns ?
Any comments please ? On Thu, Jan 5, 2012 at 11:07 AM, Aklin_81 asdk...@gmail.com wrote: I have seen supercolumns usage been discouraged most of the times. However sometimes the supercolumns seem to fit the scenario most appropriately not only in terms of how the data is stored but also in terms of how is it retrieved. Some of the queries supported by SCs are uniquely capable of doing the task which no other alternative schema could do.(Like recently I asked about getting the equivalent of retrieving a list of (full)supercolumns by name, through use of composite columns, unfortunately there was no way to do this without reading lots of extra columns). So I am really confused whether: 1. Should I really not use the supercolumns for any case at all, however appropriate, or I just need to be just careful while realizing that supercolumns fit my use case appropriately or what!? 2. Are there any performance concerns with supercolumns even in the cases where they are used most appropriately. Like when you need to retrieve the entire supercolumns everytime max. no of subcolumns vary between 0-10. (I don't write all the subcolumns inside supercolumn, at once though! Does this also matter?) 3. What is their future? Are they going to be deprecated or may be enhanced later?
Re: What is the future of supercolumns ?
I read entire columns inside the supercolumns at any time but as for writing them, I write the columns at different times. I don't have the need to update them except that die after their TTL period of 60 days. But since they are going to be deprecated, I don't know if it would be really advisable to use them right now. I believe if it was possible to do wildchard querying for a list of column names then the supercolumns use cases may be easily replaced by normal columns. Could it practically possible, in future ? On Sat, Jan 7, 2012 at 8:05 AM, Terje Marthinussen tmarthinus...@gmail.com wrote: Please realize that I do not make any decisions here and I am not part of the core Cassandra developer team. What has been said before is that they will most likely go away and at least under the hood be replaced by composite columns. Jonathan have however stated that he would like the supercolumn API/abstraction to remain at least for backwards compatibility. Please understand that under the hood, supercolumns are merely groups of columns serialized as a single block of data. The fact that there is a specialized and hardcoded way to serialize these column groups into supercolumns is a problem however and they should probably go away to make space for a more generic implementation allowing more flexible data structures and less code specific for one special data structure. Today there are tons of extra code to deal with the slight difference in serialization and features of supercolumns vs columns and hopefully most of that could go away if things got structured a bit different. I also hope that we keep APIs to allow simple access to groups of key/value pairs to simplify application logic as working with just columns can add a lot of application code which should not be needed. If you almost always need all or mostly all of the columns in a supercolumn, and you normally update all of them at the same time, they will most likely be faster than normal columns. Processing wise, you will actually do a bit more work on serialization/deserialization of SC's but the I/O part will usually be better grouped/require less operations. I think we did some benchmarks on some heavy use cases with ~30 small columns per SC some time back and I think we ended up with SCs being 10-20% faster. Terje On Jan 5, 2012, at 2:37 PM, Aklin_81 wrote: I have seen supercolumns usage been discouraged most of the times. However sometimes the supercolumns seem to fit the scenario most appropriately not only in terms of how the data is stored but also in terms of how is it retrieved. Some of the queries supported by SCs are uniquely capable of doing the task which no other alternative schema could do.(Like recently I asked about getting the equivalent of retrieving a list of (full)supercolumns by name, through use of composite columns, unfortunately there was no way to do this without reading lots of extra columns). So I am really confused whether: 1. Should I really not use the supercolumns for any case at all, however appropriate, or I just need to be just careful while realizing that supercolumns fit my use case appropriately or what!? 2. Are there any performance concerns with supercolumns even in the cases where they are used most appropriately. Like when you need to retrieve the entire supercolumns everytime max. no of subcolumns vary between 0-10. (I don't write all the subcolumns inside supercolumn, at once though! Does this also matter?) 3. What is their future? Are they going to be deprecated or may be enhanced later?
What is the future of supercolumns ?
I have seen supercolumns usage been discouraged most of the times. However sometimes the supercolumns seem to fit the scenario most appropriately not only in terms of how the data is stored but also in terms of how is it retrieved. Some of the queries supported by SCs are uniquely capable of doing the task which no other alternative schema could do.(Like recently I asked about getting the equivalent of retrieving a list of (full)supercolumns by name, through use of composite columns, unfortunately there was no way to do this without reading lots of extra columns). So I am really confused whether: 1. Should I really not use the supercolumns for any case at all, however appropriate, or I just need to be just careful while realizing that supercolumns fit my use case appropriately or what!? 2. Are there any performance concerns with supercolumns even in the cases where they are used most appropriately. Like when you need to retrieve the entire supercolumns everytime max. no of subcolumns vary between 0-10. (I don't write all the subcolumns inside supercolumn, at once though! Does this also matter?) 3. What is their future? Are they going to be deprecated or may be enhanced later?
Fast lookups for userId to username and vice versa
I need to create mapping from userId(s) to username(s) which need to provide for fast lookups service ? Also I need to provide a mapping from username to userId inorder to implement search functionality in my application. What could be a good strategy to implement this ? (I would welcome suggestions to use any new technologies if they are really worth for my case.)
Re: Cassandra Cluster Admin - phpMyAdmin for Cassandra
Is there a way to configure the serializers to use while showing up the stored data ?? Thanks Aklin On Tue, Nov 1, 2011 at 5:02 PM, Aditya Narayan ady...@gmail.com wrote: Yes that would be pretty nice feature to see! On Mon, Oct 31, 2011 at 10:45 PM, Ertio Lew ertio...@gmail.com wrote: Thanks so much SebWajam for this great piece of work! Is there a way to set a data type for displaying the column names/ values of a CF ? It seems that your project always uses String Serializer for any piece of data however most of the times in real world cases this is not true so can we anyhow configure what serializer to use while reading the data so that the data may be properly identified by your project delivered in a readable format ? On Mon, Aug 22, 2011 at 7:17 AM, SebWajam sebast...@wajam.com wrote: Hi, I'm working on this project for a few months now and I think it's mature enough to post it here: Cassandra Cluster Admin on GitHubhttps://github.com/sebgiroux/Cassandra-Cluster-Admin Basically, it's a GUI for Cassandra. If you're like me and used MySQL for a while (and still using it!), you get used to phpMyAdmin and its simple and easy to use user interface. I thought it would be nice to have a similar tool for Cassandra and I couldn't find any, so I build my own! Supported actions: - Keyspace manipulation (add/edit/drop) - Column Family manipulation (add/edit/truncate/drop) - Row manipulation on column family and super column family (insert/edit/remove) - Basic data browser to navigate in the data of a column family (seems to be the favorite feature so far) - Support Cassandra 0.8+ atomic counters - Support management of multiple Cassandra clusters Bug report and/or pull request are always welcome! -- View this message in context: Cassandra Cluster Admin - phpMyAdmin for Cassandrahttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Cluster-Admin-phpMyAdmin-for-Cassandra-tp6709930p6709930.html Sent from the cassandra-u...@incubator.apache.org mailing list archivehttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/at Nabble.com.
Reading a bunch of rows pointed out by column names of columns in some row
Is there a way to just ask for column names in a row to get just a column names list (and not the entire columnString,String). I am using Hector. I have a row of valueless columns (whose column names are keys to another set of rows) I wish to retrieve a list of those column names and send it in another query to retrieve the corresponding rows pointed by each of those columns. Now if I do a normal query I get list of columns like: List Column String, String but I would need to pass to another query as just column names list like: List String Is there any method in java which could transforms this list in a better way, rather than iterating over each column and extracting its name into another list ? What would be the best way to do such queries?
Re: Reading a bunch of rows pointed out by column names of columns in some row
anyone who can share what is the most recommended way of doing this On Sat, Feb 19, 2011 at 7:12 PM, Aklin_81 asdk...@gmail.com wrote: Is there a way to just ask for column names in a row to get just a column names list (and not the entire columnString,String). I am using Hector. I have a row of valueless columns (whose column names are keys to another set of rows) I wish to retrieve a list of those column names and send it in another query to retrieve the corresponding rows pointed by each of those columns. Now if I do a normal query I get list of columns like: List Column String, String but I would need to pass to another query as just column names list like: List String Is there any method in java which could transforms this list in a better way, rather than iterating over each column and extracting its name into another list ? What would be the best way to do such queries?
Re: Frequent updates of freshly written columns
Compaction does not 'mutate' the sst files, it 'merges' several sst files into one with new indexes, merged data rows deleting tombstones. Thus you reclaim your disk space. On Fri, Feb 18, 2011 at 7:34 PM, James Churchman jameschurch...@gmail.comwrote: but a compaction will mutate the sstables and reclaim the space (eventually) ? james On 18 Feb 2011, at 08:36, Sylvain Lebresne wrote: On Fri, Feb 18, 2011 at 8:14 AM, Aklin_81 asdk...@gmail.com wrote: Are the very freshly written columns to a row in memtables, efficiently updated/overwritten by edited/new column values. After flushing of memtable, are those(edited + unedited ones) columns stored together on disk (in same blocks!?) as if they were written in one single operation or same time ?? I know if old columns are edited then several copies of same column will be dispersed in different sst tables, what about fresh columns ? Are there any disadvantages to frequently updating fresh columns present in memtable ? The SSTables are immutable but the memtable are not. As long as you update/overwrite a column that is still in memtable, it is simply replaced in memory (so it's as efficient as it gets). In other words, when the memtable is flushed, only the last version of the column goes in. -- Sylvain
Re: Frequent updates of freshly written columns
Sylvain, I also need to store data that is frequently updated, same column being updated several times during each user session, at each action by user, But, this data is not very fresh and hence when I update this column frequently, there would be many versions of the same column in several sst files! Reading this type of data would not be too efficient I guess as the row would be totally scattered! Could there be any better strategy to store such data in cassandra? (Since the column holds an aggregate data obtained from all actions of the users, I have the need of updating that same column again again) my another doubt, When old column has been updated and exists in the memtable, but other versions of the column in SST tables exist, do the reads also scan the sst tables for that column, after memtable. or is that smart enough to say that this column is the most recent one ? On Fri, Feb 18, 2011 at 10:32 PM, Aklin_81 asdk...@gmail.com wrote: Sylvain, I also need to store data that is frequently updated, same column being updated several times during each user session, at each action by user, But, this data is not very fresh and hence when I update this column frequently, there would be many versions of the same column in several sst files! Reading this type of data would not be too efficient I guess as the row would be totally scattered! Could there be any better strategy to store such data in cassandra? (Since the column holds an aggregate data obtained from all actions of the users, I have the need of updating that same column again again) my another doubt, When old column has been updated and exists in the memtable, but other versions of the column in SST tables exist, do the reads also scan the sst tables for that column, after memtable. or is that smart enough to say that this column is the most recent one ? On Fri, Feb 18, 2011 at 8:54 PM, James Churchman jameschurch...@gmail.com wrote: ok great, thanks for the exact clarification On 18 Feb 2011, at 14:11, Aklin_81 wrote: Compaction does not 'mutate' the sst files, it 'merges' several sst files into one with new indexes, merged data rows deleting tombstones. Thus you reclaim your disk space. On Fri, Feb 18, 2011 at 7:34 PM, James Churchman jameschurch...@gmail.com wrote: but a compaction will mutate the sstables and reclaim the space (eventually) ? james On 18 Feb 2011, at 08:36, Sylvain Lebresne wrote: On Fri, Feb 18, 2011 at 8:14 AM, Aklin_81 asdk...@gmail.com wrote: Are the very freshly written columns to a row in memtables, efficiently updated/overwritten by edited/new column values. After flushing of memtable, are those(edited + unedited ones) columns stored together on disk (in same blocks!?) as if they were written in one single operation or same time ?? I know if old columns are edited then several copies of same column will be dispersed in different sst tables, what about fresh columns ? Are there any disadvantages to frequently updating fresh columns present in memtable ? The SSTables are immutable but the memtable are not. As long as you update/overwrite a column that is still in memtable, it is simply replaced in memory (so it's as efficient as it gets). In other words, when the memtable is flushed, only the last version of the column goes in. -- Sylvain
Frequent updates of freshly written columns
Are the very freshly written columns to a row in memtables, efficiently updated/overwritten by edited/new column values. After flushing of memtable, are those(edited + unedited ones) columns stored together on disk (in same blocks!?) as if they were written in one single operation or same time ?? I know if old columns are edited then several copies of same column will be dispersed in different sst tables, what about fresh columns ? Are there any disadvantages to frequently updating fresh columns present in memtable ?
Re: Column name size
Would be interested in your findings, Patrik! ... I too was searching for something similar a few days back.. for column names that contained userIds of users on my application. UUIDs that seemed to be most widely recognized(perhaps!) solution are 16 bytes but those definitely seem like a too much for the heavy denormalized databases. It definitely makes sense to attempt to reduce the size as storage may although be cheap but you also need to cache your data, so (I guess) reduction in the size of your column names, ids etc could actually matter a lot depending on the size of your actual data/column values. I was suggested by some people here, to try out solutions like Zookeeper or snowflake, to generate sequential ids, that could be used as alternative for UUIDs in some cases. Regards Asil On Fri, Feb 11, 2011 at 3:36 PM, Patrik Modesto patrik.mode...@gmail.com wrote: Hi all! I'm thinking if size of a column name could matter for a large dataset in Cassandra (I mean lots of rows). For example what if I have a row with 10 columns each has 10 bytes value and 10 bytes name. Do I have half the row size just of the column names and the other half of the data (not counting storage overhead)? What if I have 10M of these rows? Is there a difference? Should I use some 3bytes codes for a column name to save memory/bandwidth? Thanks, Patrik
Re: Calculating the size of rows in KBs
I think it does not deserialize the entire list of columns in the row(though it is the case with subcolumns in a supercolumn). In case of standard columns, only the blocks on the disk containing the columns values of the columns being asked for, are read and deserailized to get the values. On Sat, Feb 12, 2011 at 10:51 AM, Stu Hood stuh...@gmail.com wrote: Does it also mean that the whole row will be deserialized when a query comes just for one column? No, it does not mean that: at most column_index_size_in_kb will be read to read a single column, independent of where that column is in the row. On the other hand, with the row cache enabled, it is as if everything in the row is needed, so the entire row will be read. There is talk of improving this limitation of the row cache: see https://issues.apache.org/jira/browse/CASSANDRA-1956 On Fri, Feb 11, 2011 at 6:00 PM, buddhasystem potek...@bnl.gov wrote: Does it also mean that the whole row will be deserialized when a query comes just for one column? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Calculating-the-size-of-rows-in-KBs-tp6011243p6017870.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Finding the intersection results of column sets of two rows
Amongst two rows, where I need to find the common columns. I will not have more than 200 columns(in 99% cases) for the 1st row. But the 2nd row where I need to find these columns may have even around a million valueless columns. A point to note is:- These calculations are all done for **writing the data to the database that has been collected from presentation layer** not while presentation of data. I am using the results of such intersection to find the rows(that are pointed by names of common columns) that I should write to. The calculations are done after a Post is submitted by a user, in a discussions forum. Actually this is used to find out the mutual connections in a group write to the rows pointed by common columns. 1st row represents the connection list of a user, which is not going to be more than 100-250 columns for my case 2nd row represents the members of a group which may contain a million columns as I told. I find the mutual connections in a group(by finding the common columns in the above two rows) and then write to the rows of those users. Cant I run a batch query to ask for all columns that I picked up from 1st row and want to ask in the 2nd row ?? Is there any better way ? Asil On Feb 7, 2011, at 12:30 AM, Aklin_81 wrote: Thanks Aaron Shaun, ** I think my question might have been unclear to some of you. So I would again explain my problem( solution which I thought of) for the sake of clarity:- Consider I have 2 rows. 1st row contains 60-70 columns and 2nd row contains like in hundreds of thousands columns. Both the columns sets are all valueless. I need to just findout the **common column names** in the two rows. **These two rows are known to me**. So what I plan to do is, I just pick up all **columns (names)** of 1st row (60 -70 columns) and just ask for them in 2nd row, whatever column names I get back is my result. Would there be any problem with this solution ? This is how I am expecting to get common column names. Please do not consider it as a JOIN case as it leads to unnecessary confusions, I just need common column names from valueless columns in the two rows. Aaron, actually the intersection data is very much context based. So say if there are 10 million rows in CF A 1 million in CF B, then intersection data would be containing 10 million *1 million rows. This would involve very huge unaffordable amounts of denormalization. And finding columns in client would require pulling unnecessary columns like pulling 100,000 columns from a row of which only 60-70 are required . Shaun, I hope my above clarification has clarified things a bit. Yes, the rows, of which I need to find common columns are known to me. Thank you all, Asil On Mon, Feb 7, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.net wrote: In theory, you should be able to do joins by creating an extra column in one column family, holding the foreign key of the matching row in the other family. This assumes that the info you are joining on is available in both CFs (is not some sort of functional transformation). I have just found that the implementation for secondary indexes is not yet very close to optimal for more complex joins involving multiple indexes, I'm not sure if that affects you as you didn't say what you are joining on. -- Shaun On Feb 6, 2011, at 4:22 PM, Aaron Morton wrote: Is it possible for you to dernormalise and write all the intersection values? Will depend on how many I guess. The other alternative is to pull back more data that you need and the intersection in code in the client. Hope that helps. Aaron On 7/02/2011, at 7:11 AM, Aklin_81 asdk...@gmail.com wrote: Hi, @buddhasystem : yes that's well known solution. But obviously when mysql couldnt satisfy my needs, I am here. My question is in context of Cassandra, if it possible to achieve intersection result set of columns in two rows, by the way I spoke about. @Edward: yes that I know but how does that fit here for obtaining the common columns among two rows. Thanks for your comments.. -Asil On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote: Hello, If the amount of data is _that_ small, you'll have a much easier life with MySQL, which supports the join procedure -- because that's exactly what you want to achieve. asil klin wrote: Hi all, I want to procure the intersection of columns set of two rows (from 2 different column families). To achieve the intersection results, Can I, first retrieve all columns(around 300) from first row and just query by those column names in the second row(which contains maximum 100 000 columns) ? I am using the results during the write time not before presentation to the user, so latency wont be much concern while writing. Is it the proper
Re: Finding the intersection results of column sets of two rows
Thank you so much Aaron !! On Wed, Feb 9, 2011 at 2:11 AM, Aaron Morton aa...@thelastpickle.com wrote: Makes sense, use a get_slice() against the second row and pass in the column names. Should e fine. If you run into performance issues look at slice_buffer_size and column_index_size in the config. Aaron On 9/02/2011, at 5:16 AM, Aklin_81 asdk...@gmail.com wrote: Amongst two rows, where I need to find the common columns. I will not have more than 200 columns(in 99% cases) for the 1st row. But the 2nd row where I need to find these columns may have even around a million valueless columns. A point to note is:- These calculations are all done for **writing the data to the database that has been collected from presentation layer** not while presentation of data. I am using the results of such intersection to find the rows(that are pointed by names of common columns) that I should write to. The calculations are done after a Post is submitted by a user, in a discussions forum. Actually this is used to find out the mutual connections in a group write to the rows pointed by common columns. 1st row represents the connection list of a user, which is not going to be more than 100-250 columns for my case 2nd row represents the members of a group which may contain a million columns as I told. I find the mutual connections in a group(by finding the common columns in the above two rows) and then write to the rows of those users. Cant I run a batch query to ask for all columns that I picked up from 1st row and want to ask in the 2nd row ?? Is there any better way ? Asil On Feb 7, 2011, at 12:30 AM, Aklin_81 wrote: Thanks Aaron Shaun, ** I think my question might have been unclear to some of you. So I would again explain my problem( solution which I thought of) for the sake of clarity:- Consider I have 2 rows. 1st row contains 60-70 columns and 2nd row contains like in hundreds of thousands columns. Both the columns sets are all valueless. I need to just findout the **common column names** in the two rows. **These two rows are known to me**. So what I plan to do is, I just pick up all **columns (names)** of 1st row (60 -70 columns) and just ask for them in 2nd row, whatever column names I get back is my result. Would there be any problem with this solution ? This is how I am expecting to get common column names. Please do not consider it as a JOIN case as it leads to unnecessary confusions, I just need common column names from valueless columns in the two rows. Aaron, actually the intersection data is very much context based. So say if there are 10 million rows in CF A 1 million in CF B, then intersection data would be containing 10 million *1 million rows. This would involve very huge unaffordable amounts of denormalization. And finding columns in client would require pulling unnecessary columns like pulling 100,000 columns from a row of which only 60-70 are required . Shaun, I hope my above clarification has clarified things a bit. Yes, the rows, of which I need to find common columns are known to me. Thank you all, Asil On Mon, Feb 7, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.net wrote: In theory, you should be able to do joins by creating an extra column in one column family, holding the foreign key of the matching row in the other family. This assumes that the info you are joining on is available in both CFs (is not some sort of functional transformation). I have just found that the implementation for secondary indexes is not yet very close to optimal for more complex joins involving multiple indexes, I'm not sure if that affects you as you didn't say what you are joining on. -- Shaun On Feb 6, 2011, at 4:22 PM, Aaron Morton wrote: Is it possible for you to dernormalise and write all the intersection values? Will depend on how many I guess. The other alternative is to pull back more data that you need and the intersection in code in the client. Hope that helps. Aaron On 7/02/2011, at 7:11 AM, Aklin_81 asdk...@gmail.com wrote: Hi, @buddhasystem : yes that's well known solution. But obviously when mysql couldnt satisfy my needs, I am here. My question is in context of Cassandra, if it possible to achieve intersection result set of columns in two rows, by the way I spoke about. @Edward: yes that I know but how does that fit here for obtaining the common columns among two rows. Thanks for your comments.. -Asil On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote: Hello, If the amount of data is _that_ small, you'll have a much easier life with MySQL, which supports the join procedure -- because that's exactly what you want to achieve. asil klin wrote: Hi all, I want to procure the intersection of columns
Finding the intersection results of column sets of two rows
Hi all, I want to procure the intersection of columns set of two rows (from 2 different column families). To achieve the intersection results, Can I, first retrieve all columns(around 300) from first row and just query by those column names in the second row(which contains maximum 100 000 columns) ? I am using the results during the write time not before presentation to the user, so latency wont be much concern while writing. Is it the proper way to procure intersection results of two rows ? Would love to hear your comments.. - Regards, Asil
Re: Finding the intersection results of column sets of two rows
Hi, @buddhasystem : yes that's well known solution. But obviously when mysql couldnt satisfy my needs, I am here. My question is in context of Cassandra, if it possible to achieve intersection result set of columns in two rows, by the way I spoke about. @Edward: yes that I know but how does that fit here for obtaining the common columns among two rows. Thanks for your comments.. -Asil On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote: Hello, If the amount of data is _that_ small, you'll have a much easier life with MySQL, which supports the join procedure -- because that's exactly what you want to achieve. asil klin wrote: Hi all, I want to procure the intersection of columns set of two rows (from 2 different column families). To achieve the intersection results, Can I, first retrieve all columns(around 300) from first row and just query by those column names in the second row(which contains maximum 100 000 columns) ? I am using the results during the write time not before presentation to the user, so latency wont be much concern while writing. Is it the proper way to procure intersection results of two rows ? Would love to hear your comments.. - Regards, Asil -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Finding-the-intersection-results-of-column-sets-of-two-rows-tp5997248p5997743.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. You can use multi-get when fetching lists of already know keys optimize your round rip time.
Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up
Thanks so much Ryan for the links; I'll definitely take them into consideration. Just another thought which came to my mind:- perhaps it may be beneficial to store(or duplicate) some of the data like the Login credentials particularly userId to User's Name mapping, etc (which is very heavily read), in a fast MyISAM table. This could solve the problem of keys though auto-generated unique sequential primary keys. I could use the same keys for Cassandra rows for that user. And also since Cassandra reads are relatively slow, it makes sense to store data like userId to Name mapping in MyISAM as this data would be required after almost all queries to the database. Regards -Asil On Fri, Feb 4, 2011 at 10:14 PM, Ryan King r...@twitter.com wrote: On Thu, Feb 3, 2011 at 9:12 PM, Aklin_81 asdk...@gmail.com wrote: Thanks Matthew Ryan, The main inspiration behind me trying to generate Ids in sequential manner is to reduce the size of the userId, since I am using it for heavy denormalization. UUIDs are 16 bytes long, but I can also have a unique Id in just 4 bytes, and since this is just a one time process when the user signs-up, it makes sense to try cutting down the space requirements, if it is feasible without any downsides(!?). I am also using userIds to attach to Id of the other data of the user on my application. If I could reduce the userId size that I can also reduce the size of other Ids, I could drastically cut down the space requirements. [Sorry for this question is not directly related to cassandra but I think Cassandra factors here because of its tuneable consistency] Don't generate these ids in cassandra. Use something like snowflake, flickr's ticket servers [2] or zookeeper sequential nodes. -ryan 1. http://github.com/twitter/snowflake 2. http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up
Hi all, To generate new keys/ UserIds for new users on my application, I am thinking of using a simple synchronized counter that can keep track of the no. of users registered on my application and when a new user signs up, he can be allotted the next available id. Since Cassandra is eventually consistent, Is this advisable to implement with Cassandra, but then I could also use stronger consistency level like quorum or all for this purpose. Please let me know your thoughts and suggesttions.. Regards Asil
Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up
Thanks Matthew Ryan, The main inspiration behind me trying to generate Ids in sequential manner is to reduce the size of the userId, since I am using it for heavy denormalization. UUIDs are 16 bytes long, but I can also have a unique Id in just 4 bytes, and since this is just a one time process when the user signs-up, it makes sense to try cutting down the space requirements, if it is feasible without any downsides(!?). I am also using userIds to attach to Id of the other data of the user on my application. If I could reduce the userId size that I can also reduce the size of other Ids, I could drastically cut down the space requirements. [Sorry for this question is not directly related to cassandra but I think Cassandra factors here because of its tuneable consistency] Regards Asil On Fri, Feb 4, 2011 at 1:09 AM, Ryan King r...@twitter.com wrote: You could also consider snowflake: http://github.com/twitter/snowflake which gives you ids that roughly sort by time (but aren't sequential). -ryan On Thu, Feb 3, 2011 at 11:13 AM, Matthew E. Kennedy matt.kenn...@spadac.com wrote: Unless you need your user identifiers to be sequential for some reason, I would save yourself the headache of this kind of complexity and just use UUIDs if you have to generate an identifier. On Feb 3, 2011, at 2:03 PM, Aklin_81 wrote: Hi all, To generate new keys/ UserIds for new users on my application, I am thinking of using a simple synchronized counter that can keep track of the no. of users registered on my application and when a new user signs up, he can be allotted the next available id. Since Cassandra is eventually consistent, Is this advisable to implement with Cassandra, but then I could also use stronger consistency level like quorum or all for this purpose. Please let me know your thoughts and suggesttions.. Regards Asil -- @rk
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
@Roshan Yes, I thought about that, but then I wouldn't be able to use the Random Partitioner. @Aaron Do you mean like this: 'timeUUID+ row_key' as the supercolumn names? then when retriving the row_key from this column name, will I be required to parse the name ? How do I do that exactly ? Some issues: - Will you have time collisions ? No I wont be mostly having time collisions. If they happen in 1% case, I dont mind. - Not sure what your are storing in the super columns, but their are limitations. I would be storing maximum 5 subcolumns inside and would be retrieving them altogether. - If you are using cassandra 0.7, have you looked at the secondary indexes ? Yes I did but I think they are not helpful in my case. This is what I am trying to do : ** This is from an older post that I made earlier on the mailing list:- I am working on a project of Questions/answers forum that allows a user to follow questions on certain topics from his followies. I want to build user's news-feed that comprises of only those questions that have been posted by his followies tagged on the topics that he is following. Simple news-feed design that shows all the posts from network would be easy to design using Cassandra by executing fast writes to all followers of a user about the post from user. But for my application, there is an additional filter of 'followed topics', (ie, the user receives posts created by his followies on topics user is following) I was thinking of implementing this way: Initially writing to all followers, the postID of posts from their network, by adding a supercolumn to the rows of all followers in the News-feed supercolumnfamily, with supercolumn name as timestamp(for sort by time) and 5 sub-columns containing the topic tags of that post. At the read time, compare subcolumn values with the topics user is following, if they match then show the post. (I would be required to fetch the list of followed topics of the user at read time, hence should I store the topic list as a supercolumn in this Newsfeed supercolumnfamily only?) An important point to note that, often, the posts will have zero subcolumn which would mean that this post has to be shown without validating with the user's list of followed topics. There is another view for the users which allows them to see all the posts from their followies(without topic filters). In this case no checking of subcolumns for topics will be performed. I got good insights from Tyler on this, but he was recommending me an approach which although would be beneficial for reads performance, but by way of too much denormalizing like 70-80x. I currently fear that approach and would like to test upon this. ** any comments, feedback greatly appreciated.. thanks so much! On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote: It's possible that I am misunderstanding the question in some way. The row keys can be Time UUIDs and with those row keys as column names, u can use comparator TIMEUUIDTYPE to have them sorted by time automatically. On Fri, Jan 14, 2011 at 9:18 AM, Aaron Morton aa...@thelastpickle.comwrote: You could make the time an a fixed width integer and prefix your row keys with it, then set the comparotor to ascii or utf. Some issues: - Will you have time collisions ? - Not sure what your are storing in the super columns, but their are limitations http://wiki.apache.org/cassandra/CassandraLimitations http://wiki.apache.org/cassandra/CassandraLimitations- If you are using cassandra 0.7, have you looked at the secondary indexes ? http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexesIf you provide some more info on the problem your trying to solve we may be able to help some more. Cheers Aaron On 14 Jan, 2011,at 04:27 PM, Aklin_81 asdk...@gmail.com wrote: I would like to keep the reference of other rows as names of super column and sort those super columns according to time. Is there any way I could implement that ? Thanks in advance! -- Roshan Blog: http://roshandawrani.wordpress.com/ Twitter: @roshandawrani http://twitter.com/roshandawrani Skype: roshandawrani # # # #
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
I too believed so! but not totally sure. On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote: I am not sure but I guess because all the rows of certain time range will go to just one node will not be evenly distributed because the timeUUID will not be random but sequential according to time... I am not sure anyways... On Fri, Jan 14, 2011 at 7:18 PM, Roshan Dawrani roshandawr...@gmail.comwrote: On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote: @Roshan Yes, I thought about that, but then I wouldn't be able to use the Random Partitioner. Can you please expand a bit on this? What is this restriction? Can you point me to some relevant documentation on this? Thanks. #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_ #12d84d3a0b3ce961_12d84c9312ae2134_
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
I just read that cassandra internally creates a md5 hash that is used for distributing the load by sending it to a node reponsible for the range within which that md5 hash falls, so even when we create sequential keys, their MD5 hash is not the same hence they are not sent to same node. This was my misunderstanding of this concept. Sorry for creating confusions ! So.. with this I think I will be able to use timeUUID as row key !? Aaron, if you could kindly share your views on my response to your queries above. On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote: I am not clear what you guys are trying to do and say :-) So, let's take some specifics... Say you want to create rows in some column family (say CF_A), and as you create them, you want to store their row key in column names in some other column family (say CF_B) - possibly for filtering keys based on time later, etc, etc... Now your rows in CF_A may be keyed on a TimeUUID and if you store these keys as column names in CF_B that has comparator as TimeUUID, then you get your column names time sorted automatically. Now CF_A may be split across nodes - is that of any concern to you? Are you expecting any storage relationship between column names of CF_B and rows of CF_A? rgds, Roshan On Fri, Jan 14, 2011 at 7:58 PM, Aklin_81 asdk...@gmail.com wrote: I too believed so! but not totally sure. On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote: I am not sure but I guess because all the rows of certain time range will go to just one node will not be evenly distributed because the timeUUID will not be random but sequential according to time... I am not sure anyways... # # # #
Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?
No, you do not need to shut up, please! :) you may be clearing up my further misconceptions on the topic! Anyways, the link b/w 1st and 2nd para was that since the rows distribution among nodes is not affected by key(as you rightly said) but by md5 hash of the key thus I can use just any key including the timeUUIDType key (that would be helpful in my case) with Random partition. On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote: On Fri, Jan 14, 2011 at 8:51 PM, Aklin_81 asdk...@gmail.com wrote: I just read that cassandra internally creates a md5 hash that is used for distributing the load by sending it to a node reponsible for the range within which that md5 hash falls, so even when we create sequential keys, their MD5 hash is not the same hence they are not sent to same node. This was my misunderstanding of this concept. Sorry for creating confusions ! So.. with this I think I will be able to use timeUUID as row key !? Now, what really is the link between your corrected understanding and the conclusion in the 2nd para? :-) I miss the link you are using to come from para 1 to para 2. Just because you use time UUID as the row key, there is no storage guarantee because of that. Distribution of rows and ordering across nodes is only based on what partitioner you are using - it is not (only) related to the the type of the key. May be I should just shut up now as I don't seem to be understanding you requirement :-) # # # #
Is there any way I could use keys of other rows as column names that could be sorted according to time ?
I would like to keep the reference of other rows as names of super column and sort those super columns according to time. Is there any way I could implement that ? Thanks in advance!