Why don't you start off with a single small Cassandra server as you usually do it with MySQL ?

2013-08-27 Thread Aklin_81
For any website just starting out, the load initially is minimal  grows
with a  slow pace initially. People usually start with their MySQL based
sites with a single server(***that too a VPS not a dedicated server)
running as both app server as well as DB server  usually get too far with
this setup  only as they feel the need they separate the DB from the app
server giving it a separate VPS server. This is what a start up expects the
things to be while planning about resources procurement.

But so far what I have seen, it's something very different with Cassandra.
People usually recommend starting out with atleast a 3 node cluster, (on
dedicated servers) with lots  lots of RAM. 4GB or 8GB RAM is what they
suggest to start with. So is it that Cassandra requires more hardware
resources in comparison to MySQL,  for a website to deliver similar
performance, serve similar load/ traffic  same amount of data. I
understand about higher storage requirements of Cassandra due to
replication but what about other hardware resources ?

Can't we start off with Cassandra based apps just like MySQL. Starting with
1 or 2 VPS  adding more whenever there's a need. Renting out dedicated
servers with lots of RAM just from the beginning may be viable for very
well funded startups but not for all.


Re: Schema question : Query to support Find which all of these 500 email ids have been registered

2012-07-27 Thread Aklin_81
Sorry for the confusion created. I need to store emails registered
just for a single application. So although my data model would fit
into just a single row. But is storing a hundred million  columns(col
name size= 8 byte; col value size=4 byte ) in a single row a good idea
? I am very much tempted to store it in single row but I also heard it
is recommended to keep a row size within 10s of MBs for optimal
performance.


Re: Schema question : Query to support Find which all of these 500 email ids have been registered

2012-07-27 Thread Aklin_81
What about if I spread these columns across 20 rows ? Then I have to
query each of these 20 rows for 500 columns. but still this seems a
better solution than one row for all cols or separate row for each
email id approaches !?

On Fri, Jul 27, 2012 at 11:36 AM, Aklin_81 asdk...@gmail.com wrote:
 Sorry for the confusion created. I need to store emails registered
 just for a single application. So although my data model would fit
 into just a single row. But is storing a hundred million  columns(col
 name size= 8 byte; col value size=4 byte ) in a single row a good idea
 ? I am very much tempted to store it in single row but I also heard it
 is recommended to keep a row size within 10s of MBs for optimal
 performance.


Re: What is the future of supercolumns ?

2012-01-07 Thread Aklin_81
Hmm .. it would be great if the supercolumns API remains. Also I
believe we can replace the full functionality of supercolumns through
composite column names in case this issue (related to reading multiple
column ranges) is
resolved:https://issues.apache.org/jira/browse/CASSANDRA-2710


On Sun, Jan 8, 2012 at 5:45 AM, Brandon Williams dri...@gmail.com wrote:
 On Sat, Jan 7, 2012 at 5:42 PM, Rustam Aliyev rus...@code.az wrote:
 My suggestion is simple: don't use any deprecated stuff out there. In
 practically any case there is a good reason why it's deprecated.


 SuperColumns are not deprecated.

 The supercolumn API will remain:
 https://issues.apache.org/jira/browse/CASSANDRA-3237

 -Brandon


Re: What is the future of supercolumns ?

2012-01-06 Thread Aklin_81
Any comments please ?

On Thu, Jan 5, 2012 at 11:07 AM, Aklin_81 asdk...@gmail.com wrote:
 I have seen supercolumns usage been discouraged most of the times.
 However sometimes the supercolumns seem to fit the scenario most
 appropriately not only in terms of how the data is stored but also in
 terms of how is it retrieved. Some of the queries supported by SCs are
 uniquely capable of doing the task which no other alternative schema
 could do.(Like recently I asked about getting the equivalent of
 retrieving a list of (full)supercolumns by name, through use of
 composite columns, unfortunately there was no way to do this without
 reading lots of extra columns).

 So I am really confused whether:

 1. Should I really not use the supercolumns for any case at all,
 however appropriate, or I just need to be just careful while realizing
 that supercolumns fit my use case appropriately or what!?

 2. Are there any performance concerns with supercolumns even in the
 cases where they are used most appropriately. Like when you need to
 retrieve the entire supercolumns everytime  max. no of subcolumns
 vary between 0-10.
 (I don't write all the subcolumns inside supercolumn, at once though!
 Does this also matter?)

 3. What is their future? Are they going to be deprecated or may be
 enhanced later?


Re: What is the future of supercolumns ?

2012-01-06 Thread Aklin_81
I read entire columns inside the supercolumns at any time but as for
writing them, I write the columns at different times. I don't have the
need to update them except that die after their TTL period of 60 days.
But since they are going to be deprecated, I don't know if it would be
really advisable to use them right now.

I believe if it was possible to do wildchard querying for a list of
column names then the supercolumns use cases may be easily replaced by
normal columns. Could it practically possible, in future ?

On Sat, Jan 7, 2012 at 8:05 AM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Please realize that I do not make any decisions here and I am not part of the 
 core Cassandra developer team.

 What has been said before is that they will most likely go away and at least 
 under the hood be replaced by composite columns.

 Jonathan have however stated that he would like the supercolumn 
 API/abstraction to remain at least for backwards compatibility.

 Please understand that under the hood, supercolumns are merely groups of 
 columns serialized as a single block of data.


 The fact that there is a specialized and hardcoded way to serialize these 
 column groups into supercolumns is a problem however and they should probably 
 go away to make space for a more generic implementation allowing more 
 flexible data structures and less code specific for one special data 
 structure.

 Today there are tons of extra code to deal with the slight difference in 
 serialization and features of supercolumns vs columns and hopefully most of 
 that could go away if things got structured a bit different.

 I also hope that we keep APIs to allow simple access to groups of key/value 
 pairs to simplify application logic as working with just columns can add a 
 lot of application code which should not be needed.

 If you almost always need all or mostly all of the columns in a supercolumn, 
 and you normally update all of them at the same time, they will most likely 
 be faster than normal columns.

 Processing wise, you will actually do a bit more work on 
 serialization/deserialization of SC's but the I/O part will usually be better 
 grouped/require less operations.

 I think we did some benchmarks on some heavy use cases with ~30 small columns 
 per SC some time back and I think we ended up with  SCs being 10-20% faster.


 Terje

 On Jan 5, 2012, at 2:37 PM, Aklin_81 wrote:

 I have seen supercolumns usage been discouraged most of the times.
 However sometimes the supercolumns seem to fit the scenario most
 appropriately not only in terms of how the data is stored but also in
 terms of how is it retrieved. Some of the queries supported by SCs are
 uniquely capable of doing the task which no other alternative schema
 could do.(Like recently I asked about getting the equivalent of
 retrieving a list of (full)supercolumns by name, through use of
 composite columns, unfortunately there was no way to do this without
 reading lots of extra columns).

 So I am really confused whether:

 1. Should I really not use the supercolumns for any case at all,
 however appropriate, or I just need to be just careful while realizing
 that supercolumns fit my use case appropriately or what!?

 2. Are there any performance concerns with supercolumns even in the
 cases where they are used most appropriately. Like when you need to
 retrieve the entire supercolumns everytime  max. no of subcolumns
 vary between 0-10.
 (I don't write all the subcolumns inside supercolumn, at once though!
 Does this also matter?)

 3. What is their future? Are they going to be deprecated or may be
 enhanced later?



What is the future of supercolumns ?

2012-01-04 Thread Aklin_81
I have seen supercolumns usage been discouraged most of the times.
However sometimes the supercolumns seem to fit the scenario most
appropriately not only in terms of how the data is stored but also in
terms of how is it retrieved. Some of the queries supported by SCs are
uniquely capable of doing the task which no other alternative schema
could do.(Like recently I asked about getting the equivalent of
retrieving a list of (full)supercolumns by name, through use of
composite columns, unfortunately there was no way to do this without
reading lots of extra columns).

So I am really confused whether:

1. Should I really not use the supercolumns for any case at all,
however appropriate, or I just need to be just careful while realizing
that supercolumns fit my use case appropriately or what!?

2. Are there any performance concerns with supercolumns even in the
cases where they are used most appropriately. Like when you need to
retrieve the entire supercolumns everytime  max. no of subcolumns
vary between 0-10.
(I don't write all the subcolumns inside supercolumn, at once though!
Does this also matter?)

3. What is their future? Are they going to be deprecated or may be
enhanced later?


Fast lookups for userId to username and vice versa

2011-11-13 Thread Aklin_81
I need to create mapping from userId(s) to username(s) which need to
provide for fast lookups service ?
Also I need to provide a mapping from username to userId inorder to
implement search functionality in my application.

What could be a good strategy to implement this ? (I would welcome
suggestions to use any new technologies if they are really worth for my
case.)


Re: Cassandra Cluster Admin - phpMyAdmin for Cassandra

2011-11-12 Thread Aklin_81
Is there a way to configure the serializers to use while showing up the
stored data ??


Thanks
Aklin

On Tue, Nov 1, 2011 at 5:02 PM, Aditya Narayan ady...@gmail.com wrote:

 Yes that would be pretty nice feature to see!



 On Mon, Oct 31, 2011 at 10:45 PM, Ertio Lew ertio...@gmail.com wrote:

 Thanks so much SebWajam  for this great piece of work!

 Is there a way to set a data type for displaying the column names/ values
 of a CF ? It seems that your project always uses String Serializer for
 any piece of data however most of the times in real world cases this is not
 true so can we anyhow configure what serializer to use while reading the
 data so that the data may be properly identified by your project 
 delivered in a readable format ?


 On Mon, Aug 22, 2011 at 7:17 AM, SebWajam sebast...@wajam.com wrote:

 Hi,

 I'm working on this project for a few months now and I think it's mature
 enough to post it here:
 Cassandra Cluster Admin on 
 GitHubhttps://github.com/sebgiroux/Cassandra-Cluster-Admin

 Basically, it's a GUI for Cassandra. If you're like me and used MySQL
 for a while (and still using it!), you get used to phpMyAdmin and its
 simple and easy to use user interface. I thought it would be nice to have a
 similar tool for Cassandra and I couldn't find any, so I build my own!

 Supported actions:

- Keyspace manipulation (add/edit/drop)
- Column Family manipulation (add/edit/truncate/drop)
- Row manipulation on column family and super column family
(insert/edit/remove)
- Basic data browser to navigate in the data of a column family
(seems to be the favorite feature so far)
- Support Cassandra 0.8+ atomic counters
- Support management of multiple Cassandra clusters

 Bug report and/or pull request are always welcome!

 --
 View this message in context: Cassandra Cluster Admin - phpMyAdmin for
 Cassandrahttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Cluster-Admin-phpMyAdmin-for-Cassandra-tp6709930p6709930.html
 Sent from the cassandra-u...@incubator.apache.org mailing list 
 archivehttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/at
  Nabble.com.






Reading a bunch of rows pointed out by column names of columns in some row

2011-02-19 Thread Aklin_81
Is there a way to just ask for column names in a row to get just a
column names list (and not the entire columnString,String). I am
using Hector.

I have a row of valueless columns (whose column names are keys to
another set of rows)  I wish to retrieve a list of those column names
and send it in another query to retrieve the corresponding rows
pointed by each of those columns.

Now if I do a normal query I get list of columns like:
List Column String, String

but I would need to pass to another query as just column names list like:
List String

Is there any method in java which could transforms this list in a
better way, rather than iterating over each column and extracting its
name into another list ?
What would be the best way to do such queries?


Re: Reading a bunch of rows pointed out by column names of columns in some row

2011-02-19 Thread Aklin_81
anyone who can share what is the most recommended way of doing this


On Sat, Feb 19, 2011 at 7:12 PM, Aklin_81 asdk...@gmail.com wrote:
 Is there a way to just ask for column names in a row to get just a
 column names list (and not the entire columnString,String). I am
 using Hector.

 I have a row of valueless columns (whose column names are keys to
 another set of rows)  I wish to retrieve a list of those column names
 and send it in another query to retrieve the corresponding rows
 pointed by each of those columns.

 Now if I do a normal query I get list of columns like:
 List Column String, String

 but I would need to pass to another query as just column names list like:
 List String

 Is there any method in java which could transforms this list in a
 better way, rather than iterating over each column and extracting its
 name into another list ?
 What would be the best way to do such queries?



Re: Frequent updates of freshly written columns

2011-02-18 Thread Aklin_81
Compaction does not 'mutate' the sst files, it 'merges' several sst files
into one with new indexes, merged data rows  deleting tombstones. Thus you
reclaim your disk space.


On Fri, Feb 18, 2011 at 7:34 PM, James Churchman
jameschurch...@gmail.comwrote:

 but a compaction will mutate the sstables and reclaim the
 space (eventually)  ?


 james

 On 18 Feb 2011, at 08:36, Sylvain Lebresne wrote:

 On Fri, Feb 18, 2011 at 8:14 AM, Aklin_81 asdk...@gmail.com wrote:

 Are the very freshly written columns to a row in memtables, efficiently
 updated/overwritten by edited/new column values.

 After flushing of memtable, are those(edited + unedited ones) columns
 stored together on disk (in same blocks!?) as if they were written in one
 single operation or same time ?? I know if old columns are edited then
 several copies of same column will be dispersed in different sst tables,
 what about fresh columns ?

 Are there any disadvantages to frequently updating fresh columns present
 in memtable ?


 The SSTables are immutable but the memtable are not. As long as you
 update/overwrite a column that is still in memtable, it is simply replaced
 in memory (so it's as efficient as it gets).
 In other words, when the memtable is flushed, only the last version of the
 column goes in.

 --
 Sylvain





Re: Frequent updates of freshly written columns

2011-02-18 Thread Aklin_81
Sylvain,
I also need to store data that is frequently updated, same column
being updated several times during each user session, at each action
by user, But, this data is not very fresh and hence when I update this
column frequently, there would be many versions of the same column in
several sst files!
Reading this type of data would not be too efficient I guess as the
row would be totally scattered!

Could there be any better strategy to store such data in cassandra?

(Since the column holds an aggregate data obtained from all actions of
the users, I have the need of updating that same column again  again)


my another doubt,  When old column has been updated and exists in the
memtable, but other versions of the column in SST tables exist, do the
reads also scan the sst tables for that column, after memtable. or is
that smart enough to say that this column is the most recent one ?

On Fri, Feb 18, 2011 at 10:32 PM, Aklin_81 asdk...@gmail.com wrote:

 Sylvain,
 I also need to store data that is frequently updated, same column being 
 updated several times during each user session, at each action by user, But, 
 this data is not very fresh and hence when I update this column frequently, 
 there would be many versions of the same column in several sst files!
 Reading this type of data would not be too efficient I guess as the row would 
 be totally scattered!

 Could there be any better strategy to store such data in cassandra?

 (Since the column holds an aggregate data obtained from all actions of the 
 users, I have the need of updating that same column again  again)


 my another doubt,  When old column has been updated and exists in the 
 memtable, but other versions of the column in SST tables exist, do the reads 
 also scan the sst tables for that column, after memtable. or is that smart 
 enough to say that this column is the most recent one ?




 On Fri, Feb 18, 2011 at 8:54 PM, James Churchman jameschurch...@gmail.com 
 wrote:

 ok great, thanks for the exact clarification
 On 18 Feb 2011, at 14:11, Aklin_81 wrote:

 Compaction does not 'mutate' the sst files, it 'merges' several sst files 
 into one with new indexes, merged data rows  deleting tombstones. Thus you 
 reclaim your disk space.


 On Fri, Feb 18, 2011 at 7:34 PM, James Churchman jameschurch...@gmail.com 
 wrote:

 but a compaction will mutate the sstables and reclaim the 
 space (eventually)  ?

 james
 On 18 Feb 2011, at 08:36, Sylvain Lebresne wrote:

 On Fri, Feb 18, 2011 at 8:14 AM, Aklin_81 asdk...@gmail.com wrote:

 Are the very freshly written columns to a row in memtables, efficiently 
 updated/overwritten by edited/new column values.

 After flushing of memtable, are those(edited + unedited ones) columns 
 stored together on disk (in same blocks!?) as if they were written in one 
 single operation or same time ?? I know if old columns are edited then 
 several copies of same column will be dispersed in different sst tables, 
 what about fresh columns ?

 Are there any disadvantages to frequently updating fresh columns present 
 in memtable ?

 The SSTables are immutable but the memtable are not. As long as you 
 update/overwrite a column that is still in memtable, it is simply replaced 
 in memory (so it's as efficient as it gets).
 In other words, when the memtable is flushed, only the last version of the 
 column goes in.
 --
 Sylvain





Frequent updates of freshly written columns

2011-02-17 Thread Aklin_81
Are the very freshly written columns to a row in memtables, efficiently
updated/overwritten by edited/new column values.

After flushing of memtable, are those(edited + unedited ones) columns stored
together on disk (in same blocks!?) as if they were written in one single
operation or same time ?? I know if old columns are edited then several
copies of same column will be dispersed in different sst tables, what about
fresh columns ?

Are there any disadvantages to frequently updating fresh columns present in
memtable ?


Re: Column name size

2011-02-11 Thread Aklin_81
Would be interested in your findings, Patrik! ...

I too was searching for something similar a few days back.. for column
names that contained userIds of users on my application. UUIDs that
seemed to be most widely recognized(perhaps!)  solution are 16 bytes
but those definitely seem like a too much for the heavy denormalized
databases. It definitely makes sense to attempt to reduce the size as
storage may although be cheap but you also need to cache your data, so
(I guess) reduction in the size of your column names, ids etc could
actually matter a lot depending on the size of your actual data/column
values.

I was suggested by some people here, to try out solutions like
Zookeeper or snowflake, to generate sequential ids, that could be used
as alternative for UUIDs in some cases.

Regards
Asil




On Fri, Feb 11, 2011 at 3:36 PM, Patrik Modesto
patrik.mode...@gmail.com wrote:
 Hi all!

 I'm thinking if size of a column name could matter for a large dataset
 in Cassandra  (I mean lots of rows). For example what if I have a row
 with 10 columns each has 10 bytes value and 10 bytes name. Do I have
 half the row size just of the column names and the other half of the
 data (not counting storage overhead)?  What if I have 10M of these
 rows? Is there a difference? Should I use some 3bytes codes for a
 column name to save memory/bandwidth?

 Thanks,
 Patrik



Re: Calculating the size of rows in KBs

2011-02-11 Thread Aklin_81
I think it does not deserialize the entire list of columns in the
row(though it is the case with subcolumns in a supercolumn). In case
of standard columns, only the blocks on the disk containing the
columns values of the columns being asked for, are read and
deserailized to get the values.

On Sat, Feb 12, 2011 at 10:51 AM, Stu Hood stuh...@gmail.com wrote:
 Does it also mean that the whole row will be deserialized when a query
 comes
 just for one column?
 No, it does not mean that: at most column_index_size_in_kb will be read to
 read a single column, independent of where that column is in the row.
 On the other hand, with the row cache enabled, it is as if everything in the
 row is needed, so the entire row will be read. There is talk of improving
 this limitation of the row cache:
 see https://issues.apache.org/jira/browse/CASSANDRA-1956

 On Fri, Feb 11, 2011 at 6:00 PM, buddhasystem potek...@bnl.gov wrote:

 Does it also mean that the whole row will be deserialized when a query
 comes
 just for one column?

 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Calculating-the-size-of-rows-in-KBs-tp6011243p6017870.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.




Re: Finding the intersection results of column sets of two rows

2011-02-08 Thread Aklin_81
Amongst two rows, where I need to find the common columns. I will not
have more than 200 columns(in 99% cases) for the 1st row. But the 2nd
row where I need to find these columns may have even around a million
valueless columns.

A point to note is:- These calculations are all done for **writing the
data to the database that has been collected from presentation layer**
 not while presentation of data.

I am using the results of such intersection to find the rows(that are
pointed by names of common columns) that I should write to. The
calculations are done after a Post is submitted by a user, in a
discussions forum. Actually this is used to find out the mutual
connections in a group  write to the rows pointed by common columns.
1st row represents the connection list of a user, which is not going
to be more than 100-250 columns for my case  2nd row represents the
members of a group which may contain a million columns as I told.
I find the mutual connections in a group(by finding the common columns
in the above two rows) and then write to the rows of those users.

Cant I run a batch query to ask for all columns that I picked up from
1st row and want to ask in the 2nd row ??

Is there any better way ?

Asil



 On Feb 7, 2011, at 12:30 AM, Aklin_81 wrote:

 Thanks Aaron  Shaun,

 **
 I think my question might have been unclear to some of you. So I would
 again explain my problem( solution which I thought of) for the sake
 of clarity:-

 Consider I have 2 rows.  1st row contains 60-70 columns and 2nd row
 contains like in hundreds of thousands columns. Both the columns sets
 are all valueless. I need to just findout the **common column names**
 in the two rows. **These two rows are known to me**. So what I plan to
 do is, I just pick up all **columns (names)** of 1st row (60 -70
 columns) and just ask for them in 2nd row, whatever column names I get
 back is my result.
 Would there be any problem with this solution ? This is how I am
 expecting to get common column names.

 Please do not consider it as a JOIN case as it leads to unnecessary
 confusions, I just need common column names from valueless columns in
 the two rows.

 

 Aaron, actually the intersection data is very much context based. So
 say if there are 10 million rows in CF A  1 million in CF B, then
 intersection data would be containing 10 million *1 million rows. This
 would involve very huge  unaffordable amounts of denormalization.
 And finding columns in client would require pulling unnecessary
 columns like pulling 100,000 columns from a row of which only 60-70
 are required .

 Shaun, I hope my above clarification has clarified things a bit. Yes,
 the rows, of which I need to find common columns are known to me.


 Thank you all,
 Asil


 On Mon, Feb 7, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.net wrote:
 In theory, you should be able to do joins by creating an extra column in 
 one column family, holding the foreign key of the matching row in the 
 other family.

 This assumes that the info you are joining on is available in both CFs (is 
 not some sort of functional transformation).

 I have just found that the implementation for secondary indexes is not yet 
 very close to optimal for more complex joins involving multiple indexes, 
 I'm not sure if that affects you as you didn't say what you are joining on.

 -- Shaun


 On Feb 6, 2011, at 4:22 PM, Aaron Morton wrote:

 Is it possible for you to dernormalise and write all the intersection 
 values? Will depend on how many I guess.

 The other alternative is to pull back more data that you need and the 
 intersection in code in the client.


 Hope that helps.
 Aaron
 On 7/02/2011, at 7:11 AM, Aklin_81 asdk...@gmail.com wrote:

 Hi,

 @buddhasystem : yes that's well known solution. But obviously when
 mysql couldnt satisfy my needs, I am here. My question is in context
 of Cassandra, if it possible to achieve intersection result set of
 columns in two rows, by the way I spoke about.

 @Edward: yes that I know but how does that fit here for obtaining the
 common columns among two rows.

 Thanks for your comments..

 -Asil


 On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
 On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote:

 Hello,

 If the amount of data is _that_ small, you'll have a much easier life 
 with
 MySQL, which supports the join procedure -- because that's exactly 
 what
 you want to achieve.


 asil klin wrote:

 Hi all,

 I want to procure the intersection of columns set of two rows (from 2
 different column families).

 To achieve the intersection results, Can I, first retrieve all
 columns(around 300) from first row and just query by those column
 names in the second row(which contains maximum 100 000 columns) ?

 I am using the results during the write time  not before presentation
 to the user, so latency wont be much concern while writing.

 Is it the proper

Re: Finding the intersection results of column sets of two rows

2011-02-08 Thread Aklin_81
Thank you so much Aaron !!

On Wed, Feb 9, 2011 at 2:11 AM, Aaron Morton aa...@thelastpickle.com wrote:
 Makes sense, use a get_slice() against the second row and pass in the column 
 names. Should e fine.

 If you run into performance issues look at slice_buffer_size and 
 column_index_size in the config.

 Aaron


 On 9/02/2011, at 5:16 AM, Aklin_81 asdk...@gmail.com wrote:

 Amongst two rows, where I need to find the common columns. I will not
 have more than 200 columns(in 99% cases) for the 1st row. But the 2nd
 row where I need to find these columns may have even around a million
 valueless columns.

 A point to note is:- These calculations are all done for **writing the
 data to the database that has been collected from presentation layer**
  not while presentation of data.

 I am using the results of such intersection to find the rows(that are
 pointed by names of common columns) that I should write to. The
 calculations are done after a Post is submitted by a user, in a
 discussions forum. Actually this is used to find out the mutual
 connections in a group  write to the rows pointed by common columns.
 1st row represents the connection list of a user, which is not going
 to be more than 100-250 columns for my case  2nd row represents the
 members of a group which may contain a million columns as I told.
 I find the mutual connections in a group(by finding the common columns
 in the above two rows) and then write to the rows of those users.

 Cant I run a batch query to ask for all columns that I picked up from
 1st row and want to ask in the 2nd row ??

 Is there any better way ?

 Asil



 On Feb 7, 2011, at 12:30 AM, Aklin_81 wrote:

 Thanks Aaron  Shaun,

 **
 I think my question might have been unclear to some of you. So I would
 again explain my problem( solution which I thought of) for the sake
 of clarity:-

 Consider I have 2 rows.  1st row contains 60-70 columns and 2nd row
 contains like in hundreds of thousands columns. Both the columns sets
 are all valueless. I need to just findout the **common column names**
 in the two rows. **These two rows are known to me**. So what I plan to
 do is, I just pick up all **columns (names)** of 1st row (60 -70
 columns) and just ask for them in 2nd row, whatever column names I get
 back is my result.
 Would there be any problem with this solution ? This is how I am
 expecting to get common column names.

 Please do not consider it as a JOIN case as it leads to unnecessary
 confusions, I just need common column names from valueless columns in
 the two rows.

 

 Aaron, actually the intersection data is very much context based. So
 say if there are 10 million rows in CF A  1 million in CF B, then
 intersection data would be containing 10 million *1 million rows. This
 would involve very huge  unaffordable amounts of denormalization.
 And finding columns in client would require pulling unnecessary
 columns like pulling 100,000 columns from a row of which only 60-70
 are required .

 Shaun, I hope my above clarification has clarified things a bit. Yes,
 the rows, of which I need to find common columns are known to me.


 Thank you all,
 Asil


 On Mon, Feb 7, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.net wrote:
 In theory, you should be able to do joins by creating an extra column in 
 one column family, holding the foreign key of the matching row in the 
 other family.

 This assumes that the info you are joining on is available in both CFs 
 (is not some sort of functional transformation).

 I have just found that the implementation for secondary indexes is not 
 yet very close to optimal for more complex joins involving multiple 
 indexes, I'm not sure if that affects you as you didn't say what you are 
 joining on.

 -- Shaun


 On Feb 6, 2011, at 4:22 PM, Aaron Morton wrote:

 Is it possible for you to dernormalise and write all the intersection 
 values? Will depend on how many I guess.

 The other alternative is to pull back more data that you need and the 
 intersection in code in the client.


 Hope that helps.
 Aaron
 On 7/02/2011, at 7:11 AM, Aklin_81 asdk...@gmail.com wrote:

 Hi,

 @buddhasystem : yes that's well known solution. But obviously when
 mysql couldnt satisfy my needs, I am here. My question is in context
 of Cassandra, if it possible to achieve intersection result set of
 columns in two rows, by the way I spoke about.

 @Edward: yes that I know but how does that fit here for obtaining the
 common columns among two rows.

 Thanks for your comments..

 -Asil


 On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
 On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote:

 Hello,

 If the amount of data is _that_ small, you'll have a much easier life 
 with
 MySQL, which supports the join procedure -- because that's exactly 
 what
 you want to achieve.


 asil klin wrote:

 Hi all,

 I want to procure the intersection of columns

Finding the intersection results of column sets of two rows

2011-02-06 Thread Aklin_81
Hi all,

I want to procure the intersection of columns set of two rows (from 2
different column families).

To achieve the intersection results, Can I, first retrieve all
columns(around 300) from first row and just query by those column
names in the second row(which contains maximum 100 000 columns) ?

I am using the results during the write time  not before presentation
to the user, so latency wont be much concern while writing.

Is it the proper way to procure intersection results of two rows ?

Would love to hear your comments..


-

Regards,
Asil


Re: Finding the intersection results of column sets of two rows

2011-02-06 Thread Aklin_81
Hi,

@buddhasystem : yes that's well known solution. But obviously when
mysql couldnt satisfy my needs, I am here. My question is in context
of Cassandra, if it possible to achieve intersection result set of
columns in two rows, by the way I spoke about.

@Edward: yes that I know but how does that fit here for obtaining the
common columns among two rows.

Thanks for your comments..

-Asil


On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote:

 Hello,

 If the amount of data is _that_ small, you'll have a much easier life with
 MySQL, which supports the join procedure -- because that's exactly what
 you want to achieve.


 asil klin wrote:

 Hi all,

 I want to procure the intersection of columns set of two rows (from 2
 different column families).

 To achieve the intersection results, Can I, first retrieve all
 columns(around 300) from first row and just query by those column
 names in the second row(which contains maximum 100 000 columns) ?

 I am using the results during the write time  not before presentation
 to the user, so latency wont be much concern while writing.

 Is it the proper way to procure intersection results of two rows ?

 Would love to hear your comments..


 -

 Regards,
 Asil



 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Finding-the-intersection-results-of-column-sets-of-two-rows-tp5997248p5997743.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


 You can use multi-get when fetching lists of already know keys
 optimize your round rip time.



Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up

2011-02-04 Thread Aklin_81
Thanks so much Ryan for the links; I'll definitely take them into
consideration.

Just another thought which came to my mind:-
perhaps it may be beneficial to store(or duplicate) some of the data
like the Login credentials  particularly userId to User's Name
mapping, etc (which is very heavily read), in a fast MyISAM table.
This could solve the problem of keys though auto-generated unique 
sequential primary keys. I could use the same keys for Cassandra rows
for that user. And also since Cassandra reads are relatively slow, it
makes sense to store data like userId to Name mapping in MyISAM as
this data would be required after almost all queries to the database.

Regards
-Asil



On Fri, Feb 4, 2011 at 10:14 PM, Ryan King r...@twitter.com wrote:
 On Thu, Feb 3, 2011 at 9:12 PM, Aklin_81 asdk...@gmail.com wrote:
 Thanks Matthew  Ryan,

 The main inspiration behind me trying to generate Ids in sequential
 manner is to reduce the size of the userId, since I am using it for
 heavy denormalization. UUIDs are 16 bytes long, but I can also have a
 unique Id in just 4 bytes, and since this is just a one time process
 when the user signs-up, it makes sense to try cutting down the space
 requirements, if it is feasible without any downsides(!?).

 I am also using userIds to attach to Id of the other data of the user
 on my application. If I could reduce the userId size that I can also
 reduce the size of other Ids, I could drastically cut down the space
 requirements.


 [Sorry for this question is not directly related to cassandra but I
 think Cassandra factors here because of its  tuneable consistency]

 Don't generate these ids in cassandra. Use something like snowflake,
 flickr's ticket servers [2] or zookeeper sequential nodes.

 -ryan


 1. http://github.com/twitter/snowflake
 2. 
 http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/



Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up

2011-02-03 Thread Aklin_81
Hi all,

To generate new keys/ UserIds for new users on my application, I am
thinking of using a simple synchronized counter that can keep track of
the no. of users registered on my application and when a new user
signs up, he can be allotted the next available id.

Since Cassandra is eventually consistent, Is this advisable to
implement with Cassandra, but then I could also use stronger
consistency level like quorum or all for this purpose.


Please let me know your thoughts and suggesttions..

Regards
Asil


Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up

2011-02-03 Thread Aklin_81
Thanks Matthew  Ryan,

The main inspiration behind me trying to generate Ids in sequential
manner is to reduce the size of the userId, since I am using it for
heavy denormalization. UUIDs are 16 bytes long, but I can also have a
unique Id in just 4 bytes, and since this is just a one time process
when the user signs-up, it makes sense to try cutting down the space
requirements, if it is feasible without any downsides(!?).

I am also using userIds to attach to Id of the other data of the user
on my application. If I could reduce the userId size that I can also
reduce the size of other Ids, I could drastically cut down the space
requirements.


[Sorry for this question is not directly related to cassandra but I
think Cassandra factors here because of its  tuneable consistency]

Regards
Asil


On Fri, Feb 4, 2011 at 1:09 AM, Ryan King r...@twitter.com wrote:
 You could also consider snowflake:

 http://github.com/twitter/snowflake

 which gives you ids that roughly sort by time (but aren't sequential).

 -ryan

 On Thu, Feb 3, 2011 at 11:13 AM, Matthew E. Kennedy
 matt.kenn...@spadac.com wrote:
 Unless you need your user identifiers to be sequential for some reason, I 
 would save yourself the headache of this kind of complexity and just use 
 UUIDs if you have to generate an identifier.

 On Feb 3, 2011, at 2:03 PM, Aklin_81 wrote:

 Hi all,
 To generate new keys/ UserIds for new users on my application, I am
 thinking of using a simple synchronized counter that can keep track of
 the no. of users registered on my application and when a new user
 signs up, he can be allotted the next available id.

 Since Cassandra is eventually consistent, Is this advisable to
 implement with Cassandra, but then I could also use stronger
 consistency level like quorum or all for this purpose.


 Please let me know your thoughts and suggesttions..

 Regards
 Asil





 --
 @rk



Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
@Roshan
Yes, I thought about that, but then I wouldn't be able to use the
Random Partitioner.

@Aaron

Do you mean like this: 'timeUUID+ row_key'  as the supercolumn names?
then when retriving the row_key from this column name, will I be
required to parse the name ? How do I do that exactly ?


Some issues:
- Will you have time collisions ?
No I wont be mostly having time collisions. If they happen in 1% case,
I dont mind.

- Not sure what your are storing in the super columns, but their are 
limitations.
I would be storing maximum 5 subcolumns inside and would be retrieving
them altogether.

- If you are using cassandra 0.7, have you looked at the secondary indexes ?

Yes I did but I think they are not helpful in my case.

This is what I am trying to do :
**
This is from an older post that I made earlier on the mailing list:-
I am working on a project of Questions/answers forum that allows a
user to follow questions on certain topics from his followies.
I want to build user's news-feed that comprises of only those
questions that have been posted by his followies  tagged on the
topics that he is following.
Simple news-feed design that shows all the posts from network would be
easy to design using Cassandra by executing fast writes to all
followers of a user about the post from user. But for my application,
there is an additional filter of 'followed topics', (ie, the user
receives posts created by his followies  on topics user is
following)

I was thinking of implementing this way:
Initially writing to all followers, the postID of posts from their
network, by adding a supercolumn to the rows of all followers in the
News-feed supercolumnfamily, with supercolumn name as timestamp(for
sort by time) and 5 sub-columns containing the topic tags of that
post.
At the read time, compare subcolumn values with the topics user is
following, if they match then show the post. (I would be required to
fetch the list of followed topics of the user at read time, hence
should I store the topic list as a supercolumn in this Newsfeed
supercolumnfamily only?)

An important point to note that, often, the posts will have zero
subcolumn which would mean that this post has to be shown without
validating with the user's list of followed topics.

There is another view for the users which allows them to see all the
posts from their followies(without topic filters). In this case no
checking of subcolumns for topics will be performed.

I got good insights from Tyler on this, but he was recommending me an
approach which although would be beneficial for reads performance, but
by way of too much denormalizing like 70-80x. I currently fear that
approach and would like to test upon this.
**
any comments, feedback greatly appreciated..

thanks so much!

On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote:
 It's possible that I am misunderstanding the question in some way.

 The row keys can be Time UUIDs and with those row keys as column names, u
 can use comparator TIMEUUIDTYPE to have them sorted by time automatically.

 On Fri, Jan 14, 2011 at 9:18 AM, Aaron Morton
 aa...@thelastpickle.comwrote:

 You could make the time an a fixed width integer and prefix your row keys
 with it, then set the comparotor to ascii or utf.

 Some issues:
 - Will you have time collisions ?
 - Not sure what your are storing in the super columns, but their are
 limitations http://wiki.apache.org/cassandra/CassandraLimitations
 http://wiki.apache.org/cassandra/CassandraLimitations- If you are using
 cassandra 0.7, have you looked at the secondary indexes ?
 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes

 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexesIf
 you provide some more info on the problem your trying to solve we may be
 able to help some more.

 Cheers
 Aaron


 On 14 Jan, 2011,at 04:27 PM, Aklin_81 asdk...@gmail.com wrote:

 I would like to keep the reference of other rows as names of super
 column and sort those super columns according to time.
 Is there any way I could implement that ?

 Thanks in advance!




 --
 Roshan
 Blog: http://roshandawrani.wordpress.com/
 Twitter: @roshandawrani http://twitter.com/roshandawrani
 Skype: roshandawrani

#
 #
 #   #



Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
I too believed so!  but not totally sure.

On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote:
 I am not sure but I guess because all the rows of certain time range will go
 to just one node  will not be evenly distributed because the timeUUID will
 not be random but sequential according to time... I am not sure anyways...

 On Fri, Jan 14, 2011 at 7:18 PM, Roshan Dawrani
 roshandawr...@gmail.comwrote:

 On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote:

 @Roshan
 Yes, I thought about that, but then I wouldn't be able to use the
 Random Partitioner.


 Can you please expand a bit on this? What is this restriction? Can you
 point me to some relevant documentation on this?

 Thanks.

 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_




Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
I just read that cassandra internally creates a md5 hash that is used
for distributing the load by sending it to a node reponsible for the
range within which that md5 hash falls, so even when we create
sequential keys, their MD5 hash is not the same  hence they are not
sent to same node. This was my misunderstanding of this concept.
Sorry for creating confusions !

So.. with this I think I will be able to use timeUUID as row key !?

Aaron, if you could kindly share your views on my response to your
queries above.




On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote:
 I am not clear what you guys are trying to do and say :-)

 So, let's take some specifics...

 Say you want to create rows in some column family (say CF_A), and as you
 create them, you want to store their row key in column names in some other
 column family (say CF_B) - possibly for filtering keys based on time later,
 etc, etc...

 Now your rows in CF_A may be keyed on a TimeUUID and if you store these keys
 as column names in CF_B that has comparator as TimeUUID, then you get your
 column names time sorted automatically.

 Now CF_A may be split across nodes - is that of any concern to you?

 Are you expecting any storage relationship between column names of CF_B and
 rows of CF_A?

 rgds,
 Roshan

 On Fri, Jan 14, 2011 at 7:58 PM, Aklin_81 asdk...@gmail.com wrote:

 I too believed so!  but not totally sure.

 On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote:
  I am not sure but I guess because all the rows of certain time range
  will
 go
  to just one node  will not be evenly distributed because the timeUUID
 will
  not be random but sequential according to time... I am not sure
 anyways...
 


#
 #
 #   #



Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
No,  you do not need to shut up, please! :)
you may be clearing up my further misconceptions on the topic!

Anyways, the link b/w 1st and 2nd para was that since the rows
distribution among nodes is not affected by key(as you rightly said)
but by md5 hash of the key thus I can use just any key including the
timeUUIDType key (that would be helpful in my case) with Random
partition.



On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote:
 On Fri, Jan 14, 2011 at 8:51 PM, Aklin_81 asdk...@gmail.com wrote:

 I just read that cassandra internally creates a md5 hash that is used
 for distributing the load by sending it to a node reponsible for the
 range within which that md5 hash falls, so even when we create
 sequential keys, their MD5 hash is not the same  hence they are not
 sent to same node. This was my misunderstanding of this concept.
 Sorry for creating confusions !

 So.. with this I think I will be able to use timeUUID as row key !?


 Now, what really is the link between your corrected understanding and the
 conclusion in the 2nd para? :-)

 I miss the link you are using to come from para 1 to para 2.

 Just because you use time UUID as the row key, there is no storage guarantee
 because of that. Distribution of rows and ordering across nodes is only
 based on what partitioner you are using - it is not (only) related to the
 the type of the key.

 May be I should just shut up now as I don't seem to be understanding you
 requirement :-)







   #
 #
 #   #



Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-13 Thread Aklin_81
I would like to keep the reference of other rows as names of super
column and sort those super columns according to time.
Is there any way I could implement that ?

Thanks in advance!