Re: Data Modeling- another question
yes, you are right, it depend on use cases. I suggested it is a better choice not only choice. JSON will be better if any filed change re-write whole data without reading. I tend to use JSON more, where my data does not change or very rarely, Like storing demoralized JSON data for analytic purpose. I prefer CF and [:scoped] method for frequently updating filed. { this.user.cart.category.p1.name:'' this.user.cart.category.p1.unit:'' this.user.cart.category.p1.desc:'' this.user.cart.category.p2.name:'' this.user.cart.category.p2.unit:'' this.user.cart.category.p2.desc:'' } Yes you are right, Its really about understating app data and its behavior, not JSON or column, according to that designing DM. On Tue, Aug 28, 2012 at 12:20 PM, Guy Incognito dnd1...@gmail.com wrote: i would respectfully disagree, what you have said is true but it really depends on the use case. 1) do you expect to be doing updates to individual fields of an item, or will you always update all fields at once? if you are doing separate updates then the first is definitely easier to handle updates. 2) do you expect to do paging of the list? this will be easier with the json approach, as in the first your item may span across a page boundary - not an insurmountable problem by any means, but more complicated nonetheless. this is not an issue obviously if all your items have the same number of fields. 3) do you expect to read or delete multiple items individually? you may have to do multiple reads/deletes of a row if the items are not adjacent to each other as you cannot do 'disjoint' slices of columns at the moment. with the json approach you can just specify individual columns and you're done. again this is less of an issue if items have a known set of fields, but your list of columns to read/delete may get quite large fairly quickly the first is definitely better if you want to update individual fields, read-then-write is not a good idea in cassandra. but it is more complicated for most usage scenarios, so you have to work out if you really need the extra flexibility. On 24/08/2012 13:54, samal wrote: First is better choice, each filed can be updated separately(write only). Second you have to take care json yourself (read first-modify-then write). On Fri, Aug 24, 2012 at 5:45 PM, Roshni Rajagopal roshni.rajago...@wal-mart.com wrote: Hi, Suppose I have a column family to associate a user to a dynamic list of items. I want to store 5-10 key information about the item, no specific sorting requirements are there. I have two options A) use composite columns UserId1 : { itemid1:Name = Betty Crocker, itemid1:Descr = Cake itemid1:Qty = 5 itemid2:Name = Nutella, itemid2:Descr = Choc spread itemid2:Qty = 15 } B) use a json with the data UserId1 : { itemid1 = {name: Betty Crocker,descr: Cake, Qty: 5}, itemid2 ={name: Nutella,descr: Choc spread, Qty: 15} } Which do you suggest would be better? Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Data Modeling- another question
First is better choice, each filed can be updated separately(write only). Second you have to take care json yourself (read first-modify-then write). On Fri, Aug 24, 2012 at 5:45 PM, Roshni Rajagopal roshni.rajago...@wal-mart.com wrote: Hi, Suppose I have a column family to associate a user to a dynamic list of items. I want to store 5-10 key information about the item, no specific sorting requirements are there. I have two options A) use composite columns UserId1 : { itemid1:Name = Betty Crocker, itemid1:Descr = Cake itemid1:Qty = 5 itemid2:Name = Nutella, itemid2:Descr = Choc spread itemid2:Qty = 15 } B) use a json with the data UserId1 : { itemid1 = {name: Betty Crocker,descr: Cake, Qty: 5}, itemid2 ={name: Nutella,descr: Choc spread, Qty: 15} } Which do you suggest would be better? Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Effect of rangequeries with RandomPartitioner
inline resp. On Mon, Jul 9, 2012 at 10:18 AM, prasenjit mukherjee prasen@gmail.comwrote: Thanks Aaron for your response. Some follow up questions/assumptions/clarifications : 1. With RandomPartitioner, on a given node, are the keys sorted by their hash_values or original/unhashed keys ? hash value, 2. With RandomPartitioner, on a given node, are the columns (for a given key) always sorted by their column_names ? yes, depends on comparator. 3. From what I understand, token = hash(key) for a RandomPartitioner, and hence any key-range queries will return bogus results. correct. Although I believe column-range-queries should succeed even in RP if they are always sorted by column_names. correct, depends on comparator. -Thanks, Prasenjit On Mon, Jul 9, 2012 at 12:17 AM, aaron morton aa...@thelastpickle.com wrote: for background http://wiki.apache.org/cassandra/FAQ#range_rp It maps the start key to a token, and then scans X rows from their on CL number of nodes. Rows are stored in token order. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 7/07/2012, at 11:52 PM, prasenjit mukherjee wrote: Wondering how a rangequery request is handled if RP is used. Will the receiving node do a fan-out to all the nodes in the ring or it will just execute the rangequery on its own local partition ? -- Sent from my mobile device
Re: Supercolumn behavior on writes
You can't 'invent' columns on the fly, everything has to be declared when you declare the column family. That' s incorrect. You can define name on fly. Validation must be define when declaring CF
Re: Supercolumn behavior on writes
I have just check on datastax blog, CQL3 does not support, I am not aware. But as a whole we can via client lib using cql. On Thu, Jun 14, 2012 at 9:12 AM, Dave Brosius dbros...@mebigfatguy.comwrote: Via thrift, or a high level client on thrift, see as an example http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 On 06/13/2012 11:08 PM, Greg Fausak wrote: Interesting. How do you do it? I have a version 2 CF, that works fine. A version 3 table won't let me invent columns that don't exist yet. (for composite tables). What's the trick? cqlsh -3 cas1 use onplus; cqlsh:onplus select * from at_event where ac_event_id = 7690254; ac_event_id | ac_creation | ac_event_type | ac_id | ev_sev -+--+---+---+ 7690254 | 2011-07-23 00:11:47+ | SERV.CPE.CONN | \N | 5 cqlsh:onplus update at_event set wingy = 'toto' where ac_event_id = 7690254; Bad Request: Unknown identifier wingy This is what I used to create it: // // create the event column family, this contains the static // part of the definition. many additional columns can be specified // in the port from relational, these would be mainly the at_event table // use onplus; create columnfamily at_event ( ac_event_id int PRIMARY KEY, ac_event_type text, ev_sev int, ac_id text, ac_creation timestamp ) with compression_parameters:sstable_compression = '' ; -g On Wed, Jun 13, 2012 at 9:36 PM, samal samalgo...@gmail.com samalgo...@gmail.com wrote: You can't 'invent' columns on the fly, everything has to be declared when you declare the column family. That' s incorrect. You can define name on fly. Validation must be define when declaring CF
Re: about multitenant datamodel
why do you think so? I'll let users create ristricted CFs, and limit a number of CFs which users create. is it still a bad one? Ok, get it, you want to limit the cf user can create (assume) 2, what about 10k shared users creating 2 cf each= 20k CF ~~20GB memory used with no data in it. Do you think it is good one? I can think of your data model like , S3 or shared hosting is limit Keysapce and cf to fixed number. In Cassandra key and column name is very powerful, you can do anything you want, design DM anyway you want. Here is the approach I probably will take. - Limit the user to key, user cannot create/delete cf, - All user will share same cf. - Give unique signature (which MUST NOT clash) to each user like *username==anyothermarker::[[actual key name].n]* utf8 only - Each user will always prefix this signature in all cf when inserting and reading data. - Like S3 bucket check signature before creating new one for new user. - Each key for user will be like bucket, all columns can be bucket data. Eg 1) profileCF{ *user1==123456::*profile{ /* user1 profile*/ } , *user2==444::*profile{ /* user2 profile*/ } , } 2) actvityCF{ *user1==123456::*activity{ /* user1 activity columns here*/ } , *user2==**444**::*activity{ /* user2 activity columns here*/ } , } marker cf that will keep all unique signature fro users. So it can be queried while creating new one. bucketMarkerCF{ *user2==**444*:{ username: } *user1==2323*:{ username: } } problem with this approach is user may not have liberty to define their own data model. Good for fixed pattern data: logger, hits, geodata. /Samal On Thu, 31 May 2012 06:44:05 +0900, aaron morton aa...@thelastpickle.com wrote: - Do a lot of keyspaces cause some problems? (If I have 1,000 users, cassandra creates 1,000 keyspaces…) It's not keyspaces, but the number of column families. Without storing any data each CF uses about 1MB of ram. When they start storing and reading data they use more. IMHO a model that allows external users to create CF's is a bad one. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/05/2012, at 12:52 PM, Toru Inoko wrote: Hi, all. I'm designing data api service(like cassandra.io but not using dedicated server for each user) on cassandra 1.1 on which users can do DML/DDL method like cql. Followings are api which users can use( almost same to cassandra api). - create/read/delete ColumnFamilies/Rows/Columns Now I'm thinking about multitenant datamodel on that. My data model like the following. I'm going to prepare a keyspace for each user as a user's tenant space. | keyspace1 | --- | column family | |(for user1)| | ... | keyspace2 | --- | column family | |(for user2)| | ... Followings are my question! - Is this data model a good for multitenant? - Do a lot of keyspaces cause some problems? (If I have 1,000 users, cassandra creates 1,000 keyspaces...) please, help. thank you in advance. Toru Inoko. -- --**- SCSK株式会社 技術・品質・情報グループ 技術開発部 先端技術課 猪子 徹(Toru Inoko) tel : 03-6438-3544 mail : in...@ms.scsk.jp --**- -- With kind regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Re: How to include two nodes in Java code using Hector
I don't use hector, don't know much about internals, this may help * Cluster cluster = HFactory.getOrCreateCluster( TestCluster,host1:9160,host2:9160,host3:9160)* If you have 2 node cluster with RF=2, your data will be present in both node. And if consistency level 2 is used both node must be UP to read and write. It doesn't matter which node you connect, if your data is present in cluster it will be read directly or through coordinator node. Read hector doc- http://hector-client.github.com/hector/build/html/documentation.html /Samal On Wed, Jun 6, 2012 at 8:35 AM, Prakrati Agrawal prakrati.agra...@mu-sigma.com wrote: But the data is distributed on the nodes ( meaning 50% of data is on one node and 50% of data is on another node) so I need to specify the node ip address somewhere in the code. But where do I specify that is what I am clueless about. Please help me ** ** Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com ** ** *From:* Harshvardhan Ojha [mailto:harshvardhan.o...@makemytrip.com] *Sent:* Tuesday, June 05, 2012 5:51 PM *To:* user@cassandra.apache.org *Subject:* RE: How to include two nodes in Java code using Hector ** ** Use Consistency Level =2. ** ** Regards Harsh ** ** *From:* Prakrati Agrawal [mailto:prakrati.agra...@mu-sigma.com] *Sent:* Tuesday, June 05, 2012 4:08 PM *To:* user@cassandra.apache.org *Subject:* How to include two nodes in Java code using Hector ** ** Dear all ** ** I am using a two node Cassandra cluster. How do I code in Java using Hector to get data from both the nodes. Please help ** ** Thanks and Regards ** ** Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com ** ** ** ** -- This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. -- This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software.
Re: Adding a new node to Cassandra cluster
If you use thrift API, you have to maintain lot of low level code by yourself which is already being polished by HLC hector, pycassa also with HLC your can easily switch between thrift and growing CQL. On Mon, Jun 4, 2012 at 3:00 PM, R. Verlangen ro...@us2.nl wrote: You might consider using a higher level client (like Hector indeed). If you don't want this you will have to write your own connection pool. For start take a look at Hector. But keep in mind that you might be reinventing the wheel. 2012/6/4 Prakrati Agrawal prakrati.agra...@mu-sigma.com Hi, ** ** I am using Thrift API and I am not able to find anything on the internet about how to configure it for multiple nodes. I am not using any proper client like Hector. ** ** Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com ** ** *From:* R. Verlangen [mailto:ro...@us2.nl] *Sent:* Monday, June 04, 2012 2:44 PM *To:* user@cassandra.apache.org *Subject:* Re: Adding a new node to Cassandra cluster ** ** Hi there, ** ** When you speak to one node it will internally redirect the request to the proper node (local / external): but you won't be able to failover on a crash of the localhost. For adding another node to the connection pool you should take a look at the documentation of your java client. ** ** Good luck! ** ** 2012/6/4 Prakrati Agrawal prakrati.agra...@mu-sigma.com Dear all I successfully added a new node to my cluster so now it’s a 2 node cluster. But how do I mention it in my Java code as when I am retrieving data its retrieving only for one node that I am specifying in the localhost. How do I specify more than one node in the localhost. Please help me Thanks and Regards Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com ** ** -- This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. ** ** -- With kind regards, ** ** Robin Verlangen *Software engineer* ** ** W www.robinverlangen.nl E ro...@us2.nl ** ** Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. ** ** -- This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. -- With kind regards, Robin Verlangen *Software engineer* * * W www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Re: Cassandra Data Archiving
I believe you are talking about HDD space, consumed by user generated data which is no longer required after 15 days or may required. First case to use TTL which you don't wan to use. 2nd as aaron pointed snapshotting data, but data still exist in cluster, only used for back up. I think of like using column family bucket, 15 day a bucket , 2 bucket a month. Creating new cf every 15th day with time-stamp marker trip_offer_cf_[ts -ts%(86400*15)], caching cf name in app for 15 days, after 15th day old cf bucket will be read only, no write goes into it, snapshotting that old_cf_bucket _data, and deleting that cf few days later, this will keep cf count fixed. current cf count=n, bucket cf count= b*n using separate cluster old data analytic. /Samal On Fri, Jun 1, 2012 at 9:58 AM, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Problem statement: We are keeping daily generated data(user generated content) in Cassandra, but our application is using only 15 days old data. So how can we archive data older than 15 days so that we can reduce load on Cassandra ring. ** ** Note : we can’t apply TTL, as this data may be needed in future. ** ** ** ** *From:* aaron morton [mailto:aa...@thelastpickle.com] *Sent:* Friday, June 01, 2012 6:57 AM *To:* user@cassandra.apache.org *Subject:* Re: Cassandra Data Archiving ** ** I'm not sure on your needs, but the simplest thing to consider is snapshotting and copying off node. ** ** Cheers ** ** - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com ** ** On 1/06/2012, at 12:23 AM, Shubham Srivastava wrote: I need to archive my Cassandra data into another permanent storage . Two intent 1.To shed the unused data from the Live data. 2.To use the archived data for getting some analytics out or a potential source of DataWarehouse. Any recommendations for the same in terms of strategies or tools to use.** ** Regards, *Shubham Srivastava* *|* Technical Lead - Technology Development +91 124 4910 548 | MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India image001.gif*What's new?* My Trip Rewards - An exclusive loyalty program for MakeMyTrip customers. https://rewards.makemytrip.com/MTR image002.gif http://www.makemytrip.com/ image003.gifhttp://www.makemytrip.com/support/gurgaon-travel-agent-office.php *Office Map* image004.gifhttp://www.facebook.com/pages/MakeMyTrip-Deals/120740541030?ref=searchsid=10077980239.1422657277..1 *Facebook* image005.gif http://twitter.com/makemytripdeals *Twitter* ** **
Re: Query on how to count the total number of rowkeys and columns in them
default count is 100, set this to some max value, but this won't guarantee actual count. Something like paging can help in counting. Get the last key as start in second query, end as null, count as some value. But this will port data to client where as we only need count. Other solution may be (if count is very necessary) having separate counter CF, incr whenever key is inserted in other CF. I will not use Thrift API, clients library is very mature [1] CQL is also very good. [1] http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_range /Samal On Thu, May 24, 2012 at 11:52 AM, Prakrati Agrawal prakrati.agra...@mu-sigma.com wrote: Hi ** ** I am trying to learn Cassandra and I have one doubt. I am using the Thrift API, to count the number of row keys I am using KeyRange to specify the row keys. To count all of them, I specify the start and end as “new byte[0]”. But the count is set to 100 by default. How do I use this method to count the keys if I don’t know the actual number of keys in my Cassandra database? Please help me ** ** Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com ** ** -- This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software.
Re: RE Ordering counters in Cassandra
In some cases Cassandra is really good and in some cases it is not. The way I see your approach is your are recording all of your events in single key is it? Not recommended. It can go really big also if your have cluster of servers, It will hit only one server all the time make it overwhelm, and rest will sit ideal, take a nap. I will do like, I will figure out what are similar events that is occurring, and then bucket by those event. eg: if event is occurred from IOS or Andriod. I will bucket by IOS and android KEY, so here counter will give me all events occurred from IOS or andriod. KEY, concat can also be use to filter out more deep: IOS#safari, andriod#chrome. Less number of columns will help to reverse index more efficiently. /Samal On Mon, May 21, 2012 at 11:53 PM, Tamar Fraenkel ta...@tok-media.comwrote: Indeed I took the not delete approach. If time bucket rows are not that big, this is a good temporary solution. I just finished implementation and testing now on a small staging environment. So far so good. Tamar Sent from my iPod On May 21, 2012, at 9:11 PM, Filippo Diotalevi fili...@ntoklo.com wrote: Hi Tamar, the solution you propose is indeed a temporary solution, but it might be the best one. Which approach did you follow? I'm a bit concerned about the deletion approach, since in case of concurrent writes on the same counter you might lose the pointer to the column to delete. -- Filippo Diotalevi On Monday, 21 May 2012 at 18:51, Tamar Fraenkel wrote: I also had a similar problem. I have a temporary solution, which is not best, but may be of help. I have the coutner cf to count events, but apart from that I hold leaders CF: leaders = { // key is time bucket // values are composites(rank, event) ordered by // descending order of the rank // set relevant TTL on columns time_bucket1 : { composite(1000,event1) : composite(999, event2) : }, ... } Whenever I increment counter for a specific event, I add a column in the time bucket row of the leaders CF, with the new value of the counter and the event name. There are two ways to go here, either delete the old column(s) for that event (with lower counters) from leaders CF. Or let them be. If you choose to delete, there is the complication of not having getAndSetfor counters, so you may end up not deleting all the old columns. If you choose not to delete old column, and live with duplicate columns for events (each with different count), it will make your query to retrieve leaders run longer. Anyway, when you need to retrieve the leaders, you can do slice query onleaders CF and ignore duplicates events using client (I use Java). This will happen less if you do delete old columns. Another option is not to use Cassandra for that purpose, http://redis.io/ is a nice tool Will be happy to hear you comments. Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Mon, May 21, 2012 at 8:05 PM, Filippo Diotalevi fili...@ntoklo.comwrote: Hi Romain, thanks for your suggestion. When you say build every day a ranking in a dedicated CF by iterating over events: do you mean - load all the columns for the specified row key - iterate over each column, and write a new column in the inversed index ? That's my current approach, but since I have many of these wide rows (1 per day), the process is extremely slow as it involves moving an entire row from Cassandra to client, inverting every column, and sending the data back to create the inversed index. -- Filippo Diotalevi On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote: If I understand you've got a data model which looks like this: CF Events: row1: { event1: 1050, event2: 1200, event3: 830, ... } You can't query on column values but you can build every day a ranking in a dedicated CF by iterating over events: create column family Ranking with comparator = 'LongType(reversed=true)' ... CF Ranking: rank: { 1200: event2, 1050: event1, 830: event3, ... } Then you can make a top ten or whatever you want because counter values will be sorted. Filippo Diotalevi fili...@ntoklo.com a écrit sur 21/05/2012 16:59:43 : Hi, I'm trying to understand what's the best design for a simple ranking use cases. I have, in a row, a good number (10k - a few 100K) of counters; each one is counting the occurrence of an event. At the end of day, I want to create a ranking of the most occurred event. What's the best approach to perform this task? The brute force approach of retrieving the row and ordering it doesn't work well (the call usually goes timeout, especially is Cassandra is also under load); I also don't know in advance the full set of event names (column names), so it's difficult to slice the get call. Is there any trick
Re: Number of keyspaces
Not ideally, now cass has global memtable tuning. Each cf correspond to memory in ram. Year wise cf means it will be in read only state for next year, memtable will still consume ram. On 22-May-2012 5:01 PM, Franc Carter franc.car...@sirca.org.au wrote: On Tue, May 22, 2012 at 9:19 PM, aaron morton aa...@thelastpickle.comwrote: It's more the number of CF's than keyspaces. Oh - does increasing the number of Column Families affect performance ? The design we are working on at the moment is considering using a Column Family per year. We were thinking this would isolate compactions to a more manageable size as we don't update previous years. cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/05/2012, at 6:58 PM, R. Verlangen wrote: Yes, it does. However there's no real answer what's the limit: it depends on your hardware and cluster configuration. You might even want to search the archives of this mailinglist, I remember this has been asked before. Cheers! 2012/5/21 Luís Ferreira zamith...@gmail.com Hi, Does the number of keyspaces affect the overall cassandra performance? Cumprimentos, Luís Ferreira -- With kind regards, Robin Verlangen www.robinverlangen.nl -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Astyanax Error
Host not found in client. On 22-May-2012 4:34 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Hi All, Can any one suggest me why i am getting this error in Astyanax NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0), attempts=0] No hosts to borrow from Thanks In Advance Abhijit
Re: RE Ordering counters in Cassandra
Secondary index is not supported for counters plus you must know column name to support secondary index on regular column. On 22-May-2012 5:34 PM, Filippo Diotalevi fili...@ntoklo.com wrote: Thanks for all the answers, they definitely helped. Just out of curiosity, is there any underlying architectural reason why it's not possible to order a row based on its counters values? or is it something that might be in the roadmap in the future? -- Filippo Diotalevi On Tuesday, 22 May 2012 at 08:48, Romain HARDOUIN wrote: I mean iterate over each column -- more precisly: *bunches of columns* using slices -- and write new columns in the inversed index. Tamar's data model is made for real time analysis. It's maybe overdesigned for a daily ranking. I agree with Samal, you should split your data across the space of tokens. Only CF Ranking feeding would be affected, not the top N queries. Filippo Diotalevi fili...@ntoklo.com a écrit sur 21/05/2012 19:05:28 : Hi Romain, thanks for your suggestion. When you say build every day a ranking in a dedicated CF by iterating over events: do you mean - load all the columns for the specified row key - iterate over each column, and write a new column in the inversed index ? That's my current approach, but since I have many of these wide rows (1 per day), the process is extremely slow as it involves moving an entire row from Cassandra to client, inverting every column, and sending the data back to create the inversed index.
Re: supercolumns with TTL columns not being compacted correctly
Data will remain till next compaction but won't be available. Compaction will delete old sstable create new one. On 22-May-2012 5:47 PM, Pieter Callewaert pieter.callewa...@be-mobile.be wrote: Hi, ** ** I’ve had my suspicions some months, but I think I am sure about it. Data is being written by the SSTableSimpleUnsortedWriter and loaded by the sstableloader. The data should be alive for 31 days, so I use the following logic: ** ** int ttl = 2678400; long timestamp = System.currentTimeMillis() * 1000; long expirationTimestampMS = (long) ((timestamp / 1000) + ((long) ttl * 1000)); ** ** And using this to write it: ** ** sstableWriter.newRow(bytes(entry.id)); sstableWriter.newSuperColumn(bytes(superColumn)); sstableWriter.addExpiringColumn(nameTT, bytes(entry.aggregatedTTMs), timestamp, ttl, expirationTimestampMS); sstableWriter.addExpiringColumn(nameCov, bytes(entry.observationCoverage), timestamp, ttl, expirationTimestampMS); sstableWriter.addExpiringColumn(nameSpd, bytes(entry.speed), timestamp, ttl, expirationTimestampMS); ** ** This works perfectly, data can be queried until 31 days are passed, then no results are given, as expected. But the data is still on disk until the sstables are being recompacted:*** * ** ** One of our nodes (we got 6 total) has the following sstables: [cassandra@bemobile-cass3 ~]$ ls -hal /data/MapData007/HOS-* | grep G -rw-rw-r--. 1 cassandra cassandra 103G May 3 03:19 /data/MapData007/HOS-hc-125620-Data.db -rw-rw-r--. 1 cassandra cassandra 103G May 12 21:17 /data/MapData007/HOS-hc-163141-Data.db -rw-rw-r--. 1 cassandra cassandra 25G May 15 06:17 /data/MapData007/HOS-hc-172106-Data.db -rw-rw-r--. 1 cassandra cassandra 25G May 17 19:50 /data/MapData007/HOS-hc-181902-Data.db -rw-rw-r--. 1 cassandra cassandra 21G May 21 07:37 /data/MapData007/HOS-hc-191448-Data.db -rw-rw-r--. 1 cassandra cassandra 6.5G May 21 17:41 /data/MapData007/HOS-hc-193842-Data.db -rw-rw-r--. 1 cassandra cassandra 5.8G May 22 11:03 /data/MapData007/HOS-hc-196210-Data.db -rw-rw-r--. 1 cassandra cassandra 1.4G May 22 13:20 /data/MapData007/HOS-hc-196779-Data.db -rw-rw-r--. 1 cassandra cassandra 401G Apr 16 08:33 /data/MapData007/HOS-hc-58572-Data.db -rw-rw-r--. 1 cassandra cassandra 169G Apr 16 17:59 /data/MapData007/HOS-hc-61630-Data.db -rw-rw-r--. 1 cassandra cassandra 173G Apr 17 03:46 /data/MapData007/HOS-hc-63857-Data.db -rw-rw-r--. 1 cassandra cassandra 105G Apr 23 06:41 /data/MapData007/HOS-hc-87900-Data.db ** ** As you can see, the following files should be invalid: /data/MapData007/HOS-hc-58572-Data.db /data/MapData007/HOS-hc-61630-Data.db /data/MapData007/HOS-hc-63857-Data.db ** ** Because they are all written more than an moth ago. gc_grace is 0 so this should also not be a problem. ** ** As a test, I use forceUserSpecifiedCompaction on the HOS-hc-61630-Data.db. Expected behavior should be an empty file is being written because all data in the sstable should be invalid: ** ** Compactionstats is giving: compaction typekeyspace column family bytes compacted bytes total progress Compaction MapData007 HOS 11518215662532355279724 2.16% ** ** And when I ls the directory I find this: -rw-rw-r--. 1 cassandra cassandra 3.9G May 22 14:12 /data/MapData007/HOS-tmp-hc-196898-Data.db ** ** The sstable is being 1-on-1 copied to a new one. What am I missing here?** ** TTL works perfectly, but is it giving a problem because it is in a super column, and so never to be deleted from disk? ** ** Kind regards Pieter Callewaert | Web IT engineer Be-Mobile NV http://www.be-mobile.be/ | TouringMobilishttp://www.touringmobilis.be/ Technologiepark 12b - 9052 Ghent - Belgium Tel + 32 9 330 51 80 | Fax + 32 9 330 51 81 | Cell + 32 473 777 121 ** **
Re: Astyanax Error
Are you able to connect through cli? Can you share your client code? On 22-May-2012 5:59 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Samal, But I am setting up the Host. On Tue, May 22, 2012 at 5:30 PM, samal samalgo...@gmail.com wrote: Host not found in client. On 22-May-2012 4:34 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Hi All, Can any one suggest me why i am getting this error in Astyanax NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0), attempts=0] No hosts to borrow from Thanks In Advance Abhijit -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395
Re: Cassandra 0.8.5: Column name mystery in create column family command
Change your comparator to utf8type. On 22-May-2012 4:32 PM, Roshan Dawrani roshandawr...@gmail.com wrote: Hi, I use Cassandra 0.8.5 and am suddenly noticing some strange behavior. I run a create column family command with some column meta-data and it runs fine, but when I do describe keyspace, it shows me different column names for those index columns. a) Here is what I run: create column family UserTemplate with comparator=BytesType and column_metadata=[{*column_name: userid*, validation_class: UTF8Type, index_type: KEYS, index_name: TemplateUserIdIdx}, {*column_name: type*, validation_class: UTF8Type, index_type: KEYS, index_name: TemplateTypeIdx}]; b) This is what describe keyspace shows: ColumnFamily: UserTemplate Key Validation Class: org.apache.cassandra.db.marshal.BytesType ... ... Built indexes: [UserTemplate.TemplateTypeIdx, UserTemplate.TemplateUserIdIdx] Column Metadata: *Column Name: ff* Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Name: TemplateUserIdIdx Index Type: KEYS *Column Name: 0dfffaff* Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Name: TemplateTypeIdx Index Type: KEYS Does anyone see why this must be happening? I have created many such column families before and never run into this issue. -- Roshan http://roshandawrani.wordpress.com/
Re: Cassandra 0.8.5: Column name mystery in create column family command
I an not able to reproduce this in cli. On 22-May-2012 8:12 PM, Roshan Dawrani roshandawr...@gmail.com wrote: Can you please let me know why? Because I have created very similar column familes many times with comparator = BytesType, and never run into this issue before. Here is an example: ColumnFamily: Client Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType ... ... Built indexes: [Client.ACUserIdIdx] Column Metadata: Column Name: userid (757365726964) Validation Class: org.apache.cassandra.db.marshal.LexicalUUIDType Index Name: ACUserIdIdx Index Type: KEYS On Tue, May 22, 2012 at 6:16 PM, samal samalgo...@gmail.com wrote: Change your comparator to utf8type. On 22-May-2012 4:32 PM, Roshan Dawrani roshandawr...@gmail.com wrote: Hi, I use Cassandra 0.8.5 and am suddenly noticing some strange behavior. I run a create column family command with some column meta-data and it runs fine, but when I do describe keyspace, it shows me different column names for those index columns. a) Here is what I run: create column family UserTemplate with comparator=BytesType and column_metadata=[{*column_name: userid*, validation_class: UTF8Type, index_type: KEYS, index_name: TemplateUserIdIdx}, {*column_name: type*, validation_class: UTF8Type, index_type: KEYS, index_name: TemplateTypeIdx}]; b) This is what describe keyspace shows: ColumnFamily: UserTemplate Key Validation Class: org.apache.cassandra.db.marshal.BytesType ... ... Built indexes: [UserTemplate.TemplateTypeIdx, UserTemplate.TemplateUserIdIdx] Column Metadata: *Column Name: ff* Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Name: TemplateUserIdIdx Index Type: KEYS *Column Name: 0dfffaff* Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Name: TemplateTypeIdx Index Type: KEYS Does anyone see why this must be happening? I have created many such column families before and never run into this issue. -- Roshan http://roshandawrani.wordpress.com/ -- Roshan http://roshandawrani.wordpress.com/
Re: supercolumns with TTL columns not being compacted correctly
Thanks I didn't knew two stage removal process. On 23-May-2012 2:20 AM, Jonathan Ellis jbel...@gmail.com wrote: Correction: the first compaction after expiration + gcgs can remove it, even if it hasn't been turned into a tombstone previously. On Tue, May 22, 2012 at 9:37 AM, Jonathan Ellis jbel...@gmail.com wrote: Additionally, it will always take at least two compaction passes to purge an expired column: one to turn it into a tombstone, and a second (after gcgs) to remove it. On Tue, May 22, 2012 at 9:21 AM, Yuki Morishita mor.y...@gmail.com wrote: Data will not be deleted when those keys appear in other stables outside of compaction. This is to prevent obsolete data from appearing again. yuki On Tuesday, May 22, 2012 at 7:37 AM, Pieter Callewaert wrote: Hi Samal, Thanks for your time looking into this. I force the compaction by using forceUserDefinedCompaction on only that particular sstable. This gurantees me the new sstable being written only contains the data from the old sstable. The data in the sstable is more than 31 days old and gc_grace is 0, but still the data from the sstable is being written to the new one, while I am 100% sure all the data is invalid. Kind regards, Pieter Callewaert From: samal [mailto:samalgo...@gmail.com] Sent: dinsdag 22 mei 2012 14:33 To: user@cassandra.apache.org Subject: Re: supercolumns with TTL columns not being compacted correctly Data will remain till next compaction but won't be available. Compaction will delete old sstable create new one. On 22-May-2012 5:47 PM, Pieter Callewaert pieter.callewa...@be-mobile.be wrote: Hi, I’ve had my suspicions some months, but I think I am sure about it. Data is being written by the SSTableSimpleUnsortedWriter and loaded by the sstableloader. The data should be alive for 31 days, so I use the following logic: int ttl = 2678400; long timestamp = System.currentTimeMillis() * 1000; long expirationTimestampMS = (long) ((timestamp / 1000) + ((long) ttl * 1000)); And using this to write it: sstableWriter.newRow(bytes(entry.id)); sstableWriter.newSuperColumn(bytes(superColumn)); sstableWriter.addExpiringColumn(nameTT, bytes(entry.aggregatedTTMs), timestamp, ttl, expirationTimestampMS); sstableWriter.addExpiringColumn(nameCov, bytes(entry.observationCoverage), timestamp, ttl, expirationTimestampMS); sstableWriter.addExpiringColumn(nameSpd, bytes(entry.speed), timestamp, ttl, expirationTimestampMS); This works perfectly, data can be queried until 31 days are passed, then no results are given, as expected. But the data is still on disk until the sstables are being recompacted: One of our nodes (we got 6 total) has the following sstables: [cassandra@bemobile-cass3 ~]$ ls -hal /data/MapData007/HOS-* | grep G -rw-rw-r--. 1 cassandra cassandra 103G May 3 03:19 /data/MapData007/HOS-hc-125620-Data.db -rw-rw-r--. 1 cassandra cassandra 103G May 12 21:17 /data/MapData007/HOS-hc-163141-Data.db -rw-rw-r--. 1 cassandra cassandra 25G May 15 06:17 /data/MapData007/HOS-hc-172106-Data.db -rw-rw-r--. 1 cassandra cassandra 25G May 17 19:50 /data/MapData007/HOS-hc-181902-Data.db -rw-rw-r--. 1 cassandra cassandra 21G May 21 07:37 /data/MapData007/HOS-hc-191448-Data.db -rw-rw-r--. 1 cassandra cassandra 6.5G May 21 17:41 /data/MapData007/HOS-hc-193842-Data.db -rw-rw-r--. 1 cassandra cassandra 5.8G May 22 11:03 /data/MapData007/HOS-hc-196210-Data.db -rw-rw-r--. 1 cassandra cassandra 1.4G May 22 13:20 /data/MapData007/HOS-hc-196779-Data.db -rw-rw-r--. 1 cassandra cassandra 401G Apr 16 08:33 /data/MapData007/HOS-hc-58572-Data.db -rw-rw-r--. 1 cassandra cassandra 169G Apr 16 17:59 /data/MapData007/HOS-hc-61630-Data.db -rw-rw-r--. 1 cassandra cassandra 173G Apr 17 03:46 /data/MapData007/HOS-hc-63857-Data.db -rw-rw-r--. 1 cassandra cassandra 105G Apr 23 06:41 /data/MapData007/HOS-hc-87900-Data.db As you can see, the following files should be invalid: /data/MapData007/HOS-hc-58572-Data.db /data/MapData007/HOS-hc-61630-Data.db /data/MapData007/HOS-hc-63857-Data.db Because they are all written more than an moth ago. gc_grace is 0 so this should also not be a problem. As a test, I use forceUserSpecifiedCompaction on the HOS-hc-61630-Data.db. Expected behavior should be an empty file is being written because all data in the sstable should be invalid: Compactionstats is giving: compaction typekeyspace column family bytes compacted bytes total progress Compaction MapData007 HOS 11518215662 532355279724 2.16% And when I ls the directory I find this: -rw-rw-r--. 1 cassandra cassandra 3.9G May 22 14:12 /data/MapData007/HOS-tmp-hc-196898-Data.db The sstable
Re: Composite Column
It is like using your super column inside columns name. empKey{ employee1+name:XX, employee1+addr:X, employee2+name:X, employee2+addr:X } Here all of your employee details are attached to one domain i.e. all of employee1 details will be *employee1+[anytihng.n numbers of column]* comaprator=CompositeType(UTF8Type1,UTF8Type2,...,n) /Samal On Thu, May 17, 2012 at 10:40 AM, Abhijit Chanda abhijit.chan...@gmail.comwrote: Aaron, Actually Aaron i am looking for a scenario on super columns being replaced by composite column. Say this is a data model using super column rowKey{ superKey1 { Name, Address, City,. } } Actually i am having confusion how exactly the data model will look if we use composite column instead of super column. Thanks, Abhijit On Wed, May 16, 2012 at 2:56 PM, aaron morton aa...@thelastpickle.comwrote: Abhijit, Can you explain the data model a bit more. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 15/05/2012, at 10:32 PM, samal wrote: It is just column with JSON value On Tue, May 15, 2012 at 4:00 PM, samal samalgo...@gmail.com wrote: I have not used CC but yes you can. Below is not composite column. It is not not column with JSON hash value. Column value can be anything you like. date inside value are not indexed. On Tue, May 15, 2012 at 9:27 AM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Is it possible to create this data model with the help of composite column. User_Keys_By_Last_Name = { Engineering : {anderson, 1 : ac1263, anderson, 2 : 724f02, ... }, Sales : { adams, 1 : b32704, alden, 1 : 1553bd, ... }, } I am using Astyanax. Please suggest... -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395 -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395
Re: Composite Column
zCassandra data model is best known for demoralize data. You must have entry for column name in other column family. Example is just for posterity, you should not put everything in single row, very inefficient. I do not use composite column. The way I do and others do is use complex column. rowkey=username coulmns=subsets of user employee1={ name: . . previous_job_xxx_company:null previous_job_yyy_company:null } employee2={ name: . . previous_job::aaa_company:null previous_job::yyy_company:null } Here you get entire row, as row size is small, filter the similar details by marker *[previous_job::* is marker here and *xxx_company* is real value which we need, column value is not required(that depends on requirement)]. Their is very good presentation by datastax folk http://www.datastax.com/2011/07/video-data-modeling-workshop-from-cassandra-sf-2011and Joe http://www.youtube.com/watch?v=EBjWlH4NPMA , it will help you understand data model. @samalgorai On Thu, May 17, 2012 at 12:29 PM, Abhijit Chanda abhijit.chan...@gmail.comwrote: Samal, Thanks buddy for interpreting. Now suppose i am inserting data in a column family using this data model dynamically, as a result columnNames will be dynamic. Now consider there is a entry for *employee1* *name*d Smith, and i want to retrieve that value? Regards, Abhijit On Thu, May 17, 2012 at 12:03 PM, samal samalgo...@gmail.com wrote: It is like using your super column inside columns name. empKey{ employee1+name:XX, employee1+addr:X, employee2+name:X, employee2+addr:X } Here all of your employee details are attached to one domain i.e. all of employee1 details will be *employee1+[anytihng.n numbers of column]* comaprator=CompositeType(UTF8Type1,UTF8Type2,...,n) /Samal On Thu, May 17, 2012 at 10:40 AM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Aaron, Actually Aaron i am looking for a scenario on super columns being replaced by composite column. Say this is a data model using super column rowKey{ superKey1 { Name, Address, City,. } } Actually i am having confusion how exactly the data model will look if we use composite column instead of super column. Thanks, Abhijit On Wed, May 16, 2012 at 2:56 PM, aaron morton aa...@thelastpickle.comwrote: Abhijit, Can you explain the data model a bit more. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 15/05/2012, at 10:32 PM, samal wrote: It is just column with JSON value On Tue, May 15, 2012 at 4:00 PM, samal samalgo...@gmail.com wrote: I have not used CC but yes you can. Below is not composite column. It is not not column with JSON hash value. Column value can be anything you like. date inside value are not indexed. On Tue, May 15, 2012 at 9:27 AM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Is it possible to create this data model with the help of composite column. User_Keys_By_Last_Name = { Engineering : {anderson, 1 : ac1263, anderson, 2 : 724f02, ... }, Sales : { adams, 1 : b32704, alden, 1 : 1553bd, ... }, } I am using Astyanax. Please suggest... -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395 -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395 -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395
Re: How can I implement 'LIKE operation in SQL' on values while querying a column family in Cassandra
You cannot extract via relative column value. It can only extract via value if it has secondary index but exact column value need to match. as tamar suggested you can put value as column name , UTF8 comparator. { 'name_abhijit'='abhijit' 'name_abhishek'='abhiskek' 'name_atul'='atul' } here you can do slice query on column name and get desired result. /samal On Tue, May 15, 2012 at 3:29 PM, selam selam...@gmail.com wrote: Mapreduce jobs may solve your problem for batch processing On Tue, May 15, 2012 at 12:49 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Tamar, Can you please illustrate little bit with some sample code. It highly appreciable. Thanks, On Tue, May 15, 2012 at 10:48 AM, Tamar Fraenkel ta...@tok-media.comwrote: I don't think this is possible, the best you can do is prefix, if your order is alphabetical. For example I have a CF with comparator UTF8Type, and then I can do slice query and bring all columns that start with the prefix, and end with the prefix where you replace the last char with the next one in order (i.e. aaa-aab). Hope that helps. *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Tue, May 15, 2012 at 7:56 AM, Abhijit Chanda abhijit.chan...@gmail.com wrote: I don't know the exact value on a column, but I want to do a partial matching to know all available values that matches. I want to do similar kind of operation that LIKE operator in SQL do. Any help is highly appreciated. -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395 -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395 -- Saygılar İyi Çalışmalar Timu EREN ( a.k.a selam ) tokLogo.png
Re: Composite Column
I have not used CC but yes you can. Below is not composite column. It is not not column with JSON hash value. Column value can be anything you like. date inside value are not indexed. On Tue, May 15, 2012 at 9:27 AM, Abhijit Chanda abhijit.chan...@gmail.comwrote: Is it possible to create this data model with the help of composite column. User_Keys_By_Last_Name = { Engineering : {anderson, 1 : ac1263, anderson, 2 : 724f02, ... }, Sales : { adams, 1 : b32704, alden, 1 : 1553bd, ... }, } I am using Astyanax. Please suggest... -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395
Re: Composite Column
It is just column with JSON value On Tue, May 15, 2012 at 4:00 PM, samal samalgo...@gmail.com wrote: I have not used CC but yes you can. Below is not composite column. It is not not column with JSON hash value. Column value can be anything you like. date inside value are not indexed. On Tue, May 15, 2012 at 9:27 AM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Is it possible to create this data model with the help of composite column. User_Keys_By_Last_Name = { Engineering : {anderson, 1 : ac1263, anderson, 2 : 724f02, ... }, Sales : { adams, 1 : b32704, alden, 1 : 1553bd, ... }, } I am using Astyanax. Please suggest... -- Abhijit Chanda Software Developer VeHere Interactive Pvt. Ltd. +91-974395
Re: timezone time series data model
this will work.I have tried both gave one day unique bucket. I just realized, If I sync all clients to one zone then date will remain same for all. One Zone date will give materialize view to row. On Mon, Apr 30, 2012 at 11:43 PM, samal samalgo...@gmail.com wrote: hhmm. I will try both. thanks On Mon, Apr 30, 2012 at 11:29 PM, Tyler Hobbs ty...@datastax.com wrote: Err, sorry, I should have said ts - (ts % 86400). Integer division does something similar. On Mon, Apr 30, 2012 at 12:39 PM, samal samalgo...@gmail.com wrote: thanks I didn't noticed. run script for 5 minutes = divide seems to produce result ,modulo is still changing. If divide is ok will do the trick. I will run this script on Singapore, East coast server, and New delhi server whole night today. == unix = 1335806983422 unix /1000= 1335806983.422 Divid i/86400 = 15460.728969907408 Divid i/86400 INT = 15460 Modulo i%86400= 62983 == == unix = 1335806985421 unix /1000= 1335806985.421 Divid i/86400 = 15460.72899306 Divid i/86400 INT = 15460 Modulo i%86400= 62985 == == unix = 1335806987422 unix /1000= 1335806987.422 Divid i/86400 = 15460.729016203704 Divid i/86400 INT = 15460 Modulo i%86400= 62987 == == unix = 1335806989422 unix /1000= 1335806989.422 Divid i/86400 = 15460.729039351852 Divid i/86400 INT = 15460 Modulo i%86400= 62989 == == unix = 1335806991421 unix /1000= 1335806991.421 Divid i/86400 = 15460.7290625 Divid i/86400 INT = 15460 Modulo i%86400= 62991 == == unix = 1335806993422 unix /1000= 1335806993.422 Divid i/86400 = 15460.729085648149 Divid i/86400 INT = 15460 Modulo i%86400= 62993 == == unix = 1335806995422 unix /1000= 1335806995.422 Divid i/86400 = 15460.729108796297 Divid i/86400 INT = 15460 Modulo i%86400= 62995 == == unix = 1335806997421 unix /1000= 1335806997.421 Divid i/86400 = 15460.72913195 Divid i/86400 INT = 15460 Modulo i%86400= 62997 == == unix = 1335806999422 unix /1000= 1335806999.422 Divid i/86400 = 15460.729155092593 Divid i/86400 INT = 15460 Modulo i%86400= 62999 == On Mon, Apr 30, 2012 at 10:44 PM, Tyler Hobbs ty...@datastax.comwrote: getTime() returns the number of milliseconds since the epoch, not the number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp If you divide that number by 1000, it should work. On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote: I did it with node.js but it is changing after some interval. code setInterval(function(){ var d =new Date().getTime(); console.log(== ); console.log(unix = ,d); i=parseInt(d) console.log(Divid i/86400= ,i/86400); console.log(Modulo i%86400= ,i%86400); console.log(== ); },2000); /code Am I doing wrong? On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.comwrote: Correct, that's exactly what I'm saying. On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote: thanks tyler for reply. are you saying user1uuid_*{ts%86400}* would lead to unique day bucket which will be timezone {NZ to US} independent? I will try. On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.comwrote: Don't use dates or datestamps as the buckets for your row keys, use a unix timestamp modulo whatever size you want your bucket to be instead. Timestamps don't involve time zones or any of that nonsense. So, instead of having keys like user1uuid_30042012, the second half would be replaced the current unix timestamp mod 86400 (the number of seconds in a day). On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.comwrote: Hello List, I need suggestion/ recommendation on time series data. I have requirement where users belongs to different timezone and they can subscribe to global group. When users at specific timezone send update to group it is available to every user in different timezone. I am using GroupSubscribedUsers CF where all update to group are push to Each User time line, and key is timelined by useruuid_date(one day update of all groups) and columns are group updates. GroupSubscribedUsers ={ user1uuid_30042012:{//this user belongs to same timezone timeuuid1:JSON[group1update1
timezone time series data model
Hello List, I need suggestion/ recommendation on time series data. I have requirement where users belongs to different timezone and they can subscribe to global group. When users at specific timezone send update to group it is available to every user in different timezone. I am using GroupSubscribedUsers CF where all update to group are push to Each User time line, and key is timelined by useruuid_date(one day update of all groups) and columns are group updates. GroupSubscribedUsers ={ user1uuid_30042012:{//this user belongs to same timezone timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] }, user2uuid_30042012:{//this user belongs to different timezone where date has changed already to 1may but 30 april is getting update timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] timeuuid5:JSON[groupNupdate1] }, } I have noticed this approach is good for single time zone when different timezone come into picture it breaks. I am thinking of like when user pushed update to group -get user who is subscribed to group-check user timezone-push time series in user time zone. So for one user update will be on 30april where as other may have on 29april and 1may, using timestamps i can find out hours ago update came. Is there any better approach? Thanks, Samal
Re: Data model question, storing Queue Message
On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis msega...@gmail.com wrote: Hi Aaron, Thank you for your answer, I was beginning to think that my question would never be answered ;-) Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier. USE TTL feature of column as it will remove column after TTL is over (no need for manual job). I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning? key belongs to particular node , so depending on size of your data day or month wise partitioning matters. Other wise it can lead to Fat row which will cause system problem. At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ? Best regards, Morgan. Le 30 avr. 2012 à 05:52, aaron morton a écrit : Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra It hard to create a model without some idea of the data load, but I would suggest you start with: CF: UserMessages Key: ReceiverID Columns : column name = TimeUUID ; column value = message ID and Body That will order the messages by time. Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month: CF: UserMessagesMonth Key: ReceiverID+MM Columns : column name = TimeUUID ; column value = message ID and Body Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 27/04/2012, at 1:30 AM, Morgan Segalis wrote: Hi everyone ! I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model. I have worked a while on it, but I can't seems to find the best model for what I'm looking for. I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve the message queue, and deletes it. Here's some key point : - Users are identified by integer IDs - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time I have a queue Message, and here's the operations I would need to do as fast as possible : - Store from 1 to X messages per registered user - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved) - retrieve all messages from an user at once. - delete all messages from an user at once. - delete all messages that are older than Y months (from all users). I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I'm really looking for speed rather than storage optimization. My configuration is 2 dedicated server which are both : - 4 x Intel i7 2.66 Ghz - 64 bits - 24 Go - 2 TB Thank you all.
Re: Data model question, storing Queue Message
On Mon, Apr 30, 2012 at 5:52 PM, Morgan Segalis msega...@gmail.com wrote: Hi Samal, Thanks for the TTL feature, I wasn't aware of it's existence. Day's partitioning will be less wider than month partitionning (about 30 times less give or take ;-) ) Per day it should have something like 100 000 messages stored, most of it would be retrieved so deleted before the TTL feature should come do it's work. TTL is the last day column can exist in c-world after that it is deleted. Deleting before TTL is fine. Have you considered KAFKA http://incubator.apache.org/kafka/ Le 30 avr. 2012 à 13:16, samal a écrit : On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis msega...@gmail.comwrote: Hi Aaron, Thank you for your answer, I was beginning to think that my question would never be answered ;-) Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier. USE TTL feature of column as it will remove column after TTL is over (no need for manual job). I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning? key belongs to particular node , so depending on size of your data day or month wise partitioning matters. Other wise it can lead to Fat row which will cause system problem. At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ? Best regards, Morgan. Le 30 avr. 2012 à 05:52, aaron morton a écrit : Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra It hard to create a model without some idea of the data load, but I would suggest you start with: CF: UserMessages Key: ReceiverID Columns : column name = TimeUUID ; column value = message ID and Body That will order the messages by time. Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month: CF: UserMessagesMonth Key: ReceiverID+MM Columns : column name = TimeUUID ; column value = message ID and Body Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 27/04/2012, at 1:30 AM, Morgan Segalis wrote: Hi everyone ! I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model. I have worked a while on it, but I can't seems to find the best model for what I'm looking for. I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve the message queue, and deletes it. Here's some key point : - Users are identified by integer IDs - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time I have a queue Message, and here's the operations I would need to do as fast as possible : - Store from 1 to X messages per registered user - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved) - retrieve all messages from an user at once. - delete all messages from an user at once. - delete all messages that are older than Y months (from all users). I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I'm really looking for speed rather than storage optimization. My configuration is 2 dedicated server which are both : - 4 x Intel i7 2.66 Ghz - 64 bits - 24 Go - 2 TB Thank you all.
Re: timezone time series data model
thanks tyler for reply. are you saying user1uuid_*{ts%86400}* would lead to unique day bucket which will be timezone {NZ to US} independent? I will try. On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.com wrote: Don't use dates or datestamps as the buckets for your row keys, use a unix timestamp modulo whatever size you want your bucket to be instead. Timestamps don't involve time zones or any of that nonsense. So, instead of having keys like user1uuid_30042012, the second half would be replaced the current unix timestamp mod 86400 (the number of seconds in a day). On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote: Hello List, I need suggestion/ recommendation on time series data. I have requirement where users belongs to different timezone and they can subscribe to global group. When users at specific timezone send update to group it is available to every user in different timezone. I am using GroupSubscribedUsers CF where all update to group are push to Each User time line, and key is timelined by useruuid_date(one day update of all groups) and columns are group updates. GroupSubscribedUsers ={ user1uuid_30042012:{//this user belongs to same timezone timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] }, user2uuid_30042012:{//this user belongs to different timezone where date has changed already to 1may but 30 april is getting update timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] timeuuid5:JSON[groupNupdate1] }, } I have noticed this approach is good for single time zone when different timezone come into picture it breaks. I am thinking of like when user pushed update to group -get user who is subscribed to group-check user timezone-push time series in user time zone. So for one user update will be on 30april where as other may have on 29april and 1may, using timestamps i can find out hours ago update came. Is there any better approach? Thanks, Samal -- Tyler Hobbs DataStax http://datastax.com/
Re: timezone time series data model
I did it with node.js but it is changing after some interval. code setInterval(function(){ var d =new Date().getTime(); console.log(== ); console.log(unix = ,d); i=parseInt(d) console.log(Divid i/86400= ,i/86400); console.log(Modulo i%86400= ,i%86400); console.log(== ); },2000); /code Am I doing wrong? On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.com wrote: Correct, that's exactly what I'm saying. On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote: thanks tyler for reply. are you saying user1uuid_*{ts%86400}* would lead to unique day bucket which will be timezone {NZ to US} independent? I will try. On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.com wrote: Don't use dates or datestamps as the buckets for your row keys, use a unix timestamp modulo whatever size you want your bucket to be instead. Timestamps don't involve time zones or any of that nonsense. So, instead of having keys like user1uuid_30042012, the second half would be replaced the current unix timestamp mod 86400 (the number of seconds in a day). On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote: Hello List, I need suggestion/ recommendation on time series data. I have requirement where users belongs to different timezone and they can subscribe to global group. When users at specific timezone send update to group it is available to every user in different timezone. I am using GroupSubscribedUsers CF where all update to group are push to Each User time line, and key is timelined by useruuid_date(one day update of all groups) and columns are group updates. GroupSubscribedUsers ={ user1uuid_30042012:{//this user belongs to same timezone timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] }, user2uuid_30042012:{//this user belongs to different timezone where date has changed already to 1may but 30 april is getting update timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] timeuuid5:JSON[groupNupdate1] }, } I have noticed this approach is good for single time zone when different timezone come into picture it breaks. I am thinking of like when user pushed update to group -get user who is subscribed to group-check user timezone-push time series in user time zone. So for one user update will be on 30april where as other may have on 29april and 1may, using timestamps i can find out hours ago update came. Is there any better approach? Thanks, Samal -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/
Re: timezone time series data model
thanks I didn't noticed. run script for 5 minutes = divide seems to produce result ,modulo is still changing. If divide is ok will do the trick. I will run this script on Singapore, East coast server, and New delhi server whole night today. == unix = 1335806983422 unix /1000= 1335806983.422 Divid i/86400 = 15460.728969907408 Divid i/86400 INT = 15460 Modulo i%86400= 62983 == == unix = 1335806985421 unix /1000= 1335806985.421 Divid i/86400 = 15460.72899306 Divid i/86400 INT = 15460 Modulo i%86400= 62985 == == unix = 1335806987422 unix /1000= 1335806987.422 Divid i/86400 = 15460.729016203704 Divid i/86400 INT = 15460 Modulo i%86400= 62987 == == unix = 1335806989422 unix /1000= 1335806989.422 Divid i/86400 = 15460.729039351852 Divid i/86400 INT = 15460 Modulo i%86400= 62989 == == unix = 1335806991421 unix /1000= 1335806991.421 Divid i/86400 = 15460.7290625 Divid i/86400 INT = 15460 Modulo i%86400= 62991 == == unix = 1335806993422 unix /1000= 1335806993.422 Divid i/86400 = 15460.729085648149 Divid i/86400 INT = 15460 Modulo i%86400= 62993 == == unix = 1335806995422 unix /1000= 1335806995.422 Divid i/86400 = 15460.729108796297 Divid i/86400 INT = 15460 Modulo i%86400= 62995 == == unix = 1335806997421 unix /1000= 1335806997.421 Divid i/86400 = 15460.72913195 Divid i/86400 INT = 15460 Modulo i%86400= 62997 == == unix = 1335806999422 unix /1000= 1335806999.422 Divid i/86400 = 15460.729155092593 Divid i/86400 INT = 15460 Modulo i%86400= 62999 == On Mon, Apr 30, 2012 at 10:44 PM, Tyler Hobbs ty...@datastax.com wrote: getTime() returns the number of milliseconds since the epoch, not the number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp If you divide that number by 1000, it should work. On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote: I did it with node.js but it is changing after some interval. code setInterval(function(){ var d =new Date().getTime(); console.log(== ); console.log(unix = ,d); i=parseInt(d) console.log(Divid i/86400= ,i/86400); console.log(Modulo i%86400= ,i%86400); console.log(== ); },2000); /code Am I doing wrong? On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.com wrote: Correct, that's exactly what I'm saying. On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote: thanks tyler for reply. are you saying user1uuid_*{ts%86400}* would lead to unique day bucket which will be timezone {NZ to US} independent? I will try. On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.comwrote: Don't use dates or datestamps as the buckets for your row keys, use a unix timestamp modulo whatever size you want your bucket to be instead. Timestamps don't involve time zones or any of that nonsense. So, instead of having keys like user1uuid_30042012, the second half would be replaced the current unix timestamp mod 86400 (the number of seconds in a day). On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote: Hello List, I need suggestion/ recommendation on time series data. I have requirement where users belongs to different timezone and they can subscribe to global group. When users at specific timezone send update to group it is available to every user in different timezone. I am using GroupSubscribedUsers CF where all update to group are push to Each User time line, and key is timelined by useruuid_date(one day update of all groups) and columns are group updates. GroupSubscribedUsers ={ user1uuid_30042012:{//this user belongs to same timezone timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] }, user2uuid_30042012:{//this user belongs to different timezonewhere date has changed already to 1may but 30 april is getting update timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] timeuuid5:JSON[groupNupdate1] }, } I have noticed this approach is good for single time zone when different timezone come into picture it breaks. I am thinking of like when user
Re: timezone time series data model
hhmm. I will try both. thanks On Mon, Apr 30, 2012 at 11:29 PM, Tyler Hobbs ty...@datastax.com wrote: Err, sorry, I should have said ts - (ts % 86400). Integer division does something similar. On Mon, Apr 30, 2012 at 12:39 PM, samal samalgo...@gmail.com wrote: thanks I didn't noticed. run script for 5 minutes = divide seems to produce result ,modulo is still changing. If divide is ok will do the trick. I will run this script on Singapore, East coast server, and New delhi server whole night today. == unix = 1335806983422 unix /1000= 1335806983.422 Divid i/86400 = 15460.728969907408 Divid i/86400 INT = 15460 Modulo i%86400= 62983 == == unix = 1335806985421 unix /1000= 1335806985.421 Divid i/86400 = 15460.72899306 Divid i/86400 INT = 15460 Modulo i%86400= 62985 == == unix = 1335806987422 unix /1000= 1335806987.422 Divid i/86400 = 15460.729016203704 Divid i/86400 INT = 15460 Modulo i%86400= 62987 == == unix = 1335806989422 unix /1000= 1335806989.422 Divid i/86400 = 15460.729039351852 Divid i/86400 INT = 15460 Modulo i%86400= 62989 == == unix = 1335806991421 unix /1000= 1335806991.421 Divid i/86400 = 15460.7290625 Divid i/86400 INT = 15460 Modulo i%86400= 62991 == == unix = 1335806993422 unix /1000= 1335806993.422 Divid i/86400 = 15460.729085648149 Divid i/86400 INT = 15460 Modulo i%86400= 62993 == == unix = 1335806995422 unix /1000= 1335806995.422 Divid i/86400 = 15460.729108796297 Divid i/86400 INT = 15460 Modulo i%86400= 62995 == == unix = 1335806997421 unix /1000= 1335806997.421 Divid i/86400 = 15460.72913195 Divid i/86400 INT = 15460 Modulo i%86400= 62997 == == unix = 1335806999422 unix /1000= 1335806999.422 Divid i/86400 = 15460.729155092593 Divid i/86400 INT = 15460 Modulo i%86400= 62999 == On Mon, Apr 30, 2012 at 10:44 PM, Tyler Hobbs ty...@datastax.com wrote: getTime() returns the number of milliseconds since the epoch, not the number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp If you divide that number by 1000, it should work. On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote: I did it with node.js but it is changing after some interval. code setInterval(function(){ var d =new Date().getTime(); console.log(== ); console.log(unix = ,d); i=parseInt(d) console.log(Divid i/86400= ,i/86400); console.log(Modulo i%86400= ,i%86400); console.log(== ); },2000); /code Am I doing wrong? On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.comwrote: Correct, that's exactly what I'm saying. On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote: thanks tyler for reply. are you saying user1uuid_*{ts%86400}* would lead to unique day bucket which will be timezone {NZ to US} independent? I will try. On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.comwrote: Don't use dates or datestamps as the buckets for your row keys, use a unix timestamp modulo whatever size you want your bucket to be instead. Timestamps don't involve time zones or any of that nonsense. So, instead of having keys like user1uuid_30042012, the second half would be replaced the current unix timestamp mod 86400 (the number of seconds in a day). On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote: Hello List, I need suggestion/ recommendation on time series data. I have requirement where users belongs to different timezone and they can subscribe to global group. When users at specific timezone send update to group it is available to every user in different timezone. I am using GroupSubscribedUsers CF where all update to group are push to Each User time line, and key is timelined by useruuid_date(one day update of all groups) and columns are group updates. GroupSubscribedUsers ={ user1uuid_30042012:{//this user belongs to same timezone timeuuid1:JSON[group1update1] timeuuid2:JSON[group2update2] timeuuid3:JSON[group1update2] timeuuid4:JSON[group4update1] }, user2uuid_30042012:{//this user belongs to different timezonewhere date has changed already to 1may but 30 april is getting update timeuuid1
Re: Cassandra and harddrives
Each node need its own HDD for multiple copies. cant share it with others node. On Thu, Apr 26, 2012 at 8:52 AM, Benny Rönnhager benny.ronnha...@thrutherockies.com wrote: Hi! I am building a database with several hundred thousands of images. have just learned that HaProxy is a very good fronted to a couple of Cassandra nodes. I understand how that works but... Must every single node (mac mini) have it's own external harddrive with the same data (images) or can I just use one hard drive that can be accessed by all nodes? What is the recommended way to do this? Thanks in advance. Benny
Re: How to store a list of values?
YEAH! agree, it only matter for time bucket data. On Tue, Mar 27, 2012 at 12:31 PM, R. Verlangen ro...@us2.nl wrote: That's true, but it does not sound like a real problem to me.. Maybe someone else can shed some light upon this. 2012/3/27 samal samalgo...@gmail.com On Tue, Mar 27, 2012 at 1:47 AM, R. Verlangen ro...@us2.nl wrote: but any schema change will break it How do you mean? You don't have to specify the columns in Cassandra so it should work perfect. Except for the skill~ is preserverd for your list. In case skill~ is decided to change to skill:: , it need to be handle at app level. Or otherwise had t update in all row, read it first, modify it, insert new version and delete old version. -- With kind regards, Robin Verlangen www.robinverlangen.nl
Re: How to store a list of values?
I would take simple approach. create one other CF UserSkill with row key same as profile_cf key, In user_skill cf will add skill as column name and value null. Columns can be added or removed. UserProfile={ '*ben*'={ blah :blah blah :blah blah :blah } } UserSkill={ '*ben*'={ 'java':'' 'cassandra':'' . . . 'linux':'' 'skill':'infinity' } } On Mon, Mar 26, 2012 at 12:34 PM, Ben McCann b...@benmccann.com wrote: I have a profile column family and want to store a list of skills in each profile. In BigTable I could store a Protocol Bufferhttp://code.google.com/apis/protocolbuffers/docs/overview.htmlwith a repeated field, but I'm not sure how this is typically accomplished in Cassandra. One option would be to store a serialized Thrifthttp://thrift.apache.org/or protobuf, but I'd prefer not to do this as I believe Cassandra doesn't have knowledge of these formats, and so the data in the datastore would not not human readable in CQL queries from the command line. The other solution I thought of would be to use a super column and put a random UUID as the key for each skill: skills: { '4b27c2b3ac48e8df': 'java', '84bf94ea7bc92018': 'c++', '9103b9a93ce9d18': 'cobol' } Is this a good way of handling lists in Cassandra? I imagine there's some idiom I'm not aware of. I'm using the Astyanaxhttps://github.com/Netflix/astyanax/wikiclient library, which only supports composite columns instead of super columns, and so the solution I proposed above would seem quite awkward in that case. Though I'm still having some trouble understanding composite columns as they seem not to be completely documented yet. Would this solution work with composite columns? Thanks, Ben
Re: How to store a list of values?
plus it is fully compatible with CQL. SELECT * FROM UserSkill WHERE KEY='ben'; On Mon, Mar 26, 2012 at 9:13 PM, samal samalgo...@gmail.com wrote: I would take simple approach. create one other CF UserSkill with row key same as profile_cf key, In user_skill cf will add skill as column name and value null. Columns can be added or removed. UserProfile={ '*ben*'={ blah :blah blah :blah blah :blah } } UserSkill={ '*ben*'={ 'java':'' 'cassandra':'' . . . 'linux':'' 'skill':'infinity' } } On Mon, Mar 26, 2012 at 12:34 PM, Ben McCann b...@benmccann.com wrote: I have a profile column family and want to store a list of skills in each profile. In BigTable I could store a Protocol Bufferhttp://code.google.com/apis/protocolbuffers/docs/overview.htmlwith a repeated field, but I'm not sure how this is typically accomplished in Cassandra. One option would be to store a serialized Thrifthttp://thrift.apache.org/or protobuf, but I'd prefer not to do this as I believe Cassandra doesn't have knowledge of these formats, and so the data in the datastore would not not human readable in CQL queries from the command line. The other solution I thought of would be to use a super column and put a random UUID as the key for each skill: skills: { '4b27c2b3ac48e8df': 'java', '84bf94ea7bc92018': 'c++', '9103b9a93ce9d18': 'cobol' } Is this a good way of handling lists in Cassandra? I imagine there's some idiom I'm not aware of. I'm using the Astyanaxhttps://github.com/Netflix/astyanax/wikiclient library, which only supports composite columns instead of super columns, and so the solution I proposed above would seem quite awkward in that case. Though I'm still having some trouble understanding composite columns as they seem not to be completely documented yet. Would this solution work with composite columns? Thanks, Ben
Re: How to store a list of values?
On Mon, Mar 26, 2012 at 9:20 PM, Ben McCann b...@benmccann.com wrote: Thanks for the reply Samal. I did not realize that you could store a column with null value. values can be null or any value like [default@node] set hus['test']['wowq']='\{de\'.de\;\}\+\^anything'; Value inserted. Elapsed time: 4 msec(s). [default@node] [default@node] [default@node] get hus['test']; = (column=wow, value={de.de;}, timestamp=133222503000) = (column=wowq, value={de'.de;}+^anything, timestamp=133267425000) Returned 2 results. Elapsed time: 65 msec(s). [default@node] Do you know if this solution would work with composite columns? It seems super columns are being phased out in favor of composites, but I do not understand composites very well yet. personally i have phased out Super Column year back, about CC didn't much dig into it but know key and column name can be composite. 'ben'+'task1'={ utf8+ascii:'' } I'm trying to figure out if there's any way to accomplish what you've suggested using Astyanax https://github.com/Netflix/astyanax. this is the simplest approach, should work with every client available since it is independent CF, here two call is required. Thanks for the help, Ben On Mon, Mar 26, 2012 at 8:46 AM, samal samalgo...@gmail.com wrote: plus it is fully compatible with CQL. SELECT * FROM UserSkill WHERE KEY='ben'; On Mon, Mar 26, 2012 at 9:13 PM, samal samalgo...@gmail.com wrote: I would take simple approach. create one other CF UserSkill with row key same as profile_cf key, In user_skill cf will add skill as column name and value null. Columns can be added or removed. UserProfile={ '*ben*'={ blah :blah blah :blah blah :blah } } UserSkill={ '*ben*'={ 'java':'' 'cassandra':'' . . . 'linux':'' 'skill':'infinity' } } On Mon, Mar 26, 2012 at 12:34 PM, Ben McCann b...@benmccann.com wrote: I have a profile column family and want to store a list of skills in each profile. In BigTable I could store a Protocol Bufferhttp://code.google.com/apis/protocolbuffers/docs/overview.htmlwith a repeated field, but I'm not sure how this is typically accomplished in Cassandra. One option would be to store a serialized Thrifthttp://thrift.apache.org/or protobuf, but I'd prefer not to do this as I believe Cassandra doesn't have knowledge of these formats, and so the data in the datastore would not not human readable in CQL queries from the command line. The other solution I thought of would be to use a super column and put a random UUID as the key for each skill: skills: { '4b27c2b3ac48e8df': 'java', '84bf94ea7bc92018': 'c++', '9103b9a93ce9d18': 'cobol' } Is this a good way of handling lists in Cassandra? I imagine there's some idiom I'm not aware of. I'm using the Astyanaxhttps://github.com/Netflix/astyanax/wikiclient library, which only supports composite columns instead of super columns, and so the solution I proposed above would seem quite awkward in that case. Though I'm still having some trouble understanding composite columns as they seem not to be completely documented yet. Would this solution work with composite columns? Thanks, Ben
Re: How to store a list of values?
Save the skills in a single column in json format. Job done. Good if it have fixed set of skills, then any add or delete changes need handle in app. -read column first-reformat JOSN-update column (2 thrift calls). skill~Java: null, skill~Cassandra: null This is also good option, but any schema change will break it. On Mar 26, 2012 7:04 PM, Ben McCann b...@benmccann.com wrote: True. But I don't need the skills to be searchable, so I'd rather embed them in the user than add another top-level CF. I was thinking of doing something along the lines of adding a skills super column to the User table: skills: { 'java': null, 'c++': null, 'cobol': null } However, I'm still not sure yet how to accomplish this with Astyanax. I've only figured out how to make composite columns with predefined column names with it and not dynamic column names like this. On Mon, Mar 26, 2012 at 9:08 AM, R. Verlangen ro...@us2.nl wrote: In this case you only neem the columns for values. You don't need the column-values to hold multiple columns (the super-column principle). So a normal CF would work. 2012/3/26 Ben McCann b...@benmccann.com Thanks for the reply Samal. I did not realize that you could store a column with null value. Do you know if this solution would work with composite columns? It seems super columns are being phased out in favor of composites, but I do not understand composites very well yet. I'm trying to figure out if there's any way to accomplish what you've suggested using Astyanax https://github.com/Netflix/astyanax. Thanks for the help, Ben On Mon, Mar 26, 2012 at 8:46 AM, samal samalgo...@gmail.com wrote: plus it is fully compatible with CQL. SELECT * FROM UserSkill WHERE KEY='ben'; On Mon, Mar 26, 2012 at 9:13 PM, samal samalgo...@gmail.com wrote: I would take simple approach. create one other CF UserSkill with row key same as profile_cf key, In user_skill cf will add skill as column name and value null. Columns can be added or removed. UserProfile={ '*ben*'={ blah :blah blah :blah blah :blah } } UserSkill={ '*ben*'={ 'java':'' 'cassandra':'' . . . 'linux':'' 'skill':'infinity' } } On Mon, Mar 26, 2012 at 12:34 PM, Ben McCann b...@benmccann.comwrote: I have a profile column family and want to store a list of skills in each profile. In BigTable I could store a Protocol Bufferhttp://code.google.com/apis/protocolbuffers/docs/overview.htmlwith a repeated field, but I'm not sure how this is typically accomplished in Cassandra. One option would be to store a serialized Thrifthttp://thrift.apache.org/or protobuf, but I'd prefer not to do this as I believe Cassandra doesn't have knowledge of these formats, and so the data in the datastore would not not human readable in CQL queries from the command line. The other solution I thought of would be to use a super column and put a random UUID as the key for each skill: skills: { '4b27c2b3ac48e8df': 'java', '84bf94ea7bc92018': 'c++', '9103b9a93ce9d18': 'cobol' } Is this a good way of handling lists in Cassandra? I imagine there's some idiom I'm not aware of. I'm using the Astyanaxhttps://github.com/Netflix/astyanax/wikiclient library, which only supports composite columns instead of super columns, and so the solution I proposed above would seem quite awkward in that case. Though I'm still having some trouble understanding composite columns as they seem not to be completely documented yet. Would this solution work with composite columns? Thanks, Ben -- With kind regards, Robin Verlangen www.robinverlangen.nl
Re: How to store a list of values?
On Tue, Mar 27, 2012 at 1:47 AM, R. Verlangen ro...@us2.nl wrote: but any schema change will break it How do you mean? You don't have to specify the columns in Cassandra so it should work perfect. Except for the skill~ is preserverd for your list. In case skill~ is decided to change to skill:: , it need to be handle at app level. Or otherwise had t update in all row, read it first, modify it, insert new version and delete old version.
Re: Re: Cassandra DataModeling recommendations
On Mon, Dec 5, 2011 at 3:06 PM, pco...@cegetel.net wrote: Hi Thanks for the answer, as I read the book on Cassandra, I was not aware at that time on Composite Key which I recently discovered. *Composite Type's are useful for handling data-versions. * * * * *You mentioned a TTL and let the database remove the date for me. I never read about that. Is it possible without an external batch ? *Yes, TTL if set on column, auto delete column for you.* I will try to rephrase in any case my goal: Storage: - I would like to store for a user (identified by its id) several carts (BLOB). - Associated to these carts, I would like to attach metadata like expiration date and possibly others. Queries/tasks: - I would like to be able to retrieve all the carts of a given userId. *I would use timeline with TTL for carts as separate CF. And cart_Id to reverse index in userId CF with TTL set on columns. * - I would like to have a mean to remove expired carts. *set TTL on each column. * 1. cartCF{ *cart1_uuidkey:{ metadata_column:ttl } cart2_uuidkey:{ metadata_column:ttl } . . .cartN_uuidkey:{ metadata_column:ttl }* } 2. userIdCF:{ *user1:{ id:user1 //*hack : to prevent unwanted behavior one column with no ttl.* cart1:cart1_uuidkey:ttl cart2:ttl cart3:ttl } user2:{ id:user2 cart1:cartX_uuidkey:ttl cart2:cart4:ttl cart3:cartMttl }* } /Samal
Re: node.js library?
On Mon, Dec 5, 2011 at 7:59 PM, Norman Maurer norman.mau...@googlemail.comwrote: As far as I know its the library that was developed by rackspace. See https://github.com/racker/node-cassandra-client *No longer maintained. it is moved as separate project in apache-extras * 2011/12/5 Joe Stein crypt...@gmail.com Hey folks, so I have been noodling on using node.js as a new front end for the system I built for doing real time aggregate metrics within our distributed systems. Does anyone have experience or background story on this lib? http://code.google.com/a/apache-extras.org/p/cassandra-node/ it seems to be the most up to date one supporting CQL only (which should not be an issue) but was not sure if it is maintained or what the background story is on it and such? Any other experiences/horror stories/over the rainbow type stories with node.js C* would be nice to hear. /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop */
Re: Setting Key Validation Class
key_validation_class is different from validation_class validation are BytesType by default. key_valdation_class = key default_validation_class=column_values comparator=column_name default_validation_class is global scope of validation_class On Mon, Dec 5, 2011 at 10:10 PM, Dinusha Dilrukshi sdddilruk...@gmail.comwrote: Hi, I am using apache-cassandra-1.0.0 and I tried to insert/retrieve data in a column family using cassandra-jdbc program. Here is how I created 'USER' column family using cassandra-cli. create column family USER with comparator=UTF8Type and column_metadata=[{column_name: user_id, validation_class: UTF8Type, index_type: KEYS}, {column_name: username, validation_class: UTF8Type, index_type: KEYS}, {column_name: password, validation_class: UTF8Type}]; But, when i try to insert data to USER column family it gives the error java.sql.SQLException: Mismatched types: java.lang.String cannot be cast to java.nio.ByteBuffer. Since I have set user_id as a KEY and it's validation_class as UTF8Type, I was expected Key Validation Class as UTF8Type. But when I look at the meta-data of USER column family it shows as Key Validation Class: org.apache.cassandra.db.marshal.BytesType which has cause for the above error. When I created USER column family as follows, it solves the above issue. create column family USER with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[{column_name: user_id, validation_class: UTF8Type, index_type: KEYS}, {column_name: username, validation_class: UTF8Type, index_type: KEYS}, {column_name: password, validation_class: UTF8Type}]; Do we always need to define *key_validation_class* as in the above query ? Isn't it not enough to add validation classes for each column ? Regards, ~Dinusha~
Re: OutOfMemory Exception during bootstrap
Lower your heap size, if you are testing multiple instance with single node. https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh#L64 On Sun, Dec 4, 2011 at 11:08 PM, Harald Falzberger h.falzber...@gmail.comwrote: Hi, I'm trying to set up a test environment with 2 nodes on one physical machine with two ips. I configured both as adviced in the documentation: cluster_name: 'MyDemoCluster' initial_token: 0 seed_provider: - seeds: IP1 listen_address: IP1 rpc_address: IP1 cluster_name: 'MyDemoCluster' initial_token: 85070591730234615865843651857942052864 seed_provider: - seeds: IP1 listen_address: IP2 rpc_address: IP2 Node1 uses 7199 as JMX port, Node2 7198 because JMX by default is listening on all interfaces. When I bootstrap node2, on node1 following exception is thrown and node1 terminates. the same error occurs again if I try to restart node1 and node2 is still running. Does anyone of you have an idea why this happens? I'm starting each cassandra instance with 16GB RAM and my database is empty. Exception on Node1 java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703) at java.util.concurrent.ThreadPoolExecutor.prestartAllCoreThreads(ThreadPoolExecutor.java:1384) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:77) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:65) at org.apache.cassandra.concurrent.StageManager.multiThreadedStage(StageManager.java:58) at org.apache.cassandra.concurrent.StageManager.clinit(StageManager.java:44) at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:512) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:159)
Re: Seeking advice on Schema and Caching
Edanuff + Beautiful People I think row cache could be the best fit but it can take resource depending on row size. It will only touch disk once (first time) in case of SST, rest of the req for that row will be served from memory. Try increasing row cache size and decreasing save period to appropriate value *Row cache size / save period in seconds: *200/30 one catch this is only good for small size row, as your one row contain all entry with first 3 similar char, this can happen that one row could become very large while other remain very thin. eg: many ppl can have aditya name adi{ {tya,1} . . } but only few ppl will have name with x or y. On Thu, Nov 17, 2011 at 3:29 AM, Aditya ady...@gmail.com wrote: Thanks to samal who pointed to look at the composite columns. I am now using composite columns names containing username+userId valueless column. Thus column names are now unique even for users with same name as userId is also attached to the same composite col name. Thus the supercolumn issue is resolved. But I am still seeking advice some on the caching strategy for these rows. Since while a user is doing the search, the DB will be queried multiple times because I 'm not keeping the retrieved columns in the application layer. Thus I am thinking of caching this row so that the further queries be served through the cache. However the important point here is that I am using very fewer resources for this cache so that the rows remain in cache for a very short time so as to serve the needs only for a single search time interval like max 30 seconds. Is this approach correct.? That way I wont be putting unneccessary data in cache for a long time thus saving resources for other needs. On Wed, Nov 16, 2011 at 11:20 AM, samal samalgo...@gmail.com wrote: I think you can but I am not sure, I haven't tried that yet, Nothing harm in keeping value also it will be read in single query only. In 2nd case, yes 2 or more query required to get specific user details. As username is map to user_id's key(unique like UUID) and user_id key store actual details. On Wed, Nov 16, 2011 at 11:10 AM, Aditya Narayan ady...@gmail.comwrote: Regarding the first option that you suggested through composite columns, can I store the username id both in the column name and keep the column valueless? Will I be able to retrieve both the username and id from the composite col name ? Thanks a lot On Wed, Nov 16, 2011 at 10:56 AM, Aditya Narayan ady...@gmail.comwrote: Got the first option that you suggested. However, In the second one, are you suggested to use, for e.g, key='Marcos' store cols, for all users of that name, containing userId inside that row. That way it would have to read multiple rows while user is doing a single search. On Wed, Nov 16, 2011 at 10:47 AM, samal samalgo...@gmail.com wrote: I need to add 'search users' functionality to my application. (The trigger for fetching searched items(like google instant search) is made when 3 letters have been typed in). For this, I make a CF with String type keys. Each such key is made of first 3 letters of a user's name. Thus all names starting with 'Mar-' are stored in single row (with key=Mar). The column names are framed as remaining letters of the names. Thus, a name 'Marcos' will be stored within rowkey Mar col name cos. The id will be stored as column value. Since there could be many users with same name. Thus I would have multple userIds(of users named Marcos) to be stored inside columnname cos under key Mar. Thus, 1. Supercolumn seems to be a better fit for my use case(so that ids of users with same name may fit as sub-columns inside a super-column) but since supercolumns are not encouraged thus I want to use an alternative schema for this usecase if possible. Could you suggest some ideas on this ? Aditya, Have you any given thought on Composite columns [1]. I think it can help you solve your problem of multiple user with same name. mar:{ {cos,unique_user_id}:unique_user_id, {cos,1}:1, {cos,2}:2, {cos,3}:3, // {utf8,timeUUID}:timeUUID, } OR you can try wide rows indexing user name to ID's marcos{ user1:' ', user2:' ', user3:' ' } [1]http://www.slideshare.net/edanuff/indexing-in-cassandra
Re: Seeking advice on Schema and Caching
I need to add 'search users' functionality to my application. (The trigger for fetching searched items(like google instant search) is made when 3 letters have been typed in). For this, I make a CF with String type keys. Each such key is made of first 3 letters of a user's name. Thus all names starting with 'Mar-' are stored in single row (with key=Mar). The column names are framed as remaining letters of the names. Thus, a name 'Marcos' will be stored within rowkey Mar col name cos. The id will be stored as column value. Since there could be many users with same name. Thus I would have multple userIds(of users named Marcos) to be stored inside columnname cos under key Mar. Thus, 1. Supercolumn seems to be a better fit for my use case(so that ids of users with same name may fit as sub-columns inside a super-column) but since supercolumns are not encouraged thus I want to use an alternative schema for this usecase if possible. Could you suggest some ideas on this ? Aditya, Have you any given thought on Composite columns [1]. I think it can help you solve your problem of multiple user with same name. mar:{ {cos,unique_user_id}:unique_user_id, {cos,1}:1, {cos,2}:2, {cos,3}:3, // {utf8,timeUUID}:timeUUID, } OR you can try wide rows indexing user name to ID's marcos{ user1:' ', user2:' ', user3:' ' } [1]http://www.slideshare.net/edanuff/indexing-in-cassandra
Re: Seeking advice on Schema and Caching
I think you can but I am not sure, I haven't tried that yet, Nothing harm in keeping value also it will be read in single query only. In 2nd case, yes 2 or more query required to get specific user details. As username is map to user_id's key(unique like UUID) and user_id key store actual details. On Wed, Nov 16, 2011 at 11:10 AM, Aditya Narayan ady...@gmail.com wrote: Regarding the first option that you suggested through composite columns, can I store the username id both in the column name and keep the column valueless? Will I be able to retrieve both the username and id from the composite col name ? Thanks a lot On Wed, Nov 16, 2011 at 10:56 AM, Aditya Narayan ady...@gmail.com wrote: Got the first option that you suggested. However, In the second one, are you suggested to use, for e.g, key='Marcos' store cols, for all users of that name, containing userId inside that row. That way it would have to read multiple rows while user is doing a single search. On Wed, Nov 16, 2011 at 10:47 AM, samal samalgo...@gmail.com wrote: I need to add 'search users' functionality to my application. (The trigger for fetching searched items(like google instant search) is made when 3 letters have been typed in). For this, I make a CF with String type keys. Each such key is made of first 3 letters of a user's name. Thus all names starting with 'Mar-' are stored in single row (with key=Mar). The column names are framed as remaining letters of the names. Thus, a name 'Marcos' will be stored within rowkey Mar col name cos. The id will be stored as column value. Since there could be many users with same name. Thus I would have multple userIds(of users named Marcos) to be stored inside columnname cos under key Mar. Thus, 1. Supercolumn seems to be a better fit for my use case(so that ids of users with same name may fit as sub-columns inside a super-column) but since supercolumns are not encouraged thus I want to use an alternative schema for this usecase if possible. Could you suggest some ideas on this ? Aditya, Have you any given thought on Composite columns [1]. I think it can help you solve your problem of multiple user with same name. mar:{ {cos,unique_user_id}:unique_user_id, {cos,1}:1, {cos,2}:2, {cos,3}:3, // {utf8,timeUUID}:timeUUID, } OR you can try wide rows indexing user name to ID's marcos{ user1:' ', user2:' ', user3:' ' } [1]http://www.slideshare.net/edanuff/indexing-in-cassandra
Re: Apache Cassandra Hangout in Mumbai-Pune area (India)
Let's catch up. I am available in Mumbai. Using C* in dev env. Love to share or hear experience's. On Fri, Nov 11, 2011 at 10:25 PM, Adi adi.pan...@gmail.com wrote: Hey GeekTalks/any other cassandra users around Mumbai/Pune, I will be around Mumbai from last week of Nov through Third week of December. I have actively used/deployed a couple of cassandra clusters and a bunch of hadoop projects over the past year. I am keenly interested in meeting any cassandra/hadoop users and sharing my experience. Do get in touch with me if any of you would like to host a meetup/user group meeting. -Adi On Mon, Mar 21, 2011 at 9:02 AM, Geek Talks geektalks@gmail.com wrote: Hi, Anyone interested joining in Apache Cassandra hangout/meetup nearby mumbai-pune area. Share/teach your exp with Apache Cassandra, problems/issue you faced during deployment. Excited and heard about its buzz, want to learn more about NoSQL cassandra. Regards, GeekTalks
Re: Cassandra Certification
Does it really make sense? If yes, I think Apache Cassandra Project (ASF) should offer Open Certification. Other entity can offer courses, training materials.
Re: 5 node cluster - Recommended seed configuration.
It is recommended that seed list should be same in all server so all server on same state. It should be Lan IP not Loopback IP. In seed node, auto bootstrap should be false 2 seed should be enough. In your case it should be like: node1: seeds: node1, autobootstrap=false node2: seeds: node1,autobootstrap=true node3: seeds: node1, autobootstrap=true node4: seeds: node1, autobootstrap=true node5: seeds: node1, autobootstrap=true or node1: seeds: node1,node2, autobootstrap=false node2: seeds: node1,node2,autobootstrap=false (set it false after bootstrap) node3: seeds: node1,node2, autobootstrap=true node4: seeds: node1, node2,autobootstrap=true node5: seeds: node1,node2, autobootstrap=true /Samal On Tue, Aug 9, 2011 at 9:16 AM, Selva Kumar wwgse...@yahoo.com wrote: We have a 5 node Cassandra cluster. We use version 0.7.4. What is the recommended seed configuration. Here are some configurations, i have noticed. Example 1: --- One node being seed to itself. node1: seeds: node1, autobootstrap=false node2: seeds: node1, node3, autobootstrap=true node3: seeds: node2, node4, autobootstrap=true node4: seeds: node3, node5, autobootstrap=true node5: seeds: node1, node2, autobootstrap=true Example 2: --- node1: seeds: node5,node2, autobootstrap=true node2: seeds: node1, node3, autobootstrap=true node3: seeds: node2, node4, autobootstrap=true node4: seeds: node3, node5, autobootstrap=true node5: seeds: node1, node2, autobootstrap=true Thanks Selva
Re: Sample Cassandra project in Tomcat
I don't know much about this, may help you.. http://www.codefreun.de/apolloUI/ http://www.codefreun.de/apollo/ On Wed, Aug 3, 2011 at 3:36 PM, CASSANDRA learner cassandralear...@gmail.com wrote: Hiii, Can any one pleaze send me any sample application which is (.war) implemented in java/jsp and cassandra db (Tomcat)
Re: Installation Exception
did u compile source code? :) you have downloaded source code not binary. try with binary. On Wed, Aug 3, 2011 at 9:14 PM, Eldad Yamin elda...@gmail.com wrote: Hi, I'm trying to install Cassandra on Amazon EC2 without success, this is what I did: 1. Created new Small EC2 instance (this is just for testing), running Ubuntu OS - custom AIM (ami-596f3c1c) from: http://uec-images.ubuntu.com/releases/11.04/release/ 2. Installed Java: # sudo add-apt-repository deb http://archive.canonical.com/ lucid partner # sudo apt-get update # sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts openjdk-6-jre 3. Upgraded: # sudo apt-get upgrade 4. Downloaded Cassandra: # cd /usr/src/ # sudo wget http://apache.mivzakim.net//cassandra/0.8.2/apache-cassandra-0.8.2-src.tar.gz # sudo tar xvfz apache-cassandra-* # cd apache-cassandra-* 5. Config (according to README.txt) # sudo mkdir -p /var/log/cassandra # sudo chown -R `whoami` /var/log/cassandra # sudo mkdir -p /var/lib/cassandra # sudo chown -R `whoami` /var/lib/cassandra 6. RUN CASSANDRA # bin/cassandra -f The I got Exception: ubuntu@ip-10-170-31-128:/usr/src/apache-cassandra-0.8.2-src$ bin/cassandra -f Exception in thread main java.lang.NoClassDefFoundError: org/apache/cassandra/thrift/CassandraDaemon Caused by: java.lang.ClassNotFoundException: org.apache.cassandra.thrift.CassandraDaemon at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: org.apache.cassandra.thrift.CassandraDaemon. Program will exit. Any idea what is wrong? Thanks!
Re: Nodetool ring not showing all nodes in cluster
ERROR 08:53:47,678 Internal error processing batch_mutate java.lang.IllegalStateException: replication factor (3) exceeds number of endpoints (1) You already answered It always keeps showing only one node and mentions that it is handling 100% of the load. On Tue, Aug 2, 2011 at 7:21 AM, Aishwarya Venkataraman cyberai...@gmail.com wrote: Replies inline. Thanks, Aishwarya On Tue, Aug 2, 2011 at 7:12 AM, Sorin Julean sorin.jul...@gmail.com wrote: Hi, Until someone answers with more details, few questions: 1. did you moved the system keyspace as well ? Yes. But I deleted the LocationInfo* files under the system folder. Shall I go ahead and delete the entire system folder ? 2. the gossip IP of the new nodes are the same as the old ones ? No. The Ip is different. 3. which cassandra version are you running ? I am using 0.8.1 If 1. is yes and 2. is no, for a quick fix: take down the cluster, remove system keyspace, bring the cluster up and bootstrap the nodes. Kind regards, Sorin On Tue, Aug 2, 2011 at 2:53 PM, Aishwarya Venkataraman cyberai...@gmail.com wrote: Hello, I recently migrated 400 GB of data that was on a different cassandra cluster (3 node with RF= 3) to a new cluster. I have a 3 node cluster with replication factor set to three. When I run nodetool ring, it does not show me all the nodes in the cluster. It always keeps showing only one node and mentions that it is handling 100% of the load. But when I look at the logs, the nodes are able to talk to each other via the gossip protocol. Why does this happen ? Can you tell me what I am doing wrong ? Thanks, Aishwarya
Re: Nodetool ring not showing all nodes in cluster
ERROR 08:53:47,678 Internal error processing batch_mutate java.lang.IllegalStateException: replication factor (3) exceeds number of endpoints (1) You already answered It always keeps showing only one node and mentions that it is handling 100% of the load. Cluster think only one node is present in ring, it doesn't agree RF=3 it is expecting RF=1. Original Q: I m not exactly sure what is the problem. But Does nodetool ring show all the host? What is your seed list? Is bootstrapped node has seed ip of its own? AFAIK gossip work even without actively joining a ring. On Tue, Aug 2, 2011 at 7:21 AM, Aishwarya Venkataraman cyberai...@gmail.com wrote: Replies inline. Thanks, Aishwarya On Tue, Aug 2, 2011 at 7:12 AM, Sorin Julean sorin.jul...@gmail.com wrote: Hi, Until someone answers with more details, few questions: 1. did you moved the system keyspace as well ? Yes. But I deleted the LocationInfo* files under the system folder. Shall I go ahead and delete the entire system folder ? 2. the gossip IP of the new nodes are the same as the old ones ? No. The Ip is different. 3. which cassandra version are you running ? I am using 0.8.1 If 1. is yes and 2. is no, for a quick fix: take down the cluster, remove system keyspace, bring the cluster up and bootstrap the nodes. Kind regards, Sorin On Tue, Aug 2, 2011 at 2:53 PM, Aishwarya Venkataraman cyberai...@gmail.com wrote: Hello, I recently migrated 400 GB of data that was on a different cassandra cluster (3 node with RF= 3) to a new cluster. I have a 3 node cluster with replication factor set to three. When I run nodetool ring, it does not show me all the nodes in the cluster. It always keeps showing only one node and mentions that it is handling 100% of the load. But when I look at the logs, the nodes are able to talk to each other via the gossip protocol. Why does this happen ? Can you tell me what I am doing wrong ? Thanks, Aishwarya
Re: Read process
from ROW CACHE {if enabled} --KEY CACHE--MEMTABLE--SSTABLE On Wed, Jul 27, 2011 at 1:19 PM, CASSANDRA learner cassandralear...@gmail.com wrote: Hi, I am having one doubt regarding reads. The data will be stored in commitlog,memtable,sstables right.. While reading the data may be available in all the three right, then from where the reads happens,, form commit log? or from Memtable ? or from SSTables.. Please explain friends Thnks
Re: Cassandra training in Bangalore, India
As per my knowledge, there is not such expert training available in India as of now. As Sameer said there is enough online material available from where you can learn.I have been playing with Cassandra since beginning. We can plan for Meetup/learning session near Mumbai/Pune region.
Re: Memtables stored in which location
SSTable is stored on disk not memtable. Memtable is memory representation of data, which is on flush to create SSTable on disk. This is the location where SSTable is stored https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L71 Where as Commitlog which is back up (log) for memtable replaying store in https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L75 location. Once the all memtable is flushed to disk, new commit log segment is created. On Thu, Jul 21, 2011 at 1:12 PM, Abdul Haq Shaik abdulsk.cassan...@gmail.com wrote: Hi, Can you please let me know where exactly the memtables are getting stored. I wanted to know the physical location
Re: Memtables stored in which location
Any ways , some where memtable has to be stored right, like we say memtable data is flushed to create sstable on disk. Exactly from which location or memory it will be getting from. is it like an objects streams or like it is storing the values in commitlog. A Memtable is Cassandra's in-memory representation of key/value pairs. my next question is , data is written to commit log. all the data is available here, and the sstable are getting created on disk, then where and when these memtables are coming into picture Commitlog is append only file which record write sequentially, more[2], can be thought as check sum file, which to used to recalculate data for memtables in case of crash. A write first hits the *CommitLog*, then Cassandra stores/writes values to in-memory data structures called Memtables. The Memtables are flushed to disk whenever one of the configurable thresholds is met.[3] http://wiki.apache.org/cassandra/MemtableThresholds For each column family there is corresponding memtable. There is generally one commitlog file for all CF. SSTables are immutable once written to disk cannot be modified. It will only be replaced by new SSTable after compaction [1]http://wiki.apache.org/cassandra/ArchitectureOverview [2]http://wiki.apache.org/cassandra/ArchitectureCommitLog [3]http://wiki.apache.org/cassandra/MemtableThresholds
Re: Range query ordering with CQL JDBC
I haven't used CQL functionality much, but thirft client I think what I encounter is exactly this problem! If you want to query over key, you can index keys to other CF, get the column names (that is key of other CF ). and then query actual CF with keys. switch away from the random partitioner. switching away is not a good choice, RP is very good for load distribution.
Re: gossiper problem
well I am not a JVM guru, but it seem server has memory problem. 13 10:44:57,748 Gossiper.java (line 579) InetAddress /10.63.61.74 is now UP INFO [Timer-0] 2011-07-13 15:56:44,630 Gossiper.java (line 181) InetAddress /10.63.61.74 is now dead. INFO [GMFD:1] 2011-07-13 15:56:44,653 Gossiper.java (line 579) InetAddress /10.63.61.74 is now UP INFO [Timer-0] 2011-07-13 16:03:24,391 Gossiper.java (line 181) InetAddress /10.63.61.72 is now dead. It is swapping due to memory need, recommended!! disable swap. rather die with OOM than swapping. INFO [GC inspection] 2011-07-13 03:12:06,153 GCInspector.java (line 110) GC for ConcurrentMarkSweep: 1097 ms, 371528920 reclaimed leaving 17677528 used; max is 118784 INFO [GC inspection] 2011-07-13 03:12:07,351 GCInspector.java (line 110) GC for ParNew: 466 ms, 20619976 reclaimed leaving 157240232 used; max is 118784 INFO [GC inspection] 2011-07-13 03:25:54,378 GCInspector.java (line 110) GC for ParNew: 283 ms, 26850072 reclaimed leaving 154180424 used; max is 118784 INFO [GC inspection] 2011-07-13 06:29:58,092 GCInspector.java (line 110) GC for ParNew: 538 ms, 17358792 reclaimed leaving My cassandra version is **0.6.3**, and the configuration about gc on storage_conf.xml is GCGraceSeconds864000/GCGraceSeconds JVM configuration is as following: JVM_OPTS= \ -ea \ -Xms**256M** \ -Xmx**1G** \ -XX:+UseParNewGC \ Can I decrease the JVM_OPTS to –Xms**128M** –Xmx**512M** to avoid swap, the data saved in cassandra is small, I do not need so much memory. Reducing max head size wont solve problem, i think it will do more swapping. data only does not only count for memory requirement, but no. of memtables, as each CF has separate memtable and its size, compaction, caching, read You should upgrade to 0.7 or later. /samal
Re: Key_Cache @ Row_Cache
Can you give me a bit idea how key_cache and row_cache effects on performance of cassandra. How these things works in different scenario depending upon the data size? While reading, if row_cached is set, it check for row_cache first then key_cached, memtable disk. row_cache store all data on memory, need tuning, generally lowered preferred key_cache store only key and location of row in memory, higher is preferred if row if frequently read it is good to cache it but row size matters large row size can eat too much memory. Also this may help http://www.datastax.com/docs/0.8/operations/cache_tuning#configuring-key-and-row-caches /Samal
Re: One node down but it thinks its fine...
Check seed ip is same in all node and should not be loopback ip on cluster. On Wed, Jul 13, 2011 at 8:40 PM, Ray Slakinski ray.slakin...@gmail.comwrote: One of our nodes, which happens to be the seed thinks its Up and all the other nodes are down. However all the other nodes thinks the seed is down instead. The logs for the seed node show everything is running as it should be. I've tried restarting the node, turning on/off gossip and thrift and nothing seems to get the node to see the rest of its ring as up and running. I have also tried restarting one of the other nodes, which had no affect on the situation. Below is the ring outputs for the seed and one other node in the ring, plus a ping to show that the seed can ping the other node. # bin/nodetool -h 0.0.0.0 ring Address Status State Load Owns Token 141784319550391026443072753096570088105 127.0.0.1 Up Normal 4.61 GB 16.67% 0 xx.xxx.30.210 Down Normal ? 16.67% 28356863910078205288614550619314017621 xx.xx.90.87 Down Normal ? 16.67% 56713727820156410577229101238628035242 xx.xx.22.236 Down Normal ? 16.67% 85070591730234615865843651857942052863 xx.xx.97.96 Down Normal ? 16.67% 113427455640312821154458202477256070484 xx.xxx.17.122 Down Normal ? 16.67% 141784319550391026443072753096570088105 # ping xx.xxx.30.210 PING xx.xxx.30.210 (xx.xxx.30.210) 56(84) bytes of data. 64 bytes from xx.xxx.30.210: icmp_req=1 ttl=61 time=0.299 ms 64 bytes from xx.xxx.30.210: icmp_req=2 ttl=61 time=0.287 ms ^C --- xx.xxx.30.210 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.287/0.293/0.299/0.006 ms # bin/nodetool -h xx.xxx.30.210 ring Address Status State Load Owns Token 141784319550391026443072753096570088105 xx.xxx.23.40 Down Normal ? 16.67% 0 xx.xxx.30.210 Up Normal 10.58 GB 16.67% 28356863910078205288614550619314017621 xx.xx.90.87 Up Normal 10.47 GB 16.67% 56713727820156410577229101238628035242 xx.xx.22.236 Up Normal 9.63 GB 16.67% 85070591730234615865843651857942052863 xx.xx.97.96 Up Normal 10.68 GB 16.67% 113427455640312821154458202477256070484 xx.xxx.17.122 Up Normal 10.18 GB 16.67% 141784319550391026443072753096570088105 -- Ray Slakinski
Re: CQL + Counters = bad request
cqlsh UPDATE RouterAggWeekly SET 1310367600 = 1310367600 + 17 WHERE KEY = '1_20110728_ifoutmulticastpkts'; Bad Request: line 1:51 no viable alternative at character '+' I m able to insert it. ___ cqlsh cqlsh UPDATE counts SET 1310367600 = 1310367600 + 17 WHERE KEY = '1_20110728_ifoutmulticastpkts'; cqlsh UPDATE counts SET 1310367600 = 1310367600 + 17 WHERE KEY = '1_20110728_ifoutmulticastpkts'; cqlsh _ [default@test] list counts; Using default limit of 100 --- RowKey: 1_20110728_ifoutmulticastpkts = (counter=12, value=16) = (counter=1310367600, value=34) --- RowKey: 1 = (counter=1, value=10) 2 Rows Returned. [default@test]
Re: Storing counters in the standard column families along with non-counter columns ?
Yes. may be 0.8.2 current version need specific validation class CounterColumn for CCF, that only count [+,-,do not replace] stuff, where as normal CF simply just add or replace. On Sun, Jul 10, 2011 at 10:39 PM, Aditya Narayan ady...@gmail.com wrote: Thanks for info. Is there any target version in near future for which this has been promised ? On Sun, Jul 10, 2011 at 9:12 PM, Sasha Dolgy sdo...@gmail.com wrote: No, it's not possible. To achieve it, there are two options ... contribute to the issue or wait for it to be resolved ... https://issues.apache.org/jira/browse/CASSANDRA-2614 -sd On Sun, Jul 10, 2011 at 5:04 PM, Aditya Narayan ady...@gmail.com wrote: Is it now possible to store counters in the standard column families along with non counter type columns ? How to achieve this ?
Re: 4k keyspaces... Maybe we're doing it wrong?
Lot of memtables means lot of sstables means lot of disk io. On 9/7/10, Benjamin Black b...@b3k.us wrote: On Mon, Sep 6, 2010 at 12:41 AM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: So if I read this right, using lots of CF's is also a Bad Idea(tm)? Yes, lots of CFs is bad means lots of CFs is also bad. -- Sent from my mobile device
Re: servers for cassandra
As of now I think only rackspace.com support cassandra in der cloud webhosting which will cost around $150 to $200 a month. Der is no cheap kinda thing in cassandra because data is distributed in multiple servers. I advice you to test on ur LAN only. You can do benchmark testing to test real conditions. I use 64 bit linux (ubuntu) with 4GB RAM that is more than sufficient to play around. ___ *Samal Gora**i* On Sat, Sep 4, 2010 at 12:05 PM, vineet daniel vineetdan...@gmail.comwrote: Hi I am just curious to know if there is any hosting company that provides servers at a very low cost, wherein I can install cassandra on WAN. I have cassandra setup in my LAN and want to test it in real conditions, taking dedicated servers just for testing purposes is not at all feasible for me not even pay-as-you go types. I'd really appreciate if anybody can share information on such hosting providers. Vineet Daniel Cell : +918106217121 Websites : Blog http://vinetedaniel.blogspot.com | Linkedinhttp://in.linkedin.com/in/vineetdaniel | Twitter https://twitter.com/vineetdaniel
Re: Riptano Cassandra training in Denver
It will be gr8. Samal Gorai On Thu, Sep 2, 2010 at 10:46 AM, vineet daniel vineetdan...@gmail.comwrote: Hi Jonathan Any plans of coming to India in future ? ___ Regards Vineet Daniel +918106217121 ___ Let your email find you On Thu, Sep 2, 2010 at 1:52 AM, Jonathan Ellis jbel...@gmail.com wrote: Riptano is going to be in Denver next Friday (Sept 10) for a full-day Cassandra training (taught by yours truly). The training is broken into two parts: the first covers application design and modeling in Cassandra, with exercises using the Pycassa library; the second covers operations, troubleshooting, and performance tuning. For more details or to register for the training, see http://www.eventbrite.com/event/756085472 -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com