RE: Read latency issue
Hi Arindam, There were some changes for CQL3 for composite keys storage , and you may be using CQL2 by default.You could try for a non composite key or supply all the components of the key in the search...and see if you get different results... Regards,roshni From: aba...@247-inc.com To: user@cassandra.apache.org Subject: RE: Read latency issue Date: Wed, 3 Oct 2012 17:53:46 + Thanks for your responses. Just to be clear our table declaration looks something like this: CREATE TABLE sessionevents ( atag text, col2 uuid, col3 text, col4 uuid, col5 text, col6 text, col7 blob, col8 text, col9 timestamp, col10 uuid, col11 int, col12 uuid, PRIMARY KEY (atag, col2, col3, col4) ) My understanding was that the (full) row key in this case would be the 'atag' values. The column names would then be composites like (col2_value:col3_value:col4_value:col5), (col2_value: col3_value: col4_value:col6), (col2_value:col3_value:col4_value:col7) ... (col2_value: col3_value: col4_value:col12). The columns would be sorted first by col2_values, then by col3 values, etc. Hence a query like select * from sessionevents where atag=foo, we are specifying the entire row key, and Cassandra would return all the columns for that row. Using read consistency of ONE reduces the read latency by ~20ms, compared to using QUORUM. It would only have read from the local node. (I think, may be confusing secondary index reads here). For read consistency ONE, reading only from one node is my expectation as well, and hence I'm seeing the reduced read latency compared to read consistency QUORUM. Does that not sound right? Btw, with read consistency ONE, we found the reading only happens from one node, but not necessarily the local node, even if the data is present in the local node. To check this, we turned on DEBUG logs on all the Cassandra hosts in the ring. We are using replication factor=3 on a 4 node ring, hence mostly the data is present locally. However, we noticed that the coordinator host on receiving the same request multiple times (i.e with the same row key) , would sometimes return the data locally, but sometimes would contact another host in the ring to fetch the data. Thanks, Arindam -Original Message- From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, October 03, 2012 12:32 AM To: user@cassandra.apache.org Subject: Re: Read latency issue Running a query to like select * from table_name where atag=foo, where 'atag' is the first column of the composite key, from either JDBC or Hector (equivalent code), results in read times of 200-300ms from a remote host on the same network. If you send a query to select columns from a row and do not fully specify the row key cassandra has to do a row scan. If you want fast performance specify the full row key. Using read consistency of ONE reduces the read latency by ~20ms, compared to using QUORUM. It would only have read from the local node. (I think, may be confusing secondary index reads here). Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/10/2012, at 2:17 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Arindam, Did you also try the cassandra stress tool compare results? I havent done a performance test as yet, the only ones published on the internet are of YCSB on an older version of apache cassandra, and it doesn't seem to be actively supported or updated http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf. The numbers you have sound very low, for a read of a row by key which should have been the fastest. I hope someone can help investigate or share numbers from their tests. Regards, Roshni From: dean.hil...@nrel.gov To: user@cassandra.apache.org Date: Tue, 2 Oct 2012 06:41:09 -0600 Subject: Re: Read latency issue Interesting results. With PlayOrm, we did a 6 node test of reading 100 rows from 1,000,000 using PlayOrm Scalable SQL. It only took 60ms. Maybe we have better hardware though??? We are using 7200 RPM drives so nothing fancy on the disk side of things. More nodes puts at a higher throughput though as reading from more disks will be faster. Anyways, you may want to play with more nodes and re-run. If you run a test with PlayOrm, I would love to know the results there as well. Later, Dean From: Arindam Barua aba...@247-inc.commailto:aba...@247-inc.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, October 1, 2012 4:57 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Read latency issue unning a query to like
RE: Data Modeling: Comments with Voting
Hi , To explain my suggestions - my thoughts were a) you need to store entity type information about a comment like date created, comment text, commented by etc. I cant think of any other master information for a comment, but in general one starts with entities in a standard static column family. If you store an entity in a dynamic denormailized form, if any master data changes you would need to iterate across all rows and update it which is expensive in cassandra. Here comment text is editable. b) So when a comment is created it goes to the static column family. Also an entry is made in the dynamic sort_by_time_list column family with column as time created. I didn't suggest a and c be clubbed so that master information remains in one place. The other approach would be to have a comment stored as a JSON in the column value. However if you need to update comment text, it would be hard to identify the comment column and update it. c) when a comment gets a vote, the counter column family is incremented to know the number of votes for a comment. Also to sort by number of votes , after incrementing the counter you need to write the current number of votes, and the comment id in the column family d. But I see now that you also need to delete the old number of votes comment id column and add a new column with current number of votes and comment id. It would be sorted by number of votes. If there are many ways to sort, its better to do it in the application to avoid having a new column family for each type of sort...however Im not certain over time and volume which approach would perform better.Sorting can be complex - aaron's blog post http://thelastpickle.com/2012/08/18/Sorting-Lists-For-Humans/ Welcome any feedback on my suggestions. From: aa...@thelastpickle.com Subject: Re: Data Modeling: Comments with Voting Date: Tue, 2 Oct 2012 10:39:42 +1300 To: user@cassandra.apache.org You cannot (and probably do not want to) sort continually when the voting is going on. You can store the votes using CounterColumnTypes in column values. When someone votes you then (somehow) queue a job that will read the vote counts for the post / comment, pivot and sort on the vote count, and then write the updated leader board to cassandra. Alternatively if you have a small number of comments for a post just read all the votes and sort them as part of the read. Cheers -Aaron MortonFreelance Developer@aaronmortonhttp://www.thelastpickle.com On 30/09/2012, at 8:25 AM, Drew Kutcharian d...@venarc.com wrote:Thanks Roshni, I'm not sue how #d will work when users are actually voting on a comment. What happens when two users vote on the same comment simultaneously? How do you update the entries in #d column family to prevent duplicates? Also #a and #c can be combined together using TimeUUID as comment ids. - Drew On Sep 27, 2012, at 2:13 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Hi Drew, I think you have 4 requirements. Here are my suggestions. a) store comments : have a static column family for comments with master data like created date, created by , length etcb) when a person votes for a comment, increment a vote counter : have a counter column family for incrementing the votes for each commentc) display comments sorted by date created: have a column family with a dummy row id 'sort_by_time_list', column names can be date created(timeUUID), and column value can be comment id d) display comments sorted by number of votes: have a column family with a dummy row id 'sort_by_votes_list' and column names can be a composite of number of votes , and comment id ( as more than 1 comment can have the same votes) Regards,Roshni Date: Wed, 26 Sep 2012 17:36:13 -0700 From: k...@mustardgrain.com To: user@cassandra.apache.org CC: d...@venarc.com Subject: Re: Data Modeling: Comments with Voting Depending on your needs, you could simply duplicate the comments in two separate CFs with the column names including time in one and the vote in the other. If you allow for updates to the comments, that would pose some issues you'd need to solve at the app level. On 9/26/12 4:28 PM, Drew Kutcharian wrote: Hi Guys, Wondering what would be the best way to model a flat (no sub comments, i.e. twitter) comments list with support for voting (where I can sort by create time or votes) in Cassandra? To demonstrate: Sorted by create time: - comment 1 (5 votes) - comment 2 (1 votes) - comment 3 (no votes) - comment 4 (10 votes) Sorted by votes: - comment 4 (10 votes) - comment 1 (5 votes) - comment 2 (1 votes) - comment 3 (no votes) It's the sorted-by-votes that I'm having a bit of a trouble with. I'm looking for a roll-your-own approach and prefer not to use secondary indexes and CQL sorting. Thanks, Drew
RE: Data Modeling: Comments with Voting
Hi Drew, I think you have 4 requirements. Here are my suggestions. a) store comments : have a static column family for comments with master data like created date, created by , length etcb) when a person votes for a comment, increment a vote counter : have a counter column family for incrementing the votes for each commentc) display comments sorted by date created: have a column family with a dummy row id 'sort_by_time_list', column names can be date created(timeUUID), and column value can be comment id d) display comments sorted by number of votes: have a column family with a dummy row id 'sort_by_votes_list' and column names can be a composite of number of votes , and comment id ( as more than 1 comment can have the same votes) Regards,Roshni Date: Wed, 26 Sep 2012 17:36:13 -0700 From: k...@mustardgrain.com To: user@cassandra.apache.org CC: d...@venarc.com Subject: Re: Data Modeling: Comments with Voting Depending on your needs, you could simply duplicate the comments in two separate CFs with the column names including time in one and the vote in the other. If you allow for updates to the comments, that would pose some issues you'd need to solve at the app level. On 9/26/12 4:28 PM, Drew Kutcharian wrote: Hi Guys, Wondering what would be the best way to model a flat (no sub comments, i.e. twitter) comments list with support for voting (where I can sort by create time or votes) in Cassandra? To demonstrate: Sorted by create time: - comment 1 (5 votes) - comment 2 (1 votes) - comment 3 (no votes) - comment 4 (10 votes) Sorted by votes: - comment 4 (10 votes) - comment 1 (5 votes) - comment 2 (1 votes) - comment 3 (no votes) It's the sorted-by-votes that I'm having a bit of a trouble with. I'm looking for a roll-your-own approach and prefer not to use secondary indexes and CQL sorting. Thanks, Drew
RE: 1.1.5 Missing Insert! Strange Problem
By any chance is a TTL (time to live ) set on the columns... Date: Tue, 25 Sep 2012 19:56:19 -0700 Subject: 1.1.5 Missing Insert! Strange Problem From: gouda...@gmail.com To: user@cassandra.apache.org Hi All, I have a 4 node cluster setup in 2 zones with NetworkTopology strategy and strategy options for writing a copy to each zone, so the effective load on each machine is 50%. Symptom:I have a column family that has gc grace seconds of 10 days (the default). On 17th there was an insert done to this column family and from our application logs I can see that the client got a successful response back with write consistency of ONE. I can verify the existence of the key that was inserted in Commitlogs of both replicas however it seams that this record was never inserted. I used list to get all the column family rows which were about 800ish, and examine them to see if it could possibly be deleted by our application. List should have shown them to me since I have not gone beyond gc grace seconds if this record was deleted during past days. I could not find it. Things happened:During the same time as this insert was happening, I was performing a rolling upgrade of Cassandra from 1.1.3 to 1.1.5 by taking one node down at a time, performing the package upgrade and restarting the service and going to the next node. I could see from system.log that some mutations were replayed during those restarts, so I suppose the memtables were not flushed before restart. Could this procedure cause the row inser to disappear? How could I troubleshoot as I am running out of ideas. Your help is greatly appreciated. Cheers,=Arya
RE: Cassandra Counters
Thanks for the reply and sorry for being bull - headed. Once you're past the stage where you've decided its distributed, and NoSQL and cassandra out of all the NoSQL options,Now to count something, you can do it in different ways in cassandra. In all the ways you want to use cassandra's best features of availability, tunable consistency , partition tolerance etc. Given this, what are the performance tradeoffs of using counters vs a standard column family for counting. Because as I see if the counter number in a counter column family becomes wrong, it will not be 'eventually consistent' - you will need intervention to correct it. So the key aspect is how much faster would be a counter column family, and at what numbers do we start seing a difference. Date: Tue, 25 Sep 2012 07:57:08 +0200 Subject: Re: Cassandra Counters From: oleksandr.pet...@gmail.com To: user@cassandra.apache.org Maybe I'm missing the point, but counting in a standard column family would be a little overkill. I assume that distributed counting here was more of a map/reduce approach, where Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a lot. We're doing some more complex counting (e.q. based on sets of rules) like that. Of course, that would perform _way_ slower than counting beforehand. On the other side, you will always have a consistent result for a consistent dataset. On the other hand, if you use things like AMQP or Storm (sorry to put up my sentence together like that, as tools are mostly either orthogonal or complementary, but I hope you get my point), you could build a topology that makes fault-tolerant writes independently of your original write. Of course, it would still have a consistency tradeoff, mostly because of race conditions and different network latencies etc. So I would say that building a data model in a distributed system often depends more on your problem than on the common patterns, because everything has a tradeoff. Want to have an immediate result? Modify your counter while writing the row. Can sacrifice speed, but have more counting opportunities? Go with offline distributed counting.Want to have kind of both, dispatch a message and react upon it, having the processing logic and writes decoupled from main application, allowing you to care less about speed. However, I may have missed the point somewhere (early morning, you know), so I may be wrong in any given statement.Cheers On Tue, Sep 25, 2012 at 6:53 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Thanks Milind, Has anyone implemented counting in a standard col family in cassandra, when you can have increments and decrements to the count. Any comparisons in performance to using counter column families? Regards,Roshni Date: Mon, 24 Sep 2012 11:02:51 -0700 Subject: RE: Cassandra Counters From: milindpar...@gmail.com To: user@cassandra.apache.org IMO You would use Cassandra Counters (or other variation of distributed counting) in case of having determined that a centralized version of counting is not going to work. You'd determine the non_feasibility of centralized counting by figuring the speed at which you need to sustain writes and reads and reconcile that with your hard disk seek times (essentially). Once you have proved that you can't do centralized counting, the second layer of arsenal comes into play; which is distributed counting. In distributed counting , the CAP theorem comes into life. in Cassandra, Availability and Network Partitioning trumps over Consistency. So yes, you sacrifice strong consistency for availability and partion tolerance; for eventual consistency. On Sep 24, 2012 10:28 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Hi folks, I looked at my mail below, and Im rambling a bit, so Ill try to re-state my queries pointwise. a) what are the performance tradeoffs on reads writes between creating a standard column family and manually doing the counts by a lookup on a key, versus using counters. b) whats the current state of counters limitations in the latest version of apache cassandra? c) with there being a possibilty of counter values getting out of sync, would counters not be recommended where strong consistency is desired. The normal benefits of cassandra's tunable consistency would not be applicable, as re-tries may cause overstating. So the normal use case is high performance, and where consistency is not paramount. Regards,roshni From: roshni_rajago...@hotmail.com To: user@cassandra.apache.org Subject: Cassandra Counters Date: Mon, 24 Sep 2012 16:21:55 +0530 Hi , I'm trying to understand if counters are a good fit for my use case.Ive watched http://blip.tv/datastax/counters-in-cassandra-5497678 many times over now... and still need help! Suppose I have a list of items- to which I can add or delete a set of items at a time, and I want a count of the items, without considering changing the database
RE: Cassandra Counters
Hi folks, I looked at my mail below, and Im rambling a bit, so Ill try to re-state my queries pointwise. a) what are the performance tradeoffs on reads writes between creating a standard column family and manually doing the counts by a lookup on a key, versus using counters. b) whats the current state of counters limitations in the latest version of apache cassandra? c) with there being a possibilty of counter values getting out of sync, would counters not be recommended where strong consistency is desired. The normal benefits of cassandra's tunable consistency would not be applicable, as re-tries may cause overstating. So the normal use case is high performance, and where consistency is not paramount. Regards,roshni From: roshni_rajago...@hotmail.com To: user@cassandra.apache.org Subject: Cassandra Counters Date: Mon, 24 Sep 2012 16:21:55 +0530 Hi , I'm trying to understand if counters are a good fit for my use case.Ive watched http://blip.tv/datastax/counters-in-cassandra-5497678 many times over now...and still need help! Suppose I have a list of items- to which I can add or delete a set of items at a time, and I want a count of the items, without considering changing the database or additional components like zookeeper,I have 2 options_ the first is a counter col family, and the second is a standard one 1. List_Counter_CF TotalItems ListId 50 2.List_Std_CF TimeUUID1 TimeUUID2 TimeUUID3 TimeUUID4 TimeUUID5 ListId 3 70 -20 3 -6 And in the second I can add a new col with every set of items added or deleted. Over time this row may grow wide.To display the final count, Id need to read the row, slice through all columns and add them. In both cases the writes should be fast, in fact standard col family should be faster as there's no read, before write. And for CL ONE write the latency should be same. For reads, the first option is very good, just read one column for a key For the second, the read involves reading the row, and adding each column value via application code. I dont think there's a way to do math via CQL yet.There should be not hot spotting, if the key is sharded well. I could even maintain the count derived from the List_Std_CF in a separate column family which is a standard col family with the final number, but I could do that as a separate process immediately after the write to List_Std_CF completes, so that its not blocking. I understand cassandra is faster for writes than reads, but how slow would Reading by row key be...? Is there any number around after how many columns the performance starts deteriorating, or how much worse in performance it would be? The advantage I see is that I can use the same consistency rules as for the rest of column families. If quorum for reads writes, then you get strongly consistent values. In case of counters I see that in case of timeout exceptions because the first replica is down or not responding, there's a chance of the values getting messed up, and re-trying can mess it up further. Its not idempotent like a standard col family design can be. If it gets messed up, it would need administrator's help (is there a a document on how we could resolve counter values going wrong?) I believe the rest of the limitations still hold good- has anything changed in recent versions? In my opinion, they are not as major as the consistency question.-removing a counter then modifying value - behaviour is undetermined-special process for counter col family sstable loss( need to remove all files)-no TTL support-no secondary indexes In short, I can recommend counters can be used for analytics or while dealing with data where the exact numbers are not important, orwhen its ok to take some time to fix the mismatch, and the performance requirements are most important.However where the numbers should match , its better to use a std column family and a manual implementation. Please share your thoughts on this. Regards,roshni
RE: Cassandra Counters
Thanks Milind, Has anyone implemented counting in a standard col family in cassandra, when you can have increments and decrements to the count. Any comparisons in performance to using counter column families? Regards,Roshni Date: Mon, 24 Sep 2012 11:02:51 -0700 Subject: RE: Cassandra Counters From: milindpar...@gmail.com To: user@cassandra.apache.org IMO You would use Cassandra Counters (or other variation of distributed counting) in case of having determined that a centralized version of counting is not going to work. You'd determine the non_feasibility of centralized counting by figuring the speed at which you need to sustain writes and reads and reconcile that with your hard disk seek times (essentially). Once you have proved that you can't do centralized counting, the second layer of arsenal comes into play; which is distributed counting. In distributed counting , the CAP theorem comes into life. in Cassandra, Availability and Network Partitioning trumps over Consistency. So yes, you sacrifice strong consistency for availability and partion tolerance; for eventual consistency. On Sep 24, 2012 10:28 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Hi folks, I looked at my mail below, and Im rambling a bit, so Ill try to re-state my queries pointwise. a) what are the performance tradeoffs on reads writes between creating a standard column family and manually doing the counts by a lookup on a key, versus using counters. b) whats the current state of counters limitations in the latest version of apache cassandra? c) with there being a possibilty of counter values getting out of sync, would counters not be recommended where strong consistency is desired. The normal benefits of cassandra's tunable consistency would not be applicable, as re-tries may cause overstating. So the normal use case is high performance, and where consistency is not paramount. Regards,roshni From: roshni_rajago...@hotmail.com To: user@cassandra.apache.org Subject: Cassandra Counters Date: Mon, 24 Sep 2012 16:21:55 +0530 Hi , I'm trying to understand if counters are a good fit for my use case.Ive watched http://blip.tv/datastax/counters-in-cassandra-5497678 many times over now... and still need help! Suppose I have a list of items- to which I can add or delete a set of items at a time, and I want a count of the items, without considering changing the database or additional components like zookeeper, I have 2 options_ the first is a counter col family, and the second is a standard one 1. List_Counter_CF TotalItems ListId 50 2.List_Std_CF TimeUUID1 TimeUUID2 TimeUUID3 TimeUUID4 TimeUUID5 ListId 3 70 -20 3 -6 And in the second I can add a new col with every set of items added or deleted. Over time this row may grow wide.To display the final count, Id need to read the row, slice through all columns and add them. In both cases the writes should be fast, in fact standard col family should be faster as there's no read, before write. And for CL ONE write the latency should be same. For reads, the first option is very good, just read one column for a key For the second, the read involves reading the row, and adding each column value via application code. I dont think there's a way to do math via CQL yet.There should be not hot spotting, if the key is sharded well. I could even maintain the count derived from the List_Std_CF in a separate column family which is a standard col family with the final number, but I could do that as a separate process immediately after the write to List_Std_CF completes, so that its not blocking. I understand cassandra is faster for writes than reads, but how slow would Reading by row key be...? Is there any number around after how many columns the performance starts deteriorating, or how much worse in performance it would be? The advantage I see is that I can use the same consistency rules as for the rest of column families. If quorum for reads writes, then you get strongly consistent values. In case of counters I see that in case of timeout exceptions because the first replica is down or not responding, there's a chance of the values getting messed up, and re-trying can mess it up further. Its not idempotent like a standard col family design can be. If it gets messed up, it would need administrator's help (is there a a document on how we could resolve counter values going wrong?) I believe the rest of the limitations still hold good- has anything changed in recent versions? In my opinion, they are not as major as the consistency question. -removing a counter then modifying value - behaviour is undetermined-special process for counter col family sstable loss( need to remove all files)-no TTL support-no secondary indexes In short, I can recommend counters can be used
Data Model - Consistency question
Hi Folks, In the relational world, if I needed to model students, courses relationship, I may have donea students -master tablea course - master tablea bridge table students-course which gives me the ids to students and the courses they are taking. This can answer both 'which students take course A', as well as 'which courses are taken by student B' In the cassandra world, I may design it like thisa static student column familya static course column familya student-course column family with student id as key and dynamic list of course - ids to answer 'which courses are taken by student B'a course-student column family with course id as key and dynamic list of student ids 'which students take course A' A screen which displays some student entity details as well as all the courses she is taking will need to refer to 2 column families Suppose an application inserts a new row in student column family, and a new row in student-course column family, as transactions or consistency across column families is not guaranteed, there is a chance that the client receives information that a student is attending a course from student-course column family, but does not exist in the student column family. If we use Strong consistency from the reads + writes combination - will this scenario not occur ?And if we dont, can this scenario occur? Regards,Roshni Regards,Roshni
Solr Use Cases
Hi, Im new to Solr, and I hear that Solr is a great tool for improving search performanceIm unsure whether Solr or DSE Search is a must for all cassandra deployments 1. For performance - I thought cassandra had great read write performance. When should solr be used ?Taking the following use cases for cassandra from the datastax FAQ page, in which cases would Solr be useful, and whether for all?Time series data managementHigh-velocity device data ingestion and analysisMedia streaming (e.g., music, movies)Social media input and analysisOnline web retail (e.g., shopping carts, user transactions)Web log management / analysisWeb click-stream analysisReal-time data analyticsOnline gaming (e.g., real-time messaging)Write-intensive transaction systemsBuyer event analyticsRisk analysis and management 2. what changes to cassandra data modeling does Solr bring? We have some guidelines best practices around cassandra data modeling.Is Solr so powerful, that it does not matter how data is modelled in cassandra? Are there different best practices for cassandra data modeling when Solr is in the picture?Is this something we should keep in mind while modeling for cassandra today- that it should be good to be used via Solr in future? 3. Does Solr come with any drawbacks like its not real time ? I can should read the manual, but it will be great if someone can explain at a high level. Thank you! Regards,Roshni
Data Model
I want to learn how we can model a mix of static and dynamic columns in a family. Consider a course_students col family which gives a list of students for a coursewith row key- Course IdColumns - Name, Teach_Nm, StudID1, StudID2, StudID3Values - Maths, Prof. Abc, 20,21,25 where 20,21,25 are IDs of students. We have fixed columns like Course Name, Teacher Name, and a dynamic number of columns like 'StudID1', 'StudID2' etc, and my thoughts were that we could look for 'StudID' and get all the columns with the student Ids in Hector. But the question was how would we determine the number for the column, like to add StudID3 we need to read the row and identify that 2 students are there, and this is the third one. So we can remove the number in the column name, altogether and keep columns like Course Name, Teacher Name, Student:20,Student:21, Student:25, where the second part is the actual student id. However here we run into the second issue that we cannot have some columns of a composite format and some of another format, when we use static column families- all columns would need to be in the format UTF8:integer We may want to treat it as a composite column key and not use a delimiter- to get sorting, validate the types of the parts of the key, not have to search for the delimiter and separate the 2 components manually etc. A third option is to put only data in the column name for students like Course Name, Teacher Name, 20,21,25 - it would be difficult to identify that columns with name 20, 21, 25 actually stand for student names - a bit unreadable. I hope this is not confusing, and would like to hear your thoughts on this.The question isaround when you de-normalize want to have some static info like name ,and a dynamic list - whats the best way to model this. Regards,Roshni
Re: Data Modelling Suggestions
Thank you Aaron Guillermo, I find composite columns very confusing :( To reconfirm , 1. we can only search for columns range with the first component on the composite column. 2. After specifying a range for the first component, we cannot further filter for the second component. I found this link http://doanduyhai.wordpress.com/2012/07/05/apache-cassandra-tricks-and-traps/ which seems to suggest filtering is possible by second component in addition to first, and I tried the same example but I couldn't get it to work. Does anyone have an example where suppose I have data like this in my column names Timestamp1: 123, Timestamp2: 456, Timestamp3: 777,Timestamp4: 654 ---get range of columns for (start)component1 = timestamp1, component2=123 , to (end)component1=timestamp3,component2=123 -- should give me only one column Im finding that only the first component is used ….is this understanding correct? We see a lot of examples about Timeseries modelling with TimeUUID as column names. But how is the updating or deletion of columns happening here, how are the columns found to know which ones to delete or modify. Does one always need a separate column family to handle updating/deletion for time series, or is usually handled by setting TTL for data outside the archival period, or does time series modelling usually not involve any manipulation of past records? Regards, Roshni From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Data Modelling Suggestions I was trying to find hector examples where we search for second column in a composite column, but I couldn't find any good one. Im not sure if its possible.…if you have any do have any example please share. It's not. When slicing columns you can only return one contiguous range. Anyway I would prefer storing the item-ids as column names in the main column family and having a second CF for the order-by-date query only with the pair timestamp_itemid. That way you can add later other query strategies without messing with how you store the item +1 Have the orders somewhere, and build a time ordered custom index to show them in order. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/08/2012, at 6:28 AM, Guillermo Winkler gwink...@inconcertcc.commailto:gwink...@inconcertcc.com wrote: I think you need another CF as index. user_itemid - timestamped column_name Otherwise you can't guess what's the timestamp to use in the column name. Anyway I would prefer storing the item-ids as column names in the main column family and having a second CF for the order-by-date query only with the pair timestamp_itemid. That way you can add later other query strategies without messing with how you store the item information. Maybe you can solve it with a secondary index by timestamp too. Guille On Thu, Aug 23, 2012 at 7:26 AM, Roshni Rajagopal roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote: Hi, Need some help on a data modelling question. We're using Hector Datastax Enterprise 2.1. I want to associate a list of items for a user. It should be sorted on the time added. And items can be updated (quantity of the item can be changed), and items can be deleted. I can model it like this so that its denormalized and I get all my information in one go from one row, sorted by time added. I can use composite columns. Row key: User Id Column Name: TimeUUID:item ID: Item Name: Item Description: Item Price: Item Qty Column Value : Null Now, how do I handle manipulations 1. Add new item :Easy , just a new column 2. Add exiting item or modify qty: I want to get to the correct column to update . Can I search by second column in the composite column (equals condition) update the column name itself to reflect new TimeUUID and qty? Or would it be better to just add it as a new column and always use the latest column for an item in the application code and delete duplicates in the background. 3. Delete item: Can I search by second column in the composite column to find the correct column to delete? I was trying to find hector examples where we search for second column in a composite column, but I couldn't find any good one. Im not sure if its possible.…if you have any do have any example please share. Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential *** This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom
Data Modeling- another question
Hi, Suppose I have a column family to associate a user to a dynamic list of items. I want to store 5-10 key information about the item, no specific sorting requirements are there. I have two options A) use composite columns UserId1 : { itemid1:Name = Betty Crocker, itemid1:Descr = Cake itemid1:Qty = 5 itemid2:Name = Nutella, itemid2:Descr = Choc spread itemid2:Qty = 15 } B) use a json with the data UserId1 : { itemid1 = {name: Betty Crocker,descr: Cake, Qty: 5}, itemid2 ={name: Nutella,descr: Choc spread, Qty: 15} } Which do you suggest would be better? Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Secondary index partially created
What does List my_column_family in CLI show on all the nodes? Perhaps the syntax u're using isn't correct? You should be getting the same data on all the nodes irrespective of which node's CLI you use. The replication factor is for redundancy to have copies of the data on different nodes to help if nodes go down. Even if you had a replication factor of 1 you should still get the same data on all nodes. On 24/08/12 11:05 PM, Richard Crowley r...@rcrowley.org wrote: On Thu, Aug 23, 2012 at 6:54 PM, Richard Crowley r...@rcrowley.org wrote: I have a three-node cluster running Cassandra 1.0.10. In this cluster is a keyspace with RF=3. I *updated* a column family via Astyanax to add a column definition with an index on that column. Then I ran a backfill to populate the column in every row. Then I tried to query the index from Java and it failed but so did cassandra-cli: get my_column_family where my_column = 'my_value'; Two out of the three nodes are unable to query the new index and throw this error: InvalidRequestException(why:No indexed columns present in index clause with operator EQ) The third is able to query the new index happily but doesn't find any results, even when I expect it to. This morning the one node that's able to query the index is also able to produce the expected results. I'm a dummy and didn't use science so I don't know if the `nodetool compact` I ran across the cluster had anything to do with it. Regardless, it did not change the situation in any other way. `describe cluster;` in cassandra-cli confirms that all three nodes have the same schema and `show schema;` confirms that schema includes the new column definition and its index. The my_column_family.my_index-hd-* files only exist on that one node that can query the index. I ran `nodetool repair` on each node and waited for `nodetool compactionstats` to report zero pending tasks. Ditto for `nodetool compact`. The nodes that failed still fail. The node that succeeded still succeed. Can anyone shed some light? How do I convince it to let me query the index from any node? How do I get it to find results? Thanks, Richard This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Data Modelling Suggestions
Hi, Need some help on a data modelling question. We're using Hector Datastax Enterprise 2.1. I want to associate a list of items for a user. It should be sorted on the time added. And items can be updated (quantity of the item can be changed), and items can be deleted. I can model it like this so that its denormalized and I get all my information in one go from one row, sorted by time added. I can use composite columns. Row key: User Id Column Name: TimeUUID:item ID: Item Name: Item Description: Item Price: Item Qty Column Value : Null Now, how do I handle manipulations 1. Add new item :Easy , just a new column 2. Add exiting item or modify qty: I want to get to the correct column to update . Can I search by second column in the composite column (equals condition) update the column name itself to reflect new TimeUUID and qty? Or would it be better to just add it as a new column and always use the latest column for an item in the application code and delete duplicates in the background. 3. Delete item: Can I search by second column in the composite column to find the correct column to delete? I was trying to find hector examples where we search for second column in a composite column, but I couldn't find any good one. Im not sure if its possible.…if you have any do have any example please share. Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Decision Making- YCSB
Thanks Edward and Mohit. We do have an in house tool, but that tests pretty much the same thing as YCSB- read , write performance given a number of threads type of operations as an input. The good thing here is that we own the code and we can modify it easily. YCSB does not seem to be very well supported. When you say you modify the tests for your use-case what exactly do you modify. Could you give me an example of a use case driven approach. Regards, Roshni From: Mohit Anchlia mohitanch...@gmail.commailto:mohitanch...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Decision Making- YCSB I agree with Edward. We always develop our own stress tool that tests each use case of interest. Every use case is different in certain ways that can only be tested using custom stress tool. On Fri, Aug 10, 2012 at 7:25 AM, Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com wrote: There are many YCSB forks on github that get optimized for specific databases but the default one is decent across the defaults. Cassandra has it's own internal stress tool that we like better. The short comings are that generic tools and generic workloads are generic and thus not real-world. But other then that being able to tweak the workload percentages and change the read patterns from latest/random/etc does a decent job of stressing normal and worst-case scenarios on the read path. Still I would try to build my own real world use case as a tool to evaluate a solution before making a choice. Edward On Thu, Aug 9, 2012 at 8:58 PM, Roshni Rajagopal roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote: Hi Folks, I'm coming up with a set of decision criteria on when to chose traditional RDBMS vs various NoSQL options. So one aspect is the application requirements around Consistency, Availability, Partition Tolerance, Scalability, Data Modeling etc. These can be decided at a theoretical level. Once we are sure we need NoSQL, to effectively benchmark the performance around use-cases or application workloads, we need a standard method. Some tools are specific to a database like cassandra's stress tool.The only tool I could find which seems to compare across NoSQL databases, and can be extended and is freely available is YCSB. Is YCSB updated for latest versions of cassandra and hbase? Does it work for Datastax enterprise? Is it regularly updated for new versions of NoSQL databases, or is this something we would need to take up as a development effort? Are there any shortcomings to using YCSB- and would it be preferable to develop own tool for performance benchmarking of NoSQL systems. Do share your thoughts. Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential *** This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Decision Making- YCSB
Hi Folks, I'm coming up with a set of decision criteria on when to chose traditional RDBMS vs various NoSQL options. So one aspect is the application requirements around Consistency, Availability, Partition Tolerance, Scalability, Data Modeling etc. These can be decided at a theoretical level. Once we are sure we need NoSQL, to effectively benchmark the performance around use-cases or application workloads, we need a standard method. Some tools are specific to a database like cassandra's stress tool.The only tool I could find which seems to compare across NoSQL databases, and can be extended and is freely available is YCSB. Is YCSB updated for latest versions of cassandra and hbase? Does it work for Datastax enterprise? Is it regularly updated for new versions of NoSQL databases, or is this something we would need to take up as a development effort? Are there any shortcomings to using YCSB- and would it be preferable to develop own tool for performance benchmarking of NoSQL systems. Do share your thoughts. Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Project Management
Hi Baskar, The key aspect here is, you have to think of your queries , and denormalize. Here are my suggestions based on my understanding so far. You seem to have 2 queries A) what all users do I have B) what organizations do the users belong to The first can be a static column family- these are similar to RDBMS 'master data' or 'dimensions' in the DWH world. So you can have a users_CF column family where the row key is the primary key- so you can have userid as primary key. For email id as primary key- choose something which will never change (natural key vs surrogate key debate). The second query is where the real power of the data model comes in. You would not be having a separate organizations table with a foreign key to the users table. You would have a column family say Oraganizations_Users_CF with row key corresponding to your 'where clause' needs- here organization name. And then you can have a dynamic list of user names corresponding to each organization as column names.One organization can have 3 users (3 cols) another can have 10(10 cols) Note it would automatically be sorted by username when you retrieve a row, because comparator is Bytetype by default, which works for text sorting. If you want some other sort criteria, like say last time logged in, keep that as the column name, column value as username. Column names can also store some useful information, like a value in itself. Sorting is a design time decision. I think there have been numerous posts advising against using secondary indexes, so try to keep the key of the col family as what you would be searching for, as far as possible. If you have a different query, you can create a new column family- its ok to denormalize and have a separate column family per query. Regards, Roshni On 06/08/12 9:42 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Cassandra modeling is well documented on the web and a bit too complex to be explained in one mail. I advice you reading a lot before you make modeling choices. You may start with these links : http://www.datastax.com/docs/1.1/ddl/about-data-model#comparing-the-cassan dra-data-model-to-a-relational-database http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cas sandra/ and this link seem interesting, but I haven't read it yet (about indexes) : http://www.anuff.com/2011/02/indexing-in-cassandra.html I hope you'll find your answers within this documentation. Alain 2012/8/6 Baskar Sikkayan baskar@gmail.com: Hi, Just wanted to learn Cassandra and trying to convert RDBMS design to Canssandra. Considered my app is being deployed in multiple Data centers. DB Design : A) CF : USER 1) email_id - primary key 2) fullname 3) organization - ( I didnt create a separate table for organization ) B) CF : ORG_USER 1) organization - Primary Key 2) email_id Here, my intention is to get users belong to an organization. Here, I can make the organization in the user table as secondary index, but heard that, this may hit the performance. Could you please clarify me which is the better approach? Thanks, Baskar.S This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Changing comparator
Christof, Am not convinced you need to change your comparator. Bytestype works for most sorting even text ones. Did you mean validator- for a column's value. Comparator is for column ordering (ORDER BY in sql). I believe you can just convert the text you want to search for to bytes and then put it in where clause Bytes type should just be done via BytesTypeSerializer (a no-op really) as a value ... where value=raw bytes here Quoted from https://groups.google.com/forum/?fromgroups#!topic/hector-users/BpaemK95sPo Disclaimer - I have not used this. It just seems an unnecessary thing to do - to convert your validator/comparator from the default BytesType. My understanding is that comparator for the column family can be BytesType unless you want a specific Order By like TimeUUID. And validators can be left as BytesType unless you want some specific validation that the value you are storing is a number or a time etc. Regards, Roshni On 03/08/12 5:36 PM, Christof Roduner chris...@scandit.com wrote: Hi Roshni, Thanks for your reply. As far as I know, ASSUME is only for cqlsh and not for CQL in general. (We can of course achieve the same by programmatically setting the encoding. It would be just simpler to let the CQL driver take care of it...) Regards, Christof On 8/3/2012 11:31 AM, Roshni Rajagopal wrote: Christof , can't you just use ASSUME for the CQL session? http://www.datastax.com/docs/1.0/references/cql/ASSUME Regards, Roshni On 03/08/12 2:26 PM, Christof Roduner chris...@scandit.com wrote: Hi, I know that changing a CF's comparator is not officially supported. However, there is a post by Jonathan Ellis that implies it can be done (www.mail-archive.com/user@cassandra.apache.org/msg09502.html). I assume that we'd have to change entries in the system.schema_* column families. Has anyone successfully done this? We want to change the comparator from BytesType to UTF8Type to make the move to CQL easier (cannot parse 'foo' as hex bytes). Our CFs were created back in the Cassandra 0.6.x days and are too large to be easily copied to new CFs with a new schema. Many thanks in advance. Christof This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential *** This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Unsuccessful attempt to add a second node to a ring.
Hi Jakub, Were you able to resolve the issue? For a multi data center setup I do believe some steps are different. You may need to set Networktopology as your replication strategy rather than simple strategy, and setup a snitch. And mention the rack/dc configurations in a config file. You can refer to the steps for a multi data center installation. Regards, Roshni From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Unsuccessful attempt to add a second node to a ring. I found a similar thread from March : http://www.mail-archive.com/user@cassandra.apache.org/msg21007.html For me clearing the data and starting from the beginning didn't help. It's interesting because on my dev environment I was able to add another node without any problems. The only difference is that the second node now is in a different data center. (but I'm not using any different settings, SimpleSnitch) 7000,9160,7199 ports were open between those 2 nodes. How else can I check if the communication between those 2 nodes is working? In the logs I see that: DEBUG [WRITE-NODE1/node1.ip] 2012-07-31 13:50:39,642 OutboundTcpConnection.java (line 206) attempting to connect to NODE1/node1.ip So I assume that the communication is somehow established? -- regards, Jakub Glapa On Wed, Aug 1, 2012 at 11:36 AM, Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.com wrote: yes it's the same -- regards, pozdrawiam, Jakub Glapa On Wed, Aug 1, 2012 at 11:24 AM, Roshni Rajagopal roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote: Ok, sorry it may not be required, I was thinking of a configuration I had done on my local laptop, where I had aliased my IP address. In that case the directories and jmx port needed to be different. Cluster name is same right? From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Unsuccessful attempt to add a second node to a ring. Hi Roshni, no they are the same, my changes in cassandra.yaml were only in the listen_address, rpc_address, seeds and initial_token field. The rest is exactly the same as on node1. That's how the file looks on node2: cluster_name: 'Test Cluster' initial_token: 85070591730234615865843651857942052864 hinted_handoff_enabled: true hinted_handoff_throttle_delay_in_ms: 1 authenticator: org.apache.cassandra.auth.AllowAllAuthenticator authority: org.apache.cassandra.auth.AllowAllAuthority partitioner: org.apache.cassandra.dht.RandomPartitioner data_file_directories: - /data/servers/cassandra_sbe_edtool/cassandra_data/data commitlog_directory: /data/servers/cassandra_sbe_edtool/cassandra_data/commitlog saved_caches_directory: /data/servers/cassandra_sbe_edtool/cassandra_data/saved_caches commitlog_sync: periodic commitlog_sync_period_in_ms: 1 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: NODE1 flush_largest_memtables_at: 0.75 reduce_cache_sizes_at: 0.85 reduce_cache_capacity_to: 0.6 concurrent_reads: 32 concurrent_writes: 32 memtable_flush_queue_size: 4 sliced_buffer_size_in_kb: 64 storage_port: 7000 ssl_storage_port: 7001 listen_address: NODE2 rpc_address: NODE2 rpc_port: 9160 rpc_keepalive: true rpc_server_type: sync thrift_framed_transport_size_in_mb: 15 thrift_max_message_length_in_mb: 16 incremental_backups: false snapshot_before_compaction: false column_index_size_in_kb: 64 in_memory_compaction_limit_in_mb: 64 multithreaded_compaction: false compaction_throughput_mb_per_sec: 16 compaction_preheat_key_cache: true rpc_timeout_in_ms: 1 endpoint_snitch: org.apache.cassandra.locator.SimpleSnitch dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 60 dynamic_snitch_badness_threshold: 0.1 request_scheduler: org.apache.cassandra.scheduler.NoScheduler index_interval: 128 encryption_options: internode_encryption: none keystore: conf/.keystore keystore_password: cassandra truststore: conf/.truststore truststore_password: cassandra -- regards, pozdrawiam, Jakub Glapa On Wed, Aug 1, 2012 at 10:29 AM, Roshni Rajagopal roshni.rajago...@wal
Re: Changing comparator
Christof , can't you just use ASSUME for the CQL session? http://www.datastax.com/docs/1.0/references/cql/ASSUME Regards, Roshni On 03/08/12 2:26 PM, Christof Roduner chris...@scandit.com wrote: Hi, I know that changing a CF's comparator is not officially supported. However, there is a post by Jonathan Ellis that implies it can be done (www.mail-archive.com/user@cassandra.apache.org/msg09502.html). I assume that we'd have to change entries in the system.schema_* column families. Has anyone successfully done this? We want to change the comparator from BytesType to UTF8Type to make the move to CQL easier (cannot parse 'foo' as hex bytes). Our CFs were created back in the Cassandra 0.6.x days and are too large to be easily copied to new CFs with a new schema. Many thanks in advance. Christof This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Unsuccessful attempt to add a second node to a ring.
Jakub, Have you set the Data, commitlog, saved cache directories to different ones in each yaml file for each node? Regards, Roshni From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Unsuccessful attempt to add a second node to a ring. Hi Everybody! I'm trying to add a second node to an already operating one node cluster. Some specs: - cassandra 1.0.7 - both nodes have a routable listen_address and rpc_address. - Ports are open: (from node2) telnet node1 7000 is successful - Seeds parameter on node2 points to node 1. [node1] nodetool -h localhost ring Address DC RackStatus State LoadOwns Token node1.ip datacenter1 rack1 Up Normal 74.33 KB100.00% 0 - initial token on node2 was specified I see something like that in the logs on node2: DEBUG [main] 2012-07-31 13:50:38,640 CollationController.java (line 76) collectTimeOrderedData INFO [main] 2012-07-31 13:50:38,641 StorageService.java (line 667) JOINING: waiting for ring and schema information DEBUG [WRITE-NODE1/node1.ip] 2012-07-31 13:50:39,642 OutboundTcpConnection.java (line 206) attempting to connect to NODE1/node1.ip DEBUG [ScheduledTasks:1] 2012-07-31 13:50:40,639 LoadBroadcaster.java (line 86) Disseminating load info ... INFO [main] 2012-07-31 13:51:08,641 StorageService.java (line 667) JOINING: schema complete, ready to bootstrap DEBUG [main] 2012-07-31 13:51:08,642 StorageService.java (line 554) ... got ring + schema info INFO [main] 2012-07-31 13:51:08,642 StorageService.java (line 667) JOINING: getting bootstrap token DEBUG [main] 2012-07-31 13:51:08,644 BootStrapper.java (line 138) token manually specified as 85070591730234615865843651857942052864 DEBUG [main] 2012-07-31 13:51:08,645 Table.java (line 387) applying mutation of row 4c but it doesn't join the ring: [node2] nodetool -h localhost ring Address DC RackStatus State LoadOwns Token node2.ip datacenter1 rack1 Up Normal 13.49 KB100.00% 85070591730234615865843651857942052864 I'm attaching the full log from node2 startup in debug mode. PS. When I didn't specified the initial token on node2 I ended up with exception like that: Exception encountered during startup: No other nodes seen! Unable to bootstrap.If you intended to start a single-node cluster, you should make sure your broadcast_address (or listen_address) is listed as a seed. Otherwise, you need to determine why the seed being contacted has no knowledge of the rest of the cluster. Usually, this can be solved by giving all nodes the same seed list. I'm not sure how to proceed now. I found a couple of posts with problems like that but they weren't very useful. -- regards, Jakub Glapa This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Unsuccessful attempt to add a second node to a ring.
Ok, sorry it may not be required, I was thinking of a configuration I had done on my local laptop, where I had aliased my IP address. In that case the directories and jmx port needed to be different. Cluster name is same right? From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Unsuccessful attempt to add a second node to a ring. Hi Roshni, no they are the same, my changes in cassandra.yaml were only in the listen_address, rpc_address, seeds and initial_token field. The rest is exactly the same as on node1. That's how the file looks on node2: cluster_name: 'Test Cluster' initial_token: 85070591730234615865843651857942052864 hinted_handoff_enabled: true hinted_handoff_throttle_delay_in_ms: 1 authenticator: org.apache.cassandra.auth.AllowAllAuthenticator authority: org.apache.cassandra.auth.AllowAllAuthority partitioner: org.apache.cassandra.dht.RandomPartitioner data_file_directories: - /data/servers/cassandra_sbe_edtool/cassandra_data/data commitlog_directory: /data/servers/cassandra_sbe_edtool/cassandra_data/commitlog saved_caches_directory: /data/servers/cassandra_sbe_edtool/cassandra_data/saved_caches commitlog_sync: periodic commitlog_sync_period_in_ms: 1 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: NODE1 flush_largest_memtables_at: 0.75 reduce_cache_sizes_at: 0.85 reduce_cache_capacity_to: 0.6 concurrent_reads: 32 concurrent_writes: 32 memtable_flush_queue_size: 4 sliced_buffer_size_in_kb: 64 storage_port: 7000 ssl_storage_port: 7001 listen_address: NODE2 rpc_address: NODE2 rpc_port: 9160 rpc_keepalive: true rpc_server_type: sync thrift_framed_transport_size_in_mb: 15 thrift_max_message_length_in_mb: 16 incremental_backups: false snapshot_before_compaction: false column_index_size_in_kb: 64 in_memory_compaction_limit_in_mb: 64 multithreaded_compaction: false compaction_throughput_mb_per_sec: 16 compaction_preheat_key_cache: true rpc_timeout_in_ms: 1 endpoint_snitch: org.apache.cassandra.locator.SimpleSnitch dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 60 dynamic_snitch_badness_threshold: 0.1 request_scheduler: org.apache.cassandra.scheduler.NoScheduler index_interval: 128 encryption_options: internode_encryption: none keystore: conf/.keystore keystore_password: cassandra truststore: conf/.truststore truststore_password: cassandra -- regards, pozdrawiam, Jakub Glapa On Wed, Aug 1, 2012 at 10:29 AM, Roshni Rajagopal roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote: Jakub, Have you set the Data, commitlog, saved cache directories to different ones in each yaml file for each node? Regards, Roshni From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Unsuccessful attempt to add a second node to a ring. Hi Everybody! I'm trying to add a second node to an already operating one node cluster. Some specs: - cassandra 1.0.7 - both nodes have a routable listen_address and rpc_address. - Ports are open: (from node2) telnet node1 7000 is successful - Seeds parameter on node2 points to node 1. [node1] nodetool -h localhost ring Address DC RackStatus State LoadOwns Token node1.ip datacenter1 rack1 Up Normal 74.33 KB100.00% 0 - initial token on node2 was specified I see something like that in the logs on node2: DEBUG [main] 2012-07-31 13:50:38,640 CollationController.java (line 76) collectTimeOrderedData INFO [main] 2012-07-31 13:50:38,641 StorageService.java (line 667) JOINING: waiting for ring and schema information DEBUG [WRITE-NODE1/node1.ip] 2012-07-31 13:50:39,642 OutboundTcpConnection.java (line 206) attempting to connect to NODE1/node1.ip DEBUG [ScheduledTasks:1] 2012-07-31 13:50:40,639 LoadBroadcaster.java (line 86) Disseminating load info ... INFO [main] 2012-07-31 13:51:08,641 StorageService.java (line 667) JOINING: schema complete, ready to bootstrap DEBUG [main] 2012-07-31 13:51:08,642 StorageService.java (line 554) ... got ring + schema info INFO [main] 2012-07-31 13:51:08,642
Re: Does Cassandra support operations in a transaction?
Hi Ivan, Cassandra supports 'tunable consistency' . If you always read and write at a quorum (or local quorum for multi data center) from one , you can guarantee that the results will be consistent as in all the data will be compared and the latest will be returned, and no data will be out of date. This is at a loss of performance- it will be fastest to just read and write once rather than check a quorum of nodes. What you chose depends on what your application needs are. Is it ok if some users receive out of date data (it isn't earth shattering if someone doesn't know what you're eating right now), or is it a banking transaction system where all entities must be consistently updated. So designing in cassandra priortizes de-normalization. You cannot have referential integrity that 2 tables (col families in cassandra) are in sync because the database has designed it to be so using foreign keys. The application needs to ensure that all data in column families are accurate and not out of sync, because data elements may be duplicated in different col families. You cannot have 2 different entities and ensure that changes to both will be done and then only be visible to others. Regards, From: Jeffrey Kesselman jef...@gmail.commailto:jef...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Does Cassandra support operations in a transaction? Short story is that few if any of the NoSql systems supprot transactions natively. Thats oen of the big compromises they make. What they call eventual consistancy is actually eventual Durabiltiy in ACID terms. Consistancy, as meant by the C in ACID, is not gauranteed at all. On Wed, Aug 1, 2012 at 6:21 AM, Ivan Jiang wiwi1...@gmail.commailto:wiwi1...@gmail.com wrote: Hi, I am a new guy to Cassandra, I wonder if available to call Cassandra in one Transaction such as in Relation-DB. Thanks in advance. Best Regards, Ivan Jiang -- It's always darkest just before you are eaten by a grue. This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Schema question : Query to support Find which all of these 500 email ids have been registered
In general I believe wide rows (many cols ) are preferable to skinny rows (many rows) so that you can get all the information in 1 go, One can store 2 billion cols in a row. However, on what basis would you store the 500 email ids in 1 row? What can be the row key? For e.g. If the query you want to answer with this column family is 'how many email addresses are registered in this application?', then application id can be a row key, and 500 email ids can be stored as columns. Each other applications would be another row . Since you want to search by application this may be the best approach. If your information doesn't fit neatly into the model above, you can go for An email id as a row key, and list of applications as columns. Reading 500 rows does not seem a big task - I doubt it would be a performance issue given cassandra's powers. On 27/07/12 11:12 AM, Aklin_81 asdk...@gmail.com wrote: I need to find out what all email ids among a list of 500 ids passed in a single query, have been registered on my app. (Total registered email ids may be in millions). What is the best way to store this kind of data? Should I store each email id in a separate row ? But then I would have to read 500 rows at a single time ! Or if I use single row or less no of rows then they would get too heavy. Btw Would it be really bad if I read 500 rows at a single time, they'll be just 1 column rows never modified once written columns. This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Cassandra and Tableau
Hi Robin, Im from an analytics background, was working in the traditional BI tools like OBIEE and Business Objects, so I am very interested in your evaluations of a good analytics toolset combination. Do share your learnings, At a high level as I understand, cassandra can be used as the backend for a transactional systems ( with the tunable consistency adjusted according to requirements ), because it is real time. However Hadoop is not for a real time scenario. Its primarily for anlaytics- non real time processing on huge datasets. The actual information will be stored in HDFS file system. You can even use Hadoop as a replacement for 'ETL' processing. With a Hadoop cluster you can directly use a statistical programming language like 'R' to extract information. For traditional BI folks like me that’s a new piece to learn- no user friendly GUI! For a better GUI, we have a new breed of tools like Pentaho, Jaspersoft, Karmasphere, Datasphere. Im not exactly sure how all of these work, I know pentaho works with a Hadoop + Hive combination. Here's a nice ppt which explains, from their 'Chief Geek'- I found the whole series of presentations good. Pentaho – hadoop knowledge series http://www.pentaho.com/resources/videos/29/Hadoop-and-Business-Intelligence/ Regards, Roshni From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Cassandra and Tableau Thank you Aaron and Brian. We're currently investigating several options. Hadoop + Hive combo also seems a good choice as our input files are flat. I'll keep you up-to-date about our final decision. - Robin 2012/7/6 aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Here are two links I've noticed in my travels, have not looked into what they offer. http://www.pentaho.com/big-data/nosql/cassandra/ http://www.jaspersoft.com/bigdata Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 7/07/2012, at 3:03 AM, Brian O'Neill wrote: Robin, We have the same issue right now. We use Tableau for all of our reporting needs, but we couldn't find any acceptable bridge between it and Cassandra. We ended up using cassandra-triggers to replicate the data to Oracle. https://github.com/hmsonline/cassandra-triggers/ Let us know if you get things setup with a direct connection. We'd be *very* interested int helping out if you find a way to do it. -brian On Fri, Jul 6, 2012 at 5:31 AM, Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl wrote: Hi there, Is there anyone out there who's using Tableau in combination with a Cassandra cluster? There seems to be no standard solution to connect, at least I couldn't find one. Does anyone know how to tackle this problem? With kind regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nlmailto:ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024tel:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- With kind regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nlmailto:ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Starting cassandra with -D option
Hi Folks, We wanted to have a single cassandra installation, and use it to start cassandra in other nodes by passing it the cassandra configuration directories as a parameter. Idea is to avoid having the copies of cassandra code in each node, and starting each node by getting into bin/cassandra of that node. As per http://www.datastax.com/docs/1.0/references/cassandra, We have an option –D where we can supply some parameters to cassandra. Has anyone tried this? Im getting an error as below. walmarts-MacBook-Pro-2:Node1-Cassandra1.1.0 walmart$ bin/cassandra -Dcassandra.config=file:///Users/walmart/Downloads/Cassandra/Node2-Cassandra1.1.0/conf walmarts-MacBook-Pro-2:Node1-Cassandra1.1.0 walmart$ INFO 15:38:01,763 Logging initialized INFO 15:38:01,766 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.6.0_31 INFO 15:38:01,766 Heap size: 1052770304/1052770304 INFO 15:38:01,766 Classpath: bin/../conf:bin/../build/classes/main:bin/../build/classes/thrift:bin/../lib/antlr-3.2.jar:bin/../lib/apache-cassandra-1.1.0.jar:bin/../lib/apache-cassandra-clientutil-1.1.0.jar:bin/../lib/apache-cassandra-thrift-1.1.0.jar:bin/../lib/avro-1.4.0-fixes.jar:bin/../lib/avro-1.4.0-sources-fixes.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/compress-lzf-0.8.4.jar:bin/../lib/concurrentlinkedhashmap-lru-1.2.jar:bin/../lib/guava-r08.jar:bin/../lib/high-scale-lib-1.1.2.jar:bin/../lib/jackson-core-asl-1.9.2.jar:bin/../lib/jackson-mapper-asl-1.9.2.jar:bin/../lib/jamm-0.2.5.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-0.7.0.jar:bin/../lib/log4j-1.2.16.jar:bin/../lib/metrics-core-2.0.3.jar:bin/../lib/mx4j-tools-3.0.1.jar:bin/../lib/servlet-api-2.5-20081211.jar:bin/../lib/slf4j-api-1.6.1.jar:bin/../lib/slf4j-log4j12-1.6.1.jar:bin/../lib/snakeyaml-1.6.jar:bin/../lib/snappy-java-1.0.4.1.jar:bin/../lib/snaptree-0.1.jar:bin/../lib/jamm-0.2.5.jar INFO 15:38:01,768 JNA not found. Native methods will be disabled. INFO 15:38:01,826 Loading settings from file:/Users/walmart/Downloads/Cassandra/Node2-Cassandra1.1.0/conf ERROR 15:38:01,873 Fatal configuration error error Can't construct a java object for tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=No single argument constructor found for class org.apache.cassandra.config.Config in reader, line 1, column 1: cassandra.yaml The other option would be to modify cassandra.in.sh. Has anyone tried this?? Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Setting column to null
Leonid, Are you using some client for doing these operations..? Hector is a java client which provides APIs for adding/deleting columns to a column family in cassandra. I don¹t think you really need to write your wrapper in this format- you are restricting the number of columns it can use etc.I suggest your code can accept user input to get col family name, operation, keys , and operation, and accordingly call the appropriate hector API for adding/deleting data. Regards, Roshni On 11/06/12 7:20 PM, Leonid Ilyevsky lilyev...@mooncapital.com wrote: Thanks, I understand what you are telling me. Obviously deleting the column is the proper way to do this in Cassandra. What I was looking for, is some convenient wrapper on top of that which will do it for me. Here is my scenario. I have a function that takes a record to be saved in Cassandra (array of objects, or MapString, Object). Let say, it can have up to 8 columns. I prepare a statement like this: Insert into table values(?, ?, ?, ?, ?, ?, ?, ?) If I somehow could put null when I execute it, it would be enough to prepare that statement once and execute it multiple times. I would then expect that when some element is null, the corresponding column is not inserted (for the new key) or deleted (for the existing key). The way it is now, in my code I have to examine which columns are present and which are not, depending on that I have to generate customized statement, and it is going to be different for the case of existing key versus case of the new key. Isn't this too much hassle? Related question. I assumed that prepared statement in Cassandra is there for the same reason as in RDBMS, that is, for efficiency. In the above scenario, how expensive is it to execute specialized statement for every record compare to prepared statement executed multiple times? If I need to execute those specialized statements, should I still use prepared statement or should I just generate a string with everything in ascii format? -Original Message- From: Roshni Rajagopal [mailto:roshni.rajago...@wal-mart.com] Sent: Monday, June 11, 2012 12:58 AM To: user@cassandra.apache.org Subject: Re: Setting column to null Would you want to view data like this there was a key, which had this column , but now it does not have any value as of this time. Unless you specifically want this information, I believe you should just delete the column, rather than have an alternate value for NULL or create a composite column. Because in cassandra that¹s the way deletion is dealt with, putting NULLs is the way we deal with it in RDBMS because we have a fixed number of columns which always have to have some value, even if its NULL, and we have to have the same set of columns for every row. In Cassandara, we can delete the column, and in most scenarios that¹s what we should do, unless we specifically want to preserve some history that this column was turned null at this timeŠEach row can have different columns. Regards, Roshni From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Setting column to null Your best bet is to define the column as a composite column where one part represents is null and the other part is the data. On Friday, June 8, 2012, shashwat shriparv dwivedishash...@gmail.commailto:dwivedishash...@gmail.com wrote: What you can do is you can define some specific variable like NULLDATA some thing like that to update in columns that does have value On Fri, Jun 8, 2012 at 11:58 PM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: You don't nee to set columns to null, delete the column instead. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/06/2012, at 9:34 AM, Leonid Ilyevsky wrote: Is it possible to explicitly set a column value to null? I see that if insert statement does not include a specific column, that column comes up as null (assuming we are creating a record with new unique key). But if we want to update a record, how we set it to null? Another situation is when I use prepared cql3 statement (in Java) and send parameters when I execute it. If I want to leave some column unassigned, I need a special statement without that column. What I would like is, prepare one statement including all columns, and then be able to set some of them to null. I tried to set corresponding ByteBuffer parameter to null, obviously got an exception. This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited
Re: Setting column to null
Would you want to view data like this there was a key, which had this column , but now it does not have any value as of this time. Unless you specifically want this information, I believe you should just delete the column, rather than have an alternate value for NULL or create a composite column. Because in cassandra that’s the way deletion is dealt with, putting NULLs is the way we deal with it in RDBMS because we have a fixed number of columns which always have to have some value, even if its NULL, and we have to have the same set of columns for every row. In Cassandara, we can delete the column, and in most scenarios that’s what we should do, unless we specifically want to preserve some history that this column was turned null at this time…Each row can have different columns. Regards, Roshni From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Setting column to null Your best bet is to define the column as a composite column where one part represents is null and the other part is the data. On Friday, June 8, 2012, shashwat shriparv dwivedishash...@gmail.commailto:dwivedishash...@gmail.com wrote: What you can do is you can define some specific variable like NULLDATA some thing like that to update in columns that does have value On Fri, Jun 8, 2012 at 11:58 PM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: You don't nee to set columns to null, delete the column instead. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/06/2012, at 9:34 AM, Leonid Ilyevsky wrote: Is it possible to explicitly set a column value to null? I see that if insert statement does not include a specific column, that column comes up as null (assuming we are creating a record with new unique key). But if we want to update a record, how we set it to null? Another situation is when I use prepared cql3 statement (in Java) and send parameters when I execute it. If I want to leave some column unassigned, I need a special statement without that column. What I would like is, prepare one statement including all columns, and then be able to set some of them to null. I tried to set corresponding ByteBuffer parameter to null, obviously got an exception. This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited and may be in violation of law. If you are not the intended recipient, any disclosure, copying, forwarding or distribution of this email is strictly prohibited and this email and any attachments should be deleted immediately. This email and any attachments do not constitute an offer to sell or a solicitation of an offer to purchase any interest in any investment vehicle sponsored by Moon Capital Management LP (“Moon Capital”). Moon Capital does not provide legal, accounting or tax advice. Any statement regarding legal, accounting or tax matters was not intended or written to be relied upon by any person as advice. Moon Capital does not waive confidentiality or privilege as a result of this email. -- ∞ Shashwat Shriparv This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Problem in getting data from a 2 node cluster of Cassandra
Hi Prakrati, In an ideal situation, no data should be lost when a node is added. How are you getting the statistics below. The output below looks like its from some code using Hector or Thrift..is the code to get statistics from a 1 node cluster or 2 exactly the same- with the only change being a node being added or removed? Could you verify the number of rows cols in the column family using CLI or CQL.. Regards, Roshni From: Prakrati Agrawal prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday 8 June 2012 11:50 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Problem in getting data from a 2 node cluster of Cassandra Dear all I was originally having a 1 node cluster. Then I added one more node to it with initial token configured appropriately. Now when I run my queries I am not getting all my data ie all columns. Output on 2 nodes Time taken to retrieve columns 43707 of key range is 1276 Time taken to retrieve columns 2084199 of all tickers is 54334 Time taken to count is 230776 Total number of rows in the database are 183 Total number of columns in the database are 7903753 Output on 1 node Time taken to retrieve columns 43707 of key range is 767 Time taken to retrieve columns 382 of all tickers is 52793 Time taken to count is 268135 Total number of rows in the database are 396 Total number of columns in the database are 16316426 Please help me. Where is my data going or how should I retrieve it. I have consistency level specified as ONE and I did not specify any replication factor. Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: How to include two nodes in Java code using Hector
In Hector when you create a cluster using the API, you specify an IP address cluster name. Thereafter internally which node serves the request or how many nodes need to be contacted to read/write data depends on the cluster configuration i.e. Whats your replication strategy, factor, consistency levels for the col family , how many nodes are there in the ring etc. So you don't individually need to connect to each node via Hector client. Once you connect to the cluster keyspace, via any IP add of any node in the cluster, when you make Hector calls to read/write data, it would automatically figure out the node level details and carry out the task. You won't get 50% of the data, you will get all data. Also when you remove a node, your data will be unavailable ONLY if you don't have it available in some other node as a replica.. Regards, From: Prakrati Agrawal prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tue, 5 Jun 2012 20:05:21 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: How to include two nodes in Java code using Hector But the data is distributed on the nodes ( meaning 50% of data is on one node and 50% of data is on another node) so I need to specify the node ip address somewhere in the code. But where do I specify that is what I am clueless about. Please help me Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com From: Harshvardhan Ojha [mailto:harshvardhan.o...@makemytrip.com] Sent: Tuesday, June 05, 2012 5:51 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: How to include two nodes in Java code using Hector Use Consistency Level =2. Regards Harsh From: Prakrati Agrawal [mailto:prakrati.agra...@mu-sigma.com] Sent: Tuesday, June 05, 2012 4:08 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: How to include two nodes in Java code using Hector Dear all I am using a two node Cassandra cluster. How do I code in Java using Hector to get data from both the nodes. Please help Thanks and Regards Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.comhttp://www.mu-sigma.com This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Can not find auto bootstrap property in cassandra.yaml for Cassandra 1.1.0
Hi Prakrati, In 1.1.0 you don't need to set this, its by default. Im also on 1.1.0 and I didn't need to set this. Regards, Roshni From: Prakrati Agrawal prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sun, 3 Jun 2012 22:58:24 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Can not find auto bootstrap property in cassandra.yaml for Cassandra 1.1.0 Dear all I am trying to add a new node to the Cassandra cluster. In all the documentations available on net it says to set the auto bootstrap property in cassandra.yaml to true but I am not finding the property in the file. Please help me Thanks and Regards Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: Adding a new node to Cassandra cluster
Prakrati, I believe even though you would specify one node in your code, internally the request would be going to any – perhaps more than 1 node based on your replication factors consistency level settings. You can try this by connecting to one node and writing to it and then reading the same data from another node. You can see this replication happening via CLI as well. Regards, Roshni From: R. Verlangen ro...@us2.nlmailto:ro...@us2.nl Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Mon, 4 Jun 2012 02:30:40 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Adding a new node to Cassandra cluster You might consider using a higher level client (like Hector indeed). If you don't want this you will have to write your own connection pool. For start take a look at Hector. But keep in mind that you might be reinventing the wheel. 2012/6/4 Prakrati Agrawal prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com Hi, I am using Thrift API and I am not able to find anything on the internet about how to configure it for multiple nodes. I am not using any proper client like Hector. Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.comhttp://www.mu-sigma.com From: R. Verlangen [mailto:ro...@us2.nlmailto:ro...@us2.nl] Sent: Monday, June 04, 2012 2:44 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Adding a new node to Cassandra cluster Hi there, When you speak to one node it will internally redirect the request to the proper node (local / external): but you won't be able to failover on a crash of the localhost. For adding another node to the connection pool you should take a look at the documentation of your java client. Good luck! 2012/6/4 Prakrati Agrawal prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com Dear all I successfully added a new node to my cluster so now it’s a 2 node cluster. But how do I mention it in my Java code as when I am retrieving data its retrieving only for one node that I am specifying in the localhost. How do I specify more than one node in the localhost. Please help me Thanks and Regards Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.comhttp://www.mu-sigma.com This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. -- With kind regards, Robin Verlangen Software engineer W www.robinverlangen.nlhttp://www.robinverlangen.nl E ro...@us2.nlmailto:ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. -- With kind regards, Robin Verlangen Software engineer W www.robinverlangen.nlhttp://www.robinverlangen.nl E ro...@us2.nlmailto:ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use
Replication factor via hector
Hi , I'm trying to see the effect of different replication factors and consistency levels for a keyspace on a 4 node cassandra cluster. I'm doing this using hector client. I could not find an api to set replication factor for a keyspace though I could find ways to modify consistency level. Is it possible to change replication factor using hector or does it have to be done using CLI? Regards, Roshni This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
Re: no snappyjava in java.library.path (JDK 1.7 issue?)
Hi Stephen, Cassandra's wiki says Cassandra requires the most stable version of Java 1.6 you can deploy. http://wiki.apache.org/cassandra/GettingStarted Regards, Roshni From: Stephen McKamey step...@mckamey.commailto:step...@mckamey.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tue, 15 May 2012 13:40:43 -0700 To: Cassandra user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: no snappyjava in java.library.path (JDK 1.7 issue?) Reverting to JDK 1.6 appears to fix the issue. Is JDK 1.7 not yet supported by Cassandra? java version 1.6.0_31 Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode) On Tue, May 15, 2012 at 12:55 PM, Stephen McKamey step...@mckamey.commailto:step...@mckamey.com wrote: Worth noting is I'm on Mac OS X 10.7.4 and I recently upgraded to the latest JDK (really hoping this isn't the issue): java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***