Data Model Question
Hi! I am a newbie to Cassandra and seeking some advice regarding the data model I should use to best address my needs. For simplicity, what I want to accomplish is: I have a system that has users (potentially ~10,000 per day) and they perform actions in the system (total of ~50,000 a day). Each User’s action is taking place in a certain point in time, and is also classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s Categories and Tags has a score associated with it, the score is between 0 to 1 (let’s assume precision of 0.0001). I want to be able to identify similar actions in the system (performed usually by more than one user). Similarity of actions is calculated based on their common Categories and Tags taking scores into account. I need the system to store: The list of my users with attributes like name, age etc For each action – the categories and tags associated with it and their score, the time of the action, and the user who performed it. Groups of similar actions (ActionGroups) – the id’s of actions in the group, the categories and tags describing the group, with their scores. Those are calculated using an algorithm that takes into account the categories and tags of the actions in the group. When a user performs a new action in the system, I want to add it to a fitting ActionGroups (with similar categories and tags). For this I need to be able to perform the following: Find all the recent ActionGroups (those who were updated with actions performed during the last T minutes), who has at list one of the new action’s categories AND at list one of the new action’s tags. I thought of two ways to address the issue and I would appreciate your insights. First one using secondary indexes Column Family:Users Key: userId Compare with Bytes Type Columns: name: , age: etc… Column Family:Actions Key: actionId Compare with Bytes Type Columns: Category1 : Score …. CategoriN: Score, Tag1 : Score, …. TagK:Score Time: timestamp user: userId Column Family:ActionGroups Key: actionGroupId Compare with Bytes Type Columns: Category1 : Score …. CategoriN: Score, Tag1 : Score …. TagK:Score lastUpdateTime: timestamp actionId1: null, … , actionIdM: null I will then define secondary index on each tag columns, category columns, and the update time column. Let’s assume the new action I want to add to ActionGroup has NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – NewActionTagN. I will perform the following query: Select * From ActionGroups where (NewActionCategory1 0 … or NewActionCategoryK 0) and (NewActionTag1 0 … or NewActionTagN 0) and lastUpdateTime T; Second solution Have the same CF as in the first solutionwithout the secondaryindex, and have two additional CF-ies: Column Family:CategoriesToActionGroupId Key: categoryId Compare with ByteType Columns: {Timestamp, ActionGroupsId1 } : null {Timestamp, ActionGroupsId2} : null ... *timestamp is the update time for the ActionGroup A similar CF will be defined for tags. I will then be able to run several queries on CategoriesToActionGroupId (one for each of the new story Categories), with column slice for the right update time of the ActionGroup. I will do the same for the TagsToActionGroupId. I will then use my client code to remove duplicates (ActionGroups who are associated with more than one Tag or Category). My questions are: Are the two solutions viable? If yes, which is better Is there any better way of doing this? Can I use jdbc and CQL with both method, or do I have to use Hector (I am using Java). Thanks Tamar
Re: Unbalanced cluster with RandomPartitioner
On 19.01.2012, at 20:15, Narendra Sharma wrote: I believe you need to move the nodes on the ring. What was the load on the nodes before you added 5 new nodes? Its just that you are getting data in certain token range more than others. With three nodes, it was also imbalanced. What I don't understand is, why the md5 sums would generate such massive hot spots. Most of our keys look like that: 00013270494972450001234567 with the first 16 digits being a timestamp of one of our application server's startup times, and the last 10 digits being sequentially generated per user. There may be a lot of keys that start with e.g. 0001327049497245 (or some other time stamp). But I was under the impression that md5 doesn't bother and generates uniform distribution? But then again, I know next to nothing about md5. Maybe someone else has a better insight to the algorithm? However, we also use cfs with a date (mmdd) as key, as well as cfs with uuids as keys. And those cfs in itself are not balanced either. E.g. node 5 has 12 GB live space used in the cf the uuid as key, and node 8 only 428MB. Cheers, Marcel On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach marcel.steinb...@chors.de wrote: On 18.01.2012, at 02:19, Maki Watanabe wrote: Are there any significant difference of number of sstables on each nodes? No, no significant difference there. Actually, node 8 is among those with more sstables but with the least load (20GB) On 17.01.2012, at 20:14, Jeremiah Jordan wrote: Are you deleting data or using TTL's? Expired/deleted data won't go away until the sstable holding it is compacted. So if compaction has happened on some nodes, but not on others, you will see this. The disparity is pretty big 400Gb to 20GB, so this probably isn't the issue, but with our data using TTL's if I run major compactions a couple times on that column family it can shrink ~30%-40%. Yes, we do delete data. But I agree, the disparity is too big to blame only the deletions. Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks ago. After adding the node, we did compactions and cleanups and didn't have a balanced cluster. So that should have removed outdated data, right? 2012/1/18 Marcel Steinbach marcel.steinb...@chors.de: We are running regular repairs, so I don't think that's the problem. And the data dir sizes match approx. the load from the nodetool. Thanks for the advise, though. Our keys are digits only, and all contain a few zeros at the same offsets. I'm not that familiar with the md5 algorithm, but I doubt that it would generate 'hotspots' for those kind of keys, right? On 17.01.2012, at 17:34, Mohit Anchlia wrote: Have you tried running repair first on each node? Also, verify using df -h on the data dirs On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach marcel.steinb...@chors.de wrote: Hi, we're using RP and have each node assigned the same amount of the token space. The cluster looks like that: Address Status State LoadOwnsToken 205648943402372032879374446248852460236 1 Up Normal 310.83 GB 12.50% 56775407874461455114148055497453867724 2 Up Normal 470.24 GB 12.50% 78043055807020109080608968461939380940 3 Up Normal 271.57 GB 12.50% 99310703739578763047069881426424894156 4 Up Normal 282.61 GB 12.50% 120578351672137417013530794390910407372 5 Up Normal 248.76 GB 12.50% 141845999604696070979991707355395920588 6 Up Normal 164.12 GB 12.50% 163113647537254724946452620319881433804 7 Up Normal 76.23 GB12.50% 184381295469813378912913533284366947020 8 Up Normal 19.79 GB12.50% 205648943402372032879374446248852460236 I was under the impression, the RP would distribute the load more evenly. Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single node. Should we just move the nodes so that the load is more even distributed, or is there something off that needs to be fixed first? Thanks Marcel hr style=border-color:blue pchors GmbH brhr style=border-color:blue pspecialists in digital and direct marketing solutionsbr Haid-und-Neu-Straße 7br 76131 Karlsruhe, Germanybr www.chors.com/p pManaging Directors: Dr. Volker Hatz, Markus PlattnerbrAmtsgericht Montabaur, HRB 15029/p p style=font-size:9pxThis e-mail is for the intended recipient only and may contain confidential or privileged information. If you have received this e-mail by mistake, please contact us immediately and completely delete it (and any attachments) and do not forward it or inform any other person of its contents. If you send us messages by e-mail, we take this as your authorization to correspond with you by e-mail. E-mail transmission cannot be guaranteed to be secure or error-free as information
delay in data deleting in cassadra
Hi, I am experiencing a delay in delete operations in cassandra. Its as follows. I am running a thread which contains following three steps. Step 01: Read data from column family foo[1] Step 02: Process received data eg: bar1,bar2,bar3,bar4,bar5 Step 03: Remove those processed data from foo.[2] The problem occurs when this thread is invoked for the second time. In that step , it returns some of data that i already deleted in the third step of the previous cycle. Eg: it returns bar2,bar3,bar4,bar5 It seems though i called the remove operation as follows [2], it takes time to replicate it to the file system. If i make a thread sleep of 5 secs between the thread cycles, it does not give me any data that i deleted in the third step. [1] . SliceQueryString, String, byte[] sliceQuery = HFactory.createSliceQuery(keyspace, stringSerializer, stringSerializer, bs); sliceQuery.setKey(queueName); sliceQuery.setRange(, , false, messageCount); sliceQuery.setColumnFamily(USER_QUEUES_COLUMN_FAMILY); [2]. MutatorString mutator = HFactory.createMutator(keyspace, stringSerializer); mutator.addDeletion(queueName, USER_QUEUES_COLUMN_FAMILY, messageId, stringSerializer); mutator.execute(); Is there a solution for this. Cassadra version : 0.8.0 Libthrift version : 0.6.1 Thanks Shammi -- Best Regards,* Shammi Jayasinghe* Senior Software Engineer; WSO2, Inc.; http://wso2.com, mobile: +94 71 4493085
RE: Garbage collection freezes cassandra node
Thanks for this very helpful info. It is indeed a production site which I cannot easily upgrade. I will try the various gc knobs and post any positive results. -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: vrijdag 20 januari 2012 4:23 To: user@cassandra.apache.org Subject: Re: Garbage collection freezes cassandra node On node 172.16.107.46, I see the following: 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 319468K-324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 6000844K-3298251K(8005248K), 10.8526193 secs] 6310427K-3298251K(8350272K), [CMS Perm : 26355K-26346K(44268K)], 10.9832679 secs] [Times: user=11.15 sys=0.03, real=10.98 secs] 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 used; max is 8550678528 I have not yet tested the XX:+DisableExplicitGC switch. Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting? * Increasing the total heap size can definitely help; the only kink is that if you need to increase the heap size unacceptably much, it is not helpful. * Decreasing the occupancy trigger can help yes, but you will get very much diminishing returns as your trigger fraction approaches the actual live size of data on the heap. * I just re-checked your original message - you're on Cassandra 0.7? I *strongly* suggest upgrading to 1.x. In general that holds true, but also specifically relating to this are significant improvements in memory allocation behavior that significantly reduces the probability and/or frequency of promotion failures and full gcs. * Increasing the size of the young generation can help by causing less promotion to old-gen (see the cassandra.in.sh script or equivalent of for Windows). * Increasing the amount of parallel threads used by CMS can help CMS complete it's marking phase quicker, but at the cost of a greater impact on the mutator (cassandra). I think the most important thing is - upgrade to 1.x before you run these benchmarks. Particularly detailed tuning of GC issues is pretty useless on 0.7 given the significant changes in 1.0. Don't even bother spending time on this until you're on 1.0, unless this is about a production cluster that you cannot upgrade for some reason. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Unbalanced cluster with RandomPartitioner
Thanks for all the responses! I found our problem: Using the Random Partitioner, the key range is from 0..2**127.When we added nodes, we generated the keys and out of convenience, we added an offset to the tokens because the move was easier like that. However, we did not execute the modulo 2**127 for the last two tokens, so they were outside the RP's key range. moving the last two tokens to their mod 2**127 will resolve the problem. Cheers, Marcel On 20.01.2012, at 10:32, Marcel Steinbach wrote: On 19.01.2012, at 20:15, Narendra Sharma wrote: I believe you need to move the nodes on the ring. What was the load on the nodes before you added 5 new nodes? Its just that you are getting data in certain token range more than others. With three nodes, it was also imbalanced. What I don't understand is, why the md5 sums would generate such massive hot spots. Most of our keys look like that: 00013270494972450001234567 with the first 16 digits being a timestamp of one of our application server's startup times, and the last 10 digits being sequentially generated per user. There may be a lot of keys that start with e.g. 0001327049497245 (or some other time stamp). But I was under the impression that md5 doesn't bother and generates uniform distribution? But then again, I know next to nothing about md5. Maybe someone else has a better insight to the algorithm? However, we also use cfs with a date (mmdd) as key, as well as cfs with uuids as keys. And those cfs in itself are not balanced either. E.g. node 5 has 12 GB live space used in the cf the uuid as key, and node 8 only 428MB. Cheers, Marcel On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach marcel.steinb...@chors.de wrote: On 18.01.2012, at 02:19, Maki Watanabe wrote: Are there any significant difference of number of sstables on each nodes? No, no significant difference there. Actually, node 8 is among those with more sstables but with the least load (20GB) On 17.01.2012, at 20:14, Jeremiah Jordan wrote: Are you deleting data or using TTL's? Expired/deleted data won't go away until the sstable holding it is compacted. So if compaction has happened on some nodes, but not on others, you will see this. The disparity is pretty big 400Gb to 20GB, so this probably isn't the issue, but with our data using TTL's if I run major compactions a couple times on that column family it can shrink ~30%-40%. Yes, we do delete data. But I agree, the disparity is too big to blame only the deletions. Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks ago. After adding the node, we did compactions and cleanups and didn't have a balanced cluster. So that should have removed outdated data, right? 2012/1/18 Marcel Steinbach marcel.steinb...@chors.de: We are running regular repairs, so I don't think that's the problem. And the data dir sizes match approx. the load from the nodetool. Thanks for the advise, though. Our keys are digits only, and all contain a few zeros at the same offsets. I'm not that familiar with the md5 algorithm, but I doubt that it would generate 'hotspots' for those kind of keys, right? On 17.01.2012, at 17:34, Mohit Anchlia wrote: Have you tried running repair first on each node? Also, verify using df -h on the data dirs On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach marcel.steinb...@chors.de wrote: Hi, we're using RP and have each node assigned the same amount of the token space. The cluster looks like that: Address Status State LoadOwnsToken 205648943402372032879374446248852460236 1 Up Normal 310.83 GB 12.50% 56775407874461455114148055497453867724 2 Up Normal 470.24 GB 12.50% 78043055807020109080608968461939380940 3 Up Normal 271.57 GB 12.50% 99310703739578763047069881426424894156 4 Up Normal 282.61 GB 12.50% 120578351672137417013530794390910407372 5 Up Normal 248.76 GB 12.50% 141845999604696070979991707355395920588 6 Up Normal 164.12 GB 12.50% 163113647537254724946452620319881433804 7 Up Normal 76.23 GB12.50% 184381295469813378912913533284366947020 8 Up Normal 19.79 GB12.50% 205648943402372032879374446248852460236 I was under the impression, the RP would distribute the load more evenly. Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single node. Should we just move the nodes so that the load is more even distributed, or is there something off that needs to be fixed first? Thanks Marcel hr style=border-color:blue pchors GmbH brhr style=border-color:blue pspecialists in digital and direct marketing solutionsbr Haid-und-Neu-Straße 7br 76131 Karlsruhe, Germanybr www.chors.com/p pManaging Directors: Dr. Volker Hatz, Markus PlattnerbrAmtsgericht
two dimensional slicing
I'm storing very large versioned lists of names, and I'd like to query a range of names within a given range of versions, which is a two dimensional slice, in a single query. This is easy to do using ByteOrderedPartitioner, but seems to require multiple (non parallel) queries and extra CFs when using RandomPartitioner. I see two approaches when using RP: 1) Data is stored in a super column family, with one dimension being the super column names and the other the sub column names. Since slicing on sub columns requires a list of super column names, a second standard CF is needed to get a range of names before doing a query on the main super CF. With CASSANDRA-2710, the same is possible using a standard CF with composite types instead of a super CF. 2) If one of the dimensions is small, a two dimensional slice isn't required. The data can be stored in a standard CF with linear ordering on a composite type (large_dimension, small_dimension). Data is queried based on the large dimension, and the client throws out the extra data in the other dimension. Neither of the above solutions are ideal. Does anyone else have a use case where two dimensional slicing is useful? Given the disadvantages of BOP, is it practical to make the composite column query model richer to support this sort of use case? Thanks, Bryce signature.asc Description: PGP signature
Cassandra to Oracle?
I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Garbage collection freezes cassandra node
Thanks for this very helpful info. It is indeed a production site which I cannot easily upgrade. I will try the various gc knobs and post any positive results. *IF* your data size, or at least hot set, is small enough that you're not extremely reliant on the current size of page cache, and in terms of short-term relief, I recommend: * Significantly increasing the heap size. Like double it or more. * Decrease the occupancy trigger such that it kicks in around the time it already does (in terms of amount of heap usage used). * Increase the young generation size (to lessen promotion into old-gen). Experiment on a single node, making sure you're not causing too much disk I/O by stealing memory otherwise used by page cache. Once you have something that works you might try slowly going back down. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: delay in data deleting in cassadra
The problem occurs when this thread is invoked for the second time. In that step , it returns some of data that i already deleted in the third step of the previous cycle. In order to get a guarantee about a subsequent read seeing a write, you must read and write at QUORUM (or LOCAL_QUORUM if it's only within a DC). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra to Oracle?
How much data do you think you will need ad hoc query ability for? On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra to Oracle?
Not terribly large ~50 million rows, each row has ~100-300 columns. But big enough that a map/reduce job takes longer than users would like. Actually maybe that is another question... Does anyone have any benchmarks running map/reduce against Cassandra? (even a simple count / or copy CF benchmark would be helpful) -brian On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson j.zach.richard...@gmail.com wrote: How much data do you think you will need ad hoc query ability for? On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Ad Hoc Queries
Interesting articles... (changing the subject line to broaden the scope) http://codemonkeyism.com/dark-side-nosql/ http://www.reportsanywhere.com/pebble/2010/04/16/127143774.html These articulate the exact challenge we're trying to overcome. -brian On Fri, Jan 20, 2012 at 12:57 PM, Brian O'Neill b...@alumni.brown.eduwrote: Not terribly large ~50 million rows, each row has ~100-300 columns. But big enough that a map/reduce job takes longer than users would like. Actually maybe that is another question... Does anyone have any benchmarks running map/reduce against Cassandra? (even a simple count / or copy CF benchmark would be helpful) -brian On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson j.zach.richard...@gmail.com wrote: How much data do you think you will need ad hoc query ability for? On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Encryption related question
Hello, I am trying to use internode encryption in Cassandra (1.0.6) for the first time. 1. Followed the steps 1 to 5 at http://download.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore Q. In cassandra.yaml , what value goes for keystore ? I exported the certificate per step #3 above in duke.cer. Do I put the location and name of that file for this parameter ? Siminarly, what value goes for truststore ? The steps 1-5 don't indicate any other file to be exported that would possibly go here. Also do I need to follow these steps on each of the node ? Thanks AJ
Triggers?
Anyone know if there is any activity to deliver triggers? I saw this quote: http://www.readwriteweb.com/cloud/2011/10/cassandra-reaches-10-whats-nex.php Ellis says that he's just starting to think about the post-1.0 world for Cassandra. Two features do come to mind, though, that missed the boat for 1.0 and that were on a lot of wishlists. The first is triggers. Database triggers let you define rules in the database, such as updating table X when table Y is updated. Ellis says that triggers will be necessary for Cassandra as it grows in popularity. As more tools use it, that's something more users are going to be asking for. But grepping the trunk code, I don't see any work on triggers. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Encryption related question
I had the following writeup when i did the KS and TS creation... Hope this helps *Step 1:* Download your Organisation Cert/Cert Chain/Generate one. *Step 2:* Login to any of one machine do the following to create p12 # openssl pkcs12 -export -in cassandra-app.cert -inkey cassandra-app.key -certfile cassandra-app.cert -name cassandra-app -out cassandra-app.p12 *Step 3:* now you can create the Keystore # keytool -importkeystore -srckeystore cassandra-app.p12 -srcstoretype pkcs12 -destkeystore cassandra-app.jks -deststoretype JKS - You might need the password at this stage. *Step 4:* List to make sure you have the right one. # keytool -list -v -keystore cassandra-app.jks -storepass Password * * *TrustStore:* *Step 1:* Download the certificate chain from perforce. Do all the steps as above and you have a trust store (Name it sensibly to differentiate in the future) keytool -import -keystore cassandra-app.truststore -file ca.pem -alias cassandra-app -storepass diffrent pass *Finally:* Checkin the files into conf dir in Perforce. *Open Yaml File:* And Add: encryption_options: internode_encryption: *dc* keystore: conf/.keystore keystore_password: cassandra truststore: conf/.truststore truststore_password: cassandra Regards, /VJ On Fri, Jan 20, 2012 at 11:16 AM, A J s5a...@gmail.com wrote: Hello, I am trying to use internode encryption in Cassandra (1.0.6) for the first time. 1. Followed the steps 1 to 5 at http://download.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore Q. In cassandra.yaml , what value goes for keystore ? I exported the certificate per step #3 above in duke.cer. Do I put the location and name of that file for this parameter ? Siminarly, what value goes for truststore ? The steps 1-5 don't indicate any other file to be exported that would possibly go here. Also do I need to follow these steps on each of the node ? Thanks AJ
Re: delay in data deleting in cassadra
Did you run repairs withing GC_GRACE all the time? On 1/20/2012 3:42 AM, Shammi Jayasinghe wrote: Hi, I am experiencing a delay in delete operations in cassandra. Its as follows. I am running a thread which contains following three steps. Step 01: Read data from column family foo[1] Step 02: Process received data eg: bar1,bar2,bar3,bar4,bar5 Step 03: Remove those processed data from foo.[2] The problem occurs when this thread is invoked for the second time. In that step , it returns some of data that i already deleted in the third step of the previous cycle. Eg: it returns bar2,bar3,bar4,bar5 It seems though i called the remove operation as follows [2], it takes time to replicate it to the file system. If i make a thread sleep of 5 secs between the thread cycles, it does not give me any data that i deleted in the third step. [1] . SliceQueryString, String, byte[] sliceQuery = HFactory.createSliceQuery(keyspace, stringSerializer, stringSerializer, bs); sliceQuery.setKey(queueName); sliceQuery.setRange(, , false, messageCount); sliceQuery.setColumnFamily(USER_QUEUES_COLUMN_FAMILY); [2]. MutatorString mutator = HFactory.createMutator(keyspace, stringSerializer); mutator.addDeletion(queueName, USER_QUEUES_COLUMN_FAMILY, messageId, stringSerializer); mutator.execute(); Is there a solution for this. Cassadra version : 0.8.0 Libthrift version : 0.6.1 Thanks Shammi -- Best Regards,* Shammi Jayasinghe* Senior Software Engineer; WSO2, Inc.; http://wso2.com http://wso2.com/, mobile: +94 71 4493085
Re: Cassandra to Oracle?
What makes you think that RDBMS will give you acceptable performance? I guess you will try to index it to death (because otherwise the ad hoc queries won't work well if at all), and at this point you may be hit with a performance penalty. It may be a good idea to interview users and build denormalized views in Cassandra, maybe on a separate look-up cluster. A few percent of users will be unhappy, but you'll find it hard to do better. I'm talking from my experience with an industrial strength RDBMS which doesn't scale very well for what you call ad-hoc queries. Regards, Maxim On 1/20/2012 9:28 AM, Brian O'Neill wrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra to Oracle?
I think the problem stems when you have data in a column that you need to run adhoc query on which is not denormalized. In most cases it's difficult to predict the type of query that would be required. Another way of solving this could be to index the fields in search engine. On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhin potek...@bnl.gov wrote: What makes you think that RDBMS will give you acceptable performance? I guess you will try to index it to death (because otherwise the ad hoc queries won't work well if at all), and at this point you may be hit with a performance penalty. It may be a good idea to interview users and build denormalized views in Cassandra, maybe on a separate look-up cluster. A few percent of users will be unhappy, but you'll find it hard to do better. I'm talking from my experience with an industrial strength RDBMS which doesn't scale very well for what you call ad-hoc queries. Regards, Maxim On 1/20/2012 9:28 AM, Brian O'Neill wrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra to Oracle?
I certainly agree with difficult to predict. There is a Danish proverb, which goes it's difficult to make predictions, especially about the future. My point was that it's equally difficult with noSQL and RDBMS. The latter requires indexing to operate well, and that's a potential performance problem. On 1/20/2012 7:55 PM, Mohit Anchlia wrote: I think the problem stems when you have data in a column that you need to run adhoc query on which is not denormalized. In most cases it's difficult to predict the type of query that would be required. Another way of solving this could be to index the fields in search engine. On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhinpotek...@bnl.gov wrote: What makes you think that RDBMS will give you acceptable performance? I guess you will try to index it to death (because otherwise the ad hoc queries won't work well if at all), and at this point you may be hit with a performance penalty. It may be a good idea to interview users and build denormalized views in Cassandra, maybe on a separate look-up cluster. A few percent of users will be unhappy, but you'll find it hard to do better. I'm talking from my experience with an industrial strength RDBMS which doesn't scale very well for what you call ad-hoc queries. Regards, Maxim On 1/20/2012 9:28 AM, Brian O'Neill wrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: ideal cluster size
You can also scale not horizontally but diagonally, i.e. raid SSDs and have multicore CPUs. This means that you'll have same performance with less nodes, making it far easier to manage. SSDs by themselves will give you an order of magnitude improvement on I/O. On 1/19/2012 9:17 PM, Thorsten von Eicken wrote: We're embarking on a project where we estimate we will need on the order of 100 cassandra nodes. The data set is perfectly partitionable, meaning we have no queries that need to have access to all the data at once. We expect to run with RF=2 or =3. Is there some notion of ideal cluster size? Or perhaps asked differently, would it be easier to run one large cluster or would it be easier to run a bunch of, say, 16 node clusters? Everything we've done to date has fit into 4-5 node clusters.