Data Model Question

2012-01-20 Thread Tamar Fraenkel
Hi!
I am a newbie to Cassandra and seeking some advice regarding the data model I
should use to best address my needs.
For simplicity, what I want to accomplish is:
I have a system that has users (potentially ~10,000 per day) and they perform
actions in the system (total of ~50,000 a day).
Each User’s action is taking place in a certain point in time, and is also
classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s
Categories and Tags has a score associated with it, the score is between 0 to 1
(let’s assume precision of 0.0001).
I want to be able to identify similar actions in the system (performed usually
by more than one user). Similarity of actions is calculated based on their
common Categories and Tags taking scores into account.
I need the system to store:

The list of my users with attributes like name, age etc
For each action – the categories and tags associated with it and their score,
the time of the action, and the user who performed it.
Groups of similar actions (ActionGroups) – the id’s of actions in the group, the
categories and tags describing the group, with their scores. Those are
calculated using an algorithm that takes into account the categories and tags of
the actions in the group.

When a user performs a new action in the system, I want to add it to a fitting
ActionGroups (with similar categories and tags).
For this I need to be able to perform the following:
Find all the recent ActionGroups (those who were updated with actions performed
during the last T minutes), who has at list one of the new action’s categories
AND at list one of the new action’s tags.
 
I thought of two ways to address the issue and I would appreciate your insights.
 
First one using secondary indexes
Column Family:Users
Key: userId
Compare with Bytes Type
Columns: name: , age:  etc…
 
Column Family:Actions
Key: actionId
Compare with Bytes Type
Columns:  Category1 : Score ….
          CategoriN: Score,
          Tag1 : Score, ….
          TagK:Score
          Time: timestamp
          user: userId
 
Column Family:ActionGroups
Key: actionGroupId
Compare with Bytes Type
Columns: Category1 : Score ….
         CategoriN: Score,
         Tag1 : Score ….
         TagK:Score
         lastUpdateTime: timestamp
         actionId1: null, … ,
         actionIdM: null
 
I will then define secondary index on each tag columns, category columns, and
the update time column.
Let’s assume the new action I want to add to ActionGroup has NewActionCategory1
- NewActionCategoryK, and has NewActionTag1 – NewActionTagN. I will perform the
following query:
Select  * From ActionGroups where
   (NewActionCategory1  0  … or NewActionCategoryK  0) and
   (NewActionTag1  0  … or NewActionTagN  0) and
   lastUpdateTime  T;
 
Second solution
Have the same CF as in the first solutionwithout the secondaryindex, and have
two additional CF-ies:
Column Family:CategoriesToActionGroupId
Key: categoryId
Compare with ByteType
Columns: {Timestamp, ActionGroupsId1 } : null
         {Timestamp, ActionGroupsId2} : null
         ...
*timestamp is the update time for the ActionGroup
 
A similar CF will be defined for tags.
 
I will then be able to run several queries on CategoriesToActionGroupId (one for
each of the new story Categories), with column slice for the right update time
of the ActionGroup.
I will do the same for the TagsToActionGroupId.
I will then use my client code to remove duplicates (ActionGroups who are
associated with more than one Tag or Category).
 
My questions are:

Are the two solutions viable? If yes, which is better
Is there any better way of doing this?
Can I use jdbc and CQL with both method, or do I have to use Hector (I am using
Java).

Thanks
Tamar
 
 

Re: Unbalanced cluster with RandomPartitioner

2012-01-20 Thread Marcel Steinbach
On 19.01.2012, at 20:15, Narendra Sharma wrote:
 I believe you need to move the nodes on the ring. What was the load on the 
 nodes before you added 5 new nodes? Its just that you are getting data in 
 certain token range more than others.
With three nodes, it was also imbalanced. 

What I don't understand is, why the md5 sums would generate such massive hot 
spots. 

Most of our keys look like that: 
00013270494972450001234567
with the first 16 digits being a timestamp of one of our application server's 
startup times, and the last 10 digits being sequentially generated per user. 

There may be a lot of keys that start with e.g. 0001327049497245  (or some 
other time stamp). But I was under the impression that md5 doesn't bother and 
generates uniform distribution?
But then again, I know next to nothing about md5. Maybe someone else has a 
better insight to the algorithm?

However, we also use cfs with a date (mmdd) as key, as well as cfs with 
uuids as keys. And those cfs in itself are not balanced either. E.g. node 5 has 
12 GB live space used in the cf the uuid as key, and node 8 only 428MB. 

Cheers,
Marcel

 
 On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach marcel.steinb...@chors.de 
 wrote:
 On 18.01.2012, at 02:19, Maki Watanabe wrote:
 Are there any significant difference of number of sstables on each nodes?
 No, no significant difference there. Actually, node 8 is among those with 
 more sstables but with the least load (20GB)
 
 On 17.01.2012, at 20:14, Jeremiah Jordan wrote:
 Are you deleting data or using TTL's?  Expired/deleted data won't go away 
 until the sstable holding it is compacted.  So if compaction has happened on 
 some nodes, but not on others, you will see this.  The disparity is pretty 
 big 400Gb to 20GB, so this probably isn't the issue, but with our data using 
 TTL's if I run major compactions a couple times on that column family it can 
 shrink ~30%-40%.
 Yes, we do delete data. But I agree, the disparity is too big to blame only 
 the deletions. 
 
 Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks 
 ago. After adding the node, we did
 compactions and cleanups and didn't have a balanced cluster. So that should 
 have removed outdated data, right?
 
 2012/1/18 Marcel Steinbach marcel.steinb...@chors.de:
 We are running regular repairs, so I don't think that's the problem.
 And the data dir sizes match approx. the load from the nodetool.
 Thanks for the advise, though.
 
 Our keys are digits only, and all contain a few zeros at the same
 offsets. I'm not that familiar with the md5 algorithm, but I doubt that it
 would generate 'hotspots' for those kind of keys, right?
 
 On 17.01.2012, at 17:34, Mohit Anchlia wrote:
 
 Have you tried running repair first on each node? Also, verify using
 df -h on the data dirs
 
 On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach
 marcel.steinb...@chors.de wrote:
 
 Hi,
 
 
 we're using RP and have each node assigned the same amount of the token
 space. The cluster looks like that:
 
 
 Address Status State   LoadOwnsToken
 
 
 205648943402372032879374446248852460236
 
 1   Up Normal  310.83 GB   12.50%
  56775407874461455114148055497453867724
 
 2   Up Normal  470.24 GB   12.50%
  78043055807020109080608968461939380940
 
 3   Up Normal  271.57 GB   12.50%
  99310703739578763047069881426424894156
 
 4   Up Normal  282.61 GB   12.50%
  120578351672137417013530794390910407372
 
 5   Up Normal  248.76 GB   12.50%
  141845999604696070979991707355395920588
 
 6   Up Normal  164.12 GB   12.50%
  163113647537254724946452620319881433804
 
 7   Up Normal  76.23 GB12.50%
  184381295469813378912913533284366947020
 
 8   Up Normal  19.79 GB12.50%
  205648943402372032879374446248852460236
 
 
 I was under the impression, the RP would distribute the load more evenly.
 
 Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single
 node. Should we just move the nodes so that the load is more even
 distributed, or is there something off that needs to be fixed first?
 
 
 Thanks
 
 Marcel
 
 hr style=border-color:blue
 
 pchors GmbH
 
 brhr style=border-color:blue
 
 pspecialists in digital and direct marketing solutionsbr
 
 Haid-und-Neu-Straße 7br
 
 76131 Karlsruhe, Germanybr
 
 www.chors.com/p
 
 pManaging Directors: Dr. Volker Hatz, Markus PlattnerbrAmtsgericht
 Montabaur, HRB 15029/p
 
 p style=font-size:9pxThis e-mail is for the intended recipient only and
 may contain confidential or privileged information. If you have received
 this e-mail by mistake, please contact us immediately and completely delete
 it (and any attachments) and do not forward it or inform any other person of
 its contents. If you send us messages by e-mail, we take this as your
 authorization to correspond with you by e-mail. E-mail transmission cannot
 be guaranteed to be secure or error-free as information 

delay in data deleting in cassadra

2012-01-20 Thread Shammi Jayasinghe
Hi,
  I am experiencing a delay in delete operations in cassandra. Its as
follows. I am running a thread which contains following three steps.

Step 01: Read data from column family foo[1]
Step 02: Process received data eg: bar1,bar2,bar3,bar4,bar5
Step 03: Remove those processed data from foo.[2]

 The problem occurs when this thread is invoked for the second time.
In that step , it returns some of data that i already deleted in the third
step of the previous cycle.

Eg: it returns bar2,bar3,bar4,bar5

It seems though i called the remove operation as follows [2], it takes time
to replicate it to
the file system. If i make a thread sleep of 5 secs between the thread
cycles, it does not
give me any data that i deleted in the third step.

 [1] . SliceQueryString, String, byte[] sliceQuery =
HFactory.createSliceQuery(keyspace, stringSerializer,
stringSerializer, bs);
sliceQuery.setKey(queueName);
sliceQuery.setRange(, , false, messageCount);
sliceQuery.setColumnFamily(USER_QUEUES_COLUMN_FAMILY);

 [2].  MutatorString mutator = HFactory.createMutator(keyspace,
stringSerializer);
mutator.addDeletion(queueName, USER_QUEUES_COLUMN_FAMILY,
messageId, stringSerializer);
mutator.execute();



Is there a solution for this.

Cassadra version : 0.8.0
Libthrift version : 0.6.1


Thanks
Shammi
-- 
Best Regards,*

Shammi Jayasinghe*
Senior Software Engineer; WSO2, Inc.; http://wso2.com,
mobile: +94 71 4493085


RE: Garbage collection freezes cassandra node

2012-01-20 Thread Rene Kochen
Thanks for this very helpful info. It is indeed a production site which I 
cannot easily upgrade. I will try the various gc knobs and post any positive 
results.

-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: vrijdag 20 januari 2012 4:23
To: user@cassandra.apache.org
Subject: Re: Garbage collection freezes cassandra node

 On node 172.16.107.46, I see the following:

 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 
 319468K-324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 
 6000844K-3298251K(8005248K), 10.8526193 secs] 6310427K-3298251K(8350272K), 
 [CMS Perm : 26355K-26346K(44268K)], 10.9832679 secs] [Times: user=11.15 
 sys=0.03, real=10.98 secs]
 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 
 3389079904 used; max is 8550678528

 I have not yet tested the XX:+DisableExplicitGC switch.

 Is the right thing to do to decrease the CMSInitiatingOccupancyFraction 
 setting?

* Increasing the total heap size can definitely help; the only kink is
that if you need to increase the heap size unacceptably much, it is
not helpful.
* Decreasing the occupancy trigger can help yes, but you will get very
much diminishing returns as your trigger fraction approaches the
actual live size of data on the heap.
* I just re-checked your original message - you're on Cassandra 0.7? I
*strongly* suggest upgrading to 1.x. In general that holds true, but
also specifically relating to this are significant improvements in
memory allocation behavior that significantly reduces the probability
and/or frequency of promotion failures and full gcs.
* Increasing the size of the young generation can help by causing less
promotion to old-gen (see the cassandra.in.sh script or equivalent of
for Windows).
* Increasing the amount of parallel threads used by CMS can help CMS
complete it's marking phase quicker, but at the cost of a greater
impact on the mutator (cassandra).

I think the most important thing is - upgrade to 1.x before you run
these benchmarks. Particularly detailed tuning of GC issues is pretty
useless on 0.7 given the significant changes in 1.0. Don't even bother
spending time on this until you're on 1.0, unless this is about a
production cluster that you cannot upgrade for some reason.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Unbalanced cluster with RandomPartitioner

2012-01-20 Thread Marcel Steinbach
Thanks for all the responses!

I found our problem:
Using the Random Partitioner, the key range is from 0..2**127.When we added 
nodes, we generated the keys and out of convenience, we added an offset to the 
tokens because the move was easier like that.

However, we did not execute the modulo 2**127 for the last two tokens, so they 
were outside the RP's key range. 
moving the last two tokens to their mod 2**127 will resolve the problem.

Cheers,
Marcel

On 20.01.2012, at 10:32, Marcel Steinbach wrote:

 On 19.01.2012, at 20:15, Narendra Sharma wrote:
 I believe you need to move the nodes on the ring. What was the load on the 
 nodes before you added 5 new nodes? Its just that you are getting data in 
 certain token range more than others.
 With three nodes, it was also imbalanced. 
 
 What I don't understand is, why the md5 sums would generate such massive hot 
 spots. 
 
 Most of our keys look like that: 
 00013270494972450001234567
 with the first 16 digits being a timestamp of one of our application server's 
 startup times, and the last 10 digits being sequentially generated per user. 
 
 There may be a lot of keys that start with e.g. 0001327049497245  (or some 
 other time stamp). But I was under the impression that md5 doesn't bother and 
 generates uniform distribution?
 But then again, I know next to nothing about md5. Maybe someone else has a 
 better insight to the algorithm?
 
 However, we also use cfs with a date (mmdd) as key, as well as cfs with 
 uuids as keys. And those cfs in itself are not balanced either. E.g. node 5 
 has 12 GB live space used in the cf the uuid as key, and node 8 only 428MB. 
 
 Cheers,
 Marcel
 
 
 On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach 
 marcel.steinb...@chors.de wrote:
 On 18.01.2012, at 02:19, Maki Watanabe wrote:
 Are there any significant difference of number of sstables on each nodes?
 No, no significant difference there. Actually, node 8 is among those with 
 more sstables but with the least load (20GB)
 
 On 17.01.2012, at 20:14, Jeremiah Jordan wrote:
 Are you deleting data or using TTL's?  Expired/deleted data won't go away 
 until the sstable holding it is compacted.  So if compaction has happened 
 on some nodes, but not on others, you will see this.  The disparity is 
 pretty big 400Gb to 20GB, so this probably isn't the issue, but with our 
 data using TTL's if I run major compactions a couple times on that column 
 family it can shrink ~30%-40%.
 Yes, we do delete data. But I agree, the disparity is too big to blame only 
 the deletions. 
 
 Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks 
 ago. After adding the node, we did
 compactions and cleanups and didn't have a balanced cluster. So that should 
 have removed outdated data, right?
 
 2012/1/18 Marcel Steinbach marcel.steinb...@chors.de:
 We are running regular repairs, so I don't think that's the problem.
 And the data dir sizes match approx. the load from the nodetool.
 Thanks for the advise, though.
 
 Our keys are digits only, and all contain a few zeros at the same
 offsets. I'm not that familiar with the md5 algorithm, but I doubt that it
 would generate 'hotspots' for those kind of keys, right?
 
 On 17.01.2012, at 17:34, Mohit Anchlia wrote:
 
 Have you tried running repair first on each node? Also, verify using
 df -h on the data dirs
 
 On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach
 marcel.steinb...@chors.de wrote:
 
 Hi,
 
 
 we're using RP and have each node assigned the same amount of the token
 space. The cluster looks like that:
 
 
 Address Status State   LoadOwnsToken
 
 
 205648943402372032879374446248852460236
 
 1   Up Normal  310.83 GB   12.50%
 56775407874461455114148055497453867724
 
 2   Up Normal  470.24 GB   12.50%
 78043055807020109080608968461939380940
 
 3   Up Normal  271.57 GB   12.50%
 99310703739578763047069881426424894156
 
 4   Up Normal  282.61 GB   12.50%
 120578351672137417013530794390910407372
 
 5   Up Normal  248.76 GB   12.50%
 141845999604696070979991707355395920588
 
 6   Up Normal  164.12 GB   12.50%
 163113647537254724946452620319881433804
 
 7   Up Normal  76.23 GB12.50%
 184381295469813378912913533284366947020
 
 8   Up Normal  19.79 GB12.50%
 205648943402372032879374446248852460236
 
 
 I was under the impression, the RP would distribute the load more evenly.
 
 Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single
 node. Should we just move the nodes so that the load is more even
 distributed, or is there something off that needs to be fixed first?
 
 
 Thanks
 
 Marcel
 
 hr style=border-color:blue
 
 pchors GmbH
 
 brhr style=border-color:blue
 
 pspecialists in digital and direct marketing solutionsbr
 
 Haid-und-Neu-Straße 7br
 
 76131 Karlsruhe, Germanybr
 
 www.chors.com/p
 
 pManaging Directors: Dr. Volker Hatz, Markus PlattnerbrAmtsgericht
 

two dimensional slicing

2012-01-20 Thread Bryce Allen
I'm storing very large versioned lists of names, and I'd like to
query a range of names within a given range of versions, which is a two
dimensional slice, in a single query. This is easy to do using
ByteOrderedPartitioner, but seems to require multiple (non parallel)
queries and extra CFs when using RandomPartitioner.

I see two approaches when using RP:

1) Data is stored in a super column family, with one dimension being
the super column names and the other the sub column names. Since
slicing on sub columns requires a list of super column names, a
second standard CF is needed to get a range of names before doing a
query on the main super CF. With CASSANDRA-2710, the same is possible
using a standard CF with composite types instead of a super CF.

2) If one of the dimensions is small, a two dimensional slice isn't
required. The data can be stored in a standard CF with linear ordering
on a composite type (large_dimension, small_dimension). Data is queried
based on the large dimension, and the client throws out the extra data
in the other dimension.

Neither of the above solutions are ideal. Does anyone else have a use
case where two dimensional slicing is useful? Given the disadvantages of
BOP, is it practical to make the composite column query model richer to
support this sort of use case?

Thanks,
Bryce


signature.asc
Description: PGP signature


Cassandra to Oracle?

2012-01-20 Thread Brian O'Neill
I can't remember if I asked this question before, but

We're using Cassandra as our transactional system, and building up quite a
library of map/reduce jobs that perform data quality analysis, statistics,
etc.
( 100 jobs now)

But... we are still struggling to provide an ad-hoc query mechanism for
our users.

To fill that gap, I believe we still need to materialize our data in an
RDBMS.

Anyone have any ideas?  Better ways to support ad-hoc queries?

Effectively, our users want to be able to select count(distinct Y) from X
group by Z.
Where Y and Z are arbitrary columns of rows in X.

We believe we can create column families with different key structures
(using Y an Z as row keys), but some column names we don't know / can't
predict ahead of time.

Are people doing bulk exports?
Anyone trying to keep an RDBMS in synch in real-time?

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Garbage collection freezes cassandra node

2012-01-20 Thread Peter Schuller
 Thanks for this very helpful info. It is indeed a production site which I 
 cannot easily upgrade. I will try the various gc knobs and post any positive 
 results.

*IF* your data size, or at least hot set, is small enough that you're
not extremely reliant on the current size of page cache, and in terms
of short-term relief, I recommend:

* Significantly increasing the heap size. Like double it or more.
* Decrease the occupancy trigger such that it kicks in around the time
it already does (in terms of amount of heap usage used).
* Increase the young generation size (to lessen promotion into old-gen).

Experiment on a single node, making sure you're not causing too much
disk I/O by stealing memory otherwise used by page cache. Once you
have something that works you might try slowly going back down.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: delay in data deleting in cassadra

2012-01-20 Thread Peter Schuller
  The problem occurs when this thread is invoked for the second time.
 In that step , it returns some of data that i already deleted in the third
 step of the previous cycle.

In order to get a guarantee about a subsequent read seeing a write,
you must read and write at QUORUM (or LOCAL_QUORUM if it's only within
a DC).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Cassandra to Oracle?

2012-01-20 Thread Zach Richardson
How much data do you think you will need ad hoc query ability for?

On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 I can't remember if I asked this question before, but

 We're using Cassandra as our transactional system, and building up quite a
 library of map/reduce jobs that perform data quality analysis, statistics,
 etc.
 ( 100 jobs now)

 But... we are still struggling to provide an ad-hoc query mechanism for
 our users.

 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.

 Anyone have any ideas?  Better ways to support ad-hoc queries?

 Effectively, our users want to be able to select count(distinct Y) from X
 group by Z.
 Where Y and Z are arbitrary columns of rows in X.

 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.

 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/




Re: Cassandra to Oracle?

2012-01-20 Thread Brian O'Neill
Not terribly large
~50 million rows, each row has ~100-300 columns.

But big enough that a map/reduce job takes longer than users would like.

Actually maybe that is another question...
Does anyone have any benchmarks running map/reduce against Cassandra?
(even a simple count / or copy CF benchmark would be helpful)

-brian

On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson 
j.zach.richard...@gmail.com wrote:

 How much data do you think you will need ad hoc query ability for?


 On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 I can't remember if I asked this question before, but

 We're using Cassandra as our transactional system, and building up quite
 a library of map/reduce jobs that perform data quality analysis,
 statistics, etc.
 ( 100 jobs now)

 But... we are still struggling to provide an ad-hoc query mechanism for
 our users.

 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.

 Anyone have any ideas?  Better ways to support ad-hoc queries?

 Effectively, our users want to be able to select count(distinct Y) from X
 group by Z.
 Where Y and Z are arbitrary columns of rows in X.

 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.

 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/





-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Ad Hoc Queries

2012-01-20 Thread Brian O'Neill
Interesting articles... (changing the subject line to broaden the scope)
http://codemonkeyism.com/dark-side-nosql/
http://www.reportsanywhere.com/pebble/2010/04/16/127143774.html

These articulate the exact challenge we're trying to overcome.

-brian



On Fri, Jan 20, 2012 at 12:57 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 Not terribly large
 ~50 million rows, each row has ~100-300 columns.

 But big enough that a map/reduce job takes longer than users would like.

 Actually maybe that is another question...
 Does anyone have any benchmarks running map/reduce against Cassandra?
 (even a simple count / or copy CF benchmark would be helpful)

 -brian

 On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson 
 j.zach.richard...@gmail.com wrote:

 How much data do you think you will need ad hoc query ability for?


 On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 I can't remember if I asked this question before, but

 We're using Cassandra as our transactional system, and building up quite
 a library of map/reduce jobs that perform data quality analysis,
 statistics, etc.
 ( 100 jobs now)

 But... we are still struggling to provide an ad-hoc query mechanism
 for our users.

 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.

 Anyone have any ideas?  Better ways to support ad-hoc queries?

 Effectively, our users want to be able to select count(distinct Y) from
 X group by Z.
 Where Y and Z are arbitrary columns of rows in X.

 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.

 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/





 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/




-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Encryption related question

2012-01-20 Thread A J
Hello,
I am trying to use internode encryption in Cassandra (1.0.6) for the first time.

1. Followed the steps 1 to 5 at
http://download.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore
Q. In cassandra.yaml , what value goes for keystore ? I exported the
certificate per step #3 above in duke.cer. Do I put the location and
name of that file for this parameter ?
Siminarly, what value goes for truststore ? The steps 1-5 don't
indicate any other file to be exported that would possibly go here.

Also do I need to follow these steps on each of the node ?

Thanks
AJ


Triggers?

2012-01-20 Thread Brian O'Neill
Anyone know if there is any activity to deliver triggers?

I saw this quote:

http://www.readwriteweb.com/cloud/2011/10/cassandra-reaches-10-whats-nex.php

Ellis says that he's just starting to think about the post-1.0 world for
Cassandra. Two features do come to mind, though, that missed the boat for
1.0 and that were on a lot of wishlists. The first is triggers.

Database triggers let you define rules in the database, such as updating
table X when table Y is updated. Ellis says that triggers will be necessary
for Cassandra as it grows in popularity. As more tools use it, that's
something more users are going to be asking for.

But grepping the trunk code, I don't see any work on triggers.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Encryption related question

2012-01-20 Thread Vijay
I had the following writeup when i did the KS and TS creation... Hope this
helps

*Step 1:* Download your Organisation Cert/Cert Chain/Generate one.

*Step 2:* Login to any of one machine do the following to create p12

# openssl pkcs12 -export -in cassandra-app.cert -inkey cassandra-app.key
-certfile cassandra-app.cert -name cassandra-app -out cassandra-app.p12

*Step 3:* now you can create the Keystore

# keytool -importkeystore -srckeystore cassandra-app.p12 -srcstoretype
pkcs12 -destkeystore cassandra-app.jks -deststoretype JKS

- You might need the password at this stage.

*Step 4:* List to make sure you have the right one.

# keytool -list -v  -keystore cassandra-app.jks -storepass Password

*
*

*TrustStore:*

*Step 1:* Download the certificate chain from perforce.

Do all the steps as above and you have a trust store (Name it sensibly
to differentiate in the future)

keytool -import -keystore cassandra-app.truststore -file ca.pem -alias
cassandra-app -storepass diffrent pass

*Finally:* Checkin the files into conf dir in Perforce.

*Open Yaml File:*

And Add:

encryption_options:

internode_encryption: *dc*

keystore: conf/.keystore

keystore_password: cassandra

truststore: conf/.truststore

truststore_password: cassandra


Regards,
/VJ



On Fri, Jan 20, 2012 at 11:16 AM, A J s5a...@gmail.com wrote:

 Hello,
 I am trying to use internode encryption in Cassandra (1.0.6) for the first
 time.

 1. Followed the steps 1 to 5 at

 http://download.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore
 Q. In cassandra.yaml , what value goes for keystore ? I exported the
 certificate per step #3 above in duke.cer. Do I put the location and
 name of that file for this parameter ?
 Siminarly, what value goes for truststore ? The steps 1-5 don't
 indicate any other file to be exported that would possibly go here.

 Also do I need to follow these steps on each of the node ?

 Thanks
 AJ



Re: delay in data deleting in cassadra

2012-01-20 Thread Maxim Potekhin

Did you run repairs withing GC_GRACE all the time?



On 1/20/2012 3:42 AM, Shammi Jayasinghe wrote:

Hi,
  I am experiencing a delay in delete operations in cassandra. Its as 
follows. I am running a thread which contains following three steps.


Step 01: Read data from column family foo[1]
Step 02: Process received data eg: bar1,bar2,bar3,bar4,bar5
Step 03: Remove those processed data from foo.[2]

 The problem occurs when this thread is invoked for the second time.
In that step , it returns some of data that i already deleted in the 
third step of the previous cycle.


Eg: it returns bar2,bar3,bar4,bar5

It seems though i called the remove operation as follows [2], it takes 
time to replicate it to
the file system. If i make a thread sleep of 5 secs between the thread 
cycles, it does not

give me any data that i deleted in the third step.

 [1] . SliceQueryString, String, byte[] sliceQuery =
HFactory.createSliceQuery(keyspace, 
stringSerializer, stringSerializer, bs);

sliceQuery.setKey(queueName);
sliceQuery.setRange(, , false, messageCount);
sliceQuery.setColumnFamily(USER_QUEUES_COLUMN_FAMILY);

 [2].  MutatorString mutator = HFactory.createMutator(keyspace, 
stringSerializer);
mutator.addDeletion(queueName, USER_QUEUES_COLUMN_FAMILY, 
messageId, stringSerializer);

mutator.execute();


Is there a solution for this.

Cassadra version : 0.8.0
Libthrift version : 0.6.1


Thanks
Shammi
--
Best Regards,*

Shammi Jayasinghe*
Senior Software Engineer; WSO2, Inc.; http://wso2.com http://wso2.com/,
mobile: +94 71 4493085






Re: Cassandra to Oracle?

2012-01-20 Thread Maxim Potekhin

What makes you think that RDBMS will give you acceptable performance?

I guess you will try to index it to death (because otherwise the ad 
hoc queries won't work well if at all), and at this point you may be 
hit with a performance penalty.


It may be a good idea to interview users and build denormalized views in 
Cassandra, maybe on a separate look-up cluster. A few percent of users 
will be unhappy, but you'll find it hard to do better. I'm talking from 
my experience with an industrial strength RDBMS which doesn't scale very 
well for what you call ad-hoc queries.


Regards,
Maxim




On 1/20/2012 9:28 AM, Brian O'Neill wrote:


I can't remember if I asked this question before, but

We're using Cassandra as our transactional system, and building up 
quite a library of map/reduce jobs that perform data quality analysis, 
statistics, etc.

( 100 jobs now)

But... we are still struggling to provide an ad-hoc query mechanism 
for our users.


To fill that gap, I believe we still need to materialize our data in 
an RDBMS.


Anyone have any ideas?  Better ways to support ad-hoc queries?

Effectively, our users want to be able to select count(distinct Y) 
from X group by Z.

Where Y and Z are arbitrary columns of rows in X.

We believe we can create column families with different key structures 
(using Y an Z as row keys), but some column names we don't know / 
can't predict ahead of time.


Are people doing bulk exports?
Anyone trying to keep an RDBMS in synch in real-time?

-brian

--
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/





Re: Cassandra to Oracle?

2012-01-20 Thread Mohit Anchlia
I think the problem stems when you have data in a column that you need
to run adhoc query on which is not denormalized. In most cases it's
difficult to predict the type of query that would be required.

Another way of solving this could be to index the fields in search engine.

On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhin potek...@bnl.gov wrote:
 What makes you think that RDBMS will give you acceptable performance?

 I guess you will try to index it to death (because otherwise the ad hoc
 queries won't work well if at all), and at this point you may be hit with a
 performance penalty.

 It may be a good idea to interview users and build denormalized views in
 Cassandra, maybe on a separate look-up cluster. A few percent of users
 will be unhappy, but you'll find it hard to do better. I'm talking from my
 experience with an industrial strength RDBMS which doesn't scale very well
 for what you call ad-hoc queries.

 Regards,
 Maxim





 On 1/20/2012 9:28 AM, Brian O'Neill wrote:


 I can't remember if I asked this question before, but

 We're using Cassandra as our transactional system, and building up quite a
 library of map/reduce jobs that perform data quality analysis, statistics,
 etc.
 ( 100 jobs now)

 But... we are still struggling to provide an ad-hoc query mechanism for
 our users.

 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.

 Anyone have any ideas?  Better ways to support ad-hoc queries?

 Effectively, our users want to be able to select count(distinct Y) from X
 group by Z.
 Where Y and Z are arbitrary columns of rows in X.

 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.

 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/




Re: Cassandra to Oracle?

2012-01-20 Thread Maxim Potekhin

I certainly agree with difficult to predict. There is a Danish
proverb, which goes it's difficult to make predictions, especially
about the future.

My point was that it's equally difficult with noSQL and RDBMS.
The latter requires indexing to operate well, and that's a potential
performance problem.

On 1/20/2012 7:55 PM, Mohit Anchlia wrote:

I think the problem stems when you have data in a column that you need
to run adhoc query on which is not denormalized. In most cases it's
difficult to predict the type of query that would be required.

Another way of solving this could be to index the fields in search engine.

On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhinpotek...@bnl.gov  wrote:

What makes you think that RDBMS will give you acceptable performance?

I guess you will try to index it to death (because otherwise the ad hoc
queries won't work well if at all), and at this point you may be hit with a
performance penalty.

It may be a good idea to interview users and build denormalized views in
Cassandra, maybe on a separate look-up cluster. A few percent of users
will be unhappy, but you'll find it hard to do better. I'm talking from my
experience with an industrial strength RDBMS which doesn't scale very well
for what you call ad-hoc queries.

Regards,
Maxim





On 1/20/2012 9:28 AM, Brian O'Neill wrote:


I can't remember if I asked this question before, but

We're using Cassandra as our transactional system, and building up quite a
library of map/reduce jobs that perform data quality analysis, statistics,
etc.
(  100 jobs now)

But... we are still struggling to provide an ad-hoc query mechanism for
our users.

To fill that gap, I believe we still need to materialize our data in an
RDBMS.

Anyone have any ideas?  Better ways to support ad-hoc queries?

Effectively, our users want to be able to select count(distinct Y) from X
group by Z.
Where Y and Z are arbitrary columns of rows in X.

We believe we can create column families with different key structures
(using Y an Z as row keys), but some column names we don't know / can't
predict ahead of time.

Are people doing bulk exports?
Anyone trying to keep an RDBMS in synch in real-time?

-brian

--
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/





Re: ideal cluster size

2012-01-20 Thread Maxim Potekhin

You can also scale not horizontally but diagonally,
i.e. raid SSDs and have multicore CPUs. This means that
you'll have same performance with less nodes, making
it far easier to manage.

SSDs by themselves will give you an order of magnitude
improvement on I/O.


On 1/19/2012 9:17 PM, Thorsten von Eicken wrote:

We're embarking on a project where we estimate we will need on the order
of 100 cassandra nodes. The data set is perfectly partitionable, meaning
we have no queries that need to have access to all the data at once. We
expect to run with RF=2 or =3. Is there some notion of ideal cluster
size? Or perhaps asked differently, would it be easier to run one large
cluster or would it be easier to run a bunch of, say, 16 node clusters?
Everything we've done to date has fit into 4-5 node clusters.