Re: Can a same key exists for two rows in two different column families without clashing ?

2011-02-02 Thread Ertio Lew
Thanks Stephen for the Great Explanation!



On Wed, Feb 2, 2011 at 4:31 PM, Stephen Connolly 
stephen.alan.conno...@gmail.com wrote:

 On 2 February 2011 10:03, Ertio Lew ertio...@gmail.com wrote:
  Can a same key exists for two rows in two different column families
 without
  clashing ?  Other words, does the same algorithm needs to enforced for
  generating keys for different column families or can different
  algorithms(for generating keys) be enforced on column family basis?
 
  I have tried out that they can, but I wanted to know if there may be any
  problems associated with this.
 
  Thanks.
  Ertio Lew
 

 it is a bad analogy for many reasons but if you replace row key with
 primary key and column family with table then you might get an
 answer.

 a better analogy is to think of the following.

 public class Keyspace {

  public final MapString,MapString,byte[] columnFamily1;

  public final MapString,MapString,byte[] columnFamily2;

  public final MapString,MapString,MapString,byte[]
 superColumnFamily3;

 }

 (still not quite correct, but mostly so for our purposes);

 you are asking given

 Keyspace keyspace;
 String key1 = makeKeyAlg1();
 keyspace.columnFamily1.put(key1,...);

 String key2 = makeKeyAlg2();
 keyspace.columnFamily2.put(key2,...);

 when key1.equals(key2)

 then is there a problem?

 They are two separate maps... why would there be.

 -Stephen



Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing all the subcolumns data serialized in single
column(s)  ?

Thanks
Aditya Narayan


Re: reduced cached mem; resident set size growth

2011-02-02 Thread Chris Burroughs
On 01/28/2011 09:19 PM, Chris Burroughs wrote:
 Thanks Oleg and Zhu.  I swear that wasn't a new hotspot version when I
 checked, but that's obviously not the case.  I'll update one node to the
 latest as soon as I can and report back.


RSS over 48 hours with java 6 update 23:

http://img716.imageshack.us/img716/5202/u2348hours.png

I'll continue monitoring but RSS still appears to grow without bounds.
Zhu reported a similar problem with Ubuntu 10.04.  While possible, it
would seem seam extraordinary unlikely that there is a glibc or kernel
bug affecting us both.



Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan ady...@gmail.com wrote:
 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan



Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs
To reiterate, so I know we're both on the same page, your schema would be 
something like this:


- A column family (as you describe) to store the details of a reminder. One 
reminder per row. The row key would be a TimeUUID.


- A super column family to store the reminders for each user, for each day. The 
row key would be something like: MMDD:user_id. The column names would simply 
be the TimeUUID of the messages. The sub column names would be the tag names of 
the various reminders.


The idea is that you would then get a slice of each row for a user, for a day, 
that would only contain sub column names with the tags you're looking for? Then 
based upon the column names returned, you'd look-up the reminders.


That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:

Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com  wrote:

Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing all the subcolumns data serialized in single
column(s)  ?

Thanks
Aditya Narayan



Re: Does HH work (or make sense) for counters?

2011-02-02 Thread Sylvain Lebresne
When you create a counter column family, there is an option called
replicate_on_write. When this option is off then during a write the
increment is written to only one node and not replicated at all. In
particular it is not hinted to any node.

While unsafe, if you can accept its potential consequences, this
option can make sense if you want to do sustain very fast increments,
because replication for counters (unlike for normal writes) implies a read.

Right now, replicate_on_write is off by default, but if you turn it on HH
should work as expected (or then, that would likely be a bug).

Sylvain

On Tue, Feb 1, 2011 at 8:23 PM, Narendra Sharma
narendra.sha...@gmail.comwrote:

 Version: Cassandra 0.7.1 (build from trunk)

 Setup:
 - Cluster of 2 nodes (Say A and B)
 - HH enabled
 - Using the default Keyspace definition in cassandra.yaml
 - Using SuperCounter1 CF

 Client:
 - Using CL of ONE

 I started the two Cassandra nodes, created schema and then shutdown one of
 the instances (say B). Executed counter update and read operations on A with
 CL=ONE. Everything worked fine. All counters were returned with correct
 values. Now started node B, waited for couple of mins. Executed only counter
 read operation on B with CL=ONE. Initially got no counters for any of the
 rows. On second (and subsequent tries) try got counters for only one (same
 row always) out of ten rows.

 After doing one read with CL=QUORUM, reads with CL=ONE started returning
 correct data.

 Thanks,
 Naren






Commit log compaction

2011-02-02 Thread buddhasystem

How often and by what criteria is the commit log compacted/truncated?

Thanks,

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Commit-log-compaction-tp5985221p5985221.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Secondary indexes on super columns

2011-02-02 Thread Sébastien Druon
Hi!

I would like to know if secondary indexes are foreseen for super columns /
columns inside of super columns?
If yes, will it be in a near future?

Thanks a lot in advance

Sébastien Druon


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders ( not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up..

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs bill.spe...@gmail.com wrote:
 To reiterate, so I know we're both on the same page, your schema would be
 something like this:

 - A column family (as you describe) to store the details of a reminder. One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names would be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for a
 day, that would only contain sub column names with the tags you're looking
 for? Then based upon the column names returned, you'd look-up the reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan




Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs
Any time I see/hear a single row containing all ... I get nervous. That single 
row is going to reside on a single node. That is potentially a lot of load 
(don't know the system) for that single node. Why wouldn't you split it by at 
least user? If it won't be a lot of load, then why are you using Cassandra? This 
seems like something that could easily fit into an SQL/relational style DB. If 
it's too much data (millions of users, 100s of millions of reminders) for a 
standard SQL/relational model, then it's probably too much for a single row.


I'm not familiar with the TTL functionality of Cassandra... sorry cannot 
help/comment there, still learning :-)


Yea, my $0.02 is that this is an effective way to leverage super columns.

Bill-

On 02/02/2011 10:43 AM, Aditya Narayan wrote:

I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders (  not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up..

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com  wrote:

To reiterate, so I know we're both on the same page, your schema would be
something like this:

- A column family (as you describe) to store the details of a reminder. One
reminder per row. The row key would be a TimeUUID.

- A super column family to store the reminders for each user, for each day.
The row key would be something like: MMDD:user_id. The column names
would simply be the TimeUUID of the messages. The sub column names would be
the tag names of the various reminders.

The idea is that you would then get a slice of each row for a user, for a
day, that would only contain sub column names with the tags you're looking
for? Then based upon the column names returned, you'd look-up the reminders.

That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:


Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are
requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.comwrote:


Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing all the subcolumns data serialized in single
column(s)  ?

Thanks
Aditya Narayan





unsubscribe

2011-02-02 Thread JJ


Sent from my iPad


Subscribe

2011-02-02 Thread jj_chandu


Sent from my iPad


Re: Secondary indexes on super columns

2011-02-02 Thread Jonathan Ellis
On Wed, Feb 2, 2011 at 7:37 AM, Sébastien Druon sdr...@spotuse.com wrote:
 Hi!
 I would like to know if secondary indexes are foreseen for super columns /
 columns inside of super columns?

No.

 If yes, will it be in a near future?

Probably not.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: unsubscribe

2011-02-02 Thread Jonathan Ellis
http://wiki.apache.org/cassandra/FAQ#unsubscribe

On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote:


 Sent from my iPad




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: CQL

2011-02-02 Thread Eric Evans
On Wed, 2011-02-02 at 06:57 +, Vivek Mishra wrote:
 I am trying to run CQL from a java client and facing one issue.
 Keyspace is passed as null. When I execute Use Keyspace1 followed by
 my Select query it is still not working.

Can you provide some minimal sample code that demonstrates the problem
you're seeing?

-- 
Eric Evans
eev...@rackspace.com



Re: unsubscribe

2011-02-02 Thread Eric Evans
On Wed, 2011-02-02 at 07:55 -0800, JJ wrote:
 Sent from my iPad

This won't work (even from an iPad), you need to mail
user-unsubscr...@cassandra.apache.org

-- 
Eric Evans
eev...@rackspace.com



Re: cassandra as session store

2011-02-02 Thread Omer van der Horst Jansen
We're using Cassandra as the back end for a home grown session
management system. That system was originally built back in 2005 using
BerkelyDB/Java and a data distribution system that used UDP multicast.
Maintenance was becoming increasingly painful.

I wrote a prototype replacement service using Cassandra 0.6 but
decided to wait for the availability of official TTL support in 0.7
before switching over.

The new system has been running in production now for a little over a
week. My main issue is that Cassandra is using far more disk space
than I expected it to. The vast bulk of disk space seems to be used
for *Index.db files. I'm hoping that the 10-day GCGraceSeconds
interval that kicks in on Friday will help me there.

Most of our apps that use this service generate their own session
keys. I assume by hashing and salting a user ID and/or calling
something like java.util.UUID.randomUUID().

My schema is currently very simple -- there's a single CF containing a
(binary) payload column and a column that indicates whether or not the
data has been compressed. We have a few rogue apps that store
humongous XML documents in the session and compression helps to deal
with that. That's also why memcached wasn't going to work in our
scenario.



On Tue, Feb 1, 2011 at 12:18 PM, Kallin Nagelberg
kallin.nagelb...@gmail.com wrote:
 Hey,
 I am currently investigating Cassandra for storing what are
 effectively web sessions. Our production environment has about 10 high
 end servers behind a load balancer, and we'd like to add distributed
 session support. My main concerns are performance, consistency, and
 the ability to create unique session keys. The last thing we would
 want is users picking up each others sessions. After spending a few
 days investigating Cassandra I'm thinking of creating a single
 keyspace with a single super-column-family. The scf would store a few
 standard columns, and a supercolumn of arbitrary session attributes,
 like:

 0s809sdf8s908sf90s: {
 prop1: x,
 created : timestamp,
 lastAccessed: timestamp,
 prop2: y,
 arbirtraryProperties : {
     someRandomProperty1:xxyyzz,
     someRandomProperty2:xxyyzz,
     someRandomProperty3:xxyyzz
 }

 Does this sound like a reasonable use case? We are on a tight timeline
 and I'm currently on the fence about getting something up and running
 like this on a tight timeline.

 Thanks,
 -Kal



Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
You got me wrong perhaps..

I am already splitting the row on per user basis ofcourse, otherwise
the schema wont make sense for my usage. The row contains only
*reminders of a single user* sorted in chronological order. The
reminder Id are stored as supercolumn name and subcolumn contain tags
for that reminder.



On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs bill.spe...@gmail.com wrote:
 Any time I see/hear a single row containing all ... I get nervous. That
 single row is going to reside on a single node. That is potentially a lot of
 load (don't know the system) for that single node. Why wouldn't you split it
 by at least user? If it won't be a lot of load, then why are you using
 Cassandra? This seems like something that could easily fit into an
 SQL/relational style DB. If it's too much data (millions of users, 100s of
 millions of reminders) for a standard SQL/relational model, then it's
 probably too much for a single row.

 I'm not familiar with the TTL functionality of Cassandra... sorry cannot
 help/comment there, still learning :-)

 Yea, my $0.02 is that this is an effective way to leverage super columns.

 Bill-

 On 02/02/2011 10:43 AM, Aditya Narayan wrote:

 I think you got it exactly what I wanted to convey except for few
 things I want to clarify:

 I was thinking of a single row containing all reminders (  not split
 by day). History of the reminders need to be maintained for some time.
 After certain time (say 3 or 6 months) they may be deleted by ttl
 facility.

 While presenting the reminders timeline to the user, latest
 supercolumns like around 50 from the start_end will be picked up and
 their subcolumns values will be compared to the Tags user has chosen
 to see and, corresponding to the filtered subcolumn values(tags), the
 rows of the reminder details would be picked up..

 Is supercolumn a preferable choice for this ? Can there be a better
 schema than this ?


 -Aditya Narayan



 On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 To reiterate, so I know we're both on the same page, your schema would be
 something like this:

 - A column family (as you describe) to store the details of a reminder.
 One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each
 day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names would
 be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for a
 day, that would only contain sub column names with the tags you're
 looking
 for? Then based upon the column names returned, you'd look-up the
 reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com
  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan





changing JMX port in 0.7

2011-02-02 Thread Sasha Dolgy
An instance of Cassandra starts and is listening on the ports described
below:
 Port Description Defined In 9160 Client traffic via the Thrift
protocolcassandra.yaml7000Cluster traffic via
gossipcassandra.yaml8080Port for monitoring attributes via JMX
cassandra.in.sh
My $CASSANDRA_HOME/conf/cassandra.in.sh has no configuration for JMX.

In $CASSANDRA_HOME/conf/cassandra-env.sh :

JMX_PORT=8080

When I change this, the port change isn't reflected.

I am starting cassandra with:  cassandra -f

I'd like to change the default port...
-sd

-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs

I did not understand before... sorry.

Again, depending upon how many reminders you have for a single user, this could 
be a long/wide row. Again, it really comes down to how many reminders are we 
talking about and how often will they be read/written. While a single row can 
contain millions (maybe more) columns, that doesn't mean it's a good idea.


I'm working on a logging system with Cassandra and ran into this same type of 
problem. Do I put all of the messages for a single system into a single row 
keyed off that system's name? I quickly came to the answer of no and now I 
break my row keys into POSIX_timestamp:system where my timestamps are buckets 
for every 5 minutes. This nicely distributes the load across the nodes in my system.


Bill-

On 02/02/2011 11:18 AM, Aditya Narayan wrote:

You got me wrong perhaps..

I am already splitting the row on per user basis ofcourse, otherwise
the schema wont make sense for my usage. The row contains only
*reminders of a single user* sorted in chronological order. The
reminder Id are stored as supercolumn name and subcolumn contain tags
for that reminder.



On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com  wrote:

Any time I see/hear a single row containing all ... I get nervous. That
single row is going to reside on a single node. That is potentially a lot of
load (don't know the system) for that single node. Why wouldn't you split it
by at least user? If it won't be a lot of load, then why are you using
Cassandra? This seems like something that could easily fit into an
SQL/relational style DB. If it's too much data (millions of users, 100s of
millions of reminders) for a standard SQL/relational model, then it's
probably too much for a single row.

I'm not familiar with the TTL functionality of Cassandra... sorry cannot
help/comment there, still learning :-)

Yea, my $0.02 is that this is an effective way to leverage super columns.

Bill-

On 02/02/2011 10:43 AM, Aditya Narayan wrote:


I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders (not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up..

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:


To reiterate, so I know we're both on the same page, your schema would be
something like this:

- A column family (as you describe) to store the details of a reminder.
One
reminder per row. The row key would be a TimeUUID.

- A super column family to store the reminders for each user, for each
day.
The row key would be something like: MMDD:user_id. The column names
would simply be the TimeUUID of the messages. The sub column names would
be
the tag names of the various reminders.

The idea is that you would then get a slice of each row for a user, for a
day, that would only contain sub column names with the tags you're
looking
for? Then based upon the column names returned, you'd look-up the
reminders.

That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:


Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are
requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com
  wrote:


Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved 

Re: changing JMX port in 0.7

2011-02-02 Thread Sasha Dolgy
Silly me.  On windows it has to be changed in
$CASSANDRA_HOME/bin/cassandra.bat

On Wed, Feb 2, 2011 at 5:39 PM, Sasha Dolgy sdo...@gmail.com wrote:

 An instance of Cassandra starts and is listening on the ports described
 below:
  Port Description Defined In 9160 Client traffic via the Thrift 
 protocolcassandra.yaml7000Cluster traffic via gossipcassandra.yaml8080Port 
 for monitoring attributes via JMX
 cassandra.in.sh
 My $CASSANDRA_HOME/conf/cassandra.in.sh has no configuration for JMX.

 In $CASSANDRA_HOME/conf/cassandra-env.sh :

 JMX_PORT=8080

 When I change this, the port change isn't reflected.

 I am starting cassandra with:  cassandra -f

 I'd like to change the default port...
 -sd

 --
 Sasha Dolgy
 sasha.do...@gmail.com




-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: changing JMX port in 0.7

2011-02-02 Thread Roshan Dawrani
:-)

On Wed, Feb 2, 2011 at 10:14 PM, Sasha Dolgy sdo...@gmail.com wrote:

 Silly me.  On windows it has to be changed in
 $CASSANDRA_HOME/bin/cassandra.bat


 On Wed, Feb 2, 2011 at 5:39 PM, Sasha Dolgy sdo...@gmail.com wrote:

 An instance of Cassandra starts and is listening on the ports described
 below:
  Port Description Defined In 9160 Client traffic via the Thrift 
 protocolcassandra.yaml7000Cluster traffic via gossipcassandra.yaml8080Port 
 for monitoring attributes via JMX
 cassandra.in.sh
 My $CASSANDRA_HOME/conf/cassandra.in.sh has no configuration for JMX.

 In $CASSANDRA_HOME/conf/cassandra-env.sh :

 JMX_PORT=8080

 When I change this, the port change isn't reflected.

 I am starting cassandra with:  cassandra -f

 I'd like to change the default port...
 -sd

 --
 Sasha Dolgy
 sasha.do...@gmail.com




 --
 Sasha Dolgy
 sasha.do...@gmail.com



Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
@Bill
Thank you BIll!

@Cassandra users
Can others also leave their suggestions and comments about my schema, please.
Also my question about whether to use a superColumn or alternatively,
just store the data (that would otherwise be stored in subcolumns) as
serialized into a single column in standard type column family.

Thanks

-Aditya Narayan



On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com wrote:
 I did not understand before... sorry.

 Again, depending upon how many reminders you have for a single user, this
 could be a long/wide row. Again, it really comes down to how many reminders
 are we talking about and how often will they be read/written. While a single
 row can contain millions (maybe more) columns, that doesn't mean it's a good
 idea.

 I'm working on a logging system with Cassandra and ran into this same type
 of problem. Do I put all of the messages for a single system into a single
 row keyed off that system's name? I quickly came to the answer of no and
 now I break my row keys into POSIX_timestamp:system where my timestamps are
 buckets for every 5 minutes. This nicely distributes the load across the
 nodes in my system.

 Bill-

 On 02/02/2011 11:18 AM, Aditya Narayan wrote:

 You got me wrong perhaps..

 I am already splitting the row on per user basis ofcourse, otherwise
 the schema wont make sense for my usage. The row contains only
 *reminders of a single user* sorted in chronological order. The
 reminder Id are stored as supercolumn name and subcolumn contain tags
 for that reminder.



 On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 Any time I see/hear a single row containing all ... I get nervous. That
 single row is going to reside on a single node. That is potentially a lot
 of
 load (don't know the system) for that single node. Why wouldn't you split
 it
 by at least user? If it won't be a lot of load, then why are you using
 Cassandra? This seems like something that could easily fit into an
 SQL/relational style DB. If it's too much data (millions of users, 100s
 of
 millions of reminders) for a standard SQL/relational model, then it's
 probably too much for a single row.

 I'm not familiar with the TTL functionality of Cassandra... sorry cannot
 help/comment there, still learning :-)

 Yea, my $0.02 is that this is an effective way to leverage super columns.

 Bill-

 On 02/02/2011 10:43 AM, Aditya Narayan wrote:

 I think you got it exactly what I wanted to convey except for few
 things I want to clarify:

 I was thinking of a single row containing all reminders (    not split
 by day). History of the reminders need to be maintained for some time.
 After certain time (say 3 or 6 months) they may be deleted by ttl
 facility.

 While presenting the reminders timeline to the user, latest
 supercolumns like around 50 from the start_end will be picked up and
 their subcolumns values will be compared to the Tags user has chosen
 to see and, corresponding to the filtered subcolumn values(tags), the
 rows of the reminder details would be picked up..

 Is supercolumn a preferable choice for this ? Can there be a better
 schema than this ?


 -Aditya Narayan



 On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 To reiterate, so I know we're both on the same page, your schema would
 be
 something like this:

 - A column family (as you describe) to store the details of a reminder.
 One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each
 day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names
 would
 be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for
 a
 day, that would only contain sub column names with the tags you're
 looking
 for? Then based upon the column names returned, you'd look-up the
 reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column
 family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that 

Re: cassandra as session store

2011-02-02 Thread Jonathan Ellis
Sounds like you're seeing the bug in 0.7.0 preventing deletion of
non-Data.db files (i.e. your Index.db) post-compaction.  This is fixed
for 0.7.1.  (https://issues.apache.org/jira/browse/CASSANDRA-2059)

On Wed, Feb 2, 2011 at 8:15 AM, Omer van der Horst Jansen
ome...@gmail.com wrote:
 We're using Cassandra as the back end for a home grown session
 management system. That system was originally built back in 2005 using
 BerkelyDB/Java and a data distribution system that used UDP multicast.
 Maintenance was becoming increasingly painful.

 I wrote a prototype replacement service using Cassandra 0.6 but
 decided to wait for the availability of official TTL support in 0.7
 before switching over.

 The new system has been running in production now for a little over a
 week. My main issue is that Cassandra is using far more disk space
 than I expected it to. The vast bulk of disk space seems to be used
 for *Index.db files. I'm hoping that the 10-day GCGraceSeconds
 interval that kicks in on Friday will help me there.

 Most of our apps that use this service generate their own session
 keys. I assume by hashing and salting a user ID and/or calling
 something like java.util.UUID.randomUUID().

 My schema is currently very simple -- there's a single CF containing a
 (binary) payload column and a column that indicates whether or not the
 data has been compressed. We have a few rogue apps that store
 humongous XML documents in the session and compression helps to deal
 with that. That's also why memcached wasn't going to work in our
 scenario.



 On Tue, Feb 1, 2011 at 12:18 PM, Kallin Nagelberg
 kallin.nagelb...@gmail.com wrote:
 Hey,
 I am currently investigating Cassandra for storing what are
 effectively web sessions. Our production environment has about 10 high
 end servers behind a load balancer, and we'd like to add distributed
 session support. My main concerns are performance, consistency, and
 the ability to create unique session keys. The last thing we would
 want is users picking up each others sessions. After spending a few
 days investigating Cassandra I'm thinking of creating a single
 keyspace with a single super-column-family. The scf would store a few
 standard columns, and a supercolumn of arbitrary session attributes,
 like:

 0s809sdf8s908sf90s: {
 prop1: x,
 created : timestamp,
 lastAccessed: timestamp,
 prop2: y,
 arbirtraryProperties : {
     someRandomProperty1:xxyyzz,
     someRandomProperty2:xxyyzz,
     someRandomProperty3:xxyyzz
 }

 Does this sound like a reasonable use case? We are on a tight timeline
 and I'm currently on the fence about getting something up and running
 like this on a tight timeline.

 Thanks,
 -Kal





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


quick shout-out to the riptano/datastax folks!

2011-02-02 Thread Dave Viner
Just a quick shout-out to the riptano folks and becoming part of/forming
DataStax!

Congrats!


Re: unsubscribe

2011-02-02 Thread F. Hugo Zwaal
Can't the mailinglist server be changed to treat messages with  
unsubscribe as subject as an unsubscribe as well? Otherwise it will  
just keep happening, as people simply don't remember or take time to  
find out?


Just my 2 cents...

Groets, Hugo.

On 2 feb 2011, at 16:54, Jonathan Ellis jbel...@gmail.com wrote:


http://wiki.apache.org/cassandra/FAQ#unsubscribe

On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote:



Sent from my iPad





--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: unsubscribe

2011-02-02 Thread Sasha Dolgy
I'm afraid that would unsubscribe us, no?

On Wed, Feb 2, 2011 at 6:37 PM, F. Hugo Zwaal h...@unitedgames.com wrote:

 Can't the mailinglist server be changed to treat messages with unsubscribe
 as subject as an unsubscribe as well? Otherwise it will just keep happening,
 as people simply don't remember or take time to find out?

 Just my 2 cents...

 Groets, Hugo.




Re: unsubscribe

2011-02-02 Thread Norman Maurer
To make it short.. No it can't.

Bye,
Norman

(ASF Infrastructure Team)

2011/2/2 F. Hugo Zwaal h...@unitedgames.com:
 Can't the mailinglist server be changed to treat messages with unsubscribe
 as subject as an unsubscribe as well? Otherwise it will just keep happening,
 as people simply don't remember or take time to find out?

 Just my 2 cents...

 Groets, Hugo.

 On 2 feb 2011, at 16:54, Jonathan Ellis jbel...@gmail.com wrote:

 http://wiki.apache.org/cassandra/FAQ#unsubscribe

 On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote:


 Sent from my iPad




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Slow network writes

2011-02-02 Thread ruslan usifov
Hello

I try make little cluster of 2 cassandra (0.7.0) nodes and I make little
test in php:


?php
define(LIBPATH, lib/);
define(RECORDSSETCOUNT, 100);

require_once(thrift/Thrift.php);
require_once(thrift/transport/TSocket.php);
require_once(thrift/transport/TFramedTransport.php);
require_once(thrift/protocol/TBinaryProtocol.php);

require_once(LIBPATH.cassandra/Cassandra.php);
require_once(LIBPATH.cassandra/cassandra_types.php);

//-
$transport = new TFramedTransport(new TSocket(10.24.84.4, 9160));
$protocol  = new TBinaryProtocolAccelerated($transport);

$client = new CassandraClient($protocol);
$transport-open();

$client-set_keyspace(test);

//-
$l_row = array(qw = 12, as = 67, df = df, id = uid,
uid = 1212);
$l_begin = microtime(true);

for($i=0; $i  100; ++$i)
{
$l_columns = array();

foreach($l_row as $l_key = $l_value)
{
$l_columns[] = new cassandra_Column(array(name = $l_key, value
= $l_value, timestamp = time()));
};

$l_supercolumn = new cassandra_SuperColumn(array(name = $l_row[id],
columns = $l_columns));
$l_c_or_sc = new cassandra_ColumnOrSuperColumn(array(super_column =
$l_supercolumn));
$l_mutation = new cassandra_Mutation(array(column_or_supercolumn =
$l_c_or_sc));

$client-batch_mutate(array($l_row[uid] = array('adsdfsdfsd' =
array($l_mutation))), cassandra_ConsistencyLevel::ONE);

if($i  !($i % 1000))
{
   print (microtime(true) - $l_begin).\n;
   $l_begin = microtime(true);
};
};

print done\n;
sleep(20);
?



When i run this test on the same machine  that run cassandra daemon with
ip(10.24.84.4) i got foolow results:

0.64255094528198
0.53704404830933
0.4430079460144
0.43299198150635


But when i switch test on the other cassandra daemon with ip(10.24.84.7), so
test and cassandra daemon work on separates machines i got follow results:
2.4974539279938
2.3667190074921
2.2672221660614
2.3015670776367
2.2397489547729

So in my case performance degrade up to 5 times. Why this happens, and how
can i solve this? Latency of my network is good, ping give:

PING 10.24.84.7 (10.24.84.7) 56(84) bytes of data.
64 bytes from 10.24.84.7: icmp_seq=1 ttl=64 time=0.758 ms
64 bytes from 10.24.84.7: icmp_seq=2 ttl=64 time=0.696 ms
64 bytes from 10.24.84.7: icmp_seq=3 ttl=64 time=0.687 ms
64 bytes from 10.24.84.7: icmp_seq=4 ttl=64 time=0.735 ms
64 bytes from 10.24.84.7: icmp_seq=5 ttl=64 time=0.689 ms
64 bytes from 10.24.84.7: icmp_seq=6 ttl=64 time=0.631 ms
^V64 bytes from 10.24.84.7: icmp_seq=7 ttl=64 time=0.379 ms

PS: my system is Linux 2.6.32-311-ec2 #23-Ubuntu SMP Thu Dec 2 11:14:35 UTC
2010 x86_64 GNU/Linux


Re: reduced cached mem; resident set size growth

2011-02-02 Thread Ryan King
On Wed, Feb 2, 2011 at 6:22 AM, Chris Burroughs
chris.burrou...@gmail.com wrote:
 On 01/28/2011 09:19 PM, Chris Burroughs wrote:
 Thanks Oleg and Zhu.  I swear that wasn't a new hotspot version when I
 checked, but that's obviously not the case.  I'll update one node to the
 latest as soon as I can and report back.


 RSS over 48 hours with java 6 update 23:

 http://img716.imageshack.us/img716/5202/u2348hours.png

 I'll continue monitoring but RSS still appears to grow without bounds.
 Zhu reported a similar problem with Ubuntu 10.04.  While possible, it
 would seem seam extraordinary unlikely that there is a glibc or kernel
 bug affecting us both.

We're seeing a similar problem with one of our clusters (but over a
longer time scale). Its possible that its not a leak, but just
fragmentation. Unless you've told it otherwise, the jvm uses glibc's
malloc implementation for off-heap allocations. We're currently
running a test with jemalloc on one node to see if the problem goes
away.

-ryan


Re: reduced cached mem; resident set size growth

2011-02-02 Thread Chris Burroughs
On 02/02/2011 12:49 PM, Ryan King wrote:
 We're seeing a similar problem with one of our clusters (but over a
 longer time scale). Its possible that its not a leak, but just
 fragmentation. Unless you've told it otherwise, the jvm uses glibc's
 malloc implementation for off-heap allocations. We're currently
 running a test with jemalloc on one node to see if the problem goes
 away.
 

Thanks Ryan.

Is it over a longer time scale because of some action taken to mitigate
the problem, or has it always been that long for you?


Re: reduced cached mem; resident set size growth

2011-02-02 Thread Ryan King
On Wed, Feb 2, 2011 at 10:29 AM, Chris Burroughs
chris.burrou...@gmail.com wrote:
 On 02/02/2011 12:49 PM, Ryan King wrote:
 We're seeing a similar problem with one of our clusters (but over a
 longer time scale). Its possible that its not a leak, but just
 fragmentation. Unless you've told it otherwise, the jvm uses glibc's
 malloc implementation for off-heap allocations. We're currently
 running a test with jemalloc on one node to see if the problem goes
 away.


 Thanks Ryan.

 Is it over a longer time scale because of some action taken to mitigate
 the problem, or has it always been that long for you?

My guess is that its a longer timeframe because the cluster is really
low traffic (around 100qps across 10 nodes).

-ryan


0.7.0 mx4j, get attribute

2011-02-02 Thread Chris Burroughs
I'm using 0.7.0 and experimenting with the new mx4j support.

http://host:port/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage

Returns a nice pretty html page.  For purposes of monitoring I would
like to get a single attribute as xml.  The docs [1] decribe a
getattribute endpoint.  But I have been unable to get anything other
than a blank response from that.  mx4j does not seem to include any
logging for troubleshooting.

Example:
http://host:port/getattribute?objectname=org.apache.cassandra.request%3atype%3dReadStageattribute=PendingTasks

returns 200 OK with no data.

If anyone could point out what embarrassingly simple mistake I am making
I would be much obliged.


[1] http://mx4j.sourceforge.net/docs/ch05.html


Re: 0.7.0 mx4j, get attribute

2011-02-02 Thread Ryan King
On Wed, Feb 2, 2011 at 10:40 AM, Chris Burroughs
chris.burrou...@gmail.com wrote:
 I'm using 0.7.0 and experimenting with the new mx4j support.

 http://host:port/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage

 Returns a nice pretty html page.  For purposes of monitoring I would
 like to get a single attribute as xml.  The docs [1] decribe a
 getattribute endpoint.  But I have been unable to get anything other
 than a blank response from that.  mx4j does not seem to include any
 logging for troubleshooting.

 Example:
 http://host:port/getattribute?objectname=org.apache.cassandra.request%3atype%3dReadStageattribute=PendingTasks

 returns 200 OK with no data.

 If anyone could point out what embarrassingly simple mistake I am making
 I would be much obliged.


 [1] http://mx4j.sourceforge.net/docs/ch05.html


Note that many objects in cassandra aren't initialized until they're
used for the first time.

-ryan


Re: unsubscribe

2011-02-02 Thread Janne Jalkanen

How about adding an autosignature with unsubscription info?

/Janne

On Feb 2, 2011, at 19:42 , Norman Maurer wrote:

 To make it short.. No it can't.
 
 Bye,
 Norman
 
 (ASF Infrastructure Team)
 
 2011/2/2 F. Hugo Zwaal h...@unitedgames.com:
 Can't the mailinglist server be changed to treat messages with unsubscribe
 as subject as an unsubscribe as well? Otherwise it will just keep happening,
 as people simply don't remember or take time to find out?
 
 Just my 2 cents...
 
 Groets, Hugo.
 
 On 2 feb 2011, at 16:54, Jonathan Ellis jbel...@gmail.com wrote:
 
 http://wiki.apache.org/cassandra/FAQ#unsubscribe
 
 On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote:
 
 
 Sent from my iPad
 
 
 
 
 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 



Re: Counters in 0.8 -- conditional?

2011-02-02 Thread Peter Schuller
 I'm looking at
 http://wiki.apache.org/cassandra/Counters

 So, the counter feature -- it doesn't seem to count rows based in criteria,
 such as index condition. Is that correct?

Yes, it's just about supporting counters in and of themselves (which
is non-trivial in a distributed system). It is unrelated to counting
rows or columns, unless the application happens to use them for that.

-- 
/ Peter Schuller


Re: Commit log compaction

2011-02-02 Thread buddhasystem

Thank you. So what is exactly the condition that causes the older commit log
files to actually be removed? I observe that indeed they are rotated out
when the threshold is reached, but then new ones a placed in the directory
and the older ones are still there.

Thanks,
Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Commit-log-compaction-tp5985221p5986399.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Counters in 0.8 -- conditional?

2011-02-02 Thread buddhasystem

Thanks. Just wanted to note that counting the number of rows where foo=bar is
a fairly ubiquitous task in db applications. In case of big data,
trafficking all these data to client just to count something isn't optimal
at all.

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Counters-in-0-8-conditional-tp5985214p5986442.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: py_stress error in Cassandra 0.7

2011-02-02 Thread Brandon Williams
As the README suggests, you need to run ant gen-thrift-py first.

On Wed, Feb 2, 2011 at 2:53 PM, shan...@accenture.com wrote:

 Hi,



 I am trying to get the py_stress to work in Cassandra 0.7. I keep getting
 this error:



 ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$ python
 stress.py

 Traceback (most recent call last):

   File stress.py, line 520, in module

 make_keyspaces()

   File stress.py, line 185, in make_keyspaces

 cfams = [CfDef(keyspace='Keyspace1', name='Standard1',
 column_metadata=colms),

 NameError: global name 'CfDef' is not defined



 Any suggestions?



 Thanks,

 *Shan (Susie) Lu,  *Analyst**

 Accenture Technology Labs - Silicon Valley **

 cell  +1 425.749.2546 tel:+14257492546

 email *shan...@accenture.com charles.nebol...@accenture.com*



 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the email by you is prohibited.



Re: quick shout-out to the riptano/datastax folks!

2011-02-02 Thread Jonathan Ellis
Thanks, Dave!

On Wed, Feb 2, 2011 at 9:17 AM, Dave Viner davevi...@gmail.com wrote:
 Just a quick shout-out to the riptano folks and becoming part of/forming
 DataStax!
 Congrats!



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Slow network writes

2011-02-02 Thread Jonathan Ellis
You need to use multiple threads to measure throughput.  I strongly
recommend starting with contrib/stress from the source distribution,
which is multithreaded out of the box.

On Wed, Feb 2, 2011 at 9:43 AM, ruslan usifov ruslan.usi...@gmail.com wrote:
 Hello

 I try make little cluster of 2 cassandra (0.7.0) nodes and I make little
 test in php:


 ?php
 define(LIBPATH, lib/);
 define(RECORDSSETCOUNT, 100);

 require_once(thrift/Thrift.php);
 require_once(thrift/transport/TSocket.php);
 require_once(thrift/transport/TFramedTransport.php);
 require_once(thrift/protocol/TBinaryProtocol.php);

 require_once(LIBPATH.cassandra/Cassandra.php);
 require_once(LIBPATH.cassandra/cassandra_types.php);

 //-
 $transport = new TFramedTransport(new TSocket(10.24.84.4, 9160));
 $protocol  = new TBinaryProtocolAccelerated($transport);

 $client = new CassandraClient($protocol);
 $transport-open();

 $client-set_keyspace(test);

 //-
 $l_row = array(qw = 12, as = 67, df = df, id = uid,
 uid = 1212);
 $l_begin = microtime(true);

 for($i=0; $i  100; ++$i)
 {
     $l_columns = array();

     foreach($l_row as $l_key = $l_value)
     {
         $l_columns[] = new cassandra_Column(array(name = $l_key, value
 = $l_value, timestamp = time()));
     };

     $l_supercolumn = new cassandra_SuperColumn(array(name = $l_row[id],
 columns = $l_columns));
     $l_c_or_sc = new cassandra_ColumnOrSuperColumn(array(super_column =
 $l_supercolumn));
     $l_mutation = new cassandra_Mutation(array(column_or_supercolumn =
 $l_c_or_sc));

     $client-batch_mutate(array($l_row[uid] = array('adsdfsdfsd' =
 array($l_mutation))), cassandra_ConsistencyLevel::ONE);

     if($i  !($i % 1000))
     {
    print (microtime(true) - $l_begin).\n;
    $l_begin = microtime(true);
     };
 };

 print done\n;
 sleep(20);
 ?



 When i run this test on the same machine  that run cassandra daemon with
 ip(10.24.84.4) i got foolow results:

 0.64255094528198
 0.53704404830933
 0.4430079460144
 0.43299198150635


 But when i switch test on the other cassandra daemon with ip(10.24.84.7), so
 test and cassandra daemon work on separates machines i got follow results:
 2.4974539279938
 2.3667190074921
 2.2672221660614
 2.3015670776367
 2.2397489547729

 So in my case performance degrade up to 5 times. Why this happens, and how
 can i solve this? Latency of my network is good, ping give:

 PING 10.24.84.7 (10.24.84.7) 56(84) bytes of data.
 64 bytes from 10.24.84.7: icmp_seq=1 ttl=64 time=0.758 ms
 64 bytes from 10.24.84.7: icmp_seq=2 ttl=64 time=0.696 ms
 64 bytes from 10.24.84.7: icmp_seq=3 ttl=64 time=0.687 ms
 64 bytes from 10.24.84.7: icmp_seq=4 ttl=64 time=0.735 ms
 64 bytes from 10.24.84.7: icmp_seq=5 ttl=64 time=0.689 ms
 64 bytes from 10.24.84.7: icmp_seq=6 ttl=64 time=0.631 ms
 ^V64 bytes from 10.24.84.7: icmp_seq=7 ttl=64 time=0.379 ms

 PS: my system is Linux 2.6.32-311-ec2 #23-Ubuntu SMP Thu Dec 2 11:14:35 UTC
 2010 x86_64 GNU/Linux







-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Can I have some more feedback about my schema perhaps somewhat more
criticisive/harsh ?


Thanks again,
Aditya Narayan

On Wed, Feb 2, 2011 at 10:27 PM, Aditya Narayan ady...@gmail.com wrote:
 @Bill
 Thank you BIll!

 @Cassandra users
 Can others also leave their suggestions and comments about my schema, please.
 Also my question about whether to use a superColumn or alternatively,
 just store the data (that would otherwise be stored in subcolumns) as
 serialized into a single column in standard type column family.

 Thanks

 -Aditya Narayan



 On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com 
 wrote:
 I did not understand before... sorry.

 Again, depending upon how many reminders you have for a single user, this
 could be a long/wide row. Again, it really comes down to how many reminders
 are we talking about and how often will they be read/written. While a single
 row can contain millions (maybe more) columns, that doesn't mean it's a good
 idea.

 I'm working on a logging system with Cassandra and ran into this same type
 of problem. Do I put all of the messages for a single system into a single
 row keyed off that system's name? I quickly came to the answer of no and
 now I break my row keys into POSIX_timestamp:system where my timestamps are
 buckets for every 5 minutes. This nicely distributes the load across the
 nodes in my system.

 Bill-

 On 02/02/2011 11:18 AM, Aditya Narayan wrote:

 You got me wrong perhaps..

 I am already splitting the row on per user basis ofcourse, otherwise
 the schema wont make sense for my usage. The row contains only
 *reminders of a single user* sorted in chronological order. The
 reminder Id are stored as supercolumn name and subcolumn contain tags
 for that reminder.



 On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 Any time I see/hear a single row containing all ... I get nervous. That
 single row is going to reside on a single node. That is potentially a lot
 of
 load (don't know the system) for that single node. Why wouldn't you split
 it
 by at least user? If it won't be a lot of load, then why are you using
 Cassandra? This seems like something that could easily fit into an
 SQL/relational style DB. If it's too much data (millions of users, 100s
 of
 millions of reminders) for a standard SQL/relational model, then it's
 probably too much for a single row.

 I'm not familiar with the TTL functionality of Cassandra... sorry cannot
 help/comment there, still learning :-)

 Yea, my $0.02 is that this is an effective way to leverage super columns.

 Bill-

 On 02/02/2011 10:43 AM, Aditya Narayan wrote:

 I think you got it exactly what I wanted to convey except for few
 things I want to clarify:

 I was thinking of a single row containing all reminders (    not split
 by day). History of the reminders need to be maintained for some time.
 After certain time (say 3 or 6 months) they may be deleted by ttl
 facility.

 While presenting the reminders timeline to the user, latest
 supercolumns like around 50 from the start_end will be picked up and
 their subcolumns values will be compared to the Tags user has chosen
 to see and, corresponding to the filtered subcolumn values(tags), the
 rows of the reminder details would be picked up..

 Is supercolumn a preferable choice for this ? Can there be a better
 schema than this ?


 -Aditya Narayan



 On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 To reiterate, so I know we're both on the same page, your schema would
 be
 something like this:

 - A column family (as you describe) to store the details of a reminder.
 One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each
 day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names
 would
 be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for
 a
 day, that would only contain sub column names with the tags you're
 looking
 for? Then based upon the column names returned, you'd look-up the
 reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column
 family.
 -For presenting the timeline 

Re: Commit log compaction

2011-02-02 Thread Jonathan Ellis
On Wed, Feb 2, 2011 at 12:29 PM, buddhasystem potek...@bnl.gov wrote:

 Thank you. So what is exactly the condition that causes the older commit log
 files to actually be removed?

Commit log segments (whose size are controllable via the
commitlog_rotation_threshold_in_mb option) are eligable for removal
when they do not contain any data that has yet to be flushed to
memtables.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Pig not reading all cassandra data

2011-02-02 Thread Matthew E. Kennedy

I noticed in the jobtracker log that when the pig job kicks off, I get the 
following info message:

2011-02-02 09:13:07,269 INFO org.apache.hadoop.mapred.JobInProgress: Input size 
for job job_201101241634_0193 = 0. Number of splits = 1

So I looked at the job.split file that is created for the Pig job and compared 
it to the job.split file created for the map-reduce job.  The map reduce file 
contains an entry for each split, whereas the  job.split file for the Pig job 
contains just the one split.

I added some code to the ColumnFamilyInputFormat to output what it thinks it 
sees as it should be creating input splits for the pig jobs, and the call to 
getSplits() appears to be returning the correct list of splits.  I can't figure 
out where it goes wrong though when the splits should be written to the 
job.split file.

Does anybody know the specific class responsible for creating that file in a 
Pig job, and why it might be affected by using the pig CassandraStorage module?

Is anyone else successfully running Pig jobs against a 0.7 cluster?

Thanks,
Matt

Does an unused ColumnFamily consume resources better used by live CFs?

2011-02-02 Thread David Dabbs
We have an old test CF and I was wondering if it might be taking resources
better used by our app's CFs.

 

Thank you.

 

David

 



Re: how to change compare_with

2011-02-02 Thread Vedarth Kulkarni
I tried help update column family.
It gave me :

*valid attributes are:
- column_type: Super or Standard
- comment: Human-readable column family description. Any string is
acceptable
- rows_cached: Number or percentage of rows to cache
- row_cache_save_period: Period with which to persist the row cache, in
seconds
- keys_cached: Number or percentage of keys to cache
- key_cache_save_period: Period with which to persist the key cache, in
seconds
- read_repair_chance: Probability (0.0-1.0) with which to perform read
repairs on CL.ONE reads
- gc_grace: Discard tombstones after this many seconds
- column_metadata: null
- memtable_operations: Flush memtables after this many operations
- memtable_throughput: ... or after this many bytes have been written
- memtable_flush_after: ... or after this many seconds
- default_validation_class: null
- min_compaction_threshold: Avoid minor compactions of less than this
number of sstable files
- max_compaction_threshold: Compact no more than this number of sstable
files at once
- column_metadata: Metadata which describes columns of column family.
Supported format is [{ k:v, k:v, ... }, { ... }, ...]
Valid attributes: column_name, validation_class (see comparator),
  index_type (integer), index_name.

*So what is to be used ?
And also if possible please provide information on how do that in Java using
Hector.
Thank you.


Vedarth Kulkarni,
TYBSc (Computer Science).



On Thu, Feb 3, 2011 at 2:58 AM, Jonathan Ellis jbel...@gmail.com wrote:

 On Wed, Feb 2, 2011 at 12:48 PM, Vedarth Kulkarni vedar...@gmail.com
 wrote:
  Hello there,
 
  I am using Cassandra 0.7. Is there any way to change the 'compare_with'
 from
  my program ?, I am using Hector and I am programming in Java.

 Yes.

  Is it possible to change it from the bin/cassandra-cli ?

 Yes.  help update column family;

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Cassandra memory needs

2011-02-02 Thread Oleg Proudnikov
Hi All,

I am trying to understand the relationship between data set/SSTable(s) size and
Cassandra heap. 

Q1. Here is the memory calc from the Wiki:

For a rough rule of thumb, Cassandra's internal datastructures will require
about  memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal 
caches. 

This formula does not depend on the data set size. Does this mean that provided
Cassandra has sufficient disk space to accommodate growing data set,  it can run
in fixed memory for bulk load? Am I right that memory impact of compacting
increasing SSTAble sizes is capped by a parameter 
in_memory_compaction_limit_in_mb?

Q2. What would I need to monitor to predict ahead the need to double the number
of nodes assuming sufficient storage per node? Is there a simple rule of thumb
saying that for a heap of size X a node can handle SSTable of size Y? I do
realize that the i/o and CPU play a role here but could that be reduced to a
factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming
random partitioner and a fixed number of write clients.

Q3. Does the formula account for deserialization during reads? What does 1G
represent? 

Thank you very much,
Oleg




Re: Does an unused ColumnFamily consume resources better used by live CFs?

2011-02-02 Thread Jonathan Ellis
Not if it's been flushed since the last time it was written to.

On Wed, Feb 2, 2011 at 1:34 PM, David Dabbs dmda...@gmail.com wrote:
 We have an old “test” CF and I was wondering if it might be taking resources
 better used by our app’s CFs.



 Thank you.



 David





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Slow network writes

2011-02-02 Thread Oleg Proudnikov
Is it possible that the key 1212 maps to the first node? I am assuming RF=1.
You could try random keys to test this theory...

Oleg




RE: py_stress error in Cassandra 0.7

2011-02-02 Thread shan.lu
I tried running with the 0.7 version and get this error:

Buildfile: build.xml

gen-thrift-py:
 [echo] Generating Thrift Python code from 
/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift 
 [exec] 
[WARNING:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375] 
Constant strings should be quoted: ConsistencyLevel.ONE
 [exec]
 [exec]
 [exec] 
[FAILURE:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375] 
type error: const consistency_level was declared as enum

BUILD FAILED
/home/ubuntu/apache-cassandra-0.7.0/build.xml:250: exec returned: 1

Total time: 0 seconds

Thank you,
Shan (Susie) Lu, Accenture Tech Labs SV
email shan...@accenture.commailto:charles.nebol...@accenture.com

From: Brandon Williams [mailto:dri...@gmail.com]
Sent: Wednesday, February 02, 2011 1:18 PM
To: user@cassandra.apache.org
Subject: Re: py_stress error in Cassandra 0.7

As the README suggests, you need to run ant gen-thrift-py first.
On Wed, Feb 2, 2011 at 2:53 PM, 
shan.luhttp://shan.lu@accenture.comhttp://accenture.com wrote:
Hi,

I am trying to get the py_stress to work in Cassandra 0.7. I keep getting this 
error:

ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$mailto:ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$
 python stress.py
Traceback (most recent call last):
  File stress.py, line 520, in module
make_keyspaces()
  File stress.py, line 185, in make_keyspaces
cfams = [CfDef(keyspace='Keyspace1', name='Standard1', 
column_metadata=colms),
NameError: global name 'CfDef' is not defined

Any suggestions?

Thanks,
Shan (Susie) Lu,  Analyst
Accenture Technology Labs - Silicon Valley
cell  +1 425.749.2546tel:+14257492546
email shan...@accenture.commailto:charles.nebol...@accenture.com


This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information.  If you have received it in 
error, please notify the sender immediately and delete the original.  Any other 
use of the email by you is prohibited.


Re: py_stress error in Cassandra 0.7

2011-02-02 Thread Jonathan Ellis
That means you have an old version of the Thrift compiler.

On Wed, Feb 2, 2011 at 1:54 PM,  shan...@accenture.com wrote:
 I tried running with the 0.7 version and get this error:



 Buildfile: build.xml



 gen-thrift-py:

  [echo] Generating Thrift Python code from
 /home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift 

  [exec]
 [WARNING:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375]
 Constant strings should be quoted: ConsistencyLevel.ONE

  [exec]

  [exec]

  [exec]
 [FAILURE:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375]
 type error: const consistency_level was declared as enum



 BUILD FAILED

 /home/ubuntu/apache-cassandra-0.7.0/build.xml:250: exec returned: 1



 Total time: 0 seconds



 Thank you,

 Shan (Susie) Lu, Accenture Tech Labs SV

 email shan...@accenture.com



 From: Brandon Williams [mailto:dri...@gmail.com]
 Sent: Wednesday, February 02, 2011 1:18 PM
 To: user@cassandra.apache.org
 Subject: Re: py_stress error in Cassandra 0.7



 As the README suggests, you need to run ant gen-thrift-py first.

 On Wed, Feb 2, 2011 at 2:53 PM, shan...@accenture.com wrote:

 Hi,



 I am trying to get the py_stress to work in Cassandra 0.7. I keep getting
 this error:



 ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$ python
 stress.py

 Traceback (most recent call last):

   File stress.py, line 520, in module

     make_keyspaces()

   File stress.py, line 185, in make_keyspaces

     cfams = [CfDef(keyspace='Keyspace1', name='Standard1',
 column_metadata=colms),

 NameError: global name 'CfDef' is not defined



 Any suggestions?



 Thanks,

 Shan (Susie) Lu,  Analyst

 Accenture Technology Labs - Silicon Valley

 cell  +1 425.749.2546

 email shan...@accenture.com



 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the email by you is prohibited.



 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the email by you is prohibited.



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: py_stress error in Cassandra 0.7

2011-02-02 Thread Oleg Proudnikov
Have you generated Cassandra Thrift interface?

You will need to install Thrift first: 

http://wiki.apache.org/cassandra/InstallThrift

Then, in the interface directory under Cassandra's home you can run

thrift --gen py cassandra.thrift

If the above does not install generated cassandra thrift module, then copy it
manually to the site-packages directory of your python installation. On my
server it is in 

/usr/lib/python/site-packages

I hope this helps...

Oleg




Re: Counters in 0.8 -- conditional?

2011-02-02 Thread Peter Schuller
 Thanks. Just wanted to note that counting the number of rows where foo=bar is
 a fairly ubiquitous task in db applications. In case of big data,
 trafficking all these data to client just to count something isn't optimal
 at all.

You can ask Cassandra to do the counting, but the cost is still going
to involve reading the data on the Cassandra end. Hence, O(n) rather
than O(1). (It would obviously be nice if counts could be done O(1),
but it's not trivial to implement or obvious how to do it in order for
it to be generally useful. Even non-distributed databases like
PostgreSQL have issues with that.)

-- 
/ Peter Schuller


Re: Slow network writes

2011-02-02 Thread ruslan usifov
2011/2/3 Oleg Proudnikov ol...@cloudorange.com

 Is it possible that the key 1212 maps to the first node? I am assuming
 RF=1.
 You could try random keys to test this theory...


Yes you right 1212 goes to first node. I distribute tokens like described
in Operations: http://wiki.apache.org/cassandra/Operations:

0
85070591730234615865843651857942052864


So delay in my second experiment(where i got big delay in insert), appear as
result of delay communications between nodes?


Re: Counters in 0.8 -- conditional?

2011-02-02 Thread buddhasystem

Thanks. Yes I know it's by no means trivial. I thought in case there was an
index on the column on which I want to place condition, the index machinery
itself can do the counting (i.e. when the index is updated, the counter is
incremented). It doesn't seem too orthogonal to the current implementation,
at least from my very limited experience.

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Counters-in-0-8-conditional-tp5985214p5986871.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Slow network writes

2011-02-02 Thread Oleg Proudnikov
ruslan usifov ruslan.usifov at gmail.com writes:

 
 
 2011/2/3 Oleg Proudnikov olegp at cloudorange.com
 Is it possible that the key 1212 maps to the first node? I am assuming RF=1.
 You could try random keys to test this theory...
 
 
 Yes you right 1212 goes to first node. I distribute tokens like described in
Operations:
http://wiki.apache.org/cassandra/Operations:085070591730234615865843651857942052864So
delay in my second experiment(where i got big delay in insert), appear as result
of delay communications between nodes? 
 

That was the theory, assuming you are using replication factor of 1.

It is difficult to say where the key falls just by looking at the ring - random
partitioner could through this key on either node. After writing 1 million rows
you could actually see some SSTables in data directory on one node and none on
the other.




Re: Cassandra memory needs

2011-02-02 Thread Peter Schuller
 I am trying to understand the relationship between data set/SSTable(s) size 
 and
 Cassandra heap.

http://wiki.apache.org/cassandra/LargeDataSetConsiderations

 For a rough rule of thumb, Cassandra's internal datastructures will require
 about  memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal 
 caches.

 This formula does not depend on the data set size. Does this mean that 
 provided
 Cassandra has sufficient disk space to accommodate growing data set,  it can 
 run
 in fixed memory for bulk load?

No, for reasons that I hope are covered at the above URL. The
calculation you refer to has more to with how you tweak your memtables
for performance which is only loosely coupled to data size.

The cost of index sampling and bloom filters are very directly related
to database size however (see wiki url). It is essentially a
trade-off; where a typical b-tree database would simply start
demanding additional seeks as the index size grows larger, Cassandra
does limit the seeks but instead has a stricter memory requirements.
If you're only looking to smack huge amounts of data into the database
without every reading them, or reading them very very rarely, it is
sub-optimal from a memory perspective.

Note though that these are memory requirements per row key, rather
than per byte of data.

Am I right that memory impact of compacting
 increasing SSTAble sizes is capped by a parameter
 in_memory_compaction_limit_in_mb?

That limits the amount of memory allocated for individual row
compactions yes, and will put a cap on the GC pressure generated in
addition to allowing huge rows to be compacted independently of heap
size.

 Q2. What would I need to monitor to predict ahead the need to double the 
 number
 of nodes assuming sufficient storage per node? Is there a simple rule of thumb
 saying that for a heap of size X a node can handle SSTable of size Y? I do
 realize that the i/o and CPU play a role here but could that be reduced to a
 factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming
 random partitioner and a fixed number of write clients.

Disregarding memtable tweaking that will have more to do with
throughput, the most important factor in terms of scaling memory
requirements w.r.t. data size, is the number of row keys and the
length of the average row.

I recommend just empirically inserting say 10 million rows with
realistic row keys and observing the size of the resulting index and
bloom filter files. Take into account to what extent compaction will
cause memory usage to temporarily spike.

Also take into account that if you plan on having very large rows, the
indexes will begin having more than one entry per row (see
column_index_size_in_kb in the configuration).

If your use-case is somehow truly extreme in the sense of huge data
sets with little to no requirement on query efficiency, the per row
key costs can be cut down by adjusting index_interval in the
configuration to affect the cost of index sampling, and the target
false positive rates of bloom filters could be adjusted (in source,
not conf) to cut down on that. But really, that would be an unusual
thing to do I think and I wouldn't recommend touching that without
careful consideration and deep understanding of your expected
use-case.

 Q3. Does the formula account for deserialization during reads? What does 1G
 represent?

I don't know the background of that particular wiki statement, but my
guess is that 1G is just sort of a general gut feel good to have
base memory size rather than something very specifically calculated.

-- 
/ Peter Schuller


Re: how to change compare_with

2011-02-02 Thread Tyler Hobbs
I think Jonathan mispoke.

You cannot change the 'compare_with' attribute of an existing column
family.  The solution is to create a new column family with the data type
that you need.

See 'help create column family;'

-- 
Tyler Hobbs
Software Engineer, DataStax http://datastax.com/
Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
Python client library


Re: Cassandra memory needs

2011-02-02 Thread buddhasystem

Oleg,

I just wanted to add that I confirmed the importance of that rule of thumb
the hard way. I created two extra CFs and was able to reliably crash the
nodes during writes. I guess for the final setting I'll rely on results of
my testing.

But it's also important to not cause the swap death of your machine (i.e.
when you go too high on JVM memory).

Regards

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-needs-tp5986663p5986911.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: how to change compare_with

2011-02-02 Thread Jonathan Ellis
On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote:
 I think Jonathan mispoke.

I thought I was mistaken, but I was wrong. :)

 You cannot change the 'compare_with' attribute of an existing column
 family.

You can, but it's up to you to make sure that the new type makes
sense.  Most frequently, you see this when changing from BytesType to
something more structured.

(If you screw up and specify a compare_with that is nonsensical for
your data, just change it back.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


How do I get 0.7.1?

2011-02-02 Thread buddhasystem

Thanks.

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-do-I-get-0-7-1-tp5986927p5986927.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Slow network writes

2011-02-02 Thread buddhasystem

Jonathan,

where do I find that contrib/stress?

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Slow-network-writes-tp5985757p5986937.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: How do I get 0.7.1?

2011-02-02 Thread Sal Fuentes
I don't think 0.7.1 is out yet, so you'll have to wait.

On Wed, Feb 2, 2011 at 3:17 PM, buddhasystem potek...@bnl.gov wrote:


 Thanks.

 Maxim

 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-do-I-get-0-7-1-tp5986927p5986927.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.




-- 
Salvador Fuentes Jr.


Re: how to change compare_with

2011-02-02 Thread Stu Hood
Not only does the type need to make sense, but it also needs to sort in
exactly the same order as the previous type did... in which case there would
be no reason to change it?

We should probably just say no, you cannot do this, and explicitly prevent
it.

On Wed, Feb 2, 2011 at 3:14 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote:
  I think Jonathan mispoke.

 I thought I was mistaken, but I was wrong. :)

  You cannot change the 'compare_with' attribute of an existing column
  family.

 You can, but it's up to you to make sure that the new type makes
 sense.  Most frequently, you see this when changing from BytesType to
 something more structured.

 (If you screw up and specify a compare_with that is nonsensical for
 your data, just change it back.)

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: How do I get 0.7.1?

2011-02-02 Thread Stephen Connolly
the take #2 vote was canceled due to a couple of issues... take #3 had not
been called yet

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 2 Feb 2011 23:29, Sal Fuentes fuente...@gmail.com wrote:


Re: how to change compare_with

2011-02-02 Thread Jonathan Ellis
Correct.  But with more and more clients being able to do intelligent
things based on metadata it's not just decoration.  (UTF8Type,
LexicalUUIDType, BytesType, and AsciiType all have the same ordering.
I believe IntegerType and LongType are equivalent orderings as well.)

On Wed, Feb 2, 2011 at 3:35 PM, Stu Hood stuh...@gmail.com wrote:
 Not only does the type need to make sense, but it also needs to sort in
 exactly the same order as the previous type did... in which case there would
 be no reason to change it?
 We should probably just say no, you cannot do this, and explicitly prevent
 it.

 On Wed, Feb 2, 2011 at 3:14 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote:
  I think Jonathan mispoke.

 I thought I was mistaken, but I was wrong. :)

  You cannot change the 'compare_with' attribute of an existing column
  family.

 You can, but it's up to you to make sure that the new type makes
 sense.  Most frequently, you see this when changing from BytesType to
 something more structured.

 (If you screw up and specify a compare_with that is nonsensical for
 your data, just change it back.)

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: How do I get 0.7.1?

2011-02-02 Thread buddhasystem

Stephen, sorry I didn't understand your missive.

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-do-I-get-0-7-1-tp5986927p5987184.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


unsubscribe

2011-02-02 Thread Ronald Bradford
unsubscribe


Re: unsubscribe

2011-02-02 Thread Robert Coli
http://wiki.apache.org/cassandra/FAQ#unsubscribe

How do I unsubscribe from the email list?

Send an email to user-unsubscr...@cassandra.apache.org



On Wed, Feb 2, 2011 at 5:24 PM, Ronald Bradford
ronald.bradf...@gmail.com wrote:
 unsubscribe




rolling window of data

2011-02-02 Thread Jeffrey Wang
Hi,

We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. 
last 90 days). We use the timestamp of the log entries as the column names so 
we can do time range queries. Everything seems to be working fine, but it's not 
clear if there is an efficient way to delete data that is more than 90 days old.

Originally I thought that using a slice range on a deletion would do the trick, 
but that apparently is not supported yet. Another idea I had was to store the 
timestamp of the log entry as Cassandra's timestamp and pass in artificial 
timestamps to remove (thrift API), but that seems hacky. Does anyone know if 
there is a good way to support this kind of rolling window of data efficiently? 
Thanks.

-Jeffrey



Re: rolling window of data

2011-02-02 Thread Aaron Morton
This project may provide some inspiration for youhttps://github.com/thobbs/logsandraNot sure if it has a rolling window, if you find out let me know :)AaronOn 03 Feb, 2011,at 06:08 PM, Jeffrey Wang jw...@palantir.com wrote:Hi,We’re trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it’s not clear if there is an efficient way to delete data that is more than 90 days old.Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra’s timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks.-Jeffrey

Tracking down read latency

2011-02-02 Thread David Dabbs
Hello.

We’re encountering some high read latency issues. But our main Cass expert
is out-of-office so it falls to me.
We're more read than write, though there doesn't seem to be many pending
reads.
I have seen active/pending row-read at three or four, though.

Pool NameActive   Pending  Completed
FILEUTILS-DELETE-POOL 0 0 46
STREAM-STAGE  0 0  0
RESPONSE-STAGE0 0   17471880
ROW-READ-STAGE1 1   37652361
LB-OPERATIONS 0 0  0
MISCELLANEOUS-POOL0 0  0
GMFD  0 0 154630
LB-TARGET 0 0  0
CONSISTENCY-MANAGER   0 02993464
ROW-MUTATION-STAGE0 0   16383305
MESSAGE-STREAMING-POOL0 0  0
LOAD-BALANCER-STAGE   0 0  0
FLUSH-SORTER-POOL 0 0  0
MEMTABLE-POST-FLUSHER 0 0116
FLUSH-WRITER-POOL 0 0116
AE-SERVICE-STAGE  0 0  0
HINTED-HANDOFF-POOL   0 0 16


Does the high iops on our data mean we need to tune Key or other caches?
 
$ iostat
Linux 2.6.18-194.11.3.el5 02/03/2011

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1.00    0.00    0.25    1.32    0.00   97.43

Device:    tps   Blk_read/s   Blk_wrtn/s     Blk_read    Blk_wrtn
sda   1.52 6.66    20.53     6262   273904650
sda1  0.00 0.00 0.00     2484  18
sda2  1.46 5.68    18.95     75741946   252831120
sda3  0.06 0.98 1.58     13141400    21073512
# data here
sdb 103.06 13964.72  2718.28 186315436859 36266884800
sdb1    103.06 13964.72  2718.28 186315436235 36266884800
# commit logs here
sdc   1.47 1.71   309.36     22800725  4127423000
sdc1  1.47 1.71   309.36     22799901  4127423000



We're running on a beefy 64-bit Nehalem, so mmap should be
available/possible.
I need to check with our Cassandra lead when he's available as to why we're
not using mmap or auto.

From /opt/cassandra/conf/storage-conf.xml. 

DiskAccessModestandard/DiskAccessMode



Heap size is 16gb.

JVM_OPTS= \
-ea \
-Xms16G \
-Xmx16G \
-XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled \
-XX:SurvivorRatio=8 \
-XX:MaxTenuringThreshold=1 \
-XX:CMSInitiatingOccupancyFraction=75 \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:+UseCompressedOops \
-XX:+UseThreadPriorities \
-XX:ThreadPriorityPolicy=42 \
-Dcassandra.compaction.priority=1




If I've omitted any key infos, please advise and I'll provide.


Thanks,

David





RE: rolling window of data

2011-02-02 Thread Jeffrey Wang
Thanks for the link, but unfortunately it doesn't look like it uses a rolling 
window. As far as I can tell, log entries just keep getting inserted into 
Cassandra.

-Jeffrey

From: Aaron Morton [mailto:aa...@thelastpickle.com]
Sent: Wednesday, February 02, 2011 9:21 PM
To: user@cassandra.apache.org
Subject: Re: rolling window of data

This project may provide some inspiration for you 
https://github.com/thobbs/logsandra

Not sure if it has a rolling window, if you find out let me know :)

Aaron


On 03 Feb, 2011,at 06:08 PM, Jeffrey Wang jw...@palantir.com wrote:
Hi,

We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. 
last 90 days). We use the timestamp of the log entries as the column names so 
we can do time range queries. Everything seems to be working fine, but it's not 
clear if there is an efficient way to delete data that is more than 90 days old.

Originally I thought that using a slice range on a deletion would do the trick, 
but that apparently is not supported yet. Another idea I had was to store the 
timestamp of the log entry as Cassandra's timestamp and pass in artificial 
timestamps to remove (thrift API), but that seems hacky. Does anyone know if 
there is a good way to support this kind of rolling window of data efficiently? 
Thanks.

-Jeffrey



Re: Tracking down read latency

2011-02-02 Thread Robert Coli
On Wed, Feb 2, 2011 at 9:35 PM, David Dabbs dmda...@gmail.com wrote:
 We’re encountering some high read latency issues.

What is reporting high read latency?

 We're more read than write, though there doesn't seem to be many pending
 reads.
 I have seen active/pending row-read at three or four, though.

In general if you were I/O bound on reads (the most common
pathological case) you would see much higher row-read stage pending.

 [ sane looking tpstats ]

Your tpstats does not look like a node which is struggling.

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
    1.00    0.00    0.25    1.32    0.00   97.43

Your system also seems to not be breaking a sweat.

 We're running on a beefy 64-bit Nehalem, so mmap should be
 available/possible.
 I need to check with our Cassandra lead when he's available as to why we're
 not using mmap or auto.

Probably because of :

https://issues.apache.org/jira/browse/CASSANDRA-1214

 Heap size is 16gb.

16gb out of how much total?

Do your GC logs seem to indicate reasonable GC performance? Do all
nodes generally have a complete view of the ring and all nodes
generally seem to be up?

=Rob


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Tyler Hobbs
On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote:

 Can I have some more feedback about my schema perhaps somewhat more
 criticisive/harsh ?


It sounds reasonable to me.

Since you're writing/reading all of the subcolumns at the same time, I would
opt for a standard column with the tags serialized into a column value.

I don't think you need to worry about row lengths here.

Depending on the reminder size and how many times it's likely to be repeated
in the timeline, you could explore denormalizing a bit more by storing the
reminders in the timelines themselves, perhaps with a separate row per
(user, tag) combination.  This would cut down on your seeks quite a bit, but
it may not be necessary at this point (or at all).

-- 
Tyler Hobbs
Software Engineer, DataStax http://datastax.com/
Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
Python client library


Re: how to change compare_with

2011-02-02 Thread Vedarth Kulkarni
Thank you.

I got it from the examples provided by Hector.

Vedarth Kulkarni,
TYBSc (Computer Science).



On Thu, Feb 3, 2011 at 6:22 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Correct.  But with more and more clients being able to do intelligent
 things based on metadata it's not just decoration.  (UTF8Type,
 LexicalUUIDType, BytesType, and AsciiType all have the same ordering.
 I believe IntegerType and LongType are equivalent orderings as well.)

 On Wed, Feb 2, 2011 at 3:35 PM, Stu Hood stuh...@gmail.com wrote:
  Not only does the type need to make sense, but it also needs to sort in
  exactly the same order as the previous type did... in which case there
 would
  be no reason to change it?
  We should probably just say no, you cannot do this, and explicitly
 prevent
  it.
 
  On Wed, Feb 2, 2011 at 3:14 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote:
   I think Jonathan mispoke.
 
  I thought I was mistaken, but I was wrong. :)
 
   You cannot change the 'compare_with' attribute of an existing column
   family.
 
  You can, but it's up to you to make sure that the new type makes
  sense.  Most frequently, you see this when changing from BytesType to
  something more structured.
 
  (If you screw up and specify a compare_with that is nonsensical for
  your data, just change it back.)
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra support
  http://www.datastax.com
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Does Consistency QUORUM broken on cassandra 0.7.0 and 0.6.11

2011-02-02 Thread ruslan usifov
As noticed in this issue
https://issues.apache.org/jira/browse/CASSANDRA-2081. Does this mean that
QUORUM doesn't work on  0.7.0 and 0.6.11?


performance degradation in cluster

2011-02-02 Thread abhinav prakash rai
First time I tun single instance of Cassandra and my application on a system
(16GB ram and 8 core), the time taken was 480sec.

When I added one more system ,(means this time I was running 2 instance
of Cassandra in cluster) and running application from single client , I
found time taken in increased to 1000sec.   And I also found that that data
distribution was also very odd on both system (in one system data were about
2.5GB and another were 140MB).

Is any configuration require while running Cassandra in a cluster other than
adding seeds ?

hanks  Regards,
abhinav


Re: Slow network writes

2011-02-02 Thread ruslan usifov
2011/2/3 Oleg Proudnikov ol...@cloudorange.com

 ruslan usifov ruslan.usifov at gmail.com writes:

 
 
  2011/2/3 Oleg Proudnikov olegp at cloudorange.com
  Is it possible that the key 1212 maps to the first node? I am assuming
 RF=1.
  You could try random keys to test this theory...
 
 
  Yes you right 1212 goes to first node. I distribute tokens like
 described in
 Operations:

 http://wiki.apache.org/cassandra/Operations:085070591730234615865843651857942052864So
 delay in my second experiment(where i got big delay in insert), appear as
 result
 of delay communications between nodes?
 

 That was the theory, assuming you are using replication factor of 1.

 It is difficult to say where the key falls just by looking at the ring -
 random
 partitioner could through this key on either node. After writing 1 million
 rows


Hm this is very simple to calculate for random, partitioner, this script on
python do that:

from hashlib import md5;

def tokens(nodes):
  l_retval = [];

  for x in xrange(nodes):
l_retval.append(2 ** 127 / nodes * x);

  return l_retval;

def wherekey(key, orderednodetokens):
  l_m = md5();
  l_m.update(key);
  l_keytoken = long(l_m.hexdigest(), 16);

  l_found = False;
  l_i = 0;

  for l_nodetoken in orderednodetokens:
if l_keytoken = l_nodetoken:
  l_found = True;
  break;

l_i += 1;

  if l_found:
return l_i;

  return 0;

ring = tokens(2);
print wherekey(1212, ring);


So for key 1212 will by chosen 0 node. 10.24.84.4 in my case