Re: Heap is N.N full. Immediately on startup

2013-02-26 Thread Andras Szerdahelyi
( I am monitoring via a visual vm plugin that shows generation sizes )
I've increased the index sampling rate from 64 to 512 and the bloom filter fp 
ratio from default to 0.1 ( and rebuilt stables ) - still getting 1-2 sstable 
reads with LCS
However at startup I see a 5GB old gen ( that seems to be very stable at around 
5.5GB under moderate 90:10 read:write load - couple hundred q/s ) is that a 
realistic size with these values and a ~200GB CF with 3-400mil rows? ( no 
secondary indexes, skinny rows 1-5K each ) ?

Also hinted handoff is off, as full GC pauses of 1-3seconds on nodes I have not 
yet attended to ( sampling rate at 64, bloom fp default ) are kicking off the 
failure detector even at a phi_convict_threshold of 12 ( apparently ? ) and 
everything is in an up/down dance :-)

Thanks!
Andras

From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Friday 22 February 2013 17:44
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Heap is N.N full. Immediately on startup

To get a good idea of how GC is performing turn on the GC logging in 
cassandra-env.sh.

After a full cms GC event, see how big the tenured heap is. If it's not 
reducing enough then GC will never get far enough ahead.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 8:37 AM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.commailto:andras.szerdahe...@ignitionone.com 
wrote:

Thank you- indeed my index interval is 64 with a CF of 300M rows + bloom filter 
false positive chance was default.
Raising the index interval to 512 didn't fix this  alone, so I guess I'll have 
to set the bloom filter to some reasonable value and scrub.

From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday 21 February 2013 17:58
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Heap is N.N full. Immediately on startup

My first guess would be the bloom filter and index sampling from lots-o-rows

Check the row count in cfstats
Check the bloom filter size in cfstats.

Background on memory requirements 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.comhttp://www.thelastpickle.com/

On 20/02/2013, at 11:27 PM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.commailto:andras.szerdahe...@ignitionone.com 
wrote:

Hey list,

Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap 
at startup in Cassandra 1.1.6 besides

  *   row cache : not persisted and is at 0 keys when this warning is produced
  *   Memtables : no write traffic at startup, my app's column families are 
durable_writes:false
  *   Pending tasks : no pending tasks, except for 928 compactions ( not sure 
where those are coming from )

I drew these conclusions from the StatusLogger output below:

INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC 
for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 
8375238656
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) 
Pool NameActive   Pending   Blocked
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) 
ReadStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
RequestResponseStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
ReadRepairStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
MutationStage 0-1 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
ReplicateOnWriteStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
GossipStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
AntiEntropyStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
MigrationStage0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
StreamStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 

Re: Data Model - Additional Column Families or one CF?

2013-02-26 Thread Raman

Greetings!

Thank you very much sharing your insight and experience.

I am trying to migrate a normalized Schema -- 1 TB database.  The data 
is hierarchical...
child entities carry foreign keys to the parent entities. There are 
several instances like

ShapeTable, Circle, Square, Rectangle etc... (specialization hierarchies).

Often a child records are added to Parent.
Some fields are updated.
Rarely some child records are deleted.

What is the ideal schema for Cassandra?
Create one ColumnFamily with many SuperColumns each one for Shapes.

I am trying to utilize the partitioning feature to distribute the load 
around the world

according to the jurisdiction of the Parent record.

I am particularly interested in learning how to define the Shape CF but 
insert Rectangle,

Circle, with different columns?

Thank you for your help
regards
Raman


please explain read path when key not in database

2013-02-26 Thread Igor

Hello!

Explain please, how this work when I request for key which is not in 
database


 * The closest node (as determined by proximity sorting as described
   above) will be sent a command to perform an actual data read (i.e.,
   return data to the co-ordinating node).
 * As required by consistency level, additional nodes may be sent
   digest commands, asking them to perform the read locally but send
   back the digest only.
 o

   For example, at replication factor 3 a read at consistency level
   QUORUM would require one digest read in additional to the data
   read sent to the closest node. (SeeReadCallback
   http://wiki.apache.org/cassandra/ReadCallback, instantiated
   byStorageProxy http://wiki.apache.org/cassandra/StorageProxy)


I have multi-DC with NetworkTopologyStrategy and RF:1 per datacenter, 
and reads are at consitency level ONE. If local node responsible for key 
replied that it have no data for this key - will coordinator send digest 
commands?


Thanks!


RE: Read Perf

2013-02-26 Thread Kanwar Sangha
Thanks. For our case, the no of rows will more or less be the same. The only 
thing which changes is the columns and they keep getting added. 

-Original Message-
From: Hiller, Dean [mailto:dean.hil...@nrel.gov] 
Sent: 26 February 2013 09:21
To: user@cassandra.apache.org
Subject: Re: Read Perf

To find stuff on disk, there is a bloomfilter for each file in memory.  On the 
docs, 1 billion rows has 2Gig of RAM, so it really will have a huge dependency 
on your number of rows.  As you get more rows, you may need to modify the 
bloomfilter false positive to use less RAM but that means slower reads.  Ie. As 
you add more rows, you will have slower reads on a single machine.

We hit the RAM limit on one machine with 1 billion rows so we are in the 
process of tweaking the ratio of 0.000744(the default) to 0.1 to give us more 
time to solve.  Since we see no I/o load on our machines(or rather extremely 
little), we plan on moving to leveled compaction where 0.1 is the default in 
new releases and size tiered new default I think is 0.01.

Ie. If you store more data per row, this is not an issue as much but still 
something to consider.  (Also, rows have a limit I think as well on data size 
but not sure what that is.  I know the column limit on a row is in the 
millions, somewhere lower than 10 million).

Later,
Dean

From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Monday, February 25, 2013 8:31 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Read Perf

Hi - I am doing a performance run using modified YCSB client and was able to 
populate 8TB on a node and then ran some read workloads. I am seeing an average 
TPS of 930 ops/sec for random reads. There is no key cache/row cache. Question -

Will the read TPS degrade if the data size increases to say 20 TB , 50 TB, 100 
TB ? If I understand correctly, the read should remain constant irrespective of 
the data size since we eventually have sorted SStables and binary search would 
be done on the index filter to find the row ?


Thanks,
Kanwar


Re: Data Model - Additional Column Families or one CF?

2013-02-26 Thread Hiller, Dean
The bottleneck is RAM, each CF uses more RAM.  We tried to go above 15000 
column families and that hurt big time so we added a feature to PlayOrm and now 
have 60,000 virtual Column families all in one column family.  This turned out 
to be HUGE benefit though as those 60,000 tables now have been easy to modify 
cluster settings in one shot.  Right now, we are modifying the false positive 
ratio to free up ram on just one column family that all 60,000 run in.

You can always follow the same pattern PlayOrm uses which is just to prefix 
every row key but we preferred something else do the heavy lifting for us (ie. 
PlayOrm), plus the command line tool isn't too bad for inspecting the data with 
PlayOrm either.  Of course, we now have 7 billion rows and that is a new issue 
for us.  While we could just add more nodes and scale, they are pushing us to 
make it even more cost effective so we are trying to scale on a single node 
first then scale out.

Later,
Dean

From: Javier Sotelo 
javier.a.sot...@gmail.commailto:javier.a.sot...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, February 26, 2013 12:27 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Data Model - Additional Column Families or one CF?

Aaron,

Would 50 CFs be pushing it? According to 
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management,
 This has been tested to work across hundreds or even thousands of 
ColumnFamilies.

What is the bottleneck, IO?

Thanks,
Javier


On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella 
aventure...@gmail.commailto:aventure...@gmail.com wrote:

Thanks Aaron, this was a big help!

—
Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone



On Thu, Feb 21, 2013 at 9:27 AM, aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

If you have a limited / known number (say  30)  of types, I would create a CF 
for each of them.

If the number of types is unknown or very large I would have one CF with the 
row key you described.

Generally I avoid data models that require new CF's as the data grows. 
Additionally having different CF's allows you to use different cache settings, 
compactions settings and even storage mediums.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 7:43 AM, Adam Venturella 
aventure...@gmail.commailto:aventure...@gmail.com wrote:

My data needs only require me to store JSON, and I can handle this in 1 column 
family by prefixing row keys with a type, for example:

comments:{message_id}

Where comments: represents the prefix and {message_id} represents some row key 
to a message object in the same column family.

In this case comments:{message_id} would be a wide row using comment creation 
time and descending clustering order to sort the messages as they are added.

My question is, would I be better off splitting comments into their own Column 
Family or is storing them in with the Messages Column Family sufficient, they 
are all messages after all.

Or do Column Families really just provide a nice organizational front for data. 
I'm just storing JSON.







Re: Read Perf

2013-02-26 Thread Hiller, Dean
In that case, make sure you don't plan on going into the millions or test
the limit as I pretty sure it can't go above 10 million. (from previous
posts on this list).

Dean

On 2/26/13 8:23 AM, Kanwar Sangha kan...@mavenir.com wrote:

Thanks. For our case, the no of rows will more or less be the same. The
only thing which changes is the columns and they keep getting added.

-Original Message-
From: Hiller, Dean [mailto:dean.hil...@nrel.gov]
Sent: 26 February 2013 09:21
To: user@cassandra.apache.org
Subject: Re: Read Perf

To find stuff on disk, there is a bloomfilter for each file in memory.
On the docs, 1 billion rows has 2Gig of RAM, so it really will have a
huge dependency on your number of rows.  As you get more rows, you may
need to modify the bloomfilter false positive to use less RAM but that
means slower reads.  Ie. As you add more rows, you will have slower reads
on a single machine.

We hit the RAM limit on one machine with 1 billion rows so we are in the
process of tweaking the ratio of 0.000744(the default) to 0.1 to give us
more time to solve.  Since we see no I/o load on our machines(or rather
extremely little), we plan on moving to leveled compaction where 0.1 is
the default in new releases and size tiered new default I think is 0.01.

Ie. If you store more data per row, this is not an issue as much but
still something to consider.  (Also, rows have a limit I think as well on
data size but not sure what that is.  I know the column limit on a row is
in the millions, somewhere lower than 10 million).

Later,
Dean

From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Monday, February 25, 2013 8:31 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Read Perf

Hi - I am doing a performance run using modified YCSB client and was able
to populate 8TB on a node and then ran some read workloads. I am seeing
an average TPS of 930 ops/sec for random reads. There is no key cache/row
cache. Question -

Will the read TPS degrade if the data size increases to say 20 TB , 50
TB, 100 TB ? If I understand correctly, the read should remain constant
irrespective of the data size since we eventually have sorted SStables
and binary search would be done on the index filter to find the row ?


Thanks,
Kanwar



Re: Data Model - Additional Column Families or one CF?

2013-02-26 Thread Hiller, Dean
Oh, and 50 CF's should be fine.

Dean

From: Javier Sotelo 
javier.a.sot...@gmail.commailto:javier.a.sot...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, February 26, 2013 12:27 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Data Model - Additional Column Families or one CF?

Aaron,

Would 50 CFs be pushing it? According to 
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management,
 This has been tested to work across hundreds or even thousands of 
ColumnFamilies.

What is the bottleneck, IO?

Thanks,
Javier


On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella 
aventure...@gmail.commailto:aventure...@gmail.com wrote:

Thanks Aaron, this was a big help!

—
Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone



On Thu, Feb 21, 2013 at 9:27 AM, aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

If you have a limited / known number (say  30)  of types, I would create a CF 
for each of them.

If the number of types is unknown or very large I would have one CF with the 
row key you described.

Generally I avoid data models that require new CF's as the data grows. 
Additionally having different CF's allows you to use different cache settings, 
compactions settings and even storage mediums.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 7:43 AM, Adam Venturella 
aventure...@gmail.commailto:aventure...@gmail.com wrote:

My data needs only require me to store JSON, and I can handle this in 1 column 
family by prefixing row keys with a type, for example:

comments:{message_id}

Where comments: represents the prefix and {message_id} represents some row key 
to a message object in the same column family.

In this case comments:{message_id} would be a wide row using comment creation 
time and descending clustering order to sort the messages as they are added.

My question is, would I be better off splitting comments into their own Column 
Family or is storing them in with the Messages Column Family sufficient, they 
are all messages after all.

Or do Column Families really just provide a nice organizational front for data. 
I'm just storing JSON.







Re: please explain read path when key not in database

2013-02-26 Thread Hiller, Dean
This is my understanding from using cassandra for probably around 2 
years….(though I still make mistakes sometimes)….

For CL.ONE read

Depending on the client, the client may go through one of it's known 
nodes(co-ordinating node) which goes to real node(clients like astyanax/hector 
read in the ring information and usually go direct so for CL_ONE, no 
co-ordination really needed).  The node it finally gets to may not have the 
data yet and will return no row while the other 2 node might have data.

For CL.QUOROM read and RF=3
Client goes to the node with data(again depending on client) and that node 
sends off a request to one of the other 2.  Let's say A does not have row yet, 
but B has row, comparison results and latest wins and a repair for that row is 
kicked off to get all nodes in sync of that row.

If local node responsible for key replied that it have no data for this key - 
will coordinator send digest commands?

IT looks like CL_ONE does trigger a read repair according to this doc (found 
googling CL_ONE read repair cassandra)

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CL-ONE-reads-RR-badness-threshold-interaction-td6247418.html

http://wiki.apache.org/cassandra/ReadRepair

Later,
Dean

Explain please, how this work when I request for key which is not in database


 *   The closest node (as determined by proximity sorting as described above) 
will be sent a command to perform an actual data read (i.e., return data to the 
co-ordinating node).
 *   As required by consistency level, additional nodes may be sent digest 
commands, asking them to perform the read locally but send back the digest only.
*   For example, at replication factor 3 a read at consistency level QUORUM 
would require one digest read in additional to the data read sent to the 
closest node. (See ReadCallbackhttp://wiki.apache.org/cassandra/ReadCallback, 
instantiated by StorageProxyhttp://wiki.apache.org/cassandra/StorageProxy)

I have multi-DC with NetworkTopologyStrategy and RF:1 per datacenter, and reads 
are at consitency level ONE. If local node responsible for key replied that it 
have no data for this key - will coordinator send digest commands?

Thanks!


Re: Read Perf

2013-02-26 Thread Hiller, Dean
Depends, are you

1. Reading the same size of data as the data set size grows?  (reading
more data does generally get slower like reading 1MB vs. 10MB)
2. Reading the same number of columns as the data set size grows?
3. Never reading in the entire row?

If the answer to all of the above is yes, yes, yes, then it should be fine
but always better to test.

ALSO, a big note, you MUST test doing a read repair as that will slow
things down BIG TIME.  We only have 130GB per node and in general
cassandra is made for 300G to 500G per node on 1T drive(typical config).
This is due to the maintenance so TEST your maintenance stuff before you
get burned there.

Just run nodetool upgradesstables and time it.  This definitely gets
slower as your data grows and gives you a good idea of how long operations
will take.  Of course, better yet, take a node completely out, wipe it and
put it back in and see how long it takes to get all the data back into by
running the read repair.  With 10T, you will have a lot of issues I
imagine.

Dean

On 2/26/13 8:43 AM, Kanwar Sangha kan...@mavenir.com wrote:

Yep. So the read will remain constant in this case ?


-Original Message-
From: Hiller, Dean [mailto:dean.hil...@nrel.gov]
Sent: 26 February 2013 09:32
To: user@cassandra.apache.org
Subject: Re: Read Perf

In that case, make sure you don't plan on going into the millions or test
the limit as I pretty sure it can't go above 10 million. (from previous
posts on this list).

Dean

On 2/26/13 8:23 AM, Kanwar Sangha kan...@mavenir.com wrote:

Thanks. For our case, the no of rows will more or less be the same. The
only thing which changes is the columns and they keep getting added.

-Original Message-
From: Hiller, Dean [mailto:dean.hil...@nrel.gov]
Sent: 26 February 2013 09:21
To: user@cassandra.apache.org
Subject: Re: Read Perf

To find stuff on disk, there is a bloomfilter for each file in memory.
On the docs, 1 billion rows has 2Gig of RAM, so it really will have a
huge dependency on your number of rows.  As you get more rows, you may
need to modify the bloomfilter false positive to use less RAM but that
means slower reads.  Ie. As you add more rows, you will have slower
reads on a single machine.

We hit the RAM limit on one machine with 1 billion rows so we are in
the process of tweaking the ratio of 0.000744(the default) to 0.1 to
give us more time to solve.  Since we see no I/o load on our
machines(or rather extremely little), we plan on moving to leveled
compaction where 0.1 is the default in new releases and size tiered new
default I think is 0.01.

Ie. If you store more data per row, this is not an issue as much but
still something to consider.  (Also, rows have a limit I think as well
on data size but not sure what that is.  I know the column limit on a
row is in the millions, somewhere lower than 10 million).

Later,
Dean

From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Monday, February 25, 2013 8:31 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Read Perf

Hi - I am doing a performance run using modified YCSB client and was
able to populate 8TB on a node and then ran some read workloads. I am
seeing an average TPS of 930 ops/sec for random reads. There is no key
cache/row cache. Question -

Will the read TPS degrade if the data size increases to say 20 TB , 50
TB, 100 TB ? If I understand correctly, the read should remain constant
irrespective of the data size since we eventually have sorted SStables
and binary search would be done on the index filter to find the row ?


Thanks,
Kanwar




Re: Data Model - Additional Column Families or one CF?

2013-02-26 Thread Javier Sotelo
Thanks Dean, very helpful info.

Javier


On Tue, Feb 26, 2013 at 7:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Oh, and 50 CF's should be fine.

 Dean

 From: Javier Sotelo javier.a.sot...@gmail.commailto:
 javier.a.sot...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, February 26, 2013 12:27 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Data Model - Additional Column Families or one CF?

 Aaron,

 Would 50 CFs be pushing it? According to
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management,
 This has been tested to work across hundreds or even thousands of
 ColumnFamilies.

 What is the bottleneck, IO?

 Thanks,
 Javier


 On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella aventure...@gmail.com
 mailto:aventure...@gmail.com wrote:

 Thanks Aaron, this was a big help!

 —
 Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone



 On Thu, Feb 21, 2013 at 9:27 AM, aaron morton aa...@thelastpickle.com
 mailto:aa...@thelastpickle.com wrote:

 If you have a limited / known number (say  30)  of types, I would create
 a CF for each of them.

 If the number of types is unknown or very large I would have one CF with
 the row key you described.

 Generally I avoid data models that require new CF's as the data grows.
 Additionally having different CF's allows you to use different cache
 settings, compactions settings and even storage mediums.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 21/02/2013, at 7:43 AM, Adam Venturella aventure...@gmail.commailto:
 aventure...@gmail.com wrote:

 My data needs only require me to store JSON, and I can handle this in 1
 column family by prefixing row keys with a type, for example:

 comments:{message_id}

 Where comments: represents the prefix and {message_id} represents some row
 key to a message object in the same column family.

 In this case comments:{message_id} would be a wide row using comment
 creation time and descending clustering order to sort the messages as they
 are added.

 My question is, would I be better off splitting comments into their own
 Column Family or is storing them in with the Messages Column Family
 sufficient, they are all messages after all.

 Or do Column Families really just provide a nice organizational front for
 data. I'm just storing JSON.








Re: Data Model - Additional Column Families or one CF?

2013-02-26 Thread Edward Capriolo
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/whytf_would_i_need_with

On Tue, Feb 26, 2013 at 12:28 PM, Javier Sotelo
javier.a.sot...@gmail.com wrote:
 Thanks Dean, very helpful info.

 Javier


 On Tue, Feb 26, 2013 at 7:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Oh, and 50 CF's should be fine.

 Dean

 From: Javier Sotelo
 javier.a.sot...@gmail.commailto:javier.a.sot...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, February 26, 2013 12:27 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Data Model - Additional Column Families or one CF?

 Aaron,

 Would 50 CFs be pushing it? According to
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management,
 This has been tested to work across hundreds or even thousands of
 ColumnFamilies.

 What is the bottleneck, IO?

 Thanks,
 Javier


 On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella
 aventure...@gmail.commailto:aventure...@gmail.com wrote:

 Thanks Aaron, this was a big help!

 —
 Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone



 On Thu, Feb 21, 2013 at 9:27 AM, aaron morton
 aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

 If you have a limited / known number (say  30)  of types, I would create
 a CF for each of them.

 If the number of types is unknown or very large I would have one CF with
 the row key you described.

 Generally I avoid data models that require new CF's as the data grows.
 Additionally having different CF's allows you to use different cache
 settings, compactions settings and even storage mediums.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 21/02/2013, at 7:43 AM, Adam Venturella
 aventure...@gmail.commailto:aventure...@gmail.com wrote:

 My data needs only require me to store JSON, and I can handle this in 1
 column family by prefixing row keys with a type, for example:

 comments:{message_id}

 Where comments: represents the prefix and {message_id} represents some row
 key to a message object in the same column family.

 In this case comments:{message_id} would be a wide row using comment
 creation time and descending clustering order to sort the messages as they
 are added.

 My question is, would I be better off splitting comments into their own
 Column Family or is storing them in with the Messages Column Family
 sufficient, they are all messages after all.

 Or do Column Families really just provide a nice organizational front for
 data. I'm just storing JSON.








Re: Authentication and Authorization with Cassandra 1.2.2.

2013-02-26 Thread Jeremy Hanna
does this help?  Links at the bottom show the cql statements to add/modify 
users:
http://www.datastax.com/docs/1.2/security/native_authentication

On Feb 26, 2013, at 4:06 PM, C.F.Scheidecker Antunes cf.antu...@gmail.com 
wrote:

 Hello all,
 
 Cassandra has changed and now has a default authentication and authorization 
 mechanism.
 
 The classes org.apache.cassandra.auth.PasswordAuthenticator (authenticator) 
 and
 org.apache.cassandra.auth.CassandraAuthorizer (authorization) provide that.
 
 They both write to a keyspace called system_auth and there are 2 column 
 families
 that are used for it, namely credentials and permissions.
 
 The permissions table is defined in CassandraAuthorizer as follows:
 
 CREATE TABLE system_auth.permissions (username text,
 resource text,
 permissions settext,
 PRIMARY KEY(username, resource)
 ) WITH gc_grace_seconds=(90 * 24 * 60 * 60) 
 // 3 months
 
 The credentials table is created in PasswordAuthenticator as follows:
 
 CREATE TABLE system_auth.credentials (username text,
salted_hash text, // salt + hash + 
 number of rounds
options maptext,text, // for future 
 extensions
PRIMARY KEY(username)
) WITH gc_grace_seconds=(90 * 24 * 60 
 * 60) // 3 months
 
 
 The password is hashed as BCrypt.hashpw(password, 
 BCrypt.gensalt(GENSALT_LOG2_ROUNDS)); where
 
 GENSALT_LOG2_ROUNDS is set to 10.
 
 
 Out of the box, the keyspace system_auth is there but the CFs are not defined 
 when one issues a describe system_auth inside
 cassandra-cli application.
 
 The configuration file says:
 
 PasswordAuthenticator relies on username/password pairs to authenticate
 users. It keeps usernames and hashed passwords in system_auth.credentials 
 table.
 Please increase system_auth keyspace replication factor if you use this 
 authenticator.
 
 On the configuration file /etc/cassandra/cassandra.yaml I have set:
 
 authenticator: org.apache.cassandra.auth.PasswordAuthenticator
 authorizer: org.apache.cassandra.auth.CassandraAuthorizer
 
 Therefore I have 3 questions.
 
 1) How can I increase the replication factor if the keyspace system_auth is 
 already there? Can I do this?
 Currently the replication factor is 1:
 [cassandra@system_auth] describe;
 Keyspace: system_auth:
   Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
   Durable Writes: true
 Options: [replication_factor:1]
   Column Families:
 
 2) Shall I create the CFs credentials and permissions via cassandra-cli as 
 well?
 If I issue a select command from cqlsh I can see:
 
 cqlsh:system_auth SELECT * FROM credentials;
 
  username  | options | salted_hash
 ---+-+--
  cassandra |null |
 
 Eventhough there is no credentials CF defined on the schema yet.
 
 3) What is the process of adding more users? Shall I do via cassandra-cli and 
 or cqlsh? How shall I specify the read and write privileges as well
 as the keyspaces for which it has writes?
 Something like this:
 OpsCenter.rw=carlos
 system.rw=carlos
 system_traces.rw=carlos
 nando.rw=carlos
 



Re: Bulk Loading-Unable to select from CQL3 tables with NO COMPACT STORAGE option after Bulk Loading - Cassandra version 1.2.1

2013-02-26 Thread aaron morton
CQL 3 tables that do not use compact storage store use Composite Types , which 
other code may not be expecting. 

Take a look at the CQL 3 table definitions through cassandra-cli and you may 
see the changes you need to make when creating the SSTables. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/02/2013, at 3:44 AM, praveen.akun...@wipro.com wrote:

 Hi All, 
 
 I am using the bulk loader program provided in Datastax website. 
 http://www.datastax.com/dev/blog/bulk-loading
 
 I am able to load data into tables created with COMPACT STORAGE option and 
 also into tables created with out this option. However, I am unable to read 
 data from the table created without COMPACT STORAGE option. 
 
 I created 2 tables as below: 
 
 CREATE TABLE TABLE1(
 field1 text PRIMARY KEY,
 field2 text,
 field3 text,
 field4 text
  ) WITH COMPACT STORAGE;
 
 CREATE TABLE TABLE2(
 field1 text PRIMARY KEY,
 field2 text,
 field3 text,
 field4 text
  );
  
 Now, I loaded these 2 tables using the Java bulk loader program(Create 
 SSTables and load them using SSTableloader utility). 
 
 I can read the data from TABLE1, but, when I try to read data from TABLE2, I 
 am getting timeout from both cqlsh  cli. 
 
 Screen Shot 2013-02-25 at 8.10.58 PM.png
 
 
 
 Screen Shot 2013-02-25 at 8.10.38 PM.png
 
 Is this expected behavior, or am I doing something wrong? Can anyone please 
 help. 
 
 Thanks  Best Regards, 
 Praveen
 Wipro Limited (Company Regn No in UK - FC 019088) 
 Address: Level 2, West wing, 3 Sheldon Square, London W2 6PS, United Kingdom. 
 Tel +44 20 7432 8500 Fax: +44 20 7286 5703
 
 VAT Number: 563 1964 27
 
 (Branch of Wipro Limited (Incorporated in India at Bangalore with limited 
 liability vide Reg no L9KA1945PLC02800 with Registrar of Companies at 
 Bangalore, India. Authorized share capital: Rs 5550 mn))
 
 Please do not print this email unless it is absolutely necessary.
 
 The information contained in this electronic message and any attachments to 
 this message are intended for the exclusive use of the addressee(s) and may 
 contain proprietary, confidential or privileged information. If you are not 
 the intended recipient, you should not disseminate, distribute or copy this 
 e-mail. Please notify the sender immediately and destroy all copies of this 
 message and any attachments.
 
 WARNING: Computer viruses can be transmitted via email. The recipient should 
 check this email and any attachments for the presence of viruses. The company 
 accepts no liability for any damage caused by any virus transmitted by this 
 email.
 
 www.wipro.com
 



Re: no other nodes seen on priam cluster

2013-02-26 Thread Ben Bromhead
Hi Marcelo

A few questions:

Have your added the priam java agent to cassandras JVM argurments (e.g.
-javaagent:$CASS_HOME/lib/priam-cass-extensions-1.1.15.jar)  and does the
web container running priam have permissions to write to the cassandra
config directory? Also what do the priam logs say?

If you want to get up and running quickly with cassandra, AWS and priam
quickly check out
www.instaclustr.comhttp://www.instaclustr.com/?cid=cass-listyou.
We deploy Cassandra under your AWS account and you have full root access to
the nodes if you want to explore and play around + there is a free tier
which is great for experimenting and trying Cassandra out.

Cheers

Ben

On Wed, Feb 27, 2013 at 6:09 AM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 Hello,

  I am using cassandra 1.2.1 and I am trying to set up a Priam cluster
 on AWS with two nodes. However, I can't get both nodes up and running
 because of a weird error (at least to me).
  When I start both nodes, they are both able to connect to each other
 and do some communication. However, after some seconds, I just see
 Java.lang.RuntimeException: No other nodes seen! , so they disconnect and
 die. I tried to test all ports (7000, 9160 and  7199) between both nodes
 and there is no firewall. On the second node, before the above exception, I
 get a broken pipe, as shown bellow.
   Any hint?

 DEBUG 18:54:31,776 attempting to connect to /10.224.238.170
 DEBUG 18:54:32,402 Reseting version for /10.224.238.170
 DEBUG 18:54:32,778 Connection version 6 from /10.224.238.170
 DEBUG 18:54:32,779 Upgrading incoming connection to be compressed
 DEBUG 18:54:32,779 Max version for /10.224.238.170 is 6
 DEBUG 18:54:32,779 Setting version 6 for /10.224.238.170
 DEBUG 18:54:32,780 set version for /10.224.238.170 to 6
 DEBUG 18:54:33,455 Disseminating load info ...
 DEBUG 18:54:59,082 Reseting version for /10.224.238.170
 DEBUG 18:55:00,405 error writing to /10.224.238.170
 java.io.IOException: Broken pipe
  at sun.nio.ch.FileDispatcher.write0(Native Method)
 at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
  at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72)
 at sun.nio.ch.IOUtil.write(IOUtil.java:43)
  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
 at java.nio.channels.Channels.writeFullyImpl(Channels.java:59)
  at java.nio.channels.Channels.writeFully(Channels.java:81)
 at java.nio.channels.Channels.access$000(Channels.java:47)
  at java.nio.channels.Channels$1.write(Channels.java:155)
 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
  at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:272)
  at java.io.DataOutputStream.flush(DataOutputStream.java:106)
 at
 org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:189)
  at
 org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:143)
 DEBUG 18:55:01,405 attempting to connect to /10.224.238.170
 DEBUG 18:55:01,461 Started replayAllFailedBatches
 DEBUG 18:55:01,462 forceFlush requested but everything is clean in batchlog
 DEBUG 18:55:01,463 Finished replayAllFailedBatches
  INFO 18:55:01,472 JOINING: schema complete, ready to bootstrap
 DEBUG 18:55:01,473 ... got ring + schema info
  INFO 18:55:01,473 JOINING: getting bootstrap token
 ERROR 18:55:01,475 Exception encountered during startup
 java.lang.RuntimeException: No other nodes seen!  Unable to bootstrap.If
 you intended to start a single-node cluster, you should make sure your
 broadcast_address (or listen_address) is listed as a seed.  Otherwise, you
 need to determine why the seed being contacted has no knowledge of the rest
 of the cluster.  Usually, this can be solved by giving all nodes the same
 seed list.


 and on the first node:

 DEBUG 18:54:30,833 Disseminating load info ...
 DEBUG 18:54:31,532 Connection version 6 from /10.242.139.159
 DEBUG 18:54:31,533 Upgrading incoming connection to be compressed
 DEBUG 18:54:31,534 Max version for /10.242.139.159 is 6
 DEBUG 18:54:31,534 Setting version 6 for /10.242.139.159
 DEBUG 18:54:31,534 set version for /10.242.139.159 to 6
 DEBUG 18:54:31,542 Reseting version for /10.242.139.159
 DEBUG 18:54:31,791 Connection version 6 from /10.242.139.159
 DEBUG 18:54:31,792 Upgrading incoming connection to be compressed
 DEBUG 18:54:31,792 Max version for /10.242.139.159 is 6
 DEBUG 18:54:31,792 Setting version 6 for /10.242.139.159
 DEBUG 18:54:31,793 set version for /10.242.139.159 to 6
  INFO 18:54:32,414 Node /10.242.139.159 is now part of the cluster
 DEBUG 18:54:32,415 Resetting pool for /10.242.139.159
 DEBUG 18:54:32,415 removing expire time for endpoint : /10.242.139.159
  INFO 18:54:32,415 InetAddress /10.242.139.159 is now UP
 DEBUG 18:54:32,789 attempting to connect to
 ec2-75-101-233-115.compute-1.amazonaws.com/10.242.139.159
 DEBUG 18:54:58,840 Started replayAllFailedBatches
 DEBUG