Re: Effective allocation of multiple disks

2010-03-10 Thread Stu Hood
Yea, I suppose major compactions are the wildcard here. Nonetheless, the 
situation where you only have 1 SSTable should be very rare.

I'll open a ticket though, because we really ought to be able to utilize those 
disks more thoroughly, and I have some ideas there.


-Original Message-
From: "Anthony Molinaro" 
Sent: Wednesday, March 10, 2010 3:38pm
To: cassandra-user@incubator.apache.org
Subject: Re: Effective allocation of multiple disks

This is incorrect, as discussed a few weeks ago.  I have a setup with multiple
disks, and as soon as compaction occurs all the data ends up on one disk.  If
you need the additional io, you will want raid0.  But simply listing multiple
DataFileDirectories will not work.

-Anthony

On Wed, Mar 10, 2010 at 02:08:13AM -0600, Stu Hood wrote:
> You can list multiple DataFileDirectories, and Cassandra will scatter files 
> across all of them. Use 1 disk for the commitlog, and 3 disks for data 
> directories.
> 
> See http://wiki.apache.org/cassandra/CassandraHardware#Disk
> 
> Thanks,
> Stu
> 
> -Original Message-
> From: "Eric Rosenberry" 
> Sent: Wednesday, March 10, 2010 2:00am
> To: cassandra-user@incubator.apache.org
> Subject: Effective allocation of multiple disks
> 
> Based on the documentation, it is clear that with Cassandra you want to have
> one disk for commitlog, and one disk for data.
> 
> My question is: If you think your workload is going to require more io
> performance to the data disks than a single disk can handle, how would you
> recommend effectively utilizing additional disks?
> 
> It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
>  If we use one for commitlog, is there a way to have Cassandra itself
> equally split data across the three remaining disks?  Or is this something
> that needs to be handled by the hardware level, or operating system/file
> system level?
> 
> Options include a hardware RAID controller in a RAID 0 stripe (this is more
> $$$ and for what gain?), or utilizing a volume manager like LVM.
> 
> Along those same lines, if you do implement some type of striping, what RAID
> stripe size is recommended?  (I think Todd Burruss asked this earlier but I
> did not see a response)
> 
> Thanks for any input!
> 
> -Eric
> 
> 

-- 

Anthony Molinaro   




RE: CassandraHardware link on the wiki FrontPage

2010-03-10 Thread Stu Hood
Anyone can edit any page once they have an account: click the "Login" link at 
the top right next to the search box to create an account.

Thanks,
Stu

-Original Message-
From: "Eric Rosenberry" 
Sent: Wednesday, March 10, 2010 2:52am
To: cassandra-user@incubator.apache.org
Subject: CassandraHardware link on the wiki FrontPage

Would it be possible to add a link to the CassandraHardware page from the
FrontPage of the wiki?

I think other new folks to Cassandra may find it useful.  ;-)

(I would do it myself, though that page is Immutable)

http://wiki.apache.org/cassandra/FrontPage

http://wiki.apache.org/cassandra/CassandraHardware

Thanks!

-Eric




RE: Effective allocation of multiple disks

2010-03-10 Thread Stu Hood
You can list multiple DataFileDirectories, and Cassandra will scatter files 
across all of them. Use 1 disk for the commitlog, and 3 disks for data 
directories.

See http://wiki.apache.org/cassandra/CassandraHardware#Disk

Thanks,
Stu

-Original Message-
From: "Eric Rosenberry" 
Sent: Wednesday, March 10, 2010 2:00am
To: cassandra-user@incubator.apache.org
Subject: Effective allocation of multiple disks

Based on the documentation, it is clear that with Cassandra you want to have
one disk for commitlog, and one disk for data.

My question is: If you think your workload is going to require more io
performance to the data disks than a single disk can handle, how would you
recommend effectively utilizing additional disks?

It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
 If we use one for commitlog, is there a way to have Cassandra itself
equally split data across the three remaining disks?  Or is this something
that needs to be handled by the hardware level, or operating system/file
system level?

Options include a hardware RAID controller in a RAID 0 stripe (this is more
$$$ and for what gain?), or utilizing a volume manager like LVM.

Along those same lines, if you do implement some type of striping, what RAID
stripe size is recommended?  (I think Todd Burruss asked this earlier but I
did not see a response)

Thanks for any input!

-Eric




Re: Hackathon?!?

2010-03-09 Thread Stu Hood
Definitely on board!

-Original Message-
From: "Dan Di Spaltro" 
Sent: Tuesday, March 9, 2010 8:05pm
To: cassandra-user@incubator.apache.org
Subject: Re: Hackathon?!?

Alright guys, we have settled on a date for the Cassandra meetup on...

April 15th, better known as, Tax day!

We can host it here at Cloudkick, unless a cooler startup wants to host it.
http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=100290781618196563860.000478354937656785449&z=19
1499
Potrero Ave San Francisco CA 94110

Bottom line, it would be great to get some folks together and spend some
time doing an intro, cover some deployments, data models and try to address
all the other burning questions out there.

We pushed it out from PyCON and hopefully settled on a good day, lets get a
count for how many folks are interested!

Thanks,

On Tue, Feb 9, 2010 at 3:10 PM, Reuben Smith  wrote:

> I live in the city and I'd like to add my vote for an "Intro to
> Cassandra" night.
>
> Reuben
>
> On Tue, Feb 9, 2010 at 10:43 AM, Dan Di Spaltro 
> wrote:
> > I think the tentative plans would be to push this out a bit farther
> > away from PyCon, to get a bigger attendance.
> >
> > It sounds like an "Intro to Cassandra" would be a better theme; focus
> > on the education piece.
> >
> > But it will happen! So stay tuned.
> >
> > On Tue, Feb 9, 2010 at 3:53 AM, Wayne Lewis  wrote:
> >>
> >> Hi Dan,
> >>
> >> Are you still planning for end of Feb?
> >>
> >> Please add me to the "very interested" list.
> >>
> >> Thanks!
> >> Wayne Lewis
> >>
> >>
> >> On Jan 26, 2010, at 8:42 PM, Dan Di Spaltro wrote:
> >>
> >>> Would anyone be interested in a Cassandra hack-a-thon at the end of
> >>> February in San Francisco?
> >>>
> >>> I think it would be great to get everyone together, since the last
> >>> hack-a-thon was at the Twitter office back around OSCON time.   We
> >>> could provide space in the Mission area or someone else could too, our
> >>> office is in a pretty interesting area
> >>>
> >>> (
> http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=100290781618196563860.000478354937656785449&z=17
> ).
> >>>
> >>> Tell me what you guys think!
> >>>
> >>> --
> >>> Dan Di Spaltro
> >>
> >>
> >
> >
> >
> > --
> > Dan Di Spaltro
> >
>



-- 
Dan Di Spaltro




RE: Latest check-in to trunk/ is broken

2010-03-08 Thread Stu Hood
Run `ant clean` before building. A few files moved around.

-Original Message-
From: "Cool BSD" 
Sent: Monday, March 8, 2010 5:18pm
To: "cassandra-user" 
Subject: Latest check-in to trunk/ is broken

version info:
$ svn info
Path: .
URL: https://svn.apache.org/repos/asf/incubator/cassandra/trunk
Repository Root: https://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 920560
Node Kind: directory
Schedule: normal
Last Changed Author: gdusbabek
Last Changed Rev: 920537
Last Changed Date: 2010-03-08 14:00:51 -0800 (Mon, 08 Mar 2010)

and error message:

build-project:
 [echo] apache-cassandra:
/net/f5/shared/nosql/cassandra/archive/svn/build.xml
[javac] Compiling 277 source files to
/net/f5/shared/nosql/cassandra/archive/svn/build/classes
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:112:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] private void updateEstimateFor(ColumnFamilyStore cfs,
Set> buckets)

[javac]^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:138:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] public Future>
submitAnticompaction(final ColumnFamilyStore cfStore, final
Collection ranges, final InetAddress target)
[javac]^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:240:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] int doCompaction(ColumnFamilyStore cfs,
Collection sstables, int gcBefore) throws IOException
[javac]^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:341:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] private List
doAntiCompaction(ColumnFamilyStore cfs, Collection sstables,
Collection ranges, InetAddress target)

[javac]
^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:341:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] private List
doAntiCompaction(ColumnFamilyStore cfs, Collection sstables,
Collection ranges, InetAddress target)
[javac]  ^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:451:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] static Set>
getBuckets(Iterable files, long min)
[javac] ^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:451:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] static Set>
getBuckets(Iterable files, long min)
[javac] ^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:498:
reference to SSTableScanner is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableScanner in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableScanner in org.apache.cassandra.io match
[javac] private Set scanners;
[javac] ^
[javac]
/net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:500:
reference to SSTableReader is ambiguous, both class
org.apache.cassandra.io.sstable.SSTableReader in
org.apache.cassandra.io.sstable and class
org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match
[javac] public AntiCompactionIterator(Collection
sstables, Collection ranges, int gcBefo

Re: Dynamically Switching from Ordered Partitioner to Random?

2010-03-05 Thread Stu Hood
But rather than switching, you should definitely try the 'loadbalance' approach 
first, and see whether OrderPP works out for you.

-Original Message-
From: "Chris Goffinet" 
Sent: Friday, March 5, 2010 1:43pm
To: cassandra-user@incubator.apache.org
Subject: Re: Dynamically Switching from Ordered Partitioner to Random?

At this time, you have to re-import the data.

-Chris

On Fri, Mar 5, 2010 at 11:42 AM, shiv shivaji  wrote:

> I started with the ordered partitioner as I was hoping to make use of the
> map-reduce functionality. However, my data was likely lopped onto 2 key
> machines with most of it on one (as seen from another thread. There were
> also machine failures to blame for the uneven distribution). One solution
> which I am trying is to load balance. Is there any other thing I can try to
> convert the partitioner to random on a live system?
>
> I know this sounds like an odd request. Curious about my options though. I
> did see a post mentioning that one can compute the md5 hash of each key and
> then insert using that and have a mapping table from key to md5 hash.
> Unfortunately, the data is already loaded using an ordered partitioner and I
> was wondering if there is a way to switch to random now.
>
> Shiv
>



-- 
Chris Goffinet




Re: Connect during bootstrapping?

2010-03-02 Thread Stu Hood
You are probably in the portion of bootstrap where data to be transferred is 
split out to disk, which can take a while: see 
https://issues.apache.org/jira/browse/CASSANDRA-579

Look for a 'streaming' subdirectory in your data directories to confirm.

-Original Message-
From: "Brian Frank Cooper" 
Sent: Tuesday, March 2, 2010 11:50pm
To: "cassandra-user@incubator.apache.org" 
Subject: Re: Connect during bootstrapping?

Thanks for the note.

Can you help me with something else? I can't seem to get any data to transfer 
during bootstrapping...I must be doing something wrong.

Here is what I did: I took 0.6.0-beta2, loaded 2 machines with 60-70GB each. 
Then I started a third node, with AutoBootstrap true. The node claims it is 
bootstrapping:

INFO - Auto DiskAccessMode determined to be mmap
INFO - Saved Token not found. Using Rb0mePN3PheW3haA
INFO - Creating new commitlog segment 
/home/cooperb/cassandra/commitlog/CommitLog-1267594407761.log
INFO - Starting up server gossip
INFO - Joining: getting load information
INFO - Sleeping 9 ms to wait for load information...
INFO - Node /98.137.30.37 is now part of the cluster
INFO - Node /98.137.30.38 is now part of the cluster
INFO - InetAddress /98.137.30.37 is now UP
INFO - InetAddress /98.137.30.38 is now UP
INFO - Joining: getting bootstrap token
INFO - New token will be user148315419 to assume load from /98.137.30.38
INFO - Joining: sleeping 3 for pending range setup
INFO - Bootstrapping

But when I run nodetool streams, no streams are transferring:

Mode: Bootstrapping
Not sending any streams.
Not receiving any streams.

And it doesn't look like the node is getting any data. Any ideas?

Thanks for the help...

Brian


On 3/2/10 12:22 PM, "Jonathan Ellis"  wrote:

On Tue, Mar 2, 2010 at 1:54 PM, Brian Frank Cooper
 wrote:
> Hi folks,
>
> I'm running 0.5 and I had 2 nodes up and running, then added a 3rd node in
> bootstrap mode. I understand from other discussion list threads that the new
> node doesn't serve reads while it is bootstrapping, but does that mean it
> won't connect at all?

it doesn't start the thrift listener until it is bootstrapped, so yes.

(you can tell when it's bootstrapped by when it appears in nodeprobe
ring.  0.6 also adds bootstrap progress reporting via jmx.)

> When I try to connect from my java client, or
> cassandra-cli, I get the exception below. Is it the expected behavior?
> (Also, cassandra-cli says "Connected to xxx.yahoo.com" even though it isn't
> really connected...)

This is fixed in https://issues.apache.org/jira/browse/CASSANDRA-807
for 0.6, fwiw.

-Jonathan


--
Brian Cooper
Principal Research Scientist
Yahoo! Research





Re: Is Cassandra a document based DB?

2010-03-01 Thread Stu Hood
> In HBase you have table:row:family:key:val:version, which some people
> might consider richer
Cassandra is actually table:family:row:key:val[:subval], where subvals are the 
columns stored in a supercolumn (which can be easily arranged by timestamp to 
give the versioned approach).


-Original Message-
From: "Erik Holstad" 
Sent: Monday, March 1, 2010 3:49pm
To: cassandra-user@incubator.apache.org
Subject: Re: Is Cassandra a document based DB?

On Mon, Mar 1, 2010 at 4:41 AM, Brandon Williams  wrote:

> On Mon, Mar 1, 2010 at 5:34 AM, HHB  wrote:
>
>>
>> What are the advantages/disadvantages of Cassandra over HBase?
>>
>
> Ease of setup: all nodes are the same.
>
> No single point of failure: all nodes are the same.
>
> Speed: http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
>
> Richer model: supercolumns.
>
I think that there are people that would be of a different opinion here.
Cassandra has
as I've understood it table:key:name:val and in cases the val is a
serialized data structure.
In HBase you have table:row:family:key:val:version, which some people might
consider
richer.

>
> Multi-datacenter awareness.
>
> There are likely other things I'm forgetting, but those stand out for me.
>
> -Brandon
>



-- 
Regards Erik




Re: Anti-compaction Diskspace issue even when latest patch applied

2010-02-28 Thread Stu Hood
`nodetool cleanup` is a very expensive process: it performs a major compaction, 
and should not be done that frequently.

-Original Message-
From: "shiv shivaji" 
Sent: Sunday, February 28, 2010 3:34pm
To: cassandra-user@incubator.apache.org
Subject: Re: Anti-compaction Diskspace issue even when latest patch applied

Seems like the temporary solution was to run a cron job which calls nodetool 
cleanup every 5 mins or so. This stopped the disk space from going too low.

The manual solution you mentioned is likely worthy of consideration as the load 
balancing is taking a while.

I will track the jira issue of anticompaction and diskspace. Thanks for the 
pointer.


Thanks, Shiv





From: Jonathan Ellis 
To: cassandra-user@incubator.apache.org
Sent: Wed, February 24, 2010 11:34:59 AM
Subject: Re: Anti-compaction Diskspace issue even when latest patch applied

as you noticed, "nodeprobe move" first unloads the data, then moves to
the new position.  so that won't help you here.

If you are using replicationfactor=1, scp the data to the previous
node on the ring, then reduce the original node's token so it isn't
responsible for so much, and run cleanup.  (you can do this w/ higher
RF too, you just have to scp the data more places.)

Finally, you could work on
https://issues.apache.org/jira/browse/CASSANDRA-579 so it doesn't need
to anticompact to disk before moving data.

-Jonathan

On Wed, Feb 24, 2010 at 12:06 PM, shiv shivaji  wrote:
> According to the stack trace I get in the log, it makes it look like the
> patch was for anti-compaction but I did not look at the source code in
> detail yet.
>
> java.util.concurrent.ExecutionException:
> java.lang.UnsupportedOperationException: disk full
> at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
> at
> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:570)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.UnsupportedOperationException: disk full
> at
> org.apache.cassandra.db.CompactionManager.doAntiCompaction(CompactionManager.java:344)
> at
> org.apache.cassandra.db.CompactionManager.doCleanupCompaction(CompactionManager.java:405)
> at
> org.apache.cassandra.db.CompactionManager.access$400(CompactionManager.java:49)
> at
> org.apache.cassandra.db.CompactionManager$2.call(CompactionManager.java:130)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> ... 2 more
>
> I tried "nodetool cleanup" before and it did not really stop the disk from
> filling, is there a way to force move the data or some other way to solve
> the issue?
>
> Thanks, Shiv
>
> 
> From: Jonathan Ellis 
> To: cassandra-user@incubator.apache.org
> Sent: Wed, February 24, 2010 7:16:32 AM
> Subject: Re: Anti-compaction Diskspace issue even when latest patch applied
>
> The patch you refer to was to help *compaction*, not *anticompaction*.
>
> If the space is mostly hints for other machines (is that what you
> meant by "due to past problems with others?") you should run nodeprobe
> cleanup on it to remove data that doesn't actually belong on that
> node.
>
> -Jonathan
>
> On Wed, Feb 24, 2010 at 3:09 AM, shiv shivaji  wrote:
>> For about 6TB of  total data size with a replication factor of 2 (6TB x 2)
>> on a five node cluster, I see about 4.6 TB on one machine (due to
>> potential
>> past problems with other machines). The machine has a disk of 6TB.
>>
>> The data folder on this machine has 59,289 files totally 4.6 TB. The files
>> are the data, filter and indexes. I see that anti-compaction is running. I
>> applied a recent patch which does not do anti-compaction if disk space is
>> limited. I still see it happening. I have also called nodetool loadbalance
>> on this machine. Seems like it will run out of disk space anyway.
>>
>> The machine diskspace consumed are: (Each machine has a 6TB hard-drive on
>> RAID).
>>
>> Machine Space Consumed
>> M14.47 TB
>> M22.93 TB
>> M31.83 GB
>> M456.19 GB
>> M5398.01 GB
>>
>> How can I force M1 to immediately move its load to M3 and M4 for instance
>> (or any other machine). The nodetool move command moves all data, is there
>> a
>> way instead to force move 50% of data to M3 and the remaining 50% to M4
>> and
>> resume anti-compaction after t

Re: StackOverflowError on high load

2010-02-21 Thread Stu Hood
Ran,

There are bounds to how large your data directory will grow, relative to the 
actual data. Please read up on compaction: 
http://wiki.apache.org/cassandra/MemtableSSTable , and if you have a 
significant number of deletes occuring, also read 
http://wiki.apache.org/cassandra/DistributedDeletes

The key mitigation is to ensure that minor compactions get a chance to occur 
regularly. This will happen automatically, but the faster you write data to 
your nodes, the more behind on compactions they can get. We consider this a 
bug, and CASSANDRA-685 will be exploring solutions so that your client 
automatically backs off as a node becomes overloaded.

Thanks,
Stu

-Original Message-
From: "Ran Tavory" 
Sent: Sunday, February 21, 2010 9:01am
To: cassandra-user@incubator.apache.org
Subject: Re: StackOverflowError on high load

This sort of explain this, yes, but what solution can I use?
I do see the OPP writes go faster than the RP, so this makes sense that when
using the OPP there's higher chance that a host will fall behind with
compaction and eventually crash. It's not a nice feature, but hopefully
there are mitigations to this.
So my question is - what are the mitigations? What should I tell my admin to
do in order to prevent this? Telling him "increase the directory size 2x"
isn't going to cut it as the directory just keeps growing and is not
bound...
I'm also no clear whether CASSANDRA-804 is going to be a real fix.
Thanks

On Sat, Feb 20, 2010 at 9:36 PM, Jonathan Ellis  wrote:

> if OPP is configured w/ imbalanced ranges (or less balanced than RP)
> then that would explain it.
>
> OPP is actually slightly faster in terms of raw speed.
>
> On Sat, Feb 20, 2010 at 2:31 PM, Ran Tavory  wrote:
> > interestingly, I ran the same load but this time with a random
> partitioner
> > and, although from time to time test2 was a little behind with its
> > compaction task, it did not crash and was able to eventually close the
> gaps
> > that were opened.
> > Does this make sense? Is there a reason why random partitioner is less
> > likely to be faulty in this scenario? The scenario is of about 1300
> > writes/sec of small amounts of data to a single CF on a cluster with two
> > nodes and no replication. With the order-preserving-partitioner after a
> few
> > hours of load the compaction pool is behind on one of the hosts and
> > eventually this host crashes, but with the random partitioner it doesn't
> > crash.
> > thanks
> >
> > On Sat, Feb 20, 2010 at 6:27 AM, Jonathan Ellis 
> wrote:
> >>
> >> looks like test1 started gc storming, so test2 treats it as dead and
> >> starts doing hinted handoff for it, which increases test2's load, even
> >> though test1 is not completely dead yet.
> >>
> >> On Thu, Feb 18, 2010 at 1:16 AM, Ran Tavory  wrote:
> >> > I found another interesting graph, attached.
> >> > I looked at the write-count and write-latency of the CF I'm writing to
> >> > and I
> >> > see a few interesting things:
> >> > 1. the host test2 crashed at 18:00
> >> > 2. At 16:00, after a few hours of load both hosts dropped their
> >> > write-count.
> >> > test1 (which did not crash) started slowing down first and then test2
> >> > slowed.
> >> > 3. At 16:00 I start seeing high write-latency on test2 only. This
> takes
> >> > about 2h until finally at 18:00 it crashes.
> >> > Does this help?
> >> >
> >> > On Thu, Feb 18, 2010 at 7:44 AM, Ran Tavory  wrote:
> >> >>
> >> >> I ran the process again and after a few hours the same node crashed
> the
> >> >> same way. Now I can tell for sure this is indeed what Jonathan
> proposed
> >> >> -
> >> >> the data directory needs to be 2x of what it is, but it looks like a
> >> >> design
> >> >> problem, how large to I need to tell my admin to set it then?
> >> >> Here's what I see when the server crashes:
> >> >> $ df -h /outbrain/cassandra/data/
> >> >> FilesystemSize  Used Avail Use% Mounted on
> >> >> /dev/mapper/cassandra-data
> >> >>97G   46G   47G  50% /outbrain/cassandra/data
> >> >> The directory is 97G and when the host crashes it's at 50% use.
> >> >> I'm also monitoring various JMX counters and I see that
> COMPACTION-POOL
> >> >> PendingTasks grows for a while on this host (not on the other host,
> >> >> btw,
> >> >> which is fine, just this host) and then flats for 3 hours. After 3
> >> >> hours of
> >> >> flat it crashes. I'm attaching the graph.
> >> >> When I restart cassandra on this host (not changed file allocation
> >> >> size,
> >> >> just restart) it does manage to compact the data files pretty fast,
> so
> >> >> after
> >> >> a minute I get 12% use, so I wonder what made it crash before that
> >> >> doesn't
> >> >> now? (could be the load that's not running now)
> >> >> $ df -h /outbrain/cassandra/data/
> >> >> FilesystemSize  Used Avail Use% Mounted on
> >> >> /dev/mapper/cassandra-data
> >> >>97G   11G   82G  12% /outbrain/cassandra/data
> >> >> The question is what siz

Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?

2010-02-16 Thread Stu Hood
> After I ran "nodeprobe compact" on node B its read latency went up to 150ms.
The compaction process can take a while to finish... in 0.5 you need to watch 
the logs to figure out when it has actually finished, and then you should start 
seeing the improvement in read latency.

> Is there any way to utilize all of the heap space to decrease the read 
> latency?
In 0.5 you can adjust the number of keys that are cached by changing the 
'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally 
cache rows. You don't want to use up all of the memory on your box for those 
caches though: you'll want to leave at least 50% for your OS's disk cache, 
which will store the full row content.


-Original Message-
From: "Weijun Li" 
Sent: Tuesday, February 16, 2010 12:16pm
To: cassandra-user@incubator.apache.org
Subject: Re: Cassandra benchmark shows OK throughput but high read latency (> 
100ms)?

Thanks for for DataFileDirectory trick and I'll give a try.

Just noticed the impact of number of data files: node A has 13 data files
with read latency of 20ms and node B has 27 files with read latency of 60ms.
After I ran "nodeprobe compact" on node B its read latency went up to 150ms.
The read latency of node A became as low as 10ms. Is this normal behavior?
I'm using random partitioner and the hardware/JVM settings are exactly the
same for these two nodes.

Another problem is that Java heap usage is always 900mb out of 6GB? Is there
any way to utilize all of the heap space to decrease the read latency?

-Weijun

On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams  wrote:

> On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li  wrote:
>
>> One more thoughts about Martin's suggestion: is it possible to put the
>> data files into multiple directories that are located in different physical
>> disks? This should help to improve the i/o bottleneck issue.
>>
>>
> Yes, you can already do this, just add more  directives
> pointed at multiple drives.
>
>
>> Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?
>
>
> Row cache and key cache both help tremendously if your read pattern has a
> decent repeat rate.  Completely random io can only be so fast, however.
>
> -Brandon
>




Re: TimeOutExceptions and Cluster Performance

2010-02-13 Thread Stu Hood
The combination of 'too many open files' and lots of memtable flushes could 
mean you have tons and tons of sstables on disk. This can make reads especially 
slow.

If you are seeing the timeouts on reads a lot more often than on writes, then 
this explanation might make sense, and you should watch 
https://issues.apache.org/jira/browse/CASSANDRA-685.

Thanks,
Stu

-Original Message-
From: "Jonathan Ellis" 
Sent: Friday, February 12, 2010 9:43pm
To: cassandra-user@incubator.apache.org
Subject: Re: TimeOutExceptions and Cluster Performance

There's a lot more details that would be useful, but if you are on the
verge of OOMing and something actually running out, then that's
probably the culprit; when the JVM gets low on ram it will consume all
your CPU trying to GC enough to continue.  (you mentioned seeing high
cpu on one core which tends to corroborate this; to confirm you can
look at the thread using the CPU:
http://publib.boulder.ibm.com/infocenter/javasdk/tools/index.jsp?topic=/com.ibm.java.doc.igaa/_1vg0001475cb4a-1190e2e0f74-8000_1007.html)

Look at your executor queues, in the output of nodeprobe tpstats if
you have no other metrics system.  You probably are just swamping it
with writes, if you have 1000s of ops in any of the pending queues,
that's bad.

-Jonathan

On Fri, Feb 12, 2010 at 7:40 PM, Stephen Hamer  wrote:
> Hi,
> I'm running a 5 node Cassandra cluster and am having a very tough time
> getting reasonable performance from it. Many of the requests are failing
> with TimeOutException. This is making it difficult to use Cassandra in a
> production setting.
>
> The cluster was running fine for a week or two (it was created 3 weeks ago)
> but has started to degrade in the last week. The cluster was originally only
> 3 nodes but when performance started to degrade I added another two nodes.
> This doesn't seem to have helped though.
>
> Requests being made from the my application are being balanced across the
> cluster in a round robin fashion. Many of these requests are failing with
> TimeOutException. When the occurs I can look at the DB servers and several
> of them fully utilizing 1 core. I can turn off my application when this is
> going on (which stops all reads and writes to Cassandra). The cluster seems
> to stay in this state for another several hour before returning to a resting
> state.
>
> When the CPU is loaded I see lots of messages about en-queuing, sorting, and
> writing memtables so I have tried adjusting the memtable size down to 16MB
> and raised the MemtableFlushAfterMinutes to 1440. This doesn't seem to have
> affected anything though.
>
> I was seeing errors about too many file descriptors being open so I added
> “ulimit –n 32768” to Cassandra.in.sh. This seems to fixed this. I was also
> seeing lots of out of memory exceptions so I raised the heap size to 4GB.
> This has helped but not eliminated the OOM issues.
>
> I'm not sure if it's related to any of the performance issues but I see lots
> of log entries about DigestMismatchExceptions. I've included a sample of the
> exceptions below.
>
> My Cassandra cluster is almost unusable in its current state because of the
> number of timeout exceptions that I'm seeing. I suspect that this is because
> of a configuration or I have improperly set something up. It feels like the
> database has entered a bad state which is causing it to churn as much as it
> is but have no way to verify this.
>
> What steps can I take to address the performance issues I am seeing and the
> consistent stream of TimeOutExceptions?
>
> Thanks,
> Stephen
>
>
> Here are some specifics about the cluster configuration:
>
> 5 Large EC2 instances - 7.5 GB ram, 2 cores (64bit, 1-1.2Ghz), data and
> commit logs stored on separate EBS volumes. Boxes are running Debian 5.
>
> r...@prod-cassandra4 ~/cassandra # bin/nodeprobe -host localhost ring
> Address       Status     Load          Range
>      Ring
>
>
> 101279862673517536112907910111793343978
> 10.254.55.191 Up         2.94 GB       27246729060092122727944947571993545
>      |<--|
> 10.214.119.127Up         3.67 GB
> 34209800341332764076889844611182786881     |   ^
> 10.215.122.208Up         11.86 GB
>  42649376116143870288751410571644302377     v   |
> 10.215.30.47  Up         6.37 GB
> 81374929113514034361049243620869663203     |   ^
> 10.208.246.160Up         5.15 GB
> 101279862673517536112907910111793343978    |-->|
>
>
> I am running the 0.5 release of Cassandra (at commit 44e8c2e...). Here are
> some of my configuration options:
>
> Memory, disk, performance section of storage-conf.xml (I've only included
> options that I've changed from the defaults):
> org.apache.cassandra.dht.RandomPartitioner
> 3
>
> 512
> 64
> 16
> 64
> 16
> 0.1
> 1440
> 8
> 32
> periodic
> 1
> 864000
> 128
>
>
> interesting bits of cassandra.in.sh:
> ulimit -n 32768
> JVM_OPTS=" \
>         -ea \
>         -Xdebug \
>         -Xrunjdwp:transport=dt_socket,server=y,address=,suspend=n \
>         -Xms512M \
>    

Re: OOM Exception

2009-12-13 Thread Stu Hood
PS: If this turns out to actually be the problem, I'll open a ticket for it.

Thanks,
Stu

-Original Message-
From: "Stu Hood" 
Sent: Sunday, December 13, 2009 12:28pm
To: cassandra-user@incubator.apache.org
Subject: Re: OOM Exception

With 248G per box, you probably have slightly more than 1/2 billion items?

One current implementation detail in Cassandra is that it loads 128th of the 
index into memory for faster lookups. This means you might have something like 
4.5 million keys in memory at the moment.

The '128' value is a constant at SSTable.INDEX_INTERVAL. You should be able to 
recompile with '1024' to allow for an 8 times larger database, but understand 
that this will have a negative effect on your read performance.

Thanks,
Stu

-Original Message-
From: "Dan Di Spaltro" 
Sent: Sunday, December 13, 2009 12:06pm
To: cassandra-user@incubator.apache.org
Subject: Re: OOM Exception

What consistencyLevel are you inserting the elements?  If you do
./bin/nodeprobe -host localhost tpstats on each machine do you see one
metric that has a lot of pending items?

On Sun, Dec 13, 2009 at 8:14 AM, Brian Burruss  wrote:

> another OOM exception.  the only thing interesting about my testing is that
> there are 2 servers, RF=2, W=1, R=1 ... there is 248G of data on each
> server.  I have -Xmx3G assigned to each server
>
> 2009-12-12 22:04:37,436 ERROR [pool-1-thread-309] [Cassandra.java:734]
> Internal error processing get
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: Java heap space
>at
> org.apache.cassandra.service.StorageProxy.weakReadLocal(StorageProxy.java:523)
>at
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:373)
>at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:92)
>at
> org.apache.cassandra.service.CassandraServer.multigetColumns(CassandraServer.java:265)
>at
> org.apache.cassandra.service.CassandraServer.multigetInternal(CassandraServer.java:320)
>at
> org.apache.cassandra.service.CassandraServer.get(CassandraServer.java:253)
>at
> org.apache.cassandra.service.Cassandra$Processor$get.process(Cassandra.java:724)
>at
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:712)
>at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>at java.lang.Thread.run(Thread.java:636)
>
> 
> From: Brian Burruss
> Sent: Saturday, December 12, 2009 7:45 AM
> To: cassandra-user@incubator.apache.org
> Subject: OOM Exception
>
> this happened after cassandra was running for a couple of days.  I have
> -Xmx3G on JVM.
>
> is there any other info you need so this makes sense?
>
> thx!
>
>
> 2009-12-11 21:38:37,216 ERROR [HINTED-HANDOFF-POOL:1]
> [DebuggableThreadPoolExecutor.java:157] Error in ThreadPoolExecutor
> java.lang.OutOfMemoryError: Java heap space
>at
> org.apache.cassandra.io.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:151)
>at
> org.apache.cassandra.io.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:144)
>at
> org.apache.cassandra.io.SSTableWriter.(SSTableWriter.java:53)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doFileCompaction(ColumnFamilyStore.java:911)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doFileCompaction(ColumnFamilyStore.java:855)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doMajorCompactionInternal(ColumnFamilyStore.java:698)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doMajorCompaction(ColumnFamilyStore.java:670)
>at
> org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:190)
>at
> org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>at
> org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>at java.lang.Thread.run(Thread.java:636)
>



-- 
Dan Di Spaltro






Re: OOM Exception

2009-12-13 Thread Stu Hood
With 248G per box, you probably have slightly more than 1/2 billion items?

One current implementation detail in Cassandra is that it loads 128th of the 
index into memory for faster lookups. This means you might have something like 
4.5 million keys in memory at the moment.

The '128' value is a constant at SSTable.INDEX_INTERVAL. You should be able to 
recompile with '1024' to allow for an 8 times larger database, but understand 
that this will have a negative effect on your read performance.

Thanks,
Stu

-Original Message-
From: "Dan Di Spaltro" 
Sent: Sunday, December 13, 2009 12:06pm
To: cassandra-user@incubator.apache.org
Subject: Re: OOM Exception

What consistencyLevel are you inserting the elements?  If you do
./bin/nodeprobe -host localhost tpstats on each machine do you see one
metric that has a lot of pending items?

On Sun, Dec 13, 2009 at 8:14 AM, Brian Burruss  wrote:

> another OOM exception.  the only thing interesting about my testing is that
> there are 2 servers, RF=2, W=1, R=1 ... there is 248G of data on each
> server.  I have -Xmx3G assigned to each server
>
> 2009-12-12 22:04:37,436 ERROR [pool-1-thread-309] [Cassandra.java:734]
> Internal error processing get
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: Java heap space
>at
> org.apache.cassandra.service.StorageProxy.weakReadLocal(StorageProxy.java:523)
>at
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:373)
>at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:92)
>at
> org.apache.cassandra.service.CassandraServer.multigetColumns(CassandraServer.java:265)
>at
> org.apache.cassandra.service.CassandraServer.multigetInternal(CassandraServer.java:320)
>at
> org.apache.cassandra.service.CassandraServer.get(CassandraServer.java:253)
>at
> org.apache.cassandra.service.Cassandra$Processor$get.process(Cassandra.java:724)
>at
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:712)
>at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>at java.lang.Thread.run(Thread.java:636)
>
> 
> From: Brian Burruss
> Sent: Saturday, December 12, 2009 7:45 AM
> To: cassandra-user@incubator.apache.org
> Subject: OOM Exception
>
> this happened after cassandra was running for a couple of days.  I have
> -Xmx3G on JVM.
>
> is there any other info you need so this makes sense?
>
> thx!
>
>
> 2009-12-11 21:38:37,216 ERROR [HINTED-HANDOFF-POOL:1]
> [DebuggableThreadPoolExecutor.java:157] Error in ThreadPoolExecutor
> java.lang.OutOfMemoryError: Java heap space
>at
> org.apache.cassandra.io.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:151)
>at
> org.apache.cassandra.io.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:144)
>at
> org.apache.cassandra.io.SSTableWriter.(SSTableWriter.java:53)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doFileCompaction(ColumnFamilyStore.java:911)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doFileCompaction(ColumnFamilyStore.java:855)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doMajorCompactionInternal(ColumnFamilyStore.java:698)
>at
> org.apache.cassandra.db.ColumnFamilyStore.doMajorCompaction(ColumnFamilyStore.java:670)
>at
> org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:190)
>at
> org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>at
> org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>at java.lang.Thread.run(Thread.java:636)
>



-- 
Dan Di Spaltro




Re: cassandra over hbase

2009-11-24 Thread Stu Hood
> JR> After chatting with some Facebook guys, we realized that one potential
> JR> benefit from using HDFS is that the recovery from losing partial data in a
> JR> node is more efficient. Suppose that one lost a single disk at a node. 
> HDFS
> JR> can quickly rebuild the blocks on the failed disk in parallel.

HDFS replicates eagerly, which means that having a node down for longer than a 
timeout period will also mean that you do more work than you needed. Cassandra 
replicates (very) lazily, and I prefer laziness for the sake of efficiency.

> JR> So, when this happens, the whole node probably has to be taken out
> JR> and bootstrapped. The same problem exists when a single sstable file
> JR> is corrupted.
> I think recovering a single sstable is a useful thing, and it seems like
> a better problem to solve.

This is why we need to get #193 in. Going to the filesystem and 
deleting/fuzzing an SSTable on a node and then running a repair will cause a 
new SSTable to be created  that overlays and reapairs the first based on the 
data from the other nodes.

Thanks,
Stu

-Original Message-
From: "Ted Zlatanov" 
Sent: Tuesday, November 24, 2009 8:40am
To: cassandra-user@incubator.apache.org
Subject: Re: cassandra over hbase

On Mon, 23 Nov 2009 11:58:08 -0800 Jun Rao  wrote: 

JR> After chatting with some Facebook guys, we realized that one potential
JR> benefit from using HDFS is that the recovery from losing partial data in a
JR> node is more efficient. Suppose that one lost a single disk at a node. HDFS
JR> can quickly rebuild the blocks on the failed disk in parallel. This is a
JR> bit hard to do in cassandra, since we can't easily find the data on the
JR> failed disk from another node. 

This is an architectural issue, right?  IIUC Cassandra simply doesn't
care about disks.  I think that's a plus, actually, because it
simplifies the code and filesystems in my experience are better left up
to the OS.  For instance, we're evaluating Lustre and for many specific
reasons it's significantly better for our needs than HDFS, so HDFS would
be a tough sell.

JR> So, when this happens, the whole node probably has to be taken out
JR> and bootstrapped. The same problem exists when a single sstable file
JR> is corrupted.

I think recovering a single sstable is a useful thing, and it seems like
a better problem to solve.

Ted





Re: quorum / hinted handoff

2009-11-20 Thread Stu Hood
You need a quorum relative to your replication factor. You mentioned in the 
first e-mail that you have RF=2, so you need a quorum of 2. If you use RF=3, 
then you need a quorum of 2 as well.

-Original Message-
From: "B. Todd Burruss" 
Sent: Friday, November 20, 2009 4:14pm
To: cassandra-user@incubator.apache.org
Subject: Re: quorum / hinted handoff

not really.  it seems that if i start with 3 nodes, remove 1 of them, i
should still have a quorum, which is 2.  this is not what i experience.

On Fri, 2009-11-20 at 16:03 -0600, Jonathan Ellis wrote:
> Oh, okay.  Then it's working as expected.
> 
> Does it make more sense to you now? :)
> 
> -Jonathan
> 
> On Fri, Nov 20, 2009 at 3:43 PM, B. Todd Burruss  wrote:
> > this was on the build i got yesterday, 882359.
> >
> > ... and you are correct about if you start with 2 nodes and take one
> > down - there isn't a quorum and the write/read fails.  i tested that as
> > well.
> >
> > thx!
> >
> >
> > On Fri, 2009-11-20 at 15:30 -0600, Jonathan Ellis wrote:
> >> On Fri, Nov 20, 2009 at 11:31 AM, B. Todd Burruss  
> >> wrote:
> >> > one more point on this .. if i only start a cluster with 2 nodes, and i
> >> > use the same config setup (RF=2, etc) .. it works fine.  it's only when
> >> > i start with the 3 nodes and remove 1.  in fact, i remove the node
> >> > before i do any reads or writes at all, completely fresh database.
> >>
> >> That sounds like a bug.  If you have 2 nodes, RF of 2, and take one
> >> node down then quorum anything should always fail.
> >>
> >> Is this on trunk still?
> >>
> >> -Jonathan
> >
> >
> >






Re: bandwidth limiting Cassandra's replication and access control

2009-11-11 Thread Stu Hood
Hey Ted,

Would you mind creating a ticket for this issue in JIRA? A lot of discussion 
has gone on, and a place to collect the design and feedback would be a good 
start.

Thanks,
Stu

-Original Message-
From: "Ted Zlatanov" 
Sent: Wednesday, November 11, 2009 3:28pm
To: cassandra-user@incubator.apache.org
Cc: cassandra-...@incubator.apache.org
Subject: Re: bandwidth limiting Cassandra's replication and access control

On Wed, 11 Nov 2009 07:40:00 -0800 "Coe, Robin"  wrote: 

CR> Just going to chime in here, because I have experience writing apps
CR> that use JAAS and JNDI to authenticate against LDAP and JDBC
CR> services.  However, I only just started looking at Cassandra this
CR> week, so I'm not certain of the premise behind controlling access to
CR> the Cassandra service.

CR> IMO, auth services should be left to the application layer that
CR> interfaces to Cassandra and not built into Cassandra.  In the
CR> tutorial snippet included below, the access being granted is at the
CR> codebase level, not the transaction level.  Since users of Cassandra
CR> will generally be fronted by a service layer, the java security
CR> manager isn’t going to suffice.  What this snippet could do, though,
CR> and may be the rationale for the request, is to ensure that
CR> unauthorized users cannot instantiate a new Cassandra server.
CR> However, if a user has physical access to the machine on which
CR> Cassandra is installed, they could easily bypass that layer of
CR> security.

CR> So, I guess I'm wondering whether this discussion pertains to
CR> application-layer security, i.e., permission to execute Thrift
CR> transactions, or Cassandra service security?  Or is it strictly a
CR> utility function, to create a map of users to specific Keyspaces, to
CR> simplify the Thrift API?

(note followups to the devel list)

I mentioned I didn't know JAAS so I appreciate any help you can give.
Specifically, I don't know yet what is the difference between the
codebase level and the transaction level in JAAS terms.  Can you
explain?

I am interested in controlling the Thrift client API, not the Gossip
replication service.  The authenticating clients will not have physical
access to the machine and all the authentication tokens will have to be
passed over a Thrift login call. How would you use JAAS+JNDI to control
that?  The access point is CassandraServer.java as Jonathan mentioned.

Ted





RE: Incr/Decr Counters in Cassandra

2009-11-04 Thread Stu Hood
This type of problem is one of the primary examples of something that should be 
handled by pluggable/client-side conflict resolution in an eventually 
consistent system. Currently, all conflicts in Cassandra are handled with 
"highest timestamp wins"

Rather than attempting to add atomic operations, I think we should support one 
of the following in the (near) future:
 1) Client side resolution
   * When two writes conflict (ex: two clients simultaneously read the counter 
at "2", and then write it at "3"), the next client receives a callback with the 
old value and both new values, and can then do application specific resolution 
(ex: both clients incremented by "1", so use "4").
 2) Pluggable resolution
   * When two writes conflict, pluggable logic on the server side decides how 
to merge the writes. You could implement the same algorithm used in the example 
above, but the code would have to exist on the server side.

Personally, I think (1) is the better approach, and for backwards compatibility 
in the API, you could make the resolution optional.


-Original Message-
From: "Chris Goffinet" 
Sent: Wednesday, November 4, 2009 3:32pm
To: cassandra-user@incubator.apache.org
Cc: cassandra-...@incubator.apache.org
Subject: Incr/Decr Counters in Cassandra

Hey,

At Digg we've been thinking about counters in Cassandra. In a lot of  
our use cases we need this type of support from a distributed storage  
system. Anyone else out there who has such needs as well? Zookeeper  
actually has such support and we might use that if we can't get the  
support in Cassandra.

---
Chris Goffinet
goffi...@digg.com









RE: How does Cassandra store data physically?

2009-07-01 Thread Stu Hood
There is no such thing as a column or supercolumn that is not contained in a 
ColumnFamily. The ColumnFamily is the structure that is stored together on disk.

A supercolumn is not what you think it is: supercolumns are like regular 
columns, except they contain other columns, and you can have an almost infinite 
number of supercolumns within a SuperColumnFamily.

A ColumnFamily is layed out on disk as a sequence of values which is sorted by 
key, then by (super)column name (or column timestamp), then subcolumn 
name/timestamp. Therefore, it is very fast to get contiguous keys from the 
ColumnFamily, but to get a single column name from multiple keys Cassandra 
still needs to seek to the next interesting column on disk.

There is no concept of 'blocks' in the Cassandra representation, because it 
does not use a B-Tree to store data. There is an index for each ColumnFamily on 
disk that allows Cassandra to seek directly to a key in the sorted file.

Please see http://wiki.apache.org/cassandra/DataModel

Thanks,
Stu

-Original Message-
From: "Ivan Chang" 
Sent: Wednesday, July 1, 2009 3:00pm
To: cassandra-user@incubator.apache.org
Subject: How does Cassandra store data physically?

I am wondering how Cassandra stores its columns, super columns in the
database files?

A supercolumn logically groups a set of related columns together, when the
supercolumn is written to file, are the columns also stored in adjacent
blocks to each other so IO cost is minimized for related data?  What about
individual columns not associated with any supercolumn, but related only
through a given key?

Thanks,
Ivan