Re: Mutation dropped

2013-02-21 Thread Wei Zhu
Thanks Aaron for the great information as always. I just checked cfhistograms 
and only a handful of read latency are bigger than 100ms, but for 
proxyhistograms there are 10 times more are greater than 100ms. We are using 
QUORUM  for reading with RF=3, and I understand coordinator needs to get the 
digest from other nodes and read repair on the miss match etc. But is it normal 
to see the latency from proxyhistograms to go beyond 100ms? Is there anyway to 
improve that? 
We are tracking the metrics from Client side and we see the 95th percentile 
response time averages at 40ms which is a bit high. Our 50th percentile was 
great under 3ms. 

Any suggestion is very much appreciated.

Thanks.
-Wei

- Original Message -
From: "aaron morton" 
To: "Cassandra User" 
Sent: Thursday, February 21, 2013 9:20:49 AM
Subject: Re: Mutation dropped

> What does rpc_timeout control? Only the reads/writes? 
Yes. 

> like data stream,
streaming_socket_timeout_in_ms in the yaml

> merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

> What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu  wrote:

> What does rpc_timeout control? Only the reads/writes? How about other 
> inter-node communication, like data stream, merkle tree request?  What is the 
> reasonable value for roc_timeout? The default value of 10 seconds are way too 
> long. What is the side effect if it's set to a really small number, say 20ms?
> 
> Thanks.
> -Wei
> 
> From: aaron morton 
> To: user@cassandra.apache.org 
> Sent: Tuesday, February 19, 2013 7:32 PM
> Subject: Re: Mutation dropped
> 
>> Does the rpc_timeout not control the client timeout ?
> No it is how long a node will wait for a response from other nodes before 
> raising a TimedOutException if less than CL nodes have responded. 
> Set the client side socket timeout using your preferred client. 
> 
>> Is there any param which is configurable to control the replication timeout 
>> between nodes ?
> There is no such thing.
> rpc_timeout is roughly like that, but it's not right to think about it that 
> way. 
> i.e. if a message to a replica times out and CL nodes have already responded 
> then we are happy to call the request complete. 
> 
> Cheers
> 
>  
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 19/02/2013, at 1:48 AM, Kanwar Sangha  wrote:
> 
>> Thanks Aaron.
>>  
>> Does the rpc_timeout not control the client timeout ? Is there any param 
>> which is configurable to control the replication timeout between nodes ? Or 
>> the same param is used to control that since the other node is also like a 
>> client ?
>>  
>>  
>>  
>> From: aaron morton [mailto:aa...@thelastpickle.com] 
>> Sent: 17 February 2013 11:26
>> To: user@cassandra.apache.org
>> Subject: Re: Mutation dropped
>>  
>> You are hitting the maximum throughput on the cluster. 
>>  
>> The messages are dropped because the node fails to start processing them 
>> before rpc_timeout. 
>>  
>> However the request is still a success because the client requested CL was 
>> achieved. 
>>  
>> Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
>> Both nodes replicate each row, and writes are sent to each replica, so the 
>> only thing the client is waiting on is the local node to write to it's 
>> commit log. 
>>  
>> Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
>> scenario. 
>>  
>> Cheers
>>  
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>>  
>> @aaronmorton
>> http://www.thelastpickle.com
>>  
>> On 15/02/2013, at 9:42 AM, Kanwar Sangha  wrote:
>> 
>> 
>> Hi – Is there a parameter which can be tuned to prevent the mutations from 
>> being dropped ? Is this logic correct ?
>>  
>> Node A and B with RF=2, CL =1. Load balanced between the two.
>>  
>> --  Address   Load   Tokens  Owns (effective)  Host ID   
>> Rack
>> UN  10.x.x.x   746.78 GB  256 100.0%
>> dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
>> UN  10.x.x.x   880.77 GB  256 100.0%
>> 95d59054-be99-455f-90d1-f43981d3d778  rack1
>>  
>> Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
>> falling behind and we see the mutation dropped messages. But there are no 
>> failures on the client. Does that mean other node is not able to persis

RE: Using Cassandra for read operations

2013-02-21 Thread Viktor Jevdokimov
Bill de hÓra already answered, I'd like to add:

To achieve ~4ms reads (from client standpoint):
1. You can't use multi-slice, since different keys may occur on different nodes 
that require internode communication. Design you data and reads to use one 
key/row.
2. Use ConsistencyLevel.ONE to avoid waiting for other nodes.
3. Use smart client that selects endpoints by token (key) to put request to 
appropriate node, Astyanax (Java) or write such client yourself.
4. Turn off dynamic snitch. While coordinator node may read locally, dynamic 
snitch may redirect it to another replica.
5. Use SSD's to avoid re-cache issue when sstables are compacted.
6. If you do writes, the rest issue is GC. If you're not on Azul Zing JVM, 
which I can't confirm to be better than Oracle HotSpot or JRockit (both has GC 
issues), you can't tune JVM to avoid Young Gen GC pauses to be as low as you 
need. You will fight pause frequency VS time.
So if you can afford Zing, check also Aerospike (ex-CitrusLeaf) alternative to 
Cassandra, which is written in C and has no GC issues.


> From: Bill de hÓra [mailto:b...@dehora.net]
> Sent: Thursday, February 21, 2013 22:07
> To: user@cassandra.apache.org
> Subject: Re: Using Cassandra for read operations
>
> In a nutshell -
>
> - Start with defaults and tune based on small discrete adjustments and leave
> time to see the effect of each change. No-one will know your workload
> better than you and the questions you are asking are workload sensitive.
>
> - Allow time for tuning and spending time understanding the memory model
> and JVM GC.
>
> - Be very careful with caches. Leave enough room in the OS for its own disk
> cache.
>
> - Get an SSD
>
>
> Bill
>
>
> On 21 Feb 2013, at 19:03, amulya rattan  wrote:
>
> > Dear All,
> >
> > We are currently evaluating Cassandra for an application involving strict
> SLAs(Service level agreements). We just need one column family with a long
> key and approximately 70-80 bytes row. We are not concerned about write
> performance but are primarily concerned about read. For our SLAs, a read of
> max 15-20 rows at once(using multi slice), should not take more than 4 ms.
> Till now, on a single node setup, using cassandra' stress tool, the numbers 
> are
> promising. But I am guessing that's because there is no network latency
> involved there and since we set memtable around 2gb(4 gb heap), we never
> had to get to Disk I/O.
> >
> > Assuming our nodes having >32GB RAM, a couple of questions regarding
> read:
> >
> > * To avoid disk I/Os, the best option we thought is to have data in memory.
> Is it a good idea to have memtable setup around 1/2 or 3/4 of heap size?
> Obviously flushing will take a lot of time but would that hurt that node's
> performance big time?
> >
> > * Cassandra stress tool only gives out average read latency. Is there a way
> to figure out max read-latency for a bunch of read operations?
> >
> > * How big a row cache can one have? Given that cassandra provides off-
> heap row caching, in a machine >32 gb RAM, would it be wise to have a >10
> gb row cache with 8 gb java heap? And how big should the corresponding key
> cache be then?
> >
> > Any response is appreciated.
> >
> > ~Amulya
> >


Best regards / Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-01112 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.


Re: Cassandra with SAN

2013-02-21 Thread Ben Gambley
On Friday, February 22, 2013, Jared Biel wrote:

> > As a counter argument though, anyone running a C* cluster on the Amazon
> cloud is going to be using SAN storage (or some kind of proprietary storage
> array) at the lowest  layers...Amazon isn't going to have a bunch of JBOD
> running their cloud infrastructure.  However, they've invested in the
> infrastructure to do it right.
>
> This is certainly true when using EBS, however it's generally not
> recommended to use EBS when running Cassandra. EBS has proven to be
> unreliable in the past and it's a bit of a SPOF. Instead, it's recommended
> to use the "instance store" disks that come with most instances (handy
> chart here: http://www.ec2instances.info/). These are the rough
> equivalent of local disks (probably host level RAID 10 storage if I'd have
> to guess.)
>
> -Jared
>
> On 22 February 2013 00:40, Michael Morris wrote:
>
> I'm running a 27 node cassandra cluster on SAN without issue.  I will be
> perfectly clear though, the hosts are multi-homed to different
> switches/fabrics in the SAN, we have an _expensive_ EMC array, and other
> than a datacenter-wide power outage, there's no SPOF for the SAN.  We use
> it because it's there, and it's already a sunk cost.
>
> I certainly would not go out of my way to purchase SAN infrastructure for
> a C* cluster, it just doesn't make sense (for all the reasons others have
> mentioned).  Any more, you can load up a single 2U server with multi-TB
> worth of disk, so the aggregate storage capacity of your C* cluster could
> potentially be as much as a SAN you would purchase (and a lot less hassle
> too).
>
> As a counter argument though, anyone running a C* cluster on the Amazon
> cloud is going to be using SAN storage (or some kind of proprietary storage
> array) at the lowest layers...Amazon isn't going to have a bunch of JBOD
> running their cloud infrastructure.  However, they've invested in the
> infrastructure to do it right.
>
> - Mike
>
>
> On Thu, Feb 21, 2013 at 6:08 PM, P. Taylor Goetz wrote:
>
> I shouldn't have used the word "spinning"... SSDs are a great option as
> well.
>
> I also agree with all the "expensive SPOF" points others have made.
>
> Sent from my iPhone
>
> On Feb 21, 2013, at 6:56 PM, "P. Taylor Goetz"  wrote:
>
> Cassandra is designed to write and read data in a way that is optimized
> for physical spinning disks.
>
> Running C* on a SAN introduces a layer of abstraction that, at best
> negates those optimizations, and at worst introduces additional overhead.
>
> Sent from my iPhone
>
> On Feb 21, 2013, at 6:42 PM, Kanwar Sangha  wrote:
>
>  Ok. What would be the drawbacks J
>
> ** **
>
> *From:* Michael Kjellman [mailto:mkjell...@barracuda.com]
> *Sent:* 21 February 2013 17:12
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra with SAN
>
> ** **
>
> No, this is a really really bad idea and C* was not designed for this, in
> fact, it was designed so you don't need to have a large expensive SAN.
>
> ** **
>
> Don't be tempted by the shiny expensive SAN. :)
>
> ** **
>
> If money is no object instead throw SSD's in your nodes and run 10G
> between racks
>
> ** **
>
> *From: *Kanwar Sangha 
> *Reply-To: *"user@cassandra.apache.org" <
>
>


Re: Cassandra with SAN

2013-02-21 Thread Jared Biel
> As a counter argument though, anyone running a C* cluster on the Amazon
cloud is going to be using SAN storage (or some kind of proprietary storage
array) at the lowest  layers...Amazon isn't going to have a bunch of JBOD
running their cloud infrastructure.  However, they've invested in the
infrastructure to do it right.

This is certainly true when using EBS, however it's generally not
recommended to use EBS when running Cassandra. EBS has proven to be
unreliable in the past and it's a bit of a SPOF. Instead, it's recommended
to use the "instance store" disks that come with most instances (handy
chart here: http://www.ec2instances.info/). These are the rough equivalent
of local disks (probably host level RAID 10 storage if I'd have to guess.)

-Jared

On 22 February 2013 00:40, Michael Morris wrote:

> I'm running a 27 node cassandra cluster on SAN without issue.  I will be
> perfectly clear though, the hosts are multi-homed to different
> switches/fabrics in the SAN, we have an _expensive_ EMC array, and other
> than a datacenter-wide power outage, there's no SPOF for the SAN.  We use
> it because it's there, and it's already a sunk cost.
>
> I certainly would not go out of my way to purchase SAN infrastructure for
> a C* cluster, it just doesn't make sense (for all the reasons others have
> mentioned).  Any more, you can load up a single 2U server with multi-TB
> worth of disk, so the aggregate storage capacity of your C* cluster could
> potentially be as much as a SAN you would purchase (and a lot less hassle
> too).
>
> As a counter argument though, anyone running a C* cluster on the Amazon
> cloud is going to be using SAN storage (or some kind of proprietary storage
> array) at the lowest layers...Amazon isn't going to have a bunch of JBOD
> running their cloud infrastructure.  However, they've invested in the
> infrastructure to do it right.
>
> - Mike
>
>
> On Thu, Feb 21, 2013 at 6:08 PM, P. Taylor Goetz wrote:
>
>> I shouldn't have used the word "spinning"... SSDs are a great option as
>> well.
>>
>> I also agree with all the "expensive SPOF" points others have made.
>>
>> Sent from my iPhone
>>
>> On Feb 21, 2013, at 6:56 PM, "P. Taylor Goetz"  wrote:
>>
>> Cassandra is designed to write and read data in a way that is optimized
>> for physical spinning disks.
>>
>> Running C* on a SAN introduces a layer of abstraction that, at best
>> negates those optimizations, and at worst introduces additional overhead.
>>
>> Sent from my iPhone
>>
>> On Feb 21, 2013, at 6:42 PM, Kanwar Sangha  wrote:
>>
>>  Ok. What would be the drawbacks J
>>
>> ** **
>>
>> *From:* Michael Kjellman 
>> [mailto:mkjell...@barracuda.com]
>>
>> *Sent:* 21 February 2013 17:12
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Cassandra with SAN
>>
>> ** **
>>
>> No, this is a really really bad idea and C* was not designed for this, in
>> fact, it was designed so you don't need to have a large expensive SAN.***
>> *
>>
>> ** **
>>
>> Don't be tempted by the shiny expensive SAN. :)
>>
>> ** **
>>
>> If money is no object instead throw SSD's in your nodes and run 10G
>> between racks
>>
>> ** **
>>
>> *From: *Kanwar Sangha 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Thursday, February 21, 2013 2:56 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Cassandra with SAN
>>
>> ** **
>>
>> Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which
>> provides me 8 Petabytes of storage. Would I not be I/O bound irrespective
>> of the no of Cassandra machines and scaling by adding 
>>
>> machines won’t help ?
>>
>>  
>>
>> Thanks
>>
>> Kanwar
>>
>> ** **
>>
>> --
>> Copy, by Barracuda, helps you store, protect, and share all your amazing
>> things. Start today: www.copy.com . 
>>
>>   ­­  
>>
>>
>


Re: unsubscribe

2013-02-21 Thread Eric Evans
On Tue, Feb 19, 2013 at 1:02 PM, Anurag Gujral  wrote:
>
> Unsubscribe me please.
> Thanks

Could I interest you in picture of a lemur instead?

http://goo.gl/RZw3e


--
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: Counting problem

2013-02-21 Thread Jason Wee
There is a limit option, find it in the doc.


On Fri, Feb 22, 2013 at 3:41 AM, Sri Ramya  wrote:

> hi,,
> Cassandra can display maximum 100 rows in a Columnfamily. can i increase
> it. If it is possible please mention here.
>   Thank you
>


Re: Cassandra with SAN

2013-02-21 Thread Michael Morris
I'm running a 27 node cassandra cluster on SAN without issue.  I will be
perfectly clear though, the hosts are multi-homed to different
switches/fabrics in the SAN, we have an _expensive_ EMC array, and other
than a datacenter-wide power outage, there's no SPOF for the SAN.  We use
it because it's there, and it's already a sunk cost.

I certainly would not go out of my way to purchase SAN infrastructure for a
C* cluster, it just doesn't make sense (for all the reasons others have
mentioned).  Any more, you can load up a single 2U server with multi-TB
worth of disk, so the aggregate storage capacity of your C* cluster could
potentially be as much as a SAN you would purchase (and a lot less hassle
too).

As a counter argument though, anyone running a C* cluster on the Amazon
cloud is going to be using SAN storage (or some kind of proprietary storage
array) at the lowest layers...Amazon isn't going to have a bunch of JBOD
running their cloud infrastructure.  However, they've invested in the
infrastructure to do it right.

- Mike

On Thu, Feb 21, 2013 at 6:08 PM, P. Taylor Goetz  wrote:

> I shouldn't have used the word "spinning"... SSDs are a great option as
> well.
>
> I also agree with all the "expensive SPOF" points others have made.
>
> Sent from my iPhone
>
> On Feb 21, 2013, at 6:56 PM, "P. Taylor Goetz"  wrote:
>
> Cassandra is designed to write and read data in a way that is optimized
> for physical spinning disks.
>
> Running C* on a SAN introduces a layer of abstraction that, at best
> negates those optimizations, and at worst introduces additional overhead.
>
> Sent from my iPhone
>
> On Feb 21, 2013, at 6:42 PM, Kanwar Sangha  wrote:
>
>  Ok. What would be the drawbacks J
>
> ** **
>
> *From:* Michael Kjellman 
> [mailto:mkjell...@barracuda.com]
>
> *Sent:* 21 February 2013 17:12
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra with SAN
>
> ** **
>
> No, this is a really really bad idea and C* was not designed for this, in
> fact, it was designed so you don't need to have a large expensive SAN.
>
> ** **
>
> Don't be tempted by the shiny expensive SAN. :)
>
> ** **
>
> If money is no object instead throw SSD's in your nodes and run 10G
> between racks
>
> ** **
>
> *From: *Kanwar Sangha 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Thursday, February 21, 2013 2:56 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Cassandra with SAN
>
> ** **
>
> Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which
> provides me 8 Petabytes of storage. Would I not be I/O bound irrespective
> of the no of Cassandra machines and scaling by adding 
>
> machines won’t help ?
>
>  
>
> Thanks
>
> Kanwar
>
> ** **
>
> --
> Copy, by Barracuda, helps you store, protect, and share all your amazing
> things. Start today: www.copy.com . 
>
>   ­­  
>
>


Re: Cassandra with SAN

2013-02-21 Thread P. Taylor Goetz
I shouldn't have used the word "spinning"... SSDs are a great option as well.

I also agree with all the "expensive SPOF" points others have made.

Sent from my iPhone

On Feb 21, 2013, at 6:56 PM, "P. Taylor Goetz"  wrote:

> Cassandra is designed to write and read data in a way that is optimized for 
> physical spinning disks.
> 
> Running C* on a SAN introduces a layer of abstraction that, at best negates 
> those optimizations, and at worst introduces additional overhead.
> 
> Sent from my iPhone
> 
> On Feb 21, 2013, at 6:42 PM, Kanwar Sangha  wrote:
> 
>> Ok. What would be the drawbacks J
>>  
>> From: Michael Kjellman [mailto:mkjell...@barracuda.com] 
>> Sent: 21 February 2013 17:12
>> To: user@cassandra.apache.org
>> Subject: Re: Cassandra with SAN
>>  
>> No, this is a really really bad idea and C* was not designed for this, in 
>> fact, it was designed so you don't need to have a large expensive SAN.
>>  
>> Don't be tempted by the shiny expensive SAN. :)
>>  
>> If money is no object instead throw SSD's in your nodes and run 10G between 
>> racks
>>  
>> From: Kanwar Sangha 
>> Reply-To: "user@cassandra.apache.org" 
>> Date: Thursday, February 21, 2013 2:56 PM
>> To: "user@cassandra.apache.org" 
>> Subject: Cassandra with SAN
>>  
>> Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides 
>> me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no 
>> of Cassandra machines and scaling by adding
>> machines won’t help ?
>>  
>> Thanks
>> Kanwar
>>  
>> -- 
>> Copy, by Barracuda, helps you store, protect, and share all your amazing 
>> things. Start today: www.copy.com.
>>   ­­  


Re: Cassandra with SAN

2013-02-21 Thread P. Taylor Goetz
Cassandra is designed to write and read data in a way that is optimized for 
physical spinning disks.

Running C* on a SAN introduces a layer of abstraction that, at best negates 
those optimizations, and at worst introduces additional overhead.

Sent from my iPhone

On Feb 21, 2013, at 6:42 PM, Kanwar Sangha  wrote:

> Ok. What would be the drawbacks J
>  
> From: Michael Kjellman [mailto:mkjell...@barracuda.com] 
> Sent: 21 February 2013 17:12
> To: user@cassandra.apache.org
> Subject: Re: Cassandra with SAN
>  
> No, this is a really really bad idea and C* was not designed for this, in 
> fact, it was designed so you don't need to have a large expensive SAN.
>  
> Don't be tempted by the shiny expensive SAN. :)
>  
> If money is no object instead throw SSD's in your nodes and run 10G between 
> racks
>  
> From: Kanwar Sangha 
> Reply-To: "user@cassandra.apache.org" 
> Date: Thursday, February 21, 2013 2:56 PM
> To: "user@cassandra.apache.org" 
> Subject: Cassandra with SAN
>  
> Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides 
> me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
> Cassandra machines and scaling by adding
> machines won’t help ?
>  
> Thanks
> Kanwar
>  
> -- 
> Copy, by Barracuda, helps you store, protect, and share all your amazing 
> things. Start today: www.copy.com.
>   ­­  


Re: Cassandra with SAN

2013-02-21 Thread Michael Kjellman
Adding a Single Point of Failure when you chose a distributed database for 
probably a good reason. I'd also think you'd be tempted to have multiple 
terabytes per node. (so you're even more cost inefficient because you'll still 
need to buy the same number of nodes everyone else does even though you have 
the SAN). Then any operations are going to be unbearable (repair, cleanup). 
Also if you want to be multi dc, now you'll need two SANS.

I can't think of one good reason to run C* with a SAN.

From: Kanwar Sangha mailto:kan...@mavenir.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thursday, February 21, 2013 3:42 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: RE: Cassandra with SAN

Ok. What would be the drawbacks :)

From: Michael Kjellman [mailto:mkjell...@barracuda.com]
Sent: 21 February 2013 17:12
To: user@cassandra.apache.org
Subject: Re: Cassandra with SAN

No, this is a really really bad idea and C* was not designed for this, in fact, 
it was designed so you don't need to have a large expensive SAN.

Don't be tempted by the shiny expensive SAN. :)

If money is no object instead throw SSD's in your nodes and run 10G between 
racks

From: Kanwar Sangha mailto:kan...@mavenir.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thursday, February 21, 2013 2:56 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Cassandra with SAN

Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won’t help ?

Thanks
Kanwar

--
Copy, by Barracuda, helps you store, protect, and share all your amazing 
things. Start today: www.copy.com.
  ­­

Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.


Re: Cassandra with SAN

2013-02-21 Thread David Schairer
"Who breaks a butterfly upon a wheel?"

It will work, but you'd have a distributed database running on a single point 
of failure storage fabric, thus destroying much of your benefits, unless you 
have enough discrete SAN units that you treat them as racks in your cassandra 
topology to ensure that you have data replicated across redundant SAN 
shelves|controllers|etc.

You also would end up with redundancy at cross purposes in that the SAN will be 
striping data that Cassandra is already distributing efficiently.

If the SAN is free and unused, it'll be fine as a Cassandra test platform.  But 
I wouldn't spend a penny on SAN hardware instead of a much larger distributed 
cluster with commodity hardware.  Derive your redundancy and performance from 
lots of hardware in lots of places, not expensive hardware in one place.  

--DRS

On Feb 21, 2013, at 3:42 PM, Kanwar Sangha  wrote:

> Ok. What would be the drawbacks J
>  
> From: Michael Kjellman [mailto:mkjell...@barracuda.com] 
> Sent: 21 February 2013 17:12
> To: user@cassandra.apache.org
> Subject: Re: Cassandra with SAN
>  
> No, this is a really really bad idea and C* was not designed for this, in 
> fact, it was designed so you don't need to have a large expensive SAN.
>  
> Don't be tempted by the shiny expensive SAN. :)
>  
> If money is no object instead throw SSD's in your nodes and run 10G between 
> racks
>  
> From: Kanwar Sangha 
> Reply-To: "user@cassandra.apache.org" 
> Date: Thursday, February 21, 2013 2:56 PM
> To: "user@cassandra.apache.org" 
> Subject: Cassandra with SAN
>  
> Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides 
> me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
> Cassandra machines and scaling by adding
> machines won’t help ?
>  
> Thanks
> Kanwar
>  
> -- 
> Copy, by Barracuda, helps you store, protect, and share all your amazing 
> things. Start today: www.copy.com.
>   ­­  



RE: Cassandra with SAN

2013-02-21 Thread Kanwar Sangha
Ok. What would be the drawbacks :)

From: Michael Kjellman [mailto:mkjell...@barracuda.com]
Sent: 21 February 2013 17:12
To: user@cassandra.apache.org
Subject: Re: Cassandra with SAN

No, this is a really really bad idea and C* was not designed for this, in fact, 
it was designed so you don't need to have a large expensive SAN.

Don't be tempted by the shiny expensive SAN. :)

If money is no object instead throw SSD's in your nodes and run 10G between 
racks

From: Kanwar Sangha mailto:kan...@mavenir.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thursday, February 21, 2013 2:56 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Cassandra with SAN

Hi - Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won't help ?

Thanks
Kanwar

--
Copy, by Barracuda, helps you store, protect, and share all your amazing 
things. Start today: www.copy.com.
  


Re: cassandra vs. mongodb quick question(good additional info)

2013-02-21 Thread Edward Capriolo
The theoretical maximum of 10G is not even close to what you actually get.

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDIQFjAA&url=http%3A%2F%2Fdownload.intel.com%2Fsupport%2Fnetwork%2Fsb%2Ffedexcasestudyfinal.pdf&ei=HawmUcWIM6q20QG8j4DIBw&usg=AFQjCNG8Qskl9vXdJvB7OLtIPQgparrt9A&bvm=bv.42661473,d.dmQ&cad=rja

Sorry did not have time to strip the google stuff out of this link.


On Thu, Feb 21, 2013 at 12:45 PM, aaron morton  wrote:
> If you are lazy like me wolfram alpha can help
>
> http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbE&a=UnitClash_*TB.*Tebibytes--
>
> 10 hours 15 minutes 43.59 seconds
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 21/02/2013, at 11:31 AM, Wojciech Meler  wrote:
>
> you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb
> link
>
> 19 lut 2013 02:01, "Hiller, Dean"  napisał(a):
>>
>> I thought about this more, and even with a 10Gbit network, it would take
>> 40 days to bring up a replacement node if mongodb did truly have a 42T /
>> node like I had heard.  I wrote the below email to the person I heard this
>> from going back to basics which really puts some perspective on it….(and a
>> lot of people don't even have a 10Gbit network like we do)
>>
>> Nodes are hooked up by a 10G network at most right now where that is
>> 10gigabit.  We are talking about 10Terabytes on disk per node recently.
>>
>> Google "10 gigabit in gigabytes" gives me 1.25 gigabytes/second  (yes I
>> could have divided by 8 in my head but eh…course when I saw the number, I
>> went duh)
>>
>> So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we
>> are bringing online to replace a dead node would take approximately 5
>> days???
>>
>> This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1
>> second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more
>> likely 11 days if we only use 50% of the network.
>>
>> So bringing a new node up to speed is more like 11 days once it is
>> crashed.  I think this is the main reason the 1Terabyte exists to begin
>> with, right?
>>
>> From an ops perspective, this could sound like a nightmare scenario of
>> waiting 10 days…..maybe it is livable though.  Either way, I thought it
>> would be good to share the numbers.  ALSO, that is assuming the bus with
>> it's 10 disk can keep up with 10G  Can it?  What is the limit of
>> throughput on a bus / second on the computers we have as on wikipedia there
>> is a huge variance?
>>
>> What is the rate of the disks too (multiplied by 10 of course)?  Will they
>> keep up with a 10G rate for bringing a new node online?
>>
>> This all comes into play even more so when you want to double the size of
>> your cluster of course as all nodes have to transfer half of what they have
>> to all the new nodes that come online(cassandra actually has a very data
>> center/rack aware topology to transfer data correctly to not use up all
>> bandwidth unecessarily…I am not sure mongodb has that).  Anyways, just food
>> for thought.
>>
>> From: aaron morton
>> mailto:aa...@thelastpickle.com>>
>> Reply-To: "user@cassandra.apache.org"
>> mailto:user@cassandra.apache.org>>
>> Date: Monday, February 18, 2013 1:39 PM
>> To: "user@cassandra.apache.org"
>> mailto:user@cassandra.apache.org>>, Vegard Berget
>> mailto:p...@fantasista.no>>
>> Subject: Re: cassandra vs. mongodb quick question
>>
>> My experience is repair of 300GB compressed data takes longer than 300GB
>> of uncompressed, but I cannot point to an exact number. Calculating the
>> differences is mostly CPU bound and works on the non compressed data.
>>
>> Streaming uses compression (after uncompressing the on disk data).
>>
>> So if you have 300GB of compressed data, take a look at how long repair
>> takes and see if you are comfortable with that. You may also want to test
>> replacing a node so you can get the procedure documented and understand how
>> long it takes.
>>
>> The idea of the soft 300GB to 500GB limit cam about because of a number of
>> cases where people had 1 TB on a single node and they were surprised it took
>> days to repair or replace. If you know how long things may take, and that
>> fits in your operations then go with it.
>>
>> Cheers
>>
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 18/02/2013, at 10:08 PM, Vegard Berget
>> mailto:p...@fantasista.no>> wrote:
>>
>>
>>
>> Just out of curiosity :
>>
>> When using compression, does this affect this one way or another?  Is 300G
>> (compressed) SSTable size, or total size of data?
>>
>> .vegard,
>>
>> - Original Message -
>> From:
>> user@cassandra.apache.org
>>
>> To:
>> mailto:user@cassandra.apache.org>>
>> Cc:
>>
>> S

Re: Cassandra with SAN

2013-02-21 Thread Michael Kjellman
No, this is a really really bad idea and C* was not designed for this, in fact, 
it was designed so you don't need to have a large expensive SAN.

Don't be tempted by the shiny expensive SAN. :)

If money is no object instead throw SSD's in your nodes and run 10G between 
racks

From: Kanwar Sangha mailto:kan...@mavenir.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thursday, February 21, 2013 2:56 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Cassandra with SAN

Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won’t help ?

Thanks
Kanwar

Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.


Cassandra with SAN

2013-02-21 Thread Kanwar Sangha
Hi - Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won't help ?

Thanks
Kanwar


RE: cassandra vs. mongodb quick question(good additional info)

2013-02-21 Thread Kanwar Sangha
“The limiting factors are the time it take to repair, the time it takes to 
replace a node, the memory considerations for 100's of millions of rows. If you 
the performance of those operations is acceptable to you, then go crazy”



If I have a node which is attached to a RAID and the node crashes but the data 
is still good on the drives, it would just mean bringing up the node using the 
same storage ? would this not be fast…?




From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: 21 February 2013 11:46
To: user@cassandra.apache.org
Subject: Re: cassandra vs. mongodb quick question(good additional info)

If you are lazy like me wolfram alpha can help

http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbE&a=UnitClash_*TB.*Tebibytes--

10 hours 15 minutes 43.59 seconds

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 11:31 AM, Wojciech Meler 
mailto:wojciech.me...@gmail.com>> wrote:



you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link
19 lut 2013 02:01, "Hiller, Dean" 
mailto:dean.hil...@nrel.gov>> napisał(a):
I thought about this more, and even with a 10Gbit network, it would take 40 
days to bring up a replacement node if mongodb did truly have a 42T / node like 
I had heard.  I wrote the below email to the person I heard this from going 
back to basics which really puts some perspective on it….(and a lot of people 
don't even have a 10Gbit network like we do)

Nodes are hooked up by a 10G network at most right now where that is 10gigabit. 
 We are talking about 10Terabytes on disk per node recently.

Google "10 gigabit in gigabytes" gives me 1.25 gigabytes/second  (yes I could 
have divided by 8 in my head but eh…course when I saw the number, I went duh)

So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we are 
bringing online to replace a dead node would take approximately 5 days???

This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1 
second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more likely 
11 days if we only use 50% of the network.

So bringing a new node up to speed is more like 11 days once it is crashed.  I 
think this is the main reason the 1Terabyte exists to begin with, right?

From an ops perspective, this could sound like a nightmare scenario of waiting 
10 days…..maybe it is livable though.  Either way, I thought it would be good 
to share the numbers.  ALSO, that is assuming the bus with it's 10 disk can 
keep up with 10G  Can it?  What is the limit of throughput on a bus / 
second on the computers we have as on wikipedia there is a huge variance?

What is the rate of the disks too (multiplied by 10 of course)?  Will they keep 
up with a 10G rate for bringing a new node online?

This all comes into play even more so when you want to double the size of your 
cluster of course as all nodes have to transfer half of what they have to all 
the new nodes that come online(cassandra actually has a very data center/rack 
aware topology to transfer data correctly to not use up all bandwidth 
unecessarily…I am not sure mongodb has that).  Anyways, just food for thought.

From: aaron morton 
mailto:aa...@thelastpickle.com>>>
Reply-To: 
"user@cassandra.apache.org>"
 
mailto:user@cassandra.apache.org>>>
Date: Monday, February 18, 2013 1:39 PM
To: 
"user@cassandra.apache.org>"
 
mailto:user@cassandra.apache.org>>>,
 Vegard Berget 
mailto:p...@fantasista.no>>>
Subject: Re: cassandra vs. mongodb quick question

My experience is repair of 300GB compressed data takes longer than 300GB of 
uncompressed, but I cannot point to an exact number. Calculating the 
differences is mostly CPU bound and works on the non compressed data.

Streaming uses compression (after uncompressing the on disk data).

So if you have 300GB of compressed data, take a look at how long repair takes 
and see if you are comfortable with that. You may also want to test replacing a 
node so you can get the procedure documented and understand how long it takes.

The idea of the soft 300GB to 500GB limit cam about because of a number of 
cases where people had 1 TB on a single node and they were surprised it took 
days to repair or replace. If you know how long things may take, and that fits 
in your operations then go with it.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/02/201

Re: Using Cassandra for read operations

2013-02-21 Thread Bill de hÓra
> To avoid disk I/Os, the best option we thought is to have data in memory. 
> Is it a good idea to have memtable setup around 1/2 or 3/4 of 
> heap size? Obviously flushing will take a lot of time but would 
> that hurt that node's performance big time?

Start with the defaults and test your workload. If memtables start flushing 
aggressively (because of write load or bad settings), that can cause compaction 
work on the disk, that might impair read I/O. 


>  Is there a way to figure out max read-latency for a bunch of read operations?

Use nodetool's histogram feature to get a sense of outlier latency.


> We just need one column family with a long key

Take time to tune your key caches and bloom filters. They use memory and have 
an impact on read performance.


> Given that cassandra provides off-heap row caching, in a 
> machine >32 gb RAM, would it be wise to have a >10 gb row 
> cache with 8 gb java heap? 

If you use the off heap cache, allow enough room for the filesystems' own 
cache, i.e. don't give over all of ram to the off heap cache. Also the off heap 
cache can slow you down with wide rows due to serialisation overhead, or cache 
invalidation thrashing if you are update heavy. if you use the on-heap cache, 
pay close attention to GC cycles and memory stability - if you are 
cycling/evicting through the cache at a high rate that can leave too much 
garbage in memory such that the garbage collector can't keep up. If the node 
doesn't have enough working memory after GC, it will _resize_ key and row 
caches. This will lead to degraded read performance and with some workloads can 
result in a vicious cycle.


>  For our SLAs, a read of max 15-20 rows at once(using multi slice), 
> should not take more than 4 ms.

If you control your own hardware (and you probably should/must for this kind of 
latency demand) consider SSDs. You might want to carefully control background 
repair/compaction operations if predictable performance is your goal. You might 
want to avoid storing strings and use byte representations. If you have an 
application tier on the path consider caching in that tier as well to avoid the 
overhead of network calls and thrift processing.

In a nutshell -

- Start with defaults and tune based on small discrete adjustments and leave 
time to see the effect of each change. No-one will know your workload better 
than you and the questions you are asking are workload sensitive.

- Allow time for tuning and spending time understanding the memory model and 
JVM GC.

- Be very careful with caches. Leave enough room in the OS for its own disk 
cache.

- Get an SSD


Bill


On 21 Feb 2013, at 19:03, amulya rattan  wrote:

> Dear All,
> 
> We are currently evaluating Cassandra for an application involving strict 
> SLAs(Service level agreements). We just need one column family with a long 
> key and approximately 70-80 bytes row. We are not concerned about write 
> performance but are primarily concerned about read. For our SLAs, a read of 
> max 15-20 rows at once(using multi slice), should not take more than 4 ms. 
> Till now, on a single node setup, using cassandra' stress tool, the numbers 
> are promising. But I am guessing that's because there is no network latency 
> involved there and since we set memtable around 2gb(4 gb heap), we never had 
> to get to Disk I/O.
> 
> Assuming our nodes having >32GB RAM, a couple of questions regarding read:
> 
> * To avoid disk I/Os, the best option we thought is to have data in memory. 
> Is it a good idea to have memtable setup around 1/2 or 3/4 of heap size? 
> Obviously flushing will take a lot of time but would that hurt that node's 
> performance big time?
> 
> * Cassandra stress tool only gives out average read latency. Is there a way 
> to figure out max read-latency for a bunch of read operations?
> 
> * How big a row cache can one have? Given that cassandra provides off-heap 
> row caching, in a machine >32 gb RAM, would it be wise to have a >10 gb row 
> cache with 8 gb java heap? And how big should the corresponding key cache be 
> then?
> 
> Any response is appreciated.
> 
> ~Amulya 
> 



Re: "Heap is N.N full." Immediately on startup

2013-02-21 Thread Andras Szerdahelyi
Thank you- indeed my index interval is 64 with a CF of 300M rows + bloom filter 
false positive chance was default.
Raising the index interval to 512 didn't fix this  alone, so I guess I'll have 
to set the bloom filter to some reasonable value and scrub.

From: aaron morton mailto:aa...@thelastpickle.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thursday 21 February 2013 17:58
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: "Heap is N.N full." Immediately on startup

My first guess would be the bloom filter and index sampling from lots-o-rows

Check the row count in cfstats
Check the bloom filter size in cfstats.

Background on memory requirements 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 11:27 PM, Andras Szerdahelyi 
mailto:andras.szerdahe...@ignitionone.com>> 
wrote:

Hey list,

Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap 
at startup in Cassandra 1.1.6 besides

  *   row cache : not persisted and is at 0 keys when this warning is produced
  *   Memtables : no write traffic at startup, my app's column families are 
durable_writes:false
  *   Pending tasks : no pending tasks, except for 928 compactions ( not sure 
where those are coming from )

I drew these conclusions from the StatusLogger output below:

INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC 
for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 
8375238656
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) 
Pool NameActive   Pending   Blocked
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) 
ReadStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
RequestResponseStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
ReadRepairStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
MutationStage 0-1 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
ReplicateOnWriteStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
GossipStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
AntiEntropyStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
MigrationStage0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
StreamStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
MemtablePostFlusher   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
FlushWriter   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
MiscStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
commitlog_archiver0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) 
InternalResponseStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) 
CompactionManager 0   928
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) 
MessagingServicen/a   0,0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) 
Cache Type Size Capacity   
KeysToSave Provider
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) 
KeyCache 25   25
  all
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) 
RowCache  00
  all  org.apache.cassandra.cache.SerializingCacheProvider
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) 
ColumnFamilyMemtable ops,data
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
MYAPP_1.CF0,0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
MYAPP_2.CF 0,0
 INFO [Sche

Re: key cache size

2013-02-21 Thread aaron morton
This is the key cache entry 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cache/KeyCacheKey.java

Note that the Descriptor is re-used. 

If you want to see key cache metrics, including bytes used,  use nodetool info. 

Cheers
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 3:45 AM, Kanwar Sangha  wrote:

> Hi – What is the approximate overhead of the key cache ? Say each key is 50 
> bytes. What would be the overhead for this key in the key cache ?
>  
> Thanks,
> Kanwar



Re: cassandra vs. mongodb quick question(good additional info)

2013-02-21 Thread aaron morton
If you are lazy like me wolfram alpha can help 

http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbE&a=UnitClash_*TB.*Tebibytes--

10 hours 15 minutes 43.59 seconds

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 11:31 AM, Wojciech Meler  wrote:

> you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link
> 
> 19 lut 2013 02:01, "Hiller, Dean"  napisał(a):
> I thought about this more, and even with a 10Gbit network, it would take 40 
> days to bring up a replacement node if mongodb did truly have a 42T / node 
> like I had heard.  I wrote the below email to the person I heard this from 
> going back to basics which really puts some perspective on it….(and a lot of 
> people don't even have a 10Gbit network like we do)
> 
> Nodes are hooked up by a 10G network at most right now where that is 
> 10gigabit.  We are talking about 10Terabytes on disk per node recently.
> 
> Google "10 gigabit in gigabytes" gives me 1.25 gigabytes/second  (yes I could 
> have divided by 8 in my head but eh…course when I saw the number, I went duh)
> 
> So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we are 
> bringing online to replace a dead node would take approximately 5 days???
> 
> This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1 
> second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more 
> likely 11 days if we only use 50% of the network.
> 
> So bringing a new node up to speed is more like 11 days once it is crashed.  
> I think this is the main reason the 1Terabyte exists to begin with, right?
> 
> From an ops perspective, this could sound like a nightmare scenario of 
> waiting 10 days…..maybe it is livable though.  Either way, I thought it would 
> be good to share the numbers.  ALSO, that is assuming the bus with it's 10 
> disk can keep up with 10G  Can it?  What is the limit of throughput on a 
> bus / second on the computers we have as on wikipedia there is a huge 
> variance?
> 
> What is the rate of the disks too (multiplied by 10 of course)?  Will they 
> keep up with a 10G rate for bringing a new node online?
> 
> This all comes into play even more so when you want to double the size of 
> your cluster of course as all nodes have to transfer half of what they have 
> to all the new nodes that come online(cassandra actually has a very data 
> center/rack aware topology to transfer data correctly to not use up all 
> bandwidth unecessarily…I am not sure mongodb has that).  Anyways, just food 
> for thought.
> 
> From: aaron morton mailto:aa...@thelastpickle.com>>
> Reply-To: "user@cassandra.apache.org" 
> mailto:user@cassandra.apache.org>>
> Date: Monday, February 18, 2013 1:39 PM
> To: "user@cassandra.apache.org" 
> mailto:user@cassandra.apache.org>>, Vegard Berget 
> mailto:p...@fantasista.no>>
> Subject: Re: cassandra vs. mongodb quick question
> 
> My experience is repair of 300GB compressed data takes longer than 300GB of 
> uncompressed, but I cannot point to an exact number. Calculating the 
> differences is mostly CPU bound and works on the non compressed data.
> 
> Streaming uses compression (after uncompressing the on disk data).
> 
> So if you have 300GB of compressed data, take a look at how long repair takes 
> and see if you are comfortable with that. You may also want to test replacing 
> a node so you can get the procedure documented and understand how long it 
> takes.
> 
> The idea of the soft 300GB to 500GB limit cam about because of a number of 
> cases where people had 1 TB on a single node and they were surprised it took 
> days to repair or replace. If you know how long things may take, and that 
> fits in your operations then go with it.
> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 18/02/2013, at 10:08 PM, Vegard Berget 
> mailto:p...@fantasista.no>> wrote:
> 
> 
> 
> Just out of curiosity :
> 
> When using compression, does this affect this one way or another?  Is 300G 
> (compressed) SSTable size, or total size of data?
> 
> .vegard,
> 
> - Original Message -
> From:
> user@cassandra.apache.org
> 
> To:
> mailto:user@cassandra.apache.org>>
> Cc:
> 
> Sent:
> Mon, 18 Feb 2013 08:41:25 +1300
> Subject:
> Re: cassandra vs. mongodb quick question
> 
> 
> If you have spinning disk and 1G networking and no virtual nodes, I would 
> still say 300G to 500G is a soft limit.
> 
> If you are using virtual nodes, SSD, JBOD disk configuration or faster 
> networking you may go higher.
> 
> The limiting factors are the time it take to repair, the time it takes to 
> replace a node, the memory considerations for 100's of millions of rows. If 
> you the performance of those operations is acceptable

Re: Adding new nodes in a cluster with virtual nodes

2013-02-21 Thread aaron morton
> After the bootstrap is finished in the new node, no data has been moved 
> automatically from the old nodes to this new node.
What did nodetool ring say ? 

> Please, could you tell me if we need to perform a nodetool repair after the 
> bootstrap of the new node ? 
You should not, bootstrap brings the node up with the data it should have. 
If you have any doubt though, running nodetool repair -pr on the new node 
cannot hurt. 

> What happens if we perform a nodetool cleanup in the old nodes before doing 
> the nodetool repair ? (Is there a risk of loosing some data ?)
Potentially yes. 
The simple thing would be to wait until a full repair cycle (on all nodes) to 
complete. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 3:44 AM, Jean-Armel Luce  wrote:

> Hello,
> 
> We are using Cassandra 1.2.0.
> We have a cluster of 16 physical nodes, each node has 256 virtual nodes.
> We want to add 2 new nodes in our cluster : we follow the procedure as 
> explained here : 
> http://www.datastax.com/docs/1.2/operations/add_replace_nodes.
> 
> After starting 1 of the new node, we can see that this new node has 256 
> tokens ==>looks good
> We can see that this node is in the ring (using nodetool status) ==> looks 
> good
> After the bootstrap is finished in the new node, no data has been moved 
> automatically from the old nodes to this new node.
> 
> 
> However, when we send insert queries in our cluster, the new node accepts to 
> insert the new rows.
> 
> Please, could you tell me if we need to perform a nodetool repair after the 
> bootstrap of the new node ? 
> What happens if we perform a nodetool cleanup in the old nodes before doing 
> the nodetool repair ? (Is there a risk of loosing some data ?)
> 
> Regards.
> 
> Jean Armel



RE: SSTable Num

2013-02-21 Thread Kanwar Sangha
No.
The default size tiered strategy compacts files what are roughly the same size, 
and only when there are more than 4 (default) of them.

Ok. So for 10 TB, I could have at least 4 SStables files each of 2.5 TB ?

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: 21 February 2013 11:01
To: user@cassandra.apache.org
Subject: Re: SSTable Num

Hi - I have around 6TB of data on 1 node
Unless you have SSD and 10GbE you probably have too much data on there.
Remember you need to run repair and that can take a long time with a lot of 
data. Also you may need to replace a node one day and moving 6TB will take a 
while.

 Or will the sstable compaction continue and eventually we will have 1 file ?
No.
The default size tiered strategy compacts files what are roughly the same size, 
and only when there are more than 4 (default) of them.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 3:47 AM, Kanwar Sangha 
mailto:kan...@mavenir.com>> wrote:


Hi - I have around 6TB of data on 1 node and the cfstats show 32 sstables. 
There is no compaction job running in the background. Is there a limit on the 
size per sstable ? Or will the sstable compaction continue and eventually we 
will have 1 file ?

Thanks,
Kanwar




Re: very confused by jmap dump of cassandra

2013-02-21 Thread Mohit Anchlia
Roughly how much data do you have per node?

Sent from my iPhone

On Feb 20, 2013, at 10:49 AM, "Hiller, Dean"  wrote:

> I took this jmap dump of cassandra(in production).  Before I restarted the 
> whole production cluster, I had some nodes running compaction and it looked 
> like all memory had been consumed(kind of like cassandra is not clearing out 
> the caches or memtables fast enough).  I am trying to still debug compaction 
> causes slowness on the cluster since all cassandra.yaml files are pretty much 
> the defaults with size tiered compaction.
> 
> The weird thing is I dump and get a 5.4G heap.bin file and load that into 
> ecipse who tells me total is 142.8MB….what So low, top was showing 
> 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is 
> eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in 
> use?)
> 
> Tasks: 398 total,   1 running, 397 sleeping,   0 stopped,   0 zombie
> Cpu(s):  2.8%us,  0.5%sy,  0.0%ni, 96.5%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
> Mem:  32854680k total, 31910708k used,   943972k free,89776k buffers
> Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached
> 
>  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 20909 cassandr  20   0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java
> 22455 cassandr  20   0 15288 1340  824 R  3.9  0.0   0:00.02 top
> 
> It almost seems like cassandra is not being good about memory management here 
> as we slowly get into a situation where compaction is run which takes out our 
> memory(configured for 8G).  I can easily go higher than 8G on these systems 
> as I have 32gig each node, but there was docs that said 8G is better for GC.  
> Has anyone else taken a jmap dump of cassandra?
> 
> Thanks,
> Dean
> 


Re: very confused by jmap dump of cassandra

2013-02-21 Thread aaron morton
Cannot comment too much on the jmap but I can add my general "compaction is 
hurting" strategy. 

Try any or all of the following to get to a stable setup, then increase until 
things go bang. 

Set concurrent compactors to 2. 
Reduce compaction throughput by half. 
Reduce in_memory_compaction_limit. 
If you see compactions using a lot of sstables in the logs, reduce 
max_compaction_threshold. 
 
>  I can easily go higher than 8G on these systems as I have 32gig each node, 
> but there was docs that said 8G is better for GC. 
More JVM memory is not the answer. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 7:49 AM, "Hiller, Dean"  wrote:

> I took this jmap dump of cassandra(in production).  Before I restarted the 
> whole production cluster, I had some nodes running compaction and it looked 
> like all memory had been consumed(kind of like cassandra is not clearing out 
> the caches or memtables fast enough).  I am trying to still debug compaction 
> causes slowness on the cluster since all cassandra.yaml files are pretty much 
> the defaults with size tiered compaction.
> 
> The weird thing is I dump and get a 5.4G heap.bin file and load that into 
> ecipse who tells me total is 142.8MB….what So low, top was showing 
> 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is 
> eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in 
> use?)
> 
> Tasks: 398 total,   1 running, 397 sleeping,   0 stopped,   0 zombie
> Cpu(s):  2.8%us,  0.5%sy,  0.0%ni, 96.5%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
> Mem:  32854680k total, 31910708k used,   943972k free,89776k buffers
> Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached
> 
>  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 20909 cassandr  20   0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java
> 22455 cassandr  20   0 15288 1340  824 R  3.9  0.0   0:00.02 top
> 
> It almost seems like cassandra is not being good about memory management here 
> as we slowly get into a situation where compaction is run which takes out our 
> memory(configured for 8G).  I can easily go higher than 8G on these systems 
> as I have 32gig each node, but there was docs that said 8G is better for GC.  
> Has anyone else taken a jmap dump of cassandra?
> 
> Thanks,
> Dean
> 



Re: Data Model - Additional Column Families or one CF?

2013-02-21 Thread aaron morton
If you have a limited / known number (say < 30)  of types, I would create a CF 
for each of them.

If the number of types is unknown or very large I would have one CF with the 
row key you described. 

Generally I avoid data models that require new CF's as the data grows. 
Additionally having different CF's allows you to use different cache settings, 
compactions settings and even storage mediums. 

Cheers
  
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 7:43 AM, Adam Venturella  wrote:

> My data needs only require me to store JSON, and I can handle this in 1 
> column family by prefixing row keys with a type, for example:
> 
> comments:{message_id}
> 
> Where comments: represents the prefix and {message_id} represents some row 
> key to a message object in the same column family.
> 
> In this case comments:{message_id} would be a wide row using comment creation 
> time and descending clustering order to sort the messages as they are added.
> 
> My question is, would I be better off splitting comments into their own 
> Column Family or is storing them in with the Messages Column Family 
> sufficient, they are all messages after all.
> 
> Or do Column Families really just provide a nice organizational front for 
> data. I'm just storing JSON.
> 
> 



Re: Mutation dropped

2013-02-21 Thread aaron morton
> What does rpc_timeout control? Only the reads/writes? 
Yes. 

> like data stream,
streaming_socket_timeout_in_ms in the yaml

> merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

> What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu  wrote:

> What does rpc_timeout control? Only the reads/writes? How about other 
> inter-node communication, like data stream, merkle tree request?  What is the 
> reasonable value for roc_timeout? The default value of 10 seconds are way too 
> long. What is the side effect if it's set to a really small number, say 20ms?
> 
> Thanks.
> -Wei
> 
> From: aaron morton 
> To: user@cassandra.apache.org 
> Sent: Tuesday, February 19, 2013 7:32 PM
> Subject: Re: Mutation dropped
> 
>> Does the rpc_timeout not control the client timeout ?
> No it is how long a node will wait for a response from other nodes before 
> raising a TimedOutException if less than CL nodes have responded. 
> Set the client side socket timeout using your preferred client. 
> 
>> Is there any param which is configurable to control the replication timeout 
>> between nodes ?
> There is no such thing.
> rpc_timeout is roughly like that, but it's not right to think about it that 
> way. 
> i.e. if a message to a replica times out and CL nodes have already responded 
> then we are happy to call the request complete. 
> 
> Cheers
> 
>  
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 19/02/2013, at 1:48 AM, Kanwar Sangha  wrote:
> 
>> Thanks Aaron.
>>  
>> Does the rpc_timeout not control the client timeout ? Is there any param 
>> which is configurable to control the replication timeout between nodes ? Or 
>> the same param is used to control that since the other node is also like a 
>> client ?
>>  
>>  
>>  
>> From: aaron morton [mailto:aa...@thelastpickle.com] 
>> Sent: 17 February 2013 11:26
>> To: user@cassandra.apache.org
>> Subject: Re: Mutation dropped
>>  
>> You are hitting the maximum throughput on the cluster. 
>>  
>> The messages are dropped because the node fails to start processing them 
>> before rpc_timeout. 
>>  
>> However the request is still a success because the client requested CL was 
>> achieved. 
>>  
>> Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
>> Both nodes replicate each row, and writes are sent to each replica, so the 
>> only thing the client is waiting on is the local node to write to it's 
>> commit log. 
>>  
>> Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
>> scenario. 
>>  
>> Cheers
>>  
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>>  
>> @aaronmorton
>> http://www.thelastpickle.com
>>  
>> On 15/02/2013, at 9:42 AM, Kanwar Sangha  wrote:
>> 
>> 
>> Hi – Is there a parameter which can be tuned to prevent the mutations from 
>> being dropped ? Is this logic correct ?
>>  
>> Node A and B with RF=2, CL =1. Load balanced between the two.
>>  
>> --  Address   Load   Tokens  Owns (effective)  Host ID   
>> Rack
>> UN  10.x.x.x   746.78 GB  256 100.0%
>> dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
>> UN  10.x.x.x   880.77 GB  256 100.0%
>> 95d59054-be99-455f-90d1-f43981d3d778  rack1
>>  
>> Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
>> falling behind and we see the mutation dropped messages. But there are no 
>> failures on the client. Does that mean other node is not able to persist the 
>> replicated data ? Is there some timeout associated with replicated data 
>> persistence ?
>>  
>> Thanks,
>> Kanwar
>>  
>>  
>>  
>>  
>>  
>>  
>>  
>> From: Kanwar Sangha [mailto:kan...@mavenir.com] 
>> Sent: 14 February 2013 09:08
>> To: user@cassandra.apache.org
>> Subject: Mutation dropped
>>  
>> Hi – I am doing a load test using YCSB across 2 nodes in a cluster and 
>> seeing a lot of mutation dropped messages.  I understand that this is due to 
>> the replica not being written to the
>> other node ? RF = 2, CL =1.
>>  
>> From the wiki -
>> For MUTATION messages this means that the mutation was not applied to all 
>> replicas it was sent to. The inconsistency will be repaired by Read Repair 
>> or Anti Entropy Repair
>>  
>> Thanks,
>> Kanwar
>>  
> 
> 
> 



Re: how to debug slowdowns from these log snippets-more info 2

2013-02-21 Thread aaron morton
Some things to consider: 

Check for contention around the switch lock. This can happen if you get a lot 
of tables flushing at the same time, or if you have a lot of secondary indexes. 
It shows up as a pattern in the logs. As soon a the writer starts flushing a 
memtable another will be queued. Probably not happening here but can be a pain 
when a lot of memtables are flushed. 

I would turn on GC logging in cassandra-env.sh and watch that. After a full CMS 
flush how full / empty is the tenured heap ? If it is still got a lot in it 
then you are running with too much cache / bloom filter / index sampling. 

You can also experiment with the Max Tenuring Threshold, try turning it up to 4 
to start with. The GC logs will show you how much data is at each tenuring 
level. You can then see how much data is being tenuring, and if premature 
tenuring was an issue. I've seen premature tenuring cause issues with wide rows 
/ long reads. 

Hope that helps. 


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 4:35 AM, "Hiller, Dean"  wrote:

> Oh, and my startup command that cassandra logged was
> 
> a2.bigde.nrel.gov: xss =  -ea -javaagent:/opt/cassandra/lib/jamm-0.2.5.jar
> -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8021M -Xmx8021M
> -Xmn1600M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
> 
> And I remember from docs you don't want to go above 8G or java GC doesn't
> work out so well.  I am not sure why this is not working out though.
> 
> Dean
> 
> On 2/20/13 7:16 AM, "Hiller, Dean"  wrote:
> 
>> Here is the printout before that log which is probably important as
>> wellŠ..
>> 
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 GCInspector.java (line
>> 122) GC for ConcurrentMarkSweep: 3618 ms for 2 collections, 7038159096
>> used; max is 8243904512
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line
>> 57) Pool NameActive   Pending   Blocked
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line
>> 72) ReadStage11   264 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
>> 72) RequestResponseStage  0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
>> 72) ReadRepairStage   0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
>> 72) MutationStage1288 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
>> 72) ReplicateOnWriteStage 0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
>> 72) GossipStage   1 7 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
>> 72) AntiEntropyStage  0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
>> 72) MigrationStage0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
>> 72) StreamStage   0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
>> 72) MemtablePostFlusher   0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
>> 72) FlushWriter   0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
>> 72) MiscStage 0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
>> 72) commitlog_archiver0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 72) InternalResponseStage 0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 72) HintedHandoff 0 0 0
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 77) CompactionManager 4 5
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 89) MessagingServicen/a10,127
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 99) Cache Type Size Capacity
>>   KeysToSave
>> Provider
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 100) KeyCache1310719  1310719
>>   all
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 106) RowCache  00
>>   all   
>> org.apache.cassandra.cache.SerializingCacheProvider
>> INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
>> 113) ColumnFamily   

Re: SSTable Num

2013-02-21 Thread aaron morton
> Hi – I have around 6TB of data on 1 node
Unless you have SSD and 10GbE you probably have too much data on there. 
Remember you need to run repair and that can take a long time with a lot of 
data. Also you may need to replace a node one day and moving 6TB will take a 
while.

>  Or will the sstable compaction continue and eventually we will have 1 file ?
No. 
The default size tiered strategy compacts files what are roughly the same size, 
and only when there are more than 4 (default) of them.

Cheers
  
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 3:47 AM, Kanwar Sangha  wrote:

> Hi – I have around 6TB of data on 1 node and the cfstats show 32 sstables. 
> There is no compaction job running in the background. Is there a limit on the 
> size per sstable ? Or will the sstable compaction continue and eventually we 
> will have 1 file ?
>  
> Thanks,
> Kanwar
>  



Re: "Heap is N.N full." Immediately on startup

2013-02-21 Thread aaron morton
My first guess would be the bloom filter and index sampling from lots-o-rows 

Check the row count in cfstats
Check the bloom filter size in cfstats. 

Background on memory requirements 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 11:27 PM, Andras Szerdahelyi 
 wrote:

> Hey list,
> 
> Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM 
> heap at startup in Cassandra 1.1.6 besides
> row cache : not persisted and is at 0 keys when this warning is produced
> Memtables : no write traffic at startup, my app's column families are 
> durable_writes:false
> Pending tasks : no pending tasks, except for 928 compactions ( not sure where 
> those are coming from )
> I drew these conclusions from the StatusLogger output below: 
> 
> INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) 
> GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max 
> is 8375238656
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) 
> Pool NameActive   Pending   Blocked
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) 
> ReadStage 0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
> RequestResponseStage  0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
> ReadRepairStage   0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
> MutationStage 0-1 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
> ReplicateOnWriteStage 0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
> GossipStage   0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
> AntiEntropyStage  0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
> MigrationStage0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
> StreamStage   0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
> MemtablePostFlusher   0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
> FlushWriter   0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
> MiscStage 0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
> commitlog_archiver0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) 
> InternalResponseStage 0 0 0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) 
> CompactionManager 0   928
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) 
> MessagingServicen/a   0,0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) 
> Cache Type Size Capacity   
> KeysToSave Provider
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) 
> KeyCache 25   25  
> all 
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) 
> RowCache  00  
> all  org.apache.cassandra.cache.SerializingCacheProvider
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) 
> ColumnFamilyMemtable ops,data
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
> MYAPP_1.CF0,0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
> MYAPP_2.CF 0,0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
> HiveMetaStore.MetaStore   0,0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
> system.NodeIdInfo 0,0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
> system.IndexInfo  0,0
>  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
> system.LocationInfo  

Re: How to limit query results like "from row 50 to 100"

2013-02-21 Thread aaron morton
CQL does not support offset but does have limit. 

See 
http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT#specifying-rows-returned-using-limit

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 1:47 PM, Mateus Ferreira e Freitas 
 wrote:

> With CQL or an API.



Re: Cassandra network latency & tuning

2013-02-21 Thread aaron morton
>  I would like to understand how we can capture network latencies between a 
> 1GbE and 10GbE for ex.
Cassandra reports two latencies.

The CF latencies reported by nodetool cfstats, nodetool cfhistograms and the CF 
MBeans cover the local time it takes to read or write the data. This does not 
include any local wait times, network latency or coordinator overhead. 

The Storage Proxy latency from nodetool proxyhistograms and the StorageProxy 
MBean is the total latency for a request on a coordinator.

Under load, with a consistent workload,  the CF latency should not vary too 
much. While the request latency can increase as wait time becomes more of an 
factor. 

Additionally streaming is throttled which you may want to increase, see the the 
yaml file. 
   
> We will soon be adding SSD's and was wondering how Cassandra can utilize the 
> 10GbE and the SSD's and if there are specific tuning that is required.
You may want to increase both the concurrent_writes and reads in the yaml file 
to take advantage of the extra IO. 
Same for the compaction settings, comments in the yaml file will help. 

With SSD and 10GbE you can easily hold more data on each node. Typically we 
advise 300GB to 500GB per node with HDD and 1GbE, because of the time repair 
and node replacement takes. With SSD and 10GbE it will take less, and even less 
if you are using SSD. 

If you feel like being thorough add repair and node replacement (all under 
load) to your test lineup. 

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 1:44 PM, Brandon Walsh  wrote:

> I have a 5 node cluster and currently running ver 1.2. Prior to full scale 
> deployment, I'm running some benchmarks  using YCSB. From a hadoop cluster 
> deployment we saw an excellent improvement using higher speed networks. 
> However Cassandra does not include network latencies and I would like to 
> understand how we can capture network latencies between a 1GbE and 10GbE for 
> ex. As of now all the graphs look the same. We will soon be adding SSD's and 
> was wondering how Cassandra can utilize the 10GbE and the SSD's and if there 
> are specific tuning that is required.



Re: Read IO

2013-02-21 Thread Jouni Hartikainen

Hi,

On Feb 21, 2013, at 7:52 , Kanwar Sangha  wrote:
> Hi – Can someone explain the worst case IOPS for a read ? No key cache, No 
> row cache, sampling rate say 512.
>  
> 1)  Bloom filter will be checked to see existence of key (In RAM)
> 2)  Index filer sample (IN RAM) will be checked to find approx. location 
> in index file on disk
> 3)  1 IOPS to read the actual index file on disk (DISK)
> 4)  1 IOPS to get the data from the location in the sstable (DISK)
>  
> Is this correct ?

As you were asking for the worst case, I would still add one step that would be 
a seek inside an SSTable from the row start to the queried columns using column 
index.

However, this applies only if you are querying a subset of columns in the row 
(not all) and the total row size exceeds column_index_size_in_kb (defaults to 
64kB).

So, as far as I have understood, the worst case steps (without any caches) are:

1. Check the SSTable bloom filters (in memory)
2. Use index samples to find approx. correct place in the key index file (in 
memory)
3. Read the key index file until correct key is found (1st disk seek & read)
5. Seek to the start of the row in SSTable file and read row headers (possibly 
including column index) (2nd seek & read)
6. Using column index seek to the correct place inside the SSTable file to 
actually read the columns (3rd seek & read)

If the row is very wide and you are asking for a random bunch of columns from 
here and there, the step 6 might even be needed multiple times. Also, if your 
row has spread over many SSTables, each of them needs to be accessed (at least 
steps 1. - 5.) to get the complete results for the query.

All this in mind, if your node has any reasonable amount of reads, I'd say that 
in practice key index files will be page cached by the OS very quickly and thus 
normal read would end up being either one seek (for small rows without the 
column index) or two (for wider rows). Of course, as Peter already pointed out, 
the more columns you ask for, the more disk needs to read. For a continuous set 
of columns the read should be linear, however.

-Jouni

RE: Read IO

2013-02-21 Thread Kanwar Sangha
Ok.. Cassandra default block size is 256k ? Now say my data in the column is 4 
MB.  And the disk is giving me 4k block size random reads @ 100 IOPS. I can 
read max 400k in one seek ? does that mean I would need multiple seeks to get 
the complete data ?


-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: 21 February 2013 00:05
To: user@cassandra.apache.org
Subject: Re: Read IO

> Is this correct ?

Yes, at least under optimal conditions and assuming a reasonably sized row. 
Things like read-ahead (at the kernel level) will play into it; and if your 
read (even if assumed to be small) straddles two pages you might or might not 
take another read depending on your kernel settings (typically trading 
pollution of page cache vs. number of I/O:s).

--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Testing compaction strategies on a single production server?

2013-02-21 Thread Henrik Schröder
Thanks Aaron,

I hear you on the unchartered territory bit, we're definitely not gonna
risk our live data unless we know it's safe to do what we suggested. :-) Oh
well, I guess we'll have to setup a survey node instead.


/Henrik


On Thu, Feb 21, 2013 at 4:54 AM, aaron morton wrote:

> I *think* it will work. The steps in the blog post to change the
> compaction strategy before RING_DELAY expires is to ensure no sstables are
> created before the strategy is changed.
>
> But I think you will be venturing into unchartered territory where their
> might be dragons. And not the fun Disney kind.
>
> While it may be more work I personally would use one node in write survey
> to test LCS
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/02/2013, at 6:28 AM, Henrik Schröder  wrote:
>
> Well, that answer didn't really help. I know how to make a survey node,
> and I know how to simulate reads to it, it's just that that's a lot of
> work, and I wouldn't be sure that the simulated load is the same as the
> production load.
>
> We gather a lot of metrics from our production servers, so we know exactly
> how they perform over long periods of time. Changing a single server to run
> a different compaction strategy would allow us to know in detail how a
> different strategy would impact the cluster.
>
> So, is it possible to modify org.apache.cassandra.db.[keyspace].[column
> family].CompactionStrategyClass through jmx on a production server without
> any ill effects? Or is this only possible to do on a survey node while it
> is in a specific state?
>
>
> /Henrik
>
>
> On Tue, Feb 19, 2013 at 3:09 PM, Viktor Jevdokimov <
> viktor.jevdoki...@adform.com> wrote:
>
>>  Just turn off dynamic snitch on survey node and make read requests from
>> it directly with CL.ONE, watch histograms, compare.
>>
>> ** **
>>
>> Regarding switching compaction strategy there’re a lot of info already.**
>> **
>>
>> ** **
>>
>> ** **
>>Best regards / Pagarbiai
>> *Viktor Jevdokimov*
>> Senior Developer
>>
>> Email: viktor.jevdoki...@adform.com
>> Phone: +370 5 212 3063, Fax +370 5 261 0453
>> J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
>> Follow us on Twitter: @adforminsider
>> Take a ride with Adform's Rich Media Suite
>>   
>> 
>> 
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>  *From:* Henrik Schröder [mailto:skro...@gmail.com]
>> *Sent:* Tuesday, February 19, 2013 15:57
>> *To:* user
>> *Subject:* Testing compaction strategies on a single production server?**
>> **
>>
>> ** **
>>
>> Hey,
>>
>>
>> Version 1.1 of Cassandra introduced live traffic sampling, which allows
>> you to measure the performance of a node without it really joining the
>> cluster:
>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-live-traffic-sampling
>> 
>>
>> That page mentions that you can change the compaction strategy through
>> jmx if you want to test out a different strategy on your survey node.
>>
>> That's great, but it doesn't give you a complete view of how your
>> performance would change, since you're not doing reads from the survey
>> node. But what would happen if you used jmx to change the compaction
>> strategy of a column family on a single *production* node? Would that be a
>> safe way to test it out or are there side-effects of doing that live?
>>
>> And if you do that, would running a major compaction transform the entire
>> column family to the new format?
>>
>> Finally, if the test was a success, how do you proceed from there? Just
>> change the schema?
>>
>> 
>>
>> /Henrik
>>
>
>
>


key cache size

2013-02-21 Thread Kanwar Sangha
Hi - What is the approximate overhead of the key cache ? Say each key is 50 
bytes. What would be the overhead for this key in the key cache ?

Thanks,
Kanwar


Adding new nodes in a cluster with virtual nodes

2013-02-21 Thread Jean-Armel Luce
Hello,

We are using Cassandra 1.2.0.
We have a cluster of 16 physical nodes, each node has 256 virtual nodes.
We want to add 2 new nodes in our cluster : we follow the procedure as
explained here :
http://www.datastax.com/docs/1.2/operations/add_replace_nodes.

After starting 1 of the new node, we can see that this new node has 256
tokens ==>looks good
We can see that this node is in the ring (using nodetool status) ==> looks
good
After the bootstrap is finished in the new node, no data has been moved
automatically from the old nodes to this new node.


However, when we send insert queries in our cluster, the new node accepts
to insert the new rows.

Please, could you tell me if we need to perform a nodetool repair after the
bootstrap of the new node ?
What happens if we perform a nodetool cleanup in the old nodes before doing
the nodetool repair ? (Is there a risk of loosing some data ?)

Regards.

Jean Armel


Re: File Store

2013-02-21 Thread Sergey Leschenko
On Wed, Feb 20, 2013 at 6:47 PM, Kanwar Sangha  wrote:
> Hi – I am looking for some inputs on the file storage in Cassandra.  Each
> file size can range from 200kb – 3MB.  I don’t see any limitation on the
> column size. But would it be a good idea to store these files as binary in
> the columns ?

We do the same, keeps a lot of small files (up to 15Mb).
Limitation came from the Thrift side - it's bindings requires to load
whole file in memory, but affordable in our case.

-- 
Sergey