Re: File Store

2013-02-21 Thread Sergey Leschenko
On Wed, Feb 20, 2013 at 6:47 PM, Kanwar Sangha kan...@mavenir.com wrote:
 Hi – I am looking for some inputs on the file storage in Cassandra.  Each
 file size can range from 200kb – 3MB.  I don’t see any limitation on the
 column size. But would it be a good idea to store these files as binary in
 the columns ?

We do the same, keeps a lot of small files (up to 15Mb).
Limitation came from the Thrift side - it's bindings requires to load
whole file in memory, but affordable in our case.

-- 
Sergey


Adding new nodes in a cluster with virtual nodes

2013-02-21 Thread Jean-Armel Luce
Hello,

We are using Cassandra 1.2.0.
We have a cluster of 16 physical nodes, each node has 256 virtual nodes.
We want to add 2 new nodes in our cluster : we follow the procedure as
explained here :
http://www.datastax.com/docs/1.2/operations/add_replace_nodes.

After starting 1 of the new node, we can see that this new node has 256
tokens ==looks good
We can see that this node is in the ring (using nodetool status) == looks
good
After the bootstrap is finished in the new node, no data has been moved
automatically from the old nodes to this new node.


However, when we send insert queries in our cluster, the new node accepts
to insert the new rows.

Please, could you tell me if we need to perform a nodetool repair after the
bootstrap of the new node ?
What happens if we perform a nodetool cleanup in the old nodes before doing
the nodetool repair ? (Is there a risk of loosing some data ?)

Regards.

Jean Armel


key cache size

2013-02-21 Thread Kanwar Sangha
Hi - What is the approximate overhead of the key cache ? Say each key is 50 
bytes. What would be the overhead for this key in the key cache ?

Thanks,
Kanwar


Re: Testing compaction strategies on a single production server?

2013-02-21 Thread Henrik Schröder
Thanks Aaron,

I hear you on the unchartered territory bit, we're definitely not gonna
risk our live data unless we know it's safe to do what we suggested. :-) Oh
well, I guess we'll have to setup a survey node instead.


/Henrik


On Thu, Feb 21, 2013 at 4:54 AM, aaron morton aa...@thelastpickle.comwrote:

 I *think* it will work. The steps in the blog post to change the
 compaction strategy before RING_DELAY expires is to ensure no sstables are
 created before the strategy is changed.

 But I think you will be venturing into unchartered territory where their
 might be dragons. And not the fun Disney kind.

 While it may be more work I personally would use one node in write survey
 to test LCS

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 20/02/2013, at 6:28 AM, Henrik Schröder skro...@gmail.com wrote:

 Well, that answer didn't really help. I know how to make a survey node,
 and I know how to simulate reads to it, it's just that that's a lot of
 work, and I wouldn't be sure that the simulated load is the same as the
 production load.

 We gather a lot of metrics from our production servers, so we know exactly
 how they perform over long periods of time. Changing a single server to run
 a different compaction strategy would allow us to know in detail how a
 different strategy would impact the cluster.

 So, is it possible to modify org.apache.cassandra.db.[keyspace].[column
 family].CompactionStrategyClass through jmx on a production server without
 any ill effects? Or is this only possible to do on a survey node while it
 is in a specific state?


 /Henrik


 On Tue, Feb 19, 2013 at 3:09 PM, Viktor Jevdokimov 
 viktor.jevdoki...@adform.com wrote:

  Just turn off dynamic snitch on survey node and make read requests from
 it directly with CL.ONE, watch histograms, compare.

 ** **

 Regarding switching compaction strategy there’re a lot of info already.**
 **

 ** **

 ** **
Best regards / Pagarbiai
 *Viktor Jevdokimov*
 Senior Developer

 Email: viktor.jevdoki...@adform.com
 Phone: +370 5 212 3063, Fax +370 5 261 0453
 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
 Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider
 Take a ride with Adform's Rich Media Suitehttp://vimeo.com/adform/richmedia
  signature-logo18be.png http://www.adform.com/
 signature-best-employer-logo6784.png
 http://www.adform.com/site/blog/adform/adform-takes-top-spot-in-best-employer-survey/

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.

  *From:* Henrik Schröder [mailto:skro...@gmail.com]
 *Sent:* Tuesday, February 19, 2013 15:57
 *To:* user
 *Subject:* Testing compaction strategies on a single production server?**
 **

 ** **

 Hey,


 Version 1.1 of Cassandra introduced live traffic sampling, which allows
 you to measure the performance of a node without it really joining the
 cluster:
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-live-traffic-sampling
 

 That page mentions that you can change the compaction strategy through
 jmx if you want to test out a different strategy on your survey node.

 That's great, but it doesn't give you a complete view of how your
 performance would change, since you're not doing reads from the survey
 node. But what would happen if you used jmx to change the compaction
 strategy of a column family on a single *production* node? Would that be a
 safe way to test it out or are there side-effects of doing that live?

 And if you do that, would running a major compaction transform the entire
 column family to the new format?

 Finally, if the test was a success, how do you proceed from there? Just
 change the schema?

 

 /Henrik






RE: Read IO

2013-02-21 Thread Kanwar Sangha
Ok.. Cassandra default block size is 256k ? Now say my data in the column is 4 
MB.  And the disk is giving me 4k block size random reads @ 100 IOPS. I can 
read max 400k in one seek ? does that mean I would need multiple seeks to get 
the complete data ?


-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: 21 February 2013 00:05
To: user@cassandra.apache.org
Subject: Re: Read IO

 Is this correct ?

Yes, at least under optimal conditions and assuming a reasonably sized row. 
Things like read-ahead (at the kernel level) will play into it; and if your 
read (even if assumed to be small) straddles two pages you might or might not 
take another read depending on your kernel settings (typically trading 
pollution of page cache vs. number of I/O:s).

--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Read IO

2013-02-21 Thread Jouni Hartikainen

Hi,

On Feb 21, 2013, at 7:52 , Kanwar Sangha kan...@mavenir.com wrote:
 Hi – Can someone explain the worst case IOPS for a read ? No key cache, No 
 row cache, sampling rate say 512.
  
 1)  Bloom filter will be checked to see existence of key (In RAM)
 2)  Index filer sample (IN RAM) will be checked to find approx. location 
 in index file on disk
 3)  1 IOPS to read the actual index file on disk (DISK)
 4)  1 IOPS to get the data from the location in the sstable (DISK)
  
 Is this correct ?

As you were asking for the worst case, I would still add one step that would be 
a seek inside an SSTable from the row start to the queried columns using column 
index.

However, this applies only if you are querying a subset of columns in the row 
(not all) and the total row size exceeds column_index_size_in_kb (defaults to 
64kB).

So, as far as I have understood, the worst case steps (without any caches) are:

1. Check the SSTable bloom filters (in memory)
2. Use index samples to find approx. correct place in the key index file (in 
memory)
3. Read the key index file until correct key is found (1st disk seek  read)
5. Seek to the start of the row in SSTable file and read row headers (possibly 
including column index) (2nd seek  read)
6. Using column index seek to the correct place inside the SSTable file to 
actually read the columns (3rd seek  read)

If the row is very wide and you are asking for a random bunch of columns from 
here and there, the step 6 might even be needed multiple times. Also, if your 
row has spread over many SSTables, each of them needs to be accessed (at least 
steps 1. - 5.) to get the complete results for the query.

All this in mind, if your node has any reasonable amount of reads, I'd say that 
in practice key index files will be page cached by the OS very quickly and thus 
normal read would end up being either one seek (for small rows without the 
column index) or two (for wider rows). Of course, as Peter already pointed out, 
the more columns you ask for, the more disk needs to read. For a continuous set 
of columns the read should be linear, however.

-Jouni

Re: Cassandra network latency tuning

2013-02-21 Thread aaron morton
  I would like to understand how we can capture network latencies between a 
 1GbE and 10GbE for ex.
Cassandra reports two latencies.

The CF latencies reported by nodetool cfstats, nodetool cfhistograms and the CF 
MBeans cover the local time it takes to read or write the data. This does not 
include any local wait times, network latency or coordinator overhead. 

The Storage Proxy latency from nodetool proxyhistograms and the StorageProxy 
MBean is the total latency for a request on a coordinator.

Under load, with a consistent workload,  the CF latency should not vary too 
much. While the request latency can increase as wait time becomes more of an 
factor. 

Additionally streaming is throttled which you may want to increase, see the the 
yaml file. 
   
 We will soon be adding SSD's and was wondering how Cassandra can utilize the 
 10GbE and the SSD's and if there are specific tuning that is required.
You may want to increase both the concurrent_writes and reads in the yaml file 
to take advantage of the extra IO. 
Same for the compaction settings, comments in the yaml file will help. 

With SSD and 10GbE you can easily hold more data on each node. Typically we 
advise 300GB to 500GB per node with HDD and 1GbE, because of the time repair 
and node replacement takes. With SSD and 10GbE it will take less, and even less 
if you are using SSD. 

If you feel like being thorough add repair and node replacement (all under 
load) to your test lineup. 

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 1:44 PM, Brandon Walsh brandon_9021...@yahoo.com wrote:

 I have a 5 node cluster and currently running ver 1.2. Prior to full scale 
 deployment, I'm running some benchmarks  using YCSB. From a hadoop cluster 
 deployment we saw an excellent improvement using higher speed networks. 
 However Cassandra does not include network latencies and I would like to 
 understand how we can capture network latencies between a 1GbE and 10GbE for 
 ex. As of now all the graphs look the same. We will soon be adding SSD's and 
 was wondering how Cassandra can utilize the 10GbE and the SSD's and if there 
 are specific tuning that is required.



Re: How to limit query results like from row 50 to 100

2013-02-21 Thread aaron morton
CQL does not support offset but does have limit. 

See 
http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT#specifying-rows-returned-using-limit

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 1:47 PM, Mateus Ferreira e Freitas 
mateus.ffrei...@hotmail.com wrote:

 With CQL or an API.



Re: Heap is N.N full. Immediately on startup

2013-02-21 Thread aaron morton
My first guess would be the bloom filter and index sampling from lots-o-rows 

Check the row count in cfstats
Check the bloom filter size in cfstats. 

Background on memory requirements 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 11:27 PM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.com wrote:

 Hey list,
 
 Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM 
 heap at startup in Cassandra 1.1.6 besides
 row cache : not persisted and is at 0 keys when this warning is produced
 Memtables : no write traffic at startup, my app's column families are 
 durable_writes:false
 Pending tasks : no pending tasks, except for 928 compactions ( not sure where 
 those are coming from )
 I drew these conclusions from the StatusLogger output below: 
 
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) 
 GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max 
 is 8375238656
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) 
 Pool NameActive   Pending   Blocked
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) 
 ReadStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 RequestResponseStage  0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 ReadRepairStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
 MutationStage 0-1 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 ReplicateOnWriteStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 GossipStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 AntiEntropyStage  0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 MigrationStage0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
 StreamStage   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 MemtablePostFlusher   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 FlushWriter   0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 MiscStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
 commitlog_archiver0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) 
 InternalResponseStage 0 0 0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) 
 CompactionManager 0   928
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) 
 MessagingServicen/a   0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) 
 Cache Type Size Capacity   
 KeysToSave Provider
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) 
 KeyCache 25   25  
 all 
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) 
 RowCache  00  
 all  org.apache.cassandra.cache.SerializingCacheProvider
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) 
 ColumnFamilyMemtable ops,data
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 MYAPP_1.CF0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 MYAPP_2.CF 0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 HiveMetaStore.MetaStore   0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 system.NodeIdInfo 0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 system.IndexInfo  0,0
  INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
 system.LocationInfo   0,0
  INFO 

Re: SSTable Num

2013-02-21 Thread aaron morton
 Hi – I have around 6TB of data on 1 node
Unless you have SSD and 10GbE you probably have too much data on there. 
Remember you need to run repair and that can take a long time with a lot of 
data. Also you may need to replace a node one day and moving 6TB will take a 
while.

  Or will the sstable compaction continue and eventually we will have 1 file ?
No. 
The default size tiered strategy compacts files what are roughly the same size, 
and only when there are more than 4 (default) of them.

Cheers
  
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 3:47 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – I have around 6TB of data on 1 node and the cfstats show 32 sstables. 
 There is no compaction job running in the background. Is there a limit on the 
 size per sstable ? Or will the sstable compaction continue and eventually we 
 will have 1 file ?
  
 Thanks,
 Kanwar
  



Re: how to debug slowdowns from these log snippets-more info 2

2013-02-21 Thread aaron morton
Some things to consider: 

Check for contention around the switch lock. This can happen if you get a lot 
of tables flushing at the same time, or if you have a lot of secondary indexes. 
It shows up as a pattern in the logs. As soon a the writer starts flushing a 
memtable another will be queued. Probably not happening here but can be a pain 
when a lot of memtables are flushed. 

I would turn on GC logging in cassandra-env.sh and watch that. After a full CMS 
flush how full / empty is the tenured heap ? If it is still got a lot in it 
then you are running with too much cache / bloom filter / index sampling. 

You can also experiment with the Max Tenuring Threshold, try turning it up to 4 
to start with. The GC logs will show you how much data is at each tenuring 
level. You can then see how much data is being tenuring, and if premature 
tenuring was an issue. I've seen premature tenuring cause issues with wide rows 
/ long reads. 

Hope that helps. 


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 4:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Oh, and my startup command that cassandra logged was
 
 a2.bigde.nrel.gov: xss =  -ea -javaagent:/opt/cassandra/lib/jamm-0.2.5.jar
 -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8021M -Xmx8021M
 -Xmn1600M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
 
 And I remember from docs you don't want to go above 8G or java GC doesn't
 work out so well.  I am not sure why this is not working out though.
 
 Dean
 
 On 2/20/13 7:16 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 
 Here is the printout before that log which is probably important as
 wellŠ..
 
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 GCInspector.java (line
 122) GC for ConcurrentMarkSweep: 3618 ms for 2 collections, 7038159096
 used; max is 8243904512
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line
 57) Pool NameActive   Pending   Blocked
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line
 72) ReadStage11   264 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) RequestResponseStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) ReadRepairStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) MutationStage1288 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) ReplicateOnWriteStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) GossipStage   1 7 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line
 72) AntiEntropyStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) MigrationStage0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) StreamStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) MemtablePostFlusher   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) FlushWriter   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) MiscStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line
 72) commitlog_archiver0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 72) InternalResponseStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 72) HintedHandoff 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 77) CompactionManager 4 5
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 89) MessagingServicen/a10,127
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 99) Cache Type Size Capacity
   KeysToSave
 Provider
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 100) KeyCache1310719  1310719
   all
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 106) RowCache  00
   all   
 org.apache.cassandra.cache.SerializingCacheProvider
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line
 113) ColumnFamilyMemtable ops,data
 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,379 StatusLogger.java 

Re: Mutation dropped

2013-02-21 Thread aaron morton
 What does rpc_timeout control? Only the reads/writes? 
Yes. 

 like data stream,
streaming_socket_timeout_in_ms in the yaml

 merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

 What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:

 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is the 
 reasonable value for roc_timeout? The default value of 10 seconds are way too 
 long. What is the side effect if it's set to a really small number, say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's 
 commit log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID   
 Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some timeout associated with replicated data 
 persistence ?
  
 Thanks,
 Kanwar
  
  
  
  
  
  
  
 From: Kanwar Sangha [mailto:kan...@mavenir.com] 
 Sent: 14 February 2013 09:08
 To: user@cassandra.apache.org
 Subject: Mutation dropped
  
 Hi – I am doing a load test using YCSB across 2 nodes in a cluster and 
 seeing a lot of mutation dropped messages.  I understand that this is due to 
 the replica not being written to the
 other node ? RF = 2, CL =1.
  
 From the wiki -
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair 
 or Anti Entropy Repair
  
 Thanks,
 Kanwar
  
 
 
 



Re: very confused by jmap dump of cassandra

2013-02-21 Thread aaron morton
Cannot comment too much on the jmap but I can add my general compaction is 
hurting strategy. 

Try any or all of the following to get to a stable setup, then increase until 
things go bang. 

Set concurrent compactors to 2. 
Reduce compaction throughput by half. 
Reduce in_memory_compaction_limit. 
If you see compactions using a lot of sstables in the logs, reduce 
max_compaction_threshold. 
 
  I can easily go higher than 8G on these systems as I have 32gig each node, 
 but there was docs that said 8G is better for GC. 
More JVM memory is not the answer. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 7:49 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 I took this jmap dump of cassandra(in production).  Before I restarted the 
 whole production cluster, I had some nodes running compaction and it looked 
 like all memory had been consumed(kind of like cassandra is not clearing out 
 the caches or memtables fast enough).  I am trying to still debug compaction 
 causes slowness on the cluster since all cassandra.yaml files are pretty much 
 the defaults with size tiered compaction.
 
 The weird thing is I dump and get a 5.4G heap.bin file and load that into 
 ecipse who tells me total is 142.8MB….what So low, top was showing 
 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is 
 eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in 
 use?)
 
 Tasks: 398 total,   1 running, 397 sleeping,   0 stopped,   0 zombie
 Cpu(s):  2.8%us,  0.5%sy,  0.0%ni, 96.5%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
 Mem:  32854680k total, 31910708k used,   943972k free,89776k buffers
 Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached
 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 20909 cassandr  20   0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java
 22455 cassandr  20   0 15288 1340  824 R  3.9  0.0   0:00.02 top
 
 It almost seems like cassandra is not being good about memory management here 
 as we slowly get into a situation where compaction is run which takes out our 
 memory(configured for 8G).  I can easily go higher than 8G on these systems 
 as I have 32gig each node, but there was docs that said 8G is better for GC.  
 Has anyone else taken a jmap dump of cassandra?
 
 Thanks,
 Dean
 



Re: very confused by jmap dump of cassandra

2013-02-21 Thread Mohit Anchlia
Roughly how much data do you have per node?

Sent from my iPhone

On Feb 20, 2013, at 10:49 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 I took this jmap dump of cassandra(in production).  Before I restarted the 
 whole production cluster, I had some nodes running compaction and it looked 
 like all memory had been consumed(kind of like cassandra is not clearing out 
 the caches or memtables fast enough).  I am trying to still debug compaction 
 causes slowness on the cluster since all cassandra.yaml files are pretty much 
 the defaults with size tiered compaction.
 
 The weird thing is I dump and get a 5.4G heap.bin file and load that into 
 ecipse who tells me total is 142.8MB….what So low, top was showing 
 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is 
 eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in 
 use?)
 
 Tasks: 398 total,   1 running, 397 sleeping,   0 stopped,   0 zombie
 Cpu(s):  2.8%us,  0.5%sy,  0.0%ni, 96.5%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
 Mem:  32854680k total, 31910708k used,   943972k free,89776k buffers
 Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached
 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 20909 cassandr  20   0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java
 22455 cassandr  20   0 15288 1340  824 R  3.9  0.0   0:00.02 top
 
 It almost seems like cassandra is not being good about memory management here 
 as we slowly get into a situation where compaction is run which takes out our 
 memory(configured for 8G).  I can easily go higher than 8G on these systems 
 as I have 32gig each node, but there was docs that said 8G is better for GC.  
 Has anyone else taken a jmap dump of cassandra?
 
 Thanks,
 Dean
 


RE: SSTable Num

2013-02-21 Thread Kanwar Sangha
No.
The default size tiered strategy compacts files what are roughly the same size, 
and only when there are more than 4 (default) of them.

Ok. So for 10 TB, I could have at least 4 SStables files each of 2.5 TB ?

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: 21 February 2013 11:01
To: user@cassandra.apache.org
Subject: Re: SSTable Num

Hi - I have around 6TB of data on 1 node
Unless you have SSD and 10GbE you probably have too much data on there.
Remember you need to run repair and that can take a long time with a lot of 
data. Also you may need to replace a node one day and moving 6TB will take a 
while.

 Or will the sstable compaction continue and eventually we will have 1 file ?
No.
The default size tiered strategy compacts files what are roughly the same size, 
and only when there are more than 4 (default) of them.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 3:47 AM, Kanwar Sangha 
kan...@mavenir.commailto:kan...@mavenir.com wrote:


Hi - I have around 6TB of data on 1 node and the cfstats show 32 sstables. 
There is no compaction job running in the background. Is there a limit on the 
size per sstable ? Or will the sstable compaction continue and eventually we 
will have 1 file ?

Thanks,
Kanwar




Re: cassandra vs. mongodb quick question(good additional info)

2013-02-21 Thread aaron morton
If you are lazy like me wolfram alpha can help 

http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes--

10 hours 15 minutes 43.59 seconds

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 11:31 AM, Wojciech Meler wojciech.me...@gmail.com wrote:

 you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link
 
 19 lut 2013 02:01, Hiller, Dean dean.hil...@nrel.gov napisał(a):
 I thought about this more, and even with a 10Gbit network, it would take 40 
 days to bring up a replacement node if mongodb did truly have a 42T / node 
 like I had heard.  I wrote the below email to the person I heard this from 
 going back to basics which really puts some perspective on it….(and a lot of 
 people don't even have a 10Gbit network like we do)
 
 Nodes are hooked up by a 10G network at most right now where that is 
 10gigabit.  We are talking about 10Terabytes on disk per node recently.
 
 Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second  (yes I could 
 have divided by 8 in my head but eh…course when I saw the number, I went duh)
 
 So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we are 
 bringing online to replace a dead node would take approximately 5 days???
 
 This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1 
 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more 
 likely 11 days if we only use 50% of the network.
 
 So bringing a new node up to speed is more like 11 days once it is crashed.  
 I think this is the main reason the 1Terabyte exists to begin with, right?
 
 From an ops perspective, this could sound like a nightmare scenario of 
 waiting 10 days…..maybe it is livable though.  Either way, I thought it would 
 be good to share the numbers.  ALSO, that is assuming the bus with it's 10 
 disk can keep up with 10G  Can it?  What is the limit of throughput on a 
 bus / second on the computers we have as on wikipedia there is a huge 
 variance?
 
 What is the rate of the disks too (multiplied by 10 of course)?  Will they 
 keep up with a 10G rate for bringing a new node online?
 
 This all comes into play even more so when you want to double the size of 
 your cluster of course as all nodes have to transfer half of what they have 
 to all the new nodes that come online(cassandra actually has a very data 
 center/rack aware topology to transfer data correctly to not use up all 
 bandwidth unecessarily…I am not sure mongodb has that).  Anyways, just food 
 for thought.
 
 From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Monday, February 18, 2013 1:39 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget 
 p...@fantasista.nomailto:p...@fantasista.no
 Subject: Re: cassandra vs. mongodb quick question
 
 My experience is repair of 300GB compressed data takes longer than 300GB of 
 uncompressed, but I cannot point to an exact number. Calculating the 
 differences is mostly CPU bound and works on the non compressed data.
 
 Streaming uses compression (after uncompressing the on disk data).
 
 So if you have 300GB of compressed data, take a look at how long repair takes 
 and see if you are comfortable with that. You may also want to test replacing 
 a node so you can get the procedure documented and understand how long it 
 takes.
 
 The idea of the soft 300GB to 500GB limit cam about because of a number of 
 cases where people had 1 TB on a single node and they were surprised it took 
 days to repair or replace. If you know how long things may take, and that 
 fits in your operations then go with it.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/02/2013, at 10:08 PM, Vegard Berget 
 p...@fantasista.nomailto:p...@fantasista.no wrote:
 
 
 
 Just out of curiosity :
 
 When using compression, does this affect this one way or another?  Is 300G 
 (compressed) SSTable size, or total size of data?
 
 .vegard,
 
 - Original Message -
 From:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
 To:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Cc:
 
 Sent:
 Mon, 18 Feb 2013 08:41:25 +1300
 Subject:
 Re: cassandra vs. mongodb quick question
 
 
 If you have spinning disk and 1G networking and no virtual nodes, I would 
 still say 300G to 500G is a soft limit.
 
 If you are using virtual nodes, SSD, JBOD disk configuration or faster 
 networking you may go higher.
 
 The limiting factors are the time it take to repair, the time it takes to 
 replace a node, the memory considerations for 100's of millions of rows. If 
 you the 

Re: key cache size

2013-02-21 Thread aaron morton
This is the key cache entry 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cache/KeyCacheKey.java

Note that the Descriptor is re-used. 

If you want to see key cache metrics, including bytes used,  use nodetool info. 

Cheers
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 3:45 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – What is the approximate overhead of the key cache ? Say each key is 50 
 bytes. What would be the overhead for this key in the key cache ?
  
 Thanks,
 Kanwar



Re: Heap is N.N full. Immediately on startup

2013-02-21 Thread Andras Szerdahelyi
Thank you- indeed my index interval is 64 with a CF of 300M rows + bloom filter 
false positive chance was default.
Raising the index interval to 512 didn't fix this  alone, so I guess I'll have 
to set the bloom filter to some reasonable value and scrub.

From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday 21 February 2013 17:58
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Heap is N.N full. Immediately on startup

My first guess would be the bloom filter and index sampling from lots-o-rows

Check the row count in cfstats
Check the bloom filter size in cfstats.

Background on memory requirements 
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 11:27 PM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.commailto:andras.szerdahe...@ignitionone.com 
wrote:

Hey list,

Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap 
at startup in Cassandra 1.1.6 besides

  *   row cache : not persisted and is at 0 keys when this warning is produced
  *   Memtables : no write traffic at startup, my app's column families are 
durable_writes:false
  *   Pending tasks : no pending tasks, except for 928 compactions ( not sure 
where those are coming from )

I drew these conclusions from the StatusLogger output below:

INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC 
for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 
8375238656
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) 
Pool NameActive   Pending   Blocked
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) 
ReadStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
RequestResponseStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
ReadRepairStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) 
MutationStage 0-1 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
ReplicateOnWriteStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
GossipStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
AntiEntropyStage  0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
MigrationStage0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) 
StreamStage   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
MemtablePostFlusher   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
FlushWriter   0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
MiscStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) 
commitlog_archiver0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) 
InternalResponseStage 0 0 0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) 
CompactionManager 0   928
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) 
MessagingServicen/a   0,0
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) 
Cache Type Size Capacity   
KeysToSave Provider
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) 
KeyCache 25   25
  all
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) 
RowCache  00
  all  org.apache.cassandra.cache.SerializingCacheProvider
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) 
ColumnFamilyMemtable ops,data
 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) 
MYAPP_1.CF0,0
 INFO [ScheduledTasks:1] 2013-02-20 

Re: Using Cassandra for read operations

2013-02-21 Thread Bill de hÓra
 To avoid disk I/Os, the best option we thought is to have data in memory. 
 Is it a good idea to have memtable setup around 1/2 or 3/4 of 
 heap size? Obviously flushing will take a lot of time but would 
 that hurt that node's performance big time?

Start with the defaults and test your workload. If memtables start flushing 
aggressively (because of write load or bad settings), that can cause compaction 
work on the disk, that might impair read I/O. 


  Is there a way to figure out max read-latency for a bunch of read operations?

Use nodetool's histogram feature to get a sense of outlier latency.


 We just need one column family with a long key

Take time to tune your key caches and bloom filters. They use memory and have 
an impact on read performance.


 Given that cassandra provides off-heap row caching, in a 
 machine 32 gb RAM, would it be wise to have a 10 gb row 
 cache with 8 gb java heap? 

If you use the off heap cache, allow enough room for the filesystems' own 
cache, i.e. don't give over all of ram to the off heap cache. Also the off heap 
cache can slow you down with wide rows due to serialisation overhead, or cache 
invalidation thrashing if you are update heavy. if you use the on-heap cache, 
pay close attention to GC cycles and memory stability - if you are 
cycling/evicting through the cache at a high rate that can leave too much 
garbage in memory such that the garbage collector can't keep up. If the node 
doesn't have enough working memory after GC, it will _resize_ key and row 
caches. This will lead to degraded read performance and with some workloads can 
result in a vicious cycle.


  For our SLAs, a read of max 15-20 rows at once(using multi slice), 
 should not take more than 4 ms.

If you control your own hardware (and you probably should/must for this kind of 
latency demand) consider SSDs. You might want to carefully control background 
repair/compaction operations if predictable performance is your goal. You might 
want to avoid storing strings and use byte representations. If you have an 
application tier on the path consider caching in that tier as well to avoid the 
overhead of network calls and thrift processing.

In a nutshell -

- Start with defaults and tune based on small discrete adjustments and leave 
time to see the effect of each change. No-one will know your workload better 
than you and the questions you are asking are workload sensitive.

- Allow time for tuning and spending time understanding the memory model and 
JVM GC.

- Be very careful with caches. Leave enough room in the OS for its own disk 
cache.

- Get an SSD


Bill


On 21 Feb 2013, at 19:03, amulya rattan talk2amu...@gmail.com wrote:

 Dear All,
 
 We are currently evaluating Cassandra for an application involving strict 
 SLAs(Service level agreements). We just need one column family with a long 
 key and approximately 70-80 bytes row. We are not concerned about write 
 performance but are primarily concerned about read. For our SLAs, a read of 
 max 15-20 rows at once(using multi slice), should not take more than 4 ms. 
 Till now, on a single node setup, using cassandra' stress tool, the numbers 
 are promising. But I am guessing that's because there is no network latency 
 involved there and since we set memtable around 2gb(4 gb heap), we never had 
 to get to Disk I/O.
 
 Assuming our nodes having 32GB RAM, a couple of questions regarding read:
 
 * To avoid disk I/Os, the best option we thought is to have data in memory. 
 Is it a good idea to have memtable setup around 1/2 or 3/4 of heap size? 
 Obviously flushing will take a lot of time but would that hurt that node's 
 performance big time?
 
 * Cassandra stress tool only gives out average read latency. Is there a way 
 to figure out max read-latency for a bunch of read operations?
 
 * How big a row cache can one have? Given that cassandra provides off-heap 
 row caching, in a machine 32 gb RAM, would it be wise to have a 10 gb row 
 cache with 8 gb java heap? And how big should the corresponding key cache be 
 then?
 
 Any response is appreciated.
 
 ~Amulya 
 



RE: cassandra vs. mongodb quick question(good additional info)

2013-02-21 Thread Kanwar Sangha
“The limiting factors are the time it take to repair, the time it takes to 
replace a node, the memory considerations for 100's of millions of rows. If you 
the performance of those operations is acceptable to you, then go crazy”



If I have a node which is attached to a RAID and the node crashes but the data 
is still good on the drives, it would just mean bringing up the node using the 
same storage ? would this not be fast…?




From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: 21 February 2013 11:46
To: user@cassandra.apache.org
Subject: Re: cassandra vs. mongodb quick question(good additional info)

If you are lazy like me wolfram alpha can help

http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes--

10 hours 15 minutes 43.59 seconds

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 11:31 AM, Wojciech Meler 
wojciech.me...@gmail.commailto:wojciech.me...@gmail.com wrote:



you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link
19 lut 2013 02:01, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov napisał(a):
I thought about this more, and even with a 10Gbit network, it would take 40 
days to bring up a replacement node if mongodb did truly have a 42T / node like 
I had heard.  I wrote the below email to the person I heard this from going 
back to basics which really puts some perspective on it….(and a lot of people 
don't even have a 10Gbit network like we do)

Nodes are hooked up by a 10G network at most right now where that is 10gigabit. 
 We are talking about 10Terabytes on disk per node recently.

Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second  (yes I could 
have divided by 8 in my head but eh…course when I saw the number, I went duh)

So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we are 
bringing online to replace a dead node would take approximately 5 days???

This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1 
second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more likely 
11 days if we only use 50% of the network.

So bringing a new node up to speed is more like 11 days once it is crashed.  I 
think this is the main reason the 1Terabyte exists to begin with, right?

From an ops perspective, this could sound like a nightmare scenario of waiting 
10 days…..maybe it is livable though.  Either way, I thought it would be good 
to share the numbers.  ALSO, that is assuming the bus with it's 10 disk can 
keep up with 10G  Can it?  What is the limit of throughput on a bus / 
second on the computers we have as on wikipedia there is a huge variance?

What is the rate of the disks too (multiplied by 10 of course)?  Will they keep 
up with a 10G rate for bringing a new node online?

This all comes into play even more so when you want to double the size of your 
cluster of course as all nodes have to transfer half of what they have to all 
the new nodes that come online(cassandra actually has a very data center/rack 
aware topology to transfer data correctly to not use up all bandwidth 
unecessarily…I am not sure mongodb has that).  Anyways, just food for thought.

From: aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.commailto:aa...@thelastpickle.commailto:aa...@thelastpickle.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Monday, February 18, 2013 1:39 PM
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org,
 Vegard Berget 
p...@fantasista.nomailto:p...@fantasista.nomailto:p...@fantasista.nomailto:p...@fantasista.no
Subject: Re: cassandra vs. mongodb quick question

My experience is repair of 300GB compressed data takes longer than 300GB of 
uncompressed, but I cannot point to an exact number. Calculating the 
differences is mostly CPU bound and works on the non compressed data.

Streaming uses compression (after uncompressing the on disk data).

So if you have 300GB of compressed data, take a look at how long repair takes 
and see if you are comfortable with that. You may also want to test replacing a 
node so you can get the procedure documented and understand how long it takes.

The idea of the soft 300GB to 500GB limit cam about because of a number of 
cases where people had 1 TB on a single node and they were surprised it took 
days to repair or replace. If you know how long things may take, and that fits 
in your operations then go with it.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand


Cassandra with SAN

2013-02-21 Thread Kanwar Sangha
Hi - Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won't help ?

Thanks
Kanwar


Re: Cassandra with SAN

2013-02-21 Thread Michael Kjellman
No, this is a really really bad idea and C* was not designed for this, in fact, 
it was designed so you don't need to have a large expensive SAN.

Don't be tempted by the shiny expensive SAN. :)

If money is no object instead throw SSD's in your nodes and run 10G between 
racks

From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, February 21, 2013 2:56 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Cassandra with SAN

Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won’t help ?

Thanks
Kanwar

Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.


Re: cassandra vs. mongodb quick question(good additional info)

2013-02-21 Thread Edward Capriolo
The theoretical maximum of 10G is not even close to what you actually get.

http://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=1ved=0CDIQFjAAurl=http%3A%2F%2Fdownload.intel.com%2Fsupport%2Fnetwork%2Fsb%2Ffedexcasestudyfinal.pdfei=HawmUcWIM6q20QG8j4DIBwusg=AFQjCNG8Qskl9vXdJvB7OLtIPQgparrt9Abvm=bv.42661473,d.dmQcad=rja

Sorry did not have time to strip the google stuff out of this link.


On Thu, Feb 21, 2013 at 12:45 PM, aaron morton aa...@thelastpickle.com wrote:
 If you are lazy like me wolfram alpha can help

 http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes--

 10 hours 15 minutes 43.59 seconds

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 21/02/2013, at 11:31 AM, Wojciech Meler wojciech.me...@gmail.com wrote:

 you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb
 link

 19 lut 2013 02:01, Hiller, Dean dean.hil...@nrel.gov napisał(a):

 I thought about this more, and even with a 10Gbit network, it would take
 40 days to bring up a replacement node if mongodb did truly have a 42T /
 node like I had heard.  I wrote the below email to the person I heard this
 from going back to basics which really puts some perspective on it….(and a
 lot of people don't even have a 10Gbit network like we do)

 Nodes are hooked up by a 10G network at most right now where that is
 10gigabit.  We are talking about 10Terabytes on disk per node recently.

 Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second  (yes I
 could have divided by 8 in my head but eh…course when I saw the number, I
 went duh)

 So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we
 are bringing online to replace a dead node would take approximately 5
 days???

 This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1
 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more
 likely 11 days if we only use 50% of the network.

 So bringing a new node up to speed is more like 11 days once it is
 crashed.  I think this is the main reason the 1Terabyte exists to begin
 with, right?

 From an ops perspective, this could sound like a nightmare scenario of
 waiting 10 days…..maybe it is livable though.  Either way, I thought it
 would be good to share the numbers.  ALSO, that is assuming the bus with
 it's 10 disk can keep up with 10G  Can it?  What is the limit of
 throughput on a bus / second on the computers we have as on wikipedia there
 is a huge variance?

 What is the rate of the disks too (multiplied by 10 of course)?  Will they
 keep up with a 10G rate for bringing a new node online?

 This all comes into play even more so when you want to double the size of
 your cluster of course as all nodes have to transfer half of what they have
 to all the new nodes that come online(cassandra actually has a very data
 center/rack aware topology to transfer data correctly to not use up all
 bandwidth unecessarily…I am not sure mongodb has that).  Anyways, just food
 for thought.

 From: aaron morton
 aa...@thelastpickle.commailto:aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Monday, February 18, 2013 1:39 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget
 p...@fantasista.nomailto:p...@fantasista.no
 Subject: Re: cassandra vs. mongodb quick question

 My experience is repair of 300GB compressed data takes longer than 300GB
 of uncompressed, but I cannot point to an exact number. Calculating the
 differences is mostly CPU bound and works on the non compressed data.

 Streaming uses compression (after uncompressing the on disk data).

 So if you have 300GB of compressed data, take a look at how long repair
 takes and see if you are comfortable with that. You may also want to test
 replacing a node so you can get the procedure documented and understand how
 long it takes.

 The idea of the soft 300GB to 500GB limit cam about because of a number of
 cases where people had 1 TB on a single node and they were surprised it took
 days to repair or replace. If you know how long things may take, and that
 fits in your operations then go with it.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/02/2013, at 10:08 PM, Vegard Berget
 p...@fantasista.nomailto:p...@fantasista.no wrote:



 Just out of curiosity :

 When using compression, does this affect this one way or another?  Is 300G
 (compressed) SSTable size, or total size of data?

 .vegard,

 - Original Message -
 From:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org

 To:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Cc:

 Sent:
 Mon, 18 Feb 2013 08:41:25 +1300
 Subject:
 Re: 

RE: Cassandra with SAN

2013-02-21 Thread Kanwar Sangha
Ok. What would be the drawbacks :)

From: Michael Kjellman [mailto:mkjell...@barracuda.com]
Sent: 21 February 2013 17:12
To: user@cassandra.apache.org
Subject: Re: Cassandra with SAN

No, this is a really really bad idea and C* was not designed for this, in fact, 
it was designed so you don't need to have a large expensive SAN.

Don't be tempted by the shiny expensive SAN. :)

If money is no object instead throw SSD's in your nodes and run 10G between 
racks

From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, February 21, 2013 2:56 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Cassandra with SAN

Hi - Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won't help ?

Thanks
Kanwar

--
Copy, by Barracuda, helps you store, protect, and share all your amazing 
things. Start today: www.copy.comhttp://www.copy.com?a=em_footer.
  


Re: Cassandra with SAN

2013-02-21 Thread David Schairer
Who breaks a butterfly upon a wheel?

It will work, but you'd have a distributed database running on a single point 
of failure storage fabric, thus destroying much of your benefits, unless you 
have enough discrete SAN units that you treat them as racks in your cassandra 
topology to ensure that you have data replicated across redundant SAN 
shelves|controllers|etc.

You also would end up with redundancy at cross purposes in that the SAN will be 
striping data that Cassandra is already distributing efficiently.

If the SAN is free and unused, it'll be fine as a Cassandra test platform.  But 
I wouldn't spend a penny on SAN hardware instead of a much larger distributed 
cluster with commodity hardware.  Derive your redundancy and performance from 
lots of hardware in lots of places, not expensive hardware in one place.  

--DRS

On Feb 21, 2013, at 3:42 PM, Kanwar Sangha kan...@mavenir.com wrote:

 Ok. What would be the drawbacks J
  
 From: Michael Kjellman [mailto:mkjell...@barracuda.com] 
 Sent: 21 February 2013 17:12
 To: user@cassandra.apache.org
 Subject: Re: Cassandra with SAN
  
 No, this is a really really bad idea and C* was not designed for this, in 
 fact, it was designed so you don't need to have a large expensive SAN.
  
 Don't be tempted by the shiny expensive SAN. :)
  
 If money is no object instead throw SSD's in your nodes and run 10G between 
 racks
  
 From: Kanwar Sangha kan...@mavenir.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, February 21, 2013 2:56 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Cassandra with SAN
  
 Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides 
 me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
 Cassandra machines and scaling by adding
 machines won’t help ?
  
 Thanks
 Kanwar
  
 -- 
 Copy, by Barracuda, helps you store, protect, and share all your amazing 
 things. Start today: www.copy.com.
   ­­  



Re: Cassandra with SAN

2013-02-21 Thread Michael Kjellman
Adding a Single Point of Failure when you chose a distributed database for 
probably a good reason. I'd also think you'd be tempted to have multiple 
terabytes per node. (so you're even more cost inefficient because you'll still 
need to buy the same number of nodes everyone else does even though you have 
the SAN). Then any operations are going to be unbearable (repair, cleanup). 
Also if you want to be multi dc, now you'll need two SANS.

I can't think of one good reason to run C* with a SAN.

From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, February 21, 2013 3:42 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Cassandra with SAN

Ok. What would be the drawbacks :)

From: Michael Kjellman [mailto:mkjell...@barracuda.com]
Sent: 21 February 2013 17:12
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Cassandra with SAN

No, this is a really really bad idea and C* was not designed for this, in fact, 
it was designed so you don't need to have a large expensive SAN.

Don't be tempted by the shiny expensive SAN. :)

If money is no object instead throw SSD's in your nodes and run 10G between 
racks

From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, February 21, 2013 2:56 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Cassandra with SAN

Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides me 
8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
Cassandra machines and scaling by adding
machines won’t help ?

Thanks
Kanwar

--
Copy, by Barracuda, helps you store, protect, and share all your amazing 
things. Start today: www.copy.comhttp://www.copy.com?a=em_footer.
  ­­

Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.


Re: Cassandra with SAN

2013-02-21 Thread P. Taylor Goetz
Cassandra is designed to write and read data in a way that is optimized for 
physical spinning disks.

Running C* on a SAN introduces a layer of abstraction that, at best negates 
those optimizations, and at worst introduces additional overhead.

Sent from my iPhone

On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote:

 Ok. What would be the drawbacks J
  
 From: Michael Kjellman [mailto:mkjell...@barracuda.com] 
 Sent: 21 February 2013 17:12
 To: user@cassandra.apache.org
 Subject: Re: Cassandra with SAN
  
 No, this is a really really bad idea and C* was not designed for this, in 
 fact, it was designed so you don't need to have a large expensive SAN.
  
 Don't be tempted by the shiny expensive SAN. :)
  
 If money is no object instead throw SSD's in your nodes and run 10G between 
 racks
  
 From: Kanwar Sangha kan...@mavenir.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, February 21, 2013 2:56 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Cassandra with SAN
  
 Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides 
 me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of 
 Cassandra machines and scaling by adding
 machines won’t help ?
  
 Thanks
 Kanwar
  
 -- 
 Copy, by Barracuda, helps you store, protect, and share all your amazing 
 things. Start today: www.copy.com.
   ­­  


Re: Cassandra with SAN

2013-02-21 Thread P. Taylor Goetz
I shouldn't have used the word spinning... SSDs are a great option as well.

I also agree with all the expensive SPOF points others have made.

Sent from my iPhone

On Feb 21, 2013, at 6:56 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 Cassandra is designed to write and read data in a way that is optimized for 
 physical spinning disks.
 
 Running C* on a SAN introduces a layer of abstraction that, at best negates 
 those optimizations, and at worst introduces additional overhead.
 
 Sent from my iPhone
 
 On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Ok. What would be the drawbacks J
  
 From: Michael Kjellman [mailto:mkjell...@barracuda.com] 
 Sent: 21 February 2013 17:12
 To: user@cassandra.apache.org
 Subject: Re: Cassandra with SAN
  
 No, this is a really really bad idea and C* was not designed for this, in 
 fact, it was designed so you don't need to have a large expensive SAN.
  
 Don't be tempted by the shiny expensive SAN. :)
  
 If money is no object instead throw SSD's in your nodes and run 10G between 
 racks
  
 From: Kanwar Sangha kan...@mavenir.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, February 21, 2013 2:56 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Cassandra with SAN
  
 Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which provides 
 me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no 
 of Cassandra machines and scaling by adding
 machines won’t help ?
  
 Thanks
 Kanwar
  
 -- 
 Copy, by Barracuda, helps you store, protect, and share all your amazing 
 things. Start today: www.copy.com.
   ­­  


Re: Counting problem

2013-02-21 Thread Jason Wee
There is a limit option, find it in the doc.


On Fri, Feb 22, 2013 at 3:41 AM, Sri Ramya ramya.1...@gmail.com wrote:

 hi,,
 Cassandra can display maximum 100 rows in a Columnfamily. can i increase
 it. If it is possible please mention here.
   Thank you



Re: unsubscribe

2013-02-21 Thread Eric Evans
On Tue, Feb 19, 2013 at 1:02 PM, Anurag Gujral anurag.guj...@gmail.com wrote:

 Unsubscribe me please.
 Thanks

Could I interest you in picture of a lemur instead?

http://goo.gl/RZw3e


--
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: Cassandra with SAN

2013-02-21 Thread Jared Biel
 As a counter argument though, anyone running a C* cluster on the Amazon
cloud is going to be using SAN storage (or some kind of proprietary storage
array) at the lowest  layers...Amazon isn't going to have a bunch of JBOD
running their cloud infrastructure.  However, they've invested in the
infrastructure to do it right.

This is certainly true when using EBS, however it's generally not
recommended to use EBS when running Cassandra. EBS has proven to be
unreliable in the past and it's a bit of a SPOF. Instead, it's recommended
to use the instance store disks that come with most instances (handy
chart here: http://www.ec2instances.info/). These are the rough equivalent
of local disks (probably host level RAID 10 storage if I'd have to guess.)

-Jared

On 22 February 2013 00:40, Michael Morris michael.m.mor...@gmail.comwrote:

 I'm running a 27 node cassandra cluster on SAN without issue.  I will be
 perfectly clear though, the hosts are multi-homed to different
 switches/fabrics in the SAN, we have an _expensive_ EMC array, and other
 than a datacenter-wide power outage, there's no SPOF for the SAN.  We use
 it because it's there, and it's already a sunk cost.

 I certainly would not go out of my way to purchase SAN infrastructure for
 a C* cluster, it just doesn't make sense (for all the reasons others have
 mentioned).  Any more, you can load up a single 2U server with multi-TB
 worth of disk, so the aggregate storage capacity of your C* cluster could
 potentially be as much as a SAN you would purchase (and a lot less hassle
 too).

 As a counter argument though, anyone running a C* cluster on the Amazon
 cloud is going to be using SAN storage (or some kind of proprietary storage
 array) at the lowest layers...Amazon isn't going to have a bunch of JBOD
 running their cloud infrastructure.  However, they've invested in the
 infrastructure to do it right.

 - Mike


 On Thu, Feb 21, 2013 at 6:08 PM, P. Taylor Goetz ptgo...@gmail.comwrote:

 I shouldn't have used the word spinning... SSDs are a great option as
 well.

 I also agree with all the expensive SPOF points others have made.

 Sent from my iPhone

 On Feb 21, 2013, at 6:56 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 Cassandra is designed to write and read data in a way that is optimized
 for physical spinning disks.

 Running C* on a SAN introduces a layer of abstraction that, at best
 negates those optimizations, and at worst introduces additional overhead.

 Sent from my iPhone

 On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote:

  Ok. What would be the drawbacks J

 ** **

 *From:* Michael Kjellman 
 [mailto:mkjell...@barracuda.commkjell...@barracuda.com]

 *Sent:* 21 February 2013 17:12
 *To:* user@cassandra.apache.org
 *Subject:* Re: Cassandra with SAN

 ** **

 No, this is a really really bad idea and C* was not designed for this, in
 fact, it was designed so you don't need to have a large expensive SAN.***
 *

 ** **

 Don't be tempted by the shiny expensive SAN. :)

 ** **

 If money is no object instead throw SSD's in your nodes and run 10G
 between racks

 ** **

 *From: *Kanwar Sangha kan...@mavenir.com
 *Reply-To: *user@cassandra.apache.org user@cassandra.apache.org
 *Date: *Thursday, February 21, 2013 2:56 PM
 *To: *user@cassandra.apache.org user@cassandra.apache.org
 *Subject: *Cassandra with SAN

 ** **

 Hi – Is it a good idea to use Cassandra with SAN ?  Say a SAN which
 provides me 8 Petabytes of storage. Would I not be I/O bound irrespective
 of the no of Cassandra machines and scaling by adding 

 machines won’t help ?

  

 Thanks

 Kanwar

 ** **

 --
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com http://www.copy.com?a=em_footer. 

   ­­  





Re: Cassandra with SAN

2013-02-21 Thread Ben Gambley
On Friday, February 22, 2013, Jared Biel wrote:

  As a counter argument though, anyone running a C* cluster on the Amazon
 cloud is going to be using SAN storage (or some kind of proprietary storage
 array) at the lowest  layers...Amazon isn't going to have a bunch of JBOD
 running their cloud infrastructure.  However, they've invested in the
 infrastructure to do it right.

 This is certainly true when using EBS, however it's generally not
 recommended to use EBS when running Cassandra. EBS has proven to be
 unreliable in the past and it's a bit of a SPOF. Instead, it's recommended
 to use the instance store disks that come with most instances (handy
 chart here: http://www.ec2instances.info/). These are the rough
 equivalent of local disks (probably host level RAID 10 storage if I'd have
 to guess.)

 -Jared

 On 22 February 2013 00:40, Michael Morris michael.m.mor...@gmail.comwrote:

 I'm running a 27 node cassandra cluster on SAN without issue.  I will be
 perfectly clear though, the hosts are multi-homed to different
 switches/fabrics in the SAN, we have an _expensive_ EMC array, and other
 than a datacenter-wide power outage, there's no SPOF for the SAN.  We use
 it because it's there, and it's already a sunk cost.

 I certainly would not go out of my way to purchase SAN infrastructure for
 a C* cluster, it just doesn't make sense (for all the reasons others have
 mentioned).  Any more, you can load up a single 2U server with multi-TB
 worth of disk, so the aggregate storage capacity of your C* cluster could
 potentially be as much as a SAN you would purchase (and a lot less hassle
 too).

 As a counter argument though, anyone running a C* cluster on the Amazon
 cloud is going to be using SAN storage (or some kind of proprietary storage
 array) at the lowest layers...Amazon isn't going to have a bunch of JBOD
 running their cloud infrastructure.  However, they've invested in the
 infrastructure to do it right.

 - Mike


 On Thu, Feb 21, 2013 at 6:08 PM, P. Taylor Goetz ptgo...@gmail.comwrote:

 I shouldn't have used the word spinning... SSDs are a great option as
 well.

 I also agree with all the expensive SPOF points others have made.

 Sent from my iPhone

 On Feb 21, 2013, at 6:56 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 Cassandra is designed to write and read data in a way that is optimized
 for physical spinning disks.

 Running C* on a SAN introduces a layer of abstraction that, at best
 negates those optimizations, and at worst introduces additional overhead.

 Sent from my iPhone

 On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote:

  Ok. What would be the drawbacks J

 ** **

 *From:* Michael Kjellman [mailto:mkjell...@barracuda.com]
 *Sent:* 21 February 2013 17:12
 *To:* user@cassandra.apache.org
 *Subject:* Re: Cassandra with SAN

 ** **

 No, this is a really really bad idea and C* was not designed for this, in
 fact, it was designed so you don't need to have a large expensive SAN.

 ** **

 Don't be tempted by the shiny expensive SAN. :)

 ** **

 If money is no object instead throw SSD's in your nodes and run 10G
 between racks

 ** **

 *From: *Kanwar Sangha kan...@mavenir.com
 *Reply-To: *user@cassandra.apache.org 




RE: Using Cassandra for read operations

2013-02-21 Thread Viktor Jevdokimov
Bill de hÓra already answered, I'd like to add:

To achieve ~4ms reads (from client standpoint):
1. You can't use multi-slice, since different keys may occur on different nodes 
that require internode communication. Design you data and reads to use one 
key/row.
2. Use ConsistencyLevel.ONE to avoid waiting for other nodes.
3. Use smart client that selects endpoints by token (key) to put request to 
appropriate node, Astyanax (Java) or write such client yourself.
4. Turn off dynamic snitch. While coordinator node may read locally, dynamic 
snitch may redirect it to another replica.
5. Use SSD's to avoid re-cache issue when sstables are compacted.
6. If you do writes, the rest issue is GC. If you're not on Azul Zing JVM, 
which I can't confirm to be better than Oracle HotSpot or JRockit (both has GC 
issues), you can't tune JVM to avoid Young Gen GC pauses to be as low as you 
need. You will fight pause frequency VS time.
So if you can afford Zing, check also Aerospike (ex-CitrusLeaf) alternative to 
Cassandra, which is written in C and has no GC issues.


 From: Bill de hÓra [mailto:b...@dehora.net]
 Sent: Thursday, February 21, 2013 22:07
 To: user@cassandra.apache.org
 Subject: Re: Using Cassandra for read operations

 In a nutshell -

 - Start with defaults and tune based on small discrete adjustments and leave
 time to see the effect of each change. No-one will know your workload
 better than you and the questions you are asking are workload sensitive.

 - Allow time for tuning and spending time understanding the memory model
 and JVM GC.

 - Be very careful with caches. Leave enough room in the OS for its own disk
 cache.

 - Get an SSD


 Bill


 On 21 Feb 2013, at 19:03, amulya rattan talk2amu...@gmail.com wrote:

  Dear All,
 
  We are currently evaluating Cassandra for an application involving strict
 SLAs(Service level agreements). We just need one column family with a long
 key and approximately 70-80 bytes row. We are not concerned about write
 performance but are primarily concerned about read. For our SLAs, a read of
 max 15-20 rows at once(using multi slice), should not take more than 4 ms.
 Till now, on a single node setup, using cassandra' stress tool, the numbers 
 are
 promising. But I am guessing that's because there is no network latency
 involved there and since we set memtable around 2gb(4 gb heap), we never
 had to get to Disk I/O.
 
  Assuming our nodes having 32GB RAM, a couple of questions regarding
 read:
 
  * To avoid disk I/Os, the best option we thought is to have data in memory.
 Is it a good idea to have memtable setup around 1/2 or 3/4 of heap size?
 Obviously flushing will take a lot of time but would that hurt that node's
 performance big time?
 
  * Cassandra stress tool only gives out average read latency. Is there a way
 to figure out max read-latency for a bunch of read operations?
 
  * How big a row cache can one have? Given that cassandra provides off-
 heap row caching, in a machine 32 gb RAM, would it be wise to have a 10
 gb row cache with 8 gb java heap? And how big should the corresponding key
 cache be then?
 
  Any response is appreciated.
 
  ~Amulya
 


Best regards / Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-01112 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.


Re: Mutation dropped

2013-02-21 Thread Wei Zhu
Thanks Aaron for the great information as always. I just checked cfhistograms 
and only a handful of read latency are bigger than 100ms, but for 
proxyhistograms there are 10 times more are greater than 100ms. We are using 
QUORUM  for reading with RF=3, and I understand coordinator needs to get the 
digest from other nodes and read repair on the miss match etc. But is it normal 
to see the latency from proxyhistograms to go beyond 100ms? Is there anyway to 
improve that? 
We are tracking the metrics from Client side and we see the 95th percentile 
response time averages at 40ms which is a bit high. Our 50th percentile was 
great under 3ms. 

Any suggestion is very much appreciated.

Thanks.
-Wei

- Original Message -
From: aaron morton aa...@thelastpickle.com
To: Cassandra User user@cassandra.apache.org
Sent: Thursday, February 21, 2013 9:20:49 AM
Subject: Re: Mutation dropped

 What does rpc_timeout control? Only the reads/writes? 
Yes. 

 like data stream,
streaming_socket_timeout_in_ms in the yaml

 merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

 What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:

 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is the 
 reasonable value for roc_timeout? The default value of 10 seconds are way too 
 long. What is the side effect if it's set to a really small number, say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's 
 commit log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID   
 Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some