Re: Any experience of 20 node mini-itx cassandra cluster
I know the SSD's are a bit small but they should be enough for our application. Out test data is 1.6 TB(including replication of rf=3). Can't we use LCS? This will give us more space at the expensive of more I/O but SSD's have loads of I/Os. Thanks Jabbar Azam On 14 April 2013 20:20, Jabbar Azam aja...@gmail.com wrote: Thanks Aaron. Thanks Jabbar Azam On 14 April 2013 19:39, aaron morton aa...@thelastpickle.com wrote: That's better. The SSD size is a bit small, and be warned that you will want to leave 50Gb to 100GB free to allow room for compaction (using the default size tiered). On the ram side you will want to run about 4GB (assuming cass 1.2) for the JVM the rest can be off heap Cassandra structures. This may not leave too much free space for the os page cache, but SSD may help there. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 13/04/2013, at 4:47 PM, Jabbar Azam aja...@gmail.com wrote: What about using quad core athlon x4 740 3.2 GHz with 8gb of ram and 256gb ssds? I know it will depend on our workload but will be better than a dual core CPU. I think Jabbar Azam On 13 Apr 2013 01:05, Edward Capriolo edlinuxg...@gmail.com wrote: Duel core not the greatest you might run into GC issues before you run out of IO from your ssd devices. Also cassandra has other concurrency settings that are tuned roughly around the number of processors/cores. It is not uncommon to see 4-6 cores of cpu (600 % in top dealing with young gen garbage managing lots of sockets whatever. On Fri, Apr 12, 2013 at 12:02 PM, Jabbar Azam aja...@gmail.com wrote: That's my guess. My colleague is still looking at CPU's so I'm hoping he can get quad core CPU's for the servers. Thanks Jabbar Azam On 12 April 2013 16:48, Colin Blower cblo...@barracuda.com wrote: If you have not seen it already, checkout the Netflix blog post on their performance testing of AWS SSD instances. http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html My guess, based on very little experience, is that you will be CPU bound. On 04/12/2013 03:05 AM, Jabbar Azam wrote: Hello, I'm going to be building a 20 node cassandra cluster in one datacentre. The spec of the servers will roughly be dual core Celeron CPU, 256 GB SSD, 16GB RAM and two nics. Has anybody done any performance testing with this setup or have any gotcha's I should be aware of wrt to the hardware? I do realise the CPU is fairly low computational power but I'm going to assume the system is going to be IO bound hence the RAM and SSD's. Thanks Jabbar Azam -- *Colin Blower* *Software Engineer* Barracuda Networks Inc. +1 408-342-5576 (o)
Re: running cassandra on 8 GB servers
Just a small update here currently running on one node with 7 GB heap and no JNA all defaults except the heap, and everything looks OK. On Sun, Apr 14, 2013 at 9:10 PM, aaron morton aa...@thelastpickle.comwrote: Hmmm, what is the recommendation for a 10G network if 1G was 300G to 500GŠI am guessing I can't do 10 times that, correct? But maybe I could squeak out 600G to 1T? Best thing to do would be run a test on how long it takes to repair or bootstrap a node. The 300GB to 500Gb was just a guideline. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 13/04/2013, at 12:02 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Hmmm, what is the recommendation for a 10G network if 1G was 300G to 500GŠI am guessing I can't do 10 times that, correct? But maybe I could squeak out 600G to 1T? Thanks, Dean On 4/11/13 2:26 PM, aaron morton aa...@thelastpickle.com wrote: The data will be huge, I am estimating 4-6 TB per server. I know this is best, but those are my resources. You will have a very unhappy time. The general rule of thumb / guideline for a HDD based system with 1G networking is 300GB to 500Gb per node. See previous discussions on this topic for reasons. ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line 164) Exception in thread Thread[Thrift:641,5,main] ... INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915 ThriftServer.java (line 116) Stop listening to thrift clients What was the error ? What version are you using? If you have changed any defaults for memory in cassandra-env.sh or cassandra.yaml revert them. Generally C* will do the right thing and not OOM, unless you are trying to store a lot of data on a node that does not have enough memory. See this thread for background http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 12/04/2013, at 7:35 AM, Nikolay Mihaylov n...@nmmm.nu wrote: For one project I will need to run cassandra on following dedicated servers: Single CPU XEON 4 cores no hyper-threading, 8 GB RAM, 12 TB locally attached HDD's in some kind of RAID, visible as single HDD. I can do cluster of 20-30 such servers, may be even more. The data will be huge, I am estimating 4-6 TB per server. I know this is best, but those are my resources. Currently I am testing with one of such servers, except HDD is 300 GB. Every 15-20 hours, I get out of heap memory, e.g. something like: ERROR [Thrift:641] 2013-04-11 11:25:19,563 CassandraDaemon.java (line 164) Exception in thread Thread[Thrift:641,5,main] ... INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,915 ThriftServer.java (line 116) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2013-04-11 11:25:39,943 Gossiper.java (line 1077) Announcing shutdown INFO [StorageServiceShutdownHook] 2013-04-11 11:26:08,613 MessagingService.java (line 682) Waiting for messaging service to quiesce INFO [ACCEPT-/208.94.232.37] 2013-04-11 11:26:08,655 MessagingService.java (line 888) MessagingService shutting down server thread. ERROR [Thrift:721] 2013-04-11 11:26:37,709 CustomTThreadPoolServer.java (line 217) Error occurred during processing of message. java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down Anyone have some advices about better utilization of such servers? Nick.
Re: CQL3 And ReversedTypes Question
Added: https://issues.apache.org/jira/browse/CASSANDRA-5472 thanks, Gareth On Sun, Apr 14, 2013 at 2:33 PM, aaron morton aa...@thelastpickle.comwrote: Bad Request: Type error: org.apache.cassandra.cql3.statements.Selection$SimpleSelector@1e7318cannot be passed as argument 0 of function dateof of type timeuuid Is there something I am missing here or should I open a new ticket? Yes please. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 13/04/2013, at 4:40 PM, Gareth Collins gareth.o.coll...@gmail.com wrote: OK, trying out 1.2.4. The previous issue seems to be fine, but I am experiencing a new one: cqlsh:location create table test_y (message_id timeuuid, name text, PRIMARY KEY (name,message_id)); cqlsh:location insert into test_y (message_id,name) VALUES (now(),'foo'); cqlsh:location insert into test_y (message_id,name) VALUES (now(),'foo'); cqlsh:location insert into test_y (message_id,name) VALUES (now(),'foo'); cqlsh:location insert into test_y (message_id,name) VALUES (now(),'foo'); cqlsh:location select dateOf(message_id) from test_y; dateOf(message_id) -- 2013-04-13 00:33:42-0400 2013-04-13 00:33:43-0400 2013-04-13 00:33:43-0400 2013-04-13 00:33:44-0400 cqlsh:location create table test_x (message_id timeuuid, name text, PRIMARY KEY (name,message_id)) WITH CLUSTERING ORDER BY (message_id DESC); cqlsh:location insert into test_x (message_id,name) VALUES (now(),'foo'); cqlsh:location insert into test_x (message_id,name) VALUES (now(),'foo'); cqlsh:location insert into test_x (message_id,name) VALUES (now(),'foo'); cqlsh:location insert into test_x (message_id,name) VALUES (now(),'foo'); cqlsh:location insert into test_x (message_id,name) VALUES (now(),'foo'); cqlsh:location select dateOf(message_id) from test_x; Bad Request: Type error: org.apache.cassandra.cql3.statements.Selection$SimpleSelector@1e7318cannot be passed as argument 0 of function dateof of type timeuuid Is there something I am missing here or should I open a new ticket? thanks in advance, Gareth On Tue, Mar 26, 2013 at 3:30 PM, Gareth Collins gareth.o.coll...@gmail.com wrote: Added: https://issues.apache.org/jira/browse/CASSANDRA-5386 Thanks very much for the quick answer! regards, Gareth On Tue, Mar 26, 2013 at 3:55 AM, Sylvain Lebresne sylv...@datastax.com wrote: You aren't missing anything obvious. That's a bug really. Would you mind opening a ticket on https://issues.apache.org/jira/browse/CASSANDRA? -- Sylvain On Tue, Mar 26, 2013 at 2:48 AM, Gareth Collins gareth.o.coll...@gmail.com wrote: Hi, I created a table with the following structure in cqlsh (Cassandra 1.2.3 - cql 3): CREATE TABLE mytable ( column1 text, column2 text, messageId timeuuid, message blob, PRIMARY KEY ((column1, column2), messageId)); I can quite happily add values to this table. e.g: insert into client_queue (column1,column2,messageId,message) VALUES ('string1','string2',now(),'ABCCDCC123'); Yet if I decide I want to set the clustering order on messageId DESC: CREATE TABLE mytable ( column1 text, column2 text, messageId timeuuid, message blob, PRIMARY KEY ((column1, column2), messageId)) WITH CLUSTERING ORDER BY (messageId DESC); and try to do an insert: insert into client_queue2 (column1,column2,messageId,message) VALUES ('string1','string2',now(),'ABCCDCC123'); I get the following error: Bad Request: Type error: cannot assign result of function now (type timeuuid) to messageid (type 'org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.TimeUUIDType)') I am sure I am missing something obvious here, but I don't understand. Why am I getting an error? What do I need to do to be able to add an entry to this table? thanks in advance, Gareth
Re: Problems with shuffle
On 14 April 2013 00:56, Rustam Aliyev rustam.li...@code.az wrote: Just a followup on this issue. Due to the cost of shuffle, we decided not to do it. Recently, we added new node and ended up in not well balanced cluster: Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.1.8 52.28 GB 260 18.3% d28df6a6-c888-4658-9be1-f9e286368dce rack1 UN 10.0.1.11 55.21 GB 256 9.4% 7b0cf3c8-0c42-4443-9b0c-68f794299443 rack1 UN 10.0.1.2 49.03 GB 259 17.9% 2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f rack1 UN 10.0.1.4 48.51 GB 255 18.4% c253dcdf-3e93-495c-baf1-e4d2a033bce3 rack1 UN 10.0.1.1 67.14 GB 253 17.9% 4f77fd70-b134-486b-9c25-cfea96b6d412 rack1 UN 10.0.1.3 47.65 GB 253 18.0% 4d03690d-5363-42c1-85c2-5084596e09fc rack1 It looks like new node took from each other node equal amount of vnodes - which is good. However, it's not clear why it decided to have twice less than other nodes. I think this is expected behaviour when adding a node to a cluster that has been upgraded to vnodes without shuffling. The old nodes have equally spaced contiguous tokens. The new node will choose 256 random new tokens, which will on average bisect the old ranges. This means each token the new node has will only cover half the range (on average) as the old ones. However, the thing that really matters is the load, which is surprisingly balanced at 55 GB. This isn't guaranteed though - it could be about half or it could be significantly more. The problem with not doing the shuffle is the vnode after all the contiguous vnodes for a certain node will be the target for the second replica of *all* the vnodes for that node. E.g. if node A has tokens 10, 20, 30, 40, node B has tokens 50, 60, 70, 80 and node C (the new node) chooses token 45, it will store a replica for all data stored in A's tokens. This is exactly the same reason as why tokens in a multi-DC deployment need to be interleaved rather than be contiguous. If shuffle isn't going to work, you could instead decommission each node then bootstrap it in again. In principle that should copy your data twice as much as required (shuffle is optimal in terms of data transfer) but some implementation details might make it more efficient. Richard.
Re: Extracting data from SSTable files with MapReduce
Hi Aaron, I did try to upgrade to 1.2 but it did not work out. Maybe to many versions in between. Why would later formats make this easier you think? Jasper 2013/4/14 aaron morton aa...@thelastpickle.com The SSTable files are in the -f- format from 0.8.10. If you can upgrade to the latest version it will make things easier. Start a node and use nodetool upgradesstables. The org.apache.cassandra.tools.SSTableExport class provides a blue print for reading rows from disk. hope that helps. - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 13/04/2013, at 7:58 PM, Jasper K. jasper.knu...@incentro.com wrote: Hi, Does anyone have any experience with running a MapReduce directly against a CF's SSTable files? I have a use case where this seems to be an option. I want to export all data from a CF to a flat file format for statistical analysis. Some factors that make it (more) doable in my case: -The Cassandra instance is not 'on-line' (no writes- no reads) -The .db files were exported from another instance. I got them all in one place now The SSTable files are in the -f- format from 0.8.10. Looking at this : http://wiki.apache.org/cassandra/ArchitectureSSTable it should be possible to write a Hadoop RecordReader for Cassandra rowkeys. But maybe I am not fully aware of what I am up to. -- *Jasper** * --
Re: StatusLogger format?
99% sure it's in bytes. On Mon, Apr 15, 2013 at 11:25 AM, William Oberman ober...@civicscience.comwrote: Mainly the: ColumnFamilyMemtable ops,data section. Is data in bytes/kb/mb/etc? Example line: StatusLogger.java (line 116) civicscience.sessions4963,1799916 Thanks!
Re: Cassandra 1.2.2 cluster + raspberry
Hi Aaron, Thank you for your support. It was my mistake indeed. The second node was still configured to have the internode comm to be compressed. After I fixed it, I'm able to start my cluster. Cheers On Thu, Apr 11, 2013 at 12:40 PM, aaron morton aa...@thelastpickle.comwrote: I've already tried to set internode_compression: none in my yaml files. What version are you on? If you've set internode_compression to none and restarted? Can you double check. The code stack shows cassandra deciding that the connection should be compressed. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/04/2013, at 12:54 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, I'm trying to set up a cassandra cluster for some experiments on my raspberry pies but I'm still having trouble to join my nodes to the cluster. I started with two nodes (192.168.2.3 and 192.168.2.7) and when I start the cassandra, I see the following exception on the node 192.168.2.7 ERROR [WRITE-/192.168.2.3] 2013-04-10 02:10:24,524 CassandraDaemon.java (line 132) Exception in thread Thread[WRITE-/192.168.2.3,5,main] java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:66) at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:322) at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:143) I suspect that the lack of native snappy libraries are causing this exception furing the internode communication. I did not try to compile the native Snappy for ARM yet but I wonder if it is not possible to use cassandra without snappy. I've already tried to set internode_compression: none in my yaml files. nodetool outputs: nodetool -h pi1 ring Datacenter: dc1 == Replicas: 1 Address RackStatus State Load OwnsToken 192.168.2.7 RAC1Up Normal 92.35 KB 100.00% 0 nodetool -h pi2 ring Datacenter: dc1 == Replicas: 1 Address RackStatus State Load OwnsToken 192.168.2.3 RAC1Up Normal 92.42 KB 100.00% 85070591730234615865843651857942052864 Kind Regards -- Find a job you enjoy, and you'll never work a day in your life. Confucius
Re: Vnodes - HUNDRED of MapReduce jobs
Hi cem cayiro...@gmail.com, In your previous reply, you mentioned that you have a simple solution. Can you share with us :) Thanks in advance. On Sat, Mar 30, 2013 at 2:33 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It should be easy to control the number of map tasks. http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you might run into a directory with 10,000 small files and you do not want 10,000 map tasks. This is what the CombinedInputFormat's do, they help you control the number of map tasks a job will generate. For example, imagine i have a multi-tenant cluster. If a job kicks up 10,000 map tasks, all those tasks can starve out other jobs. Being able to say I only want 4 map tasks per c* node regardless of the number of vnodes would be a meaningful and useful feature. On Fri, Mar 29, 2013 at 2:17 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Yes but my point, is with 50 map slots you can only be processing 50 at once. So it will take 1000/50 waves of mappers to complete the job. On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis jbel...@gmail.comwrote: My point is that if you have over 16MB of data per node, you're going to get thousands of map tasks (that is: hundreds per node) with or without vnodes. On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Every map reduce task typically has a minimum Xmx of 256MB memory. See mapred.child.java.opts... So if you have a 10 node cluster with 256 vnodes... You will need to spawn 2,560 map tasks to complete a job. And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map slots. Wouldnt it be better if the input format spawned 10 map tasks instead of 2,560? On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis jbel...@gmail.com wrote: I still don't see the hole in the following reasoning: - Input splits are 64k by default. At this size, map processing time dominates job creation. - Therefore, if job creation time dominates, you have a toy data set ( 64K * 256 vnodes = 16 MB) Adding complexity to our inputformat to improve performance for this niche does not sound like a good idea to me. On Thu, Mar 28, 2013 at 8:40 AM, cem cayiro...@gmail.com wrote: Hi Alicia , Cassandra input format creates mappers as many as vnodes. It is a known issue. You need to lower the number of vnodes :( I have a simple solution for that and ready to write a patch. Should I create a ticket about that? I don't know the procedure about that. Regards, Cem On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong lccali...@gmail.com wrote: Hi All, I have 3 nodes of Cassandra 1.2.3 edited the cassandra.yaml for vnodes. When I execute a M/R job .. the console showed HUNDRED of Map tasks. May I know, is the normal since is vnodes? If yes, this have slow the M/R job to finish/complete. Thanks -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: Does Memtable resides in Heap?
Thanks Vitor, So as per recommendation its only efficient when heap size is below 8GB. How about when we have more RAM, does that rest of the RAM can be left for OS to make use? How about the bloom filter and index samples, are they part of off-heap? Thank you for your response. Regards, Jay On Thu, Apr 11, 2013 at 10:35 PM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: Memtables resides in heap, write rate impacts GC, more writes - more frequent and longer ParNew GC pauses. From: Jay Svc [mailto:jaytechg...@gmail.com] Sent: Friday, April 12, 2013 01:03 To: user@cassandra.apache.org Subject: Does Memtable resides in Heap? Hi Team, I have got this 8GB of RAM out of that 4GB allocated to Java Heap. My question is the size of Memtable does it contribute to heap size? or they are part of off-heap? Does bigger Memtable would have impact on GC and overall memory management? I am using DSE 3.0 / Cassandra 1.1.9. Thanks, Jay Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063 Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.