Re: Heavy writes ok for single node, but failed for cluster
Thank you for your advice. Rf=2 is a good work around. I was using 0.7.4 and have updated to the latest 0.7 branch, which includes 2554 patch. But it doesn't help. I still get lots of UnavailableException after the following logs, INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228) InetAddress /192.168.125.49 is now dead. INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609) InetAddress /192.168.125.49 is now UP INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java (line 304) Started hinted handoff for endpoint /192.168.125.49 INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java (line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49 It seems that the gossip failure detection is too sensitive. Is there any configuration? 2011/4/27 Sylvain Lebresne sylv...@datastax.com On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen chensheng2...@gmail.com wrote: I succeeded to insert 1 billion records into a single node cassandra, bin/stress -d cas01 -o insert -n 10 -c 5 -S 34 -C5 -t 20 Inserts finished in about 14 hours at a speed of 20k/sec. But when I added another node, tests always failed with UnavailableException in an hour. bin/stress -d cas01,cas02 -o insert -n 10 -c 5 -S 34 -C5 -t 20 Writes speed is also 20k/sec because of the bottleneck in the client, so the pressure on each server node should be 50% of the single node test. Why couldn't they handle? By default, rf=1, consistency=ONE Some information that may be helpful, 1. no warn/error in log file, the cluster is still alive after those exception 2. the last logs on both nodes happen to be a compaction complete info 3. gossip log shows one node is dead and then up again in 3 seconds That's your problem. Once marked down (and since rf=1), when an update for cas02 reach cas01 and cas01 has marked cas02 down, it will throw the UnavailableException. Now, it shouldn't have been marked down and I suspect this is due to https://issues.apache.org/jira/browse/CASSANDRA-2554 (even though you didn't tell which version you're using, I suppose this is a 0.7.*). If you apply this patch or use the svn current 0.7 branch, that should hopefully not happen again. Note that if you had rf = 2, the node would still have been marked down wrongly for 3 seconds, but that would have been transparent to the stress test. 4. I set hinted_handoff_enabled: false, but still see lots of handoff logs What are those saying ? -- Sylvain
Re: Heavy writes ok for single node, but failed for cluster
n/a 4067 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 82) MessagingServicen/a 0,0 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 86) ColumnFamilyMemtable ops,data Row cache size/cap Key cache size/cap INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) Keyspace1.Super1 0,0 0/0 0/20 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) Keyspace1.Standard1 410835,20744126 0/0 36172/20 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) system.IndexInfo 0,0 0/0 0/1 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) system.LocationInfo 1,20 0/0 2/2 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.Migrations 0,0 0/0 0/2 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.HintsColumnFamily 206378,8667876 0/0 0/26 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.Schema 0,0 0/0 0/2 INFO [ScheduledTasks:1] 2011-04-29 07:10:06,872 GCInspector.java (line 128) GC for ParNew: 219 ms, 589477312 reclaimed leaving 4204479728 used; max is 8466202624 -- INFO [GossipTasks:1] 2011-04-29 07:03:43,262 Gossiper.java (line 228) InetAddress /192.168.125.51 is now dead. INFO [GossipStage:1] 2011-04-29 07:03:43,270 Gossiper.java (line 609) InetAddress /192.168.125.51 is now UP -- INFO [FlushWriter:1] 2011-04-29 07:03:41,061 Memtable.java (line 157) Writing Memtable-Standard1@595540423(61557029 bytes, 1218975 operations) INFO [GossipStage:1] 2011-04-29 07:03:43,538 Gossiper.java (line 609) InetAddress /192.168.125.49 is now UP INFO [FlushWriter:1] 2011-04-29 07:03:46,550 Memtable.java (line 172) Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-3991-Data.db (76184729 bytes) 2011/4/28 Jonathan Ellis jbel...@gmail.com This means a node was too busy with something else to send out its heartbeat. Sometimes this is STW GC. Other times it is a bug (one was fixed for 0.7.6 in https://issues.apache.org/jira/browse/CASSANDRA-2554). On Thu, Apr 28, 2011 at 3:57 AM, Sheng Chen chensheng2...@gmail.com wrote: Thank you for your advice. Rf=2 is a good work around. I was using 0.7.4 and have updated to the latest 0.7 branch, which includes 2554 patch. But it doesn't help. I still get lots of UnavailableException after the following logs, INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228) InetAddress /192.168.125.49 is now dead. INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609) InetAddress /192.168.125.49 is now UP INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java (line 304) Started hinted handoff for endpoint /192.168.125.49 INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java (line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49 It seems that the gossip failure detection is too sensitive. Is there any configuration? 2011/4/27 Sylvain Lebresne sylv...@datastax.com On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen chensheng2...@gmail.com wrote: I succeeded to insert 1 billion records into a single node cassandra, bin/stress -d cas01 -o insert -n 10 -c 5 -S 34 -C5 -t 20 Inserts finished in about 14 hours at a speed of 20k/sec. But when I added another node, tests always failed with UnavailableException in an hour. bin/stress -d cas01,cas02 -o insert -n 10 -c 5 -S 34 -C5 -t 20 Writes speed is also 20k/sec because of the bottleneck in the client, so the pressure on each server node should be 50% of the single node test. Why couldn't they handle? By default, rf=1, consistency=ONE Some information that may be helpful, 1. no warn/error in log file, the cluster is still alive after those exception 2. the last logs on both nodes happen to be a compaction complete info 3. gossip log shows one node is dead and then up again in 3 seconds That's your problem. Once marked down (and since rf=1), when an update for cas02 reach cas01 and cas01 has marked cas02 down, it will throw the UnavailableException. Now, it shouldn't have been marked down and I suspect this is due to https://issues.apache.org/jira/browse/CASSANDRA-2554 (even though you didn't tell which version you're using, I suppose this is a 0.7.*). If you apply this patch or use the svn current 0.7 branch, that should hopefully not happen again. Note that if you had rf = 2, the node would still have been marked down wrongly for 3
Heavy writes ok for single node, but failed for cluster
I succeeded to insert 1 billion records into a single node cassandra, bin/stress -d cas01 -o insert -n 10 -c 5 -S 34 -C5 -t 20 Inserts finished in about 14 hours at a speed of 20k/sec. But when I added another node, tests always failed with UnavailableException in an hour. bin/stress -d cas01,cas02 -o insert -n 10 -c 5 -S 34 -C5 -t 20 Writes speed is also 20k/sec because of the bottleneck in the client, so the pressure on each server node should be 50% of the single node test. Why couldn't they handle? By default, rf=1, consistency=ONE Some information that may be helpful, 1. no warn/error in log file, the cluster is still alive after those exception 2. the last logs on both nodes happen to be a compaction complete info 3. gossip log shows one node is dead and then up again in 3 seconds 4. I set hinted_handoff_enabled: false, but still see lots of handoff logs
Re: Test idea on cassandra
Stress tools in contrib directory use multiple threads/processes. 2011/4/7 Mengchen Yu yum...@umail.iu.edu I'm trying to simulate a multi-user scenario. The reason why I want to use MPJ is to create different processes act like individual users. Do any one have idea how to do this clearly? Sorry for duplicated mails if any
Re: Compaction threshold does not save with nodetool
Thanks. I think this feature could be clarified on the wiki. 2011/4/6 Dan Hendry dan.hendry.j...@gmail.com There are two layers of settings, the default, cluster wide, settings part of the schema and exposed/modifiable via the cli and individual settings exposed/modifiable via JMX and nodetool. Using nodetool, you are only modifying the in memory settings for a single node, changes to those settings are not persisted or reflected in other nodes. If you want a particular setting to be persisted across a restart (and applied to other nodes when they restart), you have to use the cli and the ‘update column family X with min_compaction_threshold=Y and max_compaction_threshold=X’ command. Dan *From:* Sheng Chen [mailto:chensheng2...@gmail.com] *Sent:* April-06-11 1:42 *To:* user@cassandra.apache.org *Subject:* Compaction threshold does not save with nodetool Cassandra 0.7.4 # nodetool -h localhost getcompactionthreshold Keyspace1 Standard1 min=4 max=32 # nodetool -h localhost setcompactionthreshold Keyspace1 Standard1 0 0 # nodetool -h localhost getcompactionthreshold Keyspace1 Standard1 min=0 max=0 Now the thresholds have changed on the JMX pannel, but in the cassandra-cli `show keyspaces`, it is still 4/32. After I restart cassandra, threshold by nodetool shows 4/32 again. The setting is lost. I tried to use nodetool flush to save the change but it doesn't work. No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.894 / Virus Database: 271.1.1/3554 - Release Date: 04/06/11 02:34:00
Re: Stress tests failed with secondary index
Thank you Aaron. It does not seem to be an overload problem. I have 16 cores and 48G ram on the single node, and I reduced the concurrent threads to be 1. Still, it just suddenly dies of a timeout, while the cpu, ram, disk load are below 10% and write latency is about 0.5ms for the past 10 minutes which is really fast. No logs of dropped messages are found. 2011/4/7 aaron morton aa...@thelastpickle.com TimedOutException means that the less than CL number of nodes responded to the coordinator before the rpc_timeout. So it was overloaded. Which makes sense when you say it only happens with secondary indexes. Consider things like - reducing the throughput - reducing the number of clients - ensuring the clients are connecting to all nodes in the cluster. You will probably find some logs about dropped messages on some nodes. Aaron On 6 Apr 2011, at 20:39, Sheng Chen wrote: I used py_stress module to insert 10m test data with a secondary index. I got the following exceptions. # python stress.py -d xxx -o insert -n 1000 -c 5 -s 34 -C 5 -x keys total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 265322,26532,26541,0.00186140829433,10 630300,36497,36502,0.00129331431204,20 986781,35648,35640,0.0013310986218,30 1332190,34540,34534,0.00135942295893,40 1473578,14138,14138,0.00142941070007,50 Process Inserter-38: Traceback (most recent call last): File /usr/lib64/python2.4/site-packages/multiprocessing/process.py, line 237, in _bootstrap self.run() File stress.py, line 242, in run self.cclient.batch_mutate(cfmap, consistency) File /root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py, line 784, in batch_mutate TimedOutException: TimedOutException(args=()) self.run() File stress.py, line 242, in run self.recv_batch_mutate() File /root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py, line 810, in recv_batch_mutate raise result.te Tests without secondary index is ok at about 40k ops/sec. There is a `GC for ParNew` for about 200ms taking place every second. Does it matter? The same gc for about 400ms happens every 2 seconds, which does not hurt the inserts without secondary index. Thanks in advance for any advice. Sheng
Compaction threshold does not save with nodetool
Cassandra 0.7.4 # nodetool -h localhost getcompactionthreshold Keyspace1 Standard1 min=4 max=32 # nodetool -h localhost setcompactionthreshold Keyspace1 Standard1 0 0 # nodetool -h localhost getcompactionthreshold Keyspace1 Standard1 min=0 max=0 Now the thresholds have changed on the JMX pannel, but in the cassandra-cli `show keyspaces`, it is still 4/32. After I restart cassandra, threshold by nodetool shows 4/32 again. The setting is lost. I tried to use nodetool flush to save the change but it doesn't work.
Re: Endless minor compactions after heavy inserts
I think if i can keep a single sstable file in a proper size, the hot data/index files may be able to fit into memory at least in some occasions. In my use case, I want to use cassandra for storage of a large amount of log data. There will be multiple nodes, and each node has 10*2TB disks to hold as much data as possible, ideally 20TB (about 100 billion rows) in one node. Reading operations will be much less than writing. A reading latency within 1 second is acceptable. Is it possible? Do you have advice on this design? Thank you. Sheng 2011/4/3 aaron morton aa...@thelastpickle.com With only one data file your reads would use the least amount of IO to find the data. Most people have multiple nodes and probably fewer disks, so each node may have a TB or two of data. How much capacity do your 10 disks give ? Will you be running multiple nodes in production ? Aaron On 2 Apr 2011, at 12:45, Sheng Chen wrote: Thank you very much. The major compaction will merge everything into one big file., which would be very large. Is there any way to control the number or size of files created by major compaction? Or, is there a recommended number or size of files for cassandra to handle? Thanks. I see the trigger of my minor compaction is OperationsInMillions. It is a number of operations in total, which I thought was in a second. Cheers, Sheng 2011/4/1 aaron morton aa...@thelastpickle.com If you are doing some sort of bulk load you can disable minor compactions by setting the min_compaction_threshold and max_compaction_threshold to 0 . Then once your insert is complete run a major compaction via nodetool before turning the minor compaction back on. You can also reduce the compaction threads priority, see compaction_thread_priority in the yaml file. The memtable will be flushed when either the MB or ops throughput is triggered. If you are seeing a lot of memtables smaller than the MB threshold then the ops threshold is probably been triggered. Look for a log message at INFO level starting with Enqueuing flush of Memtable that will tell you how many bytes and ops the memtable had when it was flushed. Trying increasing the ops threshold and see what happens. You're change in the compaction threshold may not have an an effect because the compaction process was already running. AFAIK the best way to get the best out of your 10 disks will be to use a dedicated mirror for the commit log and a stripe set for the data. Hope that helps. Aaron On 1 Apr 2011, at 14:52, Sheng Chen wrote: I've got a single node of cassandra 0.7.4, and I used the java stress tool to insert about 100 million records. The inserts took about 6 hours (45k inserts/sec) but the following minor compactions last for 2 days and the pending compaction jobs are still increasing. From jconsole I can read the MemtableThroughputInMB=1499, MemtableOperationsInMillions=7.0 But in my data directory, I got hundreds of 438MB data files, which should be the cause of the minor compactions. I tried to set compaction threshold by nodetool, but it didn't seem to take effects (no change in pending compaction tasks). After restarting the node, my setting is lost. I want to distribute the read load in my disks (10 disks in xfs, LVM), so I don't want to do a major compaction. So, what can I do to keep the sstable file in a reasonable size, or to make the minor compactions faster? Thank you in advance. Sheng
Re: Endless minor compactions after heavy inserts
Thank you very much. The major compaction will merge everything into one big file., which would be very large. Is there any way to control the number or size of files created by major compaction? Or, is there a recommended number or size of files for cassandra to handle? Thanks. I see the trigger of my minor compaction is OperationsInMillions. It is a number of operations in total, which I thought was in a second. Cheers, Sheng 2011/4/1 aaron morton aa...@thelastpickle.com If you are doing some sort of bulk load you can disable minor compactions by setting the min_compaction_threshold and max_compaction_threshold to 0 . Then once your insert is complete run a major compaction via nodetool before turning the minor compaction back on. You can also reduce the compaction threads priority, see compaction_thread_priority in the yaml file. The memtable will be flushed when either the MB or ops throughput is triggered. If you are seeing a lot of memtables smaller than the MB threshold then the ops threshold is probably been triggered. Look for a log message at INFO level starting with Enqueuing flush of Memtable that will tell you how many bytes and ops the memtable had when it was flushed. Trying increasing the ops threshold and see what happens. You're change in the compaction threshold may not have an an effect because the compaction process was already running. AFAIK the best way to get the best out of your 10 disks will be to use a dedicated mirror for the commit log and a stripe set for the data. Hope that helps. Aaron On 1 Apr 2011, at 14:52, Sheng Chen wrote: I've got a single node of cassandra 0.7.4, and I used the java stress tool to insert about 100 million records. The inserts took about 6 hours (45k inserts/sec) but the following minor compactions last for 2 days and the pending compaction jobs are still increasing. From jconsole I can read the MemtableThroughputInMB=1499, MemtableOperationsInMillions=7.0 But in my data directory, I got hundreds of 438MB data files, which should be the cause of the minor compactions. I tried to set compaction threshold by nodetool, but it didn't seem to take effects (no change in pending compaction tasks). After restarting the node, my setting is lost. I want to distribute the read load in my disks (10 disks in xfs, LVM), so I don't want to do a major compaction. So, what can I do to keep the sstable file in a reasonable size, or to make the minor compactions faster? Thank you in advance. Sheng
Re: newbie question: how do I know the total number of rows of a cf?
I just found an estmateKeys() method of the ColumnFamilyStoreMBean. Is there any indication about how it works? Sheng 2011/3/28 Sheng Chen chensheng2...@gmail.com Hi all, I want to know how many records I am holding in Cassandra, just like count(*) in sql. What can I do ? Thank you. Sheng
Endless minor compactions after heavy inserts
I've got a single node of cassandra 0.7.4, and I used the java stress tool to insert about 100 million records. The inserts took about 6 hours (45k inserts/sec) but the following minor compactions last for 2 days and the pending compaction jobs are still increasing. From jconsole I can read the MemtableThroughputInMB=1499, MemtableOperationsInMillions=7.0 But in my data directory, I got hundreds of 438MB data files, which should be the cause of the minor compactions. I tried to set compaction threshold by nodetool, but it didn't seem to take effects (no change in pending compaction tasks). After restarting the node, my setting is lost. I want to distribute the read load in my disks (10 disks in xfs, LVM), so I don't want to do a major compaction. So, what can I do to keep the sstable file in a reasonable size, or to make the minor compactions faster? Thank you in advance. Sheng
Compaction doubles disk space
I use 'nodetool compact' command to start a compaction. I can understand that extra disk spaces are required during the compaction, but after the compaction, the extra spaces are not released. Before compaction: SSTable count: 10 space used (live): 19G space used (total): 21G After compaction: sstable count: 1 space used (live): 19G space used (total): 42G BTW, given that compaction requires double disk spaces, does it mean that I should never reach half of my total disk space? e.g. if I have 505GB data on 1TB disk, I cannot even delete any data at all.
Re: Compaction doubles disk space
From a previous thread of the same topic, I used a force GC and the extra spaces are released. What about my second question? 2011/3/29 Sheng Chen chensheng2...@gmail.com I use 'nodetool compact' command to start a compaction. I can understand that extra disk spaces are required during the compaction, but after the compaction, the extra spaces are not released. Before compaction: SSTable count: 10 space used (live): 19G space used (total): 21G After compaction: sstable count: 1 space used (live): 19G space used (total): 42G BTW, given that compaction requires double disk spaces, does it mean that I should never reach half of my total disk space? e.g. if I have 505GB data on 1TB disk, I cannot even delete any data at all.
Re: Compaction doubles disk space
Yes. I think at least we can remove the tombstones for each sstable first, and then do the merge. 2011/3/29 Karl Hiramoto k...@hiramoto.org Would it be possible to improve the current compaction disk space issue by compacting one only a few SSTables at a time then imediately deleting the old one? Looking at the logs it seems like deletions of old SSTables are taking longer than necessary. -- Karl
Re: stress.py bug?
I am just wondering, why the stress test tools (python, java) need more threads ? Is the bottleneck of a single thread in the client, or in the server? Thanks. Sean 2011/3/22 Ryan King r...@twitter.com On Mon, Mar 21, 2011 at 4:02 AM, pob peterob...@gmail.com wrote: Hi, I'm inserting data from client node with stress.py to cluster of 6 nodes. They are all on 1Gbps network, max real throughput of network is 930Mbps (after measurement). python stress.py -c 1 -S 17 -d{6nodes} -l3 -e QUORUM --operation=insert -i 1 -n 50 -t100 The problem is stress.py show up it does avg ~750ops/sec what is 127MB/s, but the real throughput of network is ~116MB/s. You may need more concurrency in order to saturate your network. -ryan