Re: 0.8.1 Vs 1.0.7
When creating a new CF, defaults are now in fact compression enabled. On Sat, Mar 17, 2012 at 5:50 AM, R. Verlangen ro...@us2.nl wrote: Check your log for messages about rebuilding indices: that might grow your dataset some. One thing is for sure: the data import removed all the crap that lasted in the 0.8.1 cluster (duplicates, thombstones etc). The decrease is fairly dramatic but not unlogical at all. 2012/3/16 Jeremiah Jordan jeremiah.jor...@morningstar.com I would guess more aggressive compaction settings, did you update rows or insert some twice? If you run major compaction a couple times on the 0.8.1 cluster does the data size get smaller? You can use the describe command to check if compression got turned on. -Jeremiah -- *From:* Ravikumar Govindarajan [ravikumar.govindara...@gmail.com] *Sent:* Thursday, March 15, 2012 4:41 AM *To:* user@cassandra.apache.org *Subject:* 0.8.1 Vs 1.0.7 Hi, I ran some data import tests for cassandra 0.8.1 and 1.0.7. The results were a little bit surprising 0.8.1, SimpleStrategy, Rep_Factor=3,QUORUM Writes, RP, SimpleSnitch XXX.XXX.XXX.A datacenter1 rack1 Up Normal 140.61 GB 12.50% XXX.XXX.XXX.B datacenter1 rack1 Up Normal 139.92 GB 12.50% XXX.XXX.XXX.C datacenter1 rack1 Up Normal 138.81 GB 12.50% XXX.XXX.XXX.D datacenter1 rack1 Up Normal 139.78 GB 12.50% XXX.XXX.XXX.E datacenter1 rack1 Up Normal 137.44 GB 12.50% XXX.XXX.XXX.F datacenter1 rack1 Up Normal 138.48 GB 12.50% XXX.XXX.XXX.G datacenter1 rack1 Up Normal 140.52 GB 12.50% XXX.XXX.XXX.H datacenter1 rack1 Up Normal 145.24 GB 12.50% 1.0.7, NTS, Rep_Factor{DC1:3, DC2:2}, LOCAL_QUORUM writes, RP [DC2 m/c yet to join ring], PropertyFileSnitch XXX.XXX.XXX.A DC1 RAC1 Up Normal 48.72 GB 12.50% XXX.XXX.XXX.B DC1 RAC1 Up Normal 51.23 GB 12.50% XXX.XXX.XXX.C DC1 RAC1 Up Normal 52.4GB 12.50% XXX.XXX.XXX.D DC1 RAC1 Up Normal 49.64 GB 12.50% XXX.XXX.XXX.E DC1 RAC1 Up Normal 48.5GB 12.50% XXX.XXX.XXX.F DC1 RAC1 Up Normal53.38 GB 12.50% XXX.XXX.XXX.G DC1 RAC1 Up Normal 51.11 GB 12.50% XXX.XXX.XXX.H DC1 RAC1 Up Normal 53.36 GB 12.50% There seems to be 3X savings in size for the same dataset running 1.0.7. I have not enabled compression for any of the CFs. Will it be enabled by default when creating a new CF in 1.0.7? cassandra.yaml is also mostly identical. Thanks and Regards, Ravi
Re: Cassandra cluster HW spec (commit log directory vs data file directory)
On Sun, Oct 30, 2011 at 3:34 PM, Sorin Julean sorin.jul...@gmail.comwrote: Hey Chris, Thanks for sharing all the info. I have few questions: 1. What are you doing with so much memory :) ? How much of it do you allocate for heap ? max heap is 12GB. we use the rest for cache. we run memcache on each node and allocate the remaining to that. 2. What your network speed ? Do you use trunks ? Do you have a dedicated VLAN for gossip/store traffic ? No dedicated VLAN for gossip. We run at 2Gb/s. We have bonded NIC's. Cheers, Sorin On Sun, Oct 30, 2011 at 5:00 AM, Chris Goffinet c...@chrisgoffinet.comwrote: RE: RAID0 Recommendation Cassandra supports multiple data file directories. Because we do compactions, it's just much easier to deal with (1) data file directory that is stripped across all disks as 1 volume (RAID0). There are other ways to accomplish this though. At Twitter we use software raid (RAID0 RAID10). We own the physical hardware and have found that even with hardware raid, software raid in Linux actually faster. The reason being is: http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 We have found that using far-copies is much faster over near-copies. We set the i/o scheduler to noop at the moment. We might move back to CFQ with more tuning in the future. We use RAID10 for cases where we need better disk performance if we are hitting the disk often, sacrificing storage. We initially thought RAID0 should be faster over RAID10 until we found out about the near vs far layouts. RE: Hardware This is going to depend on how well your automated infrastructure is, but we chose the path of finding the cheapest servers we could get from Dell/HP/etc. 8/12 cores, 72gb memory per node, 2TB/3TB, 2.5. We are in the process of making changes to our servers, I'll report back in when we have more details to share. I wouldn't recommend 75 CFs. It could work but just seems too complex. Another recommendation for clusters, always go big. You will be thankful in the future for this. Even if you can do this on 3-6 nodes, go much larger for future expansion. If you own your hardware and racks, I recommend making sure to size out the rack diversity and # of nodes per rack. Also take into account the replication factor when doing this. RF=3, should be min of 3 racks, and # of nodes per rack should be divisible by the replication factor. This has worked out pretty well for us. Our biggest problems today are adding 100s of nodes to existing clusters at once. I'm not sure how many other companies are having this problem, but it's certainly on our radar to improve, if you get to that point :) On Tue, Oct 25, 2011 at 5:23 AM, Alexandru Sicoe adsi...@gmail.comwrote: Hi everyone, I am currently in the process of writing a hardware proposal for a Cassandra cluster for storing a lot of monitoring time series data. My workload is write intensive and my data set is extremely varied in types of variables and insertion rate for these variables (I will have to handle an order of 2 million variables coming in, each at very different rates - the majority of them will come at very low rates but there are many that will come at higher rates constant rates and a few coming in with huge spikes in rates). These variables correspond to all basic C++ types and arrays of these types. The highest insertion rates are received for basic types, out of which U32 variables seem to be the most prevalent (e.g. I recorded 2 million U32 vars were inserted in 8 mins of operation while 600.000 doubles and 170.000 strings were inserted during the same time. Note this measurement was only for a subset of the total data currently taken in). At the moment I am partitioning the data in Cassandra in 75 CFs (each CF corresponds to a logical partitioning of the set of variables mentioned before - but this partitioning is not related with the amount of data or rates...it is somewhat random). These 75 CFs account for ~1 million of the variables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each node is a 4 real core with 4 GB RAM and split commit log directory and data file directory between two RAID arrays with HDDs). I can handle the load in this configuration but the average CPU usage of the Cassandra nodes is slightly above 50%. As I will need to add 12 more CFs (corresponding to another ~ 1 million variables) plus potentially other data later, it is clear that I need better hardware (also for the retrieval part). I am looking at Dell servers (Power Edge etc) Questions: 1. Is anyone using Dell HW for their Cassandra clusters? How do they behave? Anybody care to share their configurations or tips for buying, what to avoid etc? 2. Obviously I am going to keep to the advice on the http://wiki.apache.org/cassandra/CassandraHardware and split the commmitlog and data on separate disks. I was going to use SSD for commitlog
Re: Cassandra cluster HW spec (commit log directory vs data file directory)
No. We built a pluggable cache provider for memcache. On Sun, Oct 30, 2011 at 7:31 PM, Mohit Anchlia mohitanch...@gmail.comwrote: On Sun, Oct 30, 2011 at 6:53 PM, Chris Goffinet c...@chrisgoffinet.com wrote: On Sun, Oct 30, 2011 at 3:34 PM, Sorin Julean sorin.jul...@gmail.com wrote: Hey Chris, Thanks for sharing all the info. I have few questions: 1. What are you doing with so much memory :) ? How much of it do you allocate for heap ? max heap is 12GB. we use the rest for cache. we run memcache on each node and allocate the remaining to that. Is this using off heap cache of Cassandra? 2. What your network speed ? Do you use trunks ? Do you have a dedicated VLAN for gossip/store traffic ? No dedicated VLAN for gossip. We run at 2Gb/s. We have bonded NIC's. Cheers, Sorin On Sun, Oct 30, 2011 at 5:00 AM, Chris Goffinet c...@chrisgoffinet.com wrote: RE: RAID0 Recommendation Cassandra supports multiple data file directories. Because we do compactions, it's just much easier to deal with (1) data file directory that is stripped across all disks as 1 volume (RAID0). There are other ways to accomplish this though. At Twitter we use software raid (RAID0 RAID10). We own the physical hardware and have found that even with hardware raid, software raid in Linux actually faster. The reason being is: http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 We have found that using far-copies is much faster over near-copies. We set the i/o scheduler to noop at the moment. We might move back to CFQ with more tuning in the future. We use RAID10 for cases where we need better disk performance if we are hitting the disk often, sacrificing storage. We initially thought RAID0 should be faster over RAID10 until we found out about the near vs far layouts. RE: Hardware This is going to depend on how well your automated infrastructure is, but we chose the path of finding the cheapest servers we could get from Dell/HP/etc. 8/12 cores, 72gb memory per node, 2TB/3TB, 2.5. We are in the process of making changes to our servers, I'll report back in when we have more details to share. I wouldn't recommend 75 CFs. It could work but just seems too complex. Another recommendation for clusters, always go big. You will be thankful in the future for this. Even if you can do this on 3-6 nodes, go much larger for future expansion. If you own your hardware and racks, I recommend making sure to size out the rack diversity and # of nodes per rack. Also take into account the replication factor when doing this. RF=3, should be min of 3 racks, and # of nodes per rack should be divisible by the replication factor. This has worked out pretty well for us. Our biggest problems today are adding 100s of nodes to existing clusters at once. I'm not sure how many other companies are having this problem, but it's certainly on our radar to improve, if you get to that point :) On Tue, Oct 25, 2011 at 5:23 AM, Alexandru Sicoe adsi...@gmail.com wrote: Hi everyone, I am currently in the process of writing a hardware proposal for a Cassandra cluster for storing a lot of monitoring time series data. My workload is write intensive and my data set is extremely varied in types of variables and insertion rate for these variables (I will have to handle an order of 2 million variables coming in, each at very different rates - the majority of them will come at very low rates but there are many that will come at higher rates constant rates and a few coming in with huge spikes in rates). These variables correspond to all basic C++ types and arrays of these types. The highest insertion rates are received for basic types, out of which U32 variables seem to be the most prevalent (e.g. I recorded 2 million U32 vars were inserted in 8 mins of operation while 600.000 doubles and 170.000 strings were inserted during the same time. Note this measurement was only for a subset of the total data currently taken in). At the moment I am partitioning the data in Cassandra in 75 CFs (each CF corresponds to a logical partitioning of the set of variables mentioned before - but this partitioning is not related with the amount of data or rates...it is somewhat random). These 75 CFs account for ~1 million of the variables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each node is a 4 real core with 4 GB RAM and split commit log directory and data file directory between two RAID arrays with HDDs). I can handle the load in this configuration but the average CPU usage of the Cassandra nodes is slightly above 50%. As I will need to add 12 more CFs (corresponding to another ~ 1 million variables) plus potentially other data later, it is clear that I need better hardware (also for the retrieval part). I am looking at Dell servers (Power Edge etc
Re: Cassandra cluster HW spec (commit log directory vs data file directory)
RE: RAID0 Recommendation Cassandra supports multiple data file directories. Because we do compactions, it's just much easier to deal with (1) data file directory that is stripped across all disks as 1 volume (RAID0). There are other ways to accomplish this though. At Twitter we use software raid (RAID0 RAID10). We own the physical hardware and have found that even with hardware raid, software raid in Linux actually faster. The reason being is: http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 We have found that using far-copies is much faster over near-copies. We set the i/o scheduler to noop at the moment. We might move back to CFQ with more tuning in the future. We use RAID10 for cases where we need better disk performance if we are hitting the disk often, sacrificing storage. We initially thought RAID0 should be faster over RAID10 until we found out about the near vs far layouts. RE: Hardware This is going to depend on how well your automated infrastructure is, but we chose the path of finding the cheapest servers we could get from Dell/HP/etc. 8/12 cores, 72gb memory per node, 2TB/3TB, 2.5. We are in the process of making changes to our servers, I'll report back in when we have more details to share. I wouldn't recommend 75 CFs. It could work but just seems too complex. Another recommendation for clusters, always go big. You will be thankful in the future for this. Even if you can do this on 3-6 nodes, go much larger for future expansion. If you own your hardware and racks, I recommend making sure to size out the rack diversity and # of nodes per rack. Also take into account the replication factor when doing this. RF=3, should be min of 3 racks, and # of nodes per rack should be divisible by the replication factor. This has worked out pretty well for us. Our biggest problems today are adding 100s of nodes to existing clusters at once. I'm not sure how many other companies are having this problem, but it's certainly on our radar to improve, if you get to that point :) On Tue, Oct 25, 2011 at 5:23 AM, Alexandru Sicoe adsi...@gmail.com wrote: Hi everyone, I am currently in the process of writing a hardware proposal for a Cassandra cluster for storing a lot of monitoring time series data. My workload is write intensive and my data set is extremely varied in types of variables and insertion rate for these variables (I will have to handle an order of 2 million variables coming in, each at very different rates - the majority of them will come at very low rates but there are many that will come at higher rates constant rates and a few coming in with huge spikes in rates). These variables correspond to all basic C++ types and arrays of these types. The highest insertion rates are received for basic types, out of which U32 variables seem to be the most prevalent (e.g. I recorded 2 million U32 vars were inserted in 8 mins of operation while 600.000 doubles and 170.000 strings were inserted during the same time. Note this measurement was only for a subset of the total data currently taken in). At the moment I am partitioning the data in Cassandra in 75 CFs (each CF corresponds to a logical partitioning of the set of variables mentioned before - but this partitioning is not related with the amount of data or rates...it is somewhat random). These 75 CFs account for ~1 million of the variables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each node is a 4 real core with 4 GB RAM and split commit log directory and data file directory between two RAID arrays with HDDs). I can handle the load in this configuration but the average CPU usage of the Cassandra nodes is slightly above 50%. As I will need to add 12 more CFs (corresponding to another ~ 1 million variables) plus potentially other data later, it is clear that I need better hardware (also for the retrieval part). I am looking at Dell servers (Power Edge etc) Questions: 1. Is anyone using Dell HW for their Cassandra clusters? How do they behave? Anybody care to share their configurations or tips for buying, what to avoid etc? 2. Obviously I am going to keep to the advice on the http://wiki.apache.org/cassandra/CassandraHardware and split the commmitlog and data on separate disks. I was going to use SSD for commitlog but then did some more research and found out that it doesn't make sense to use SSDs for sequential appends because it won't have a performance advantage with respect to rotational media. So I am going to use rotational disk for the commit log and an SSD for data. Does this make sense? 3. What's the best way to find out how big my commitlog disk and my data disk has to be? The Cassandra hardware page says the Commitlog disk shouldn't be big but still I need to choose a size! 4. I also noticed RAID 0 configuration is recommended for the data file directory. Can anyone explain why? Sorry for the huge email. Cheers, Alex
Re: Size calculations for off heap caching
My best advice on this is, insert a bit of data into the tree, and then do a heap dump to calculate the extra overhead. It's unfortunately more than you would like from our testing. On Tue, Oct 18, 2011 at 8:14 PM, Todd Nine t...@spidertracks.com wrote: ** Hi guys, We've just built a K tree implementation in cassandra. We're going for relatively wide nodes in our tree to minimize our tree depth and increase our search times. Most of the links between parent/child nodes are longs. We're ready to start tuning the size of K so that our most access paths in our tree will be row cached in Cassandra. We're on Cassandra 0.8.7, and I can't find any documentation regarding the actual memory size of the off heap row cache. Can someone explain how much additional space will be used when caching rows? For instance, if our links between nodes are all Longs, and we have 100 children (cols), that gives us 900 bytes with a 0 byte placeholder value. What is the additional overhead when using the off heap storage? Thanks, Todd
Re: Does anybody know why Twitter stop integrate Cassandra as Twitter store?
At the time of that project, there wasn't enough resources and dedicated team. Since then we changed that (based on the presentation I gave). We decided to focus on other areas, and newer projects. We spent a lot of time with the community improving failure conditions, performance, etc. We chose to focus on projects that were lower tier SLAs at first, and work our way up. Now we have Cassandra running on our highest tier SLA (Cuckoo -- our monitoring and alerting infrastructure for Twitter). On Tue, Oct 4, 2011 at 1:37 PM, aaron morton aa...@thelastpickle.comwrote: If you want to see just how much Twitter uses Cassandra watch Chris Goffinet's awesome presentation at this years Cassandra SF meeting http://www.datastax.com/events/cassandrasf2011/presentations Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 5/10/2011, at 8:16 AM, Paul Loy wrote: yup, and again it gives a perfectly adequate reason: Twitter is busy fighting other fires http://engineering.twitter.com/2010/06/perfect-stormof-whales.htmland they don't have the time to retrofit something that is (more or less) working, namely their MySQL based tweet storage, with a completely new technology based on Cassandra.* * If I was in charge of platform at twitter I'd have probably made the same call. If it aint broke, don't spend $100ks fixing it. Push out new features that help keep you ahead of the competition. I really don't see anything in the closet here. It's just a simple resource management issue. On Tue, Oct 4, 2011 at 11:43 AM, ruslan usifov ruslan.usi...@gmail.comwrote: Hello 2011/10/4 Paul Loy ketera...@gmail.com Did you read the article you posted? Yes *We believe that this isn't the time to make large scale migration to a new technology*. We will focus our Cassandra work on new projects that we wouldn't be able to ship without a large-scale data store. There was big boom in network about, that Tweeter will migrate they tweets to cassandra, but than they reject this plans. This explanation sounds very vague. Why they have changed the mind? I find only one article about this: http://highscalability.com/blog/2010/7/11/so-why-is-twitter-really-not-using-cassandra-to-store-tweets.html -- - Paul Loy p...@keteracel.com http://uk.linkedin.com/in/paulloy
Re: cassandra performance degrades after 12 hours
Most likely what could be happening is you are running single threaded compaction. Look at the cassandra.yaml of how to enable multi-threaded compaction. As more data comes into the system, bigger files get created during compaction. You could be in a situation where you might be compacting at a higher bucket N level, and compactions build up at lower buckets. Run nodetool -host localhost compactionstats to get an idea of what's going on. On Mon, Oct 3, 2011 at 12:05 PM, Mohit Anchlia mohitanch...@gmail.comwrote: In order to understand what's going on you might want to first just do write test, look at the results and then do just the read tests and then do both read / write tests. Since you mentioned high update/deletes I should also ask your CL for writes/reads? with high updates/delete + high CL I think one should expect reads to slow down when sstables have not been compacted. You have 20G space and 17G is used by your process and I also see 36G VIRT which I don't really understand why it's that high when swap is disabled. Look at sar -r output too to make sure there are no swaps occurring. Also, verify jna.jar is installed. On Mon, Oct 3, 2011 at 11:52 AM, Ramesh Natarajan rames...@gmail.com wrote: I will start another test run to collect these stats. Our test model is in the neighborhood of 4500 inserts, 8000 updatesdeletes and 1500 reads every second across 6 servers. Can you elaborate more on reducing the heap space? Do you think it is a problem with 17G RSS? thanks Ramesh On Mon, Oct 3, 2011 at 1:33 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am wondering if you are seeing issues because of more frequent compactions kicking in. Is this primarily write ops or reads too? During the period of test gather data like: 1. cfstats 2. tpstats 3. compactionstats 4. netstats 5. iostat You have RSS memory close to 17gb. Maybe someone can give further advise if that could be because of mmap. You might want to lower your heap size to 6-8G and see if that helps. Also, check if you have jna.jar deployed and you see malloc successful message in the logs. On Mon, Oct 3, 2011 at 10:36 AM, Ramesh Natarajan rames...@gmail.com wrote: We have 5 CF. Attached is the output from the describe command. We don't have row cache enabled. Thanks Ramesh Keyspace: MSA: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:3] Column Families: ColumnFamily: admin Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.5671875/1440/121 (millions of ops/minutes/MB) GC grace seconds: 3600 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] ColumnFamily: modseq Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 50.0/14400 Memtable thresholds: 0.5671875/1440/121 (millions of ops/minutes/MB) GC grace seconds: 3600 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] ColumnFamily: msgid Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 50.0/14400 Memtable thresholds: 0.5671875/1440/121 (millions of ops/minutes/MB) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] ColumnFamily: participants Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 50.0/14400 Memtable thresholds: 0.5671875/1440/121 (millions of ops/minutes/MB) GC grace seconds: 3600 Compaction min/max thresholds: 4/32 Read repair
Re: cfstats - check Read Count per minute
If he puts the mx4j jar (http://mx4j.sourceforge.net/) in his lib/ folder, he can fetch stats out over HTTP. mx4j is a bridge for JMX-HTTP. On Mon, Oct 3, 2011 at 2:53 AM, aaron morton aa...@thelastpickle.comwrote: Other than manually pull them from JMX, not really. Most monitoring templates will grab those stats per cf (and perhaps per ks). Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 3/10/2011, at 3:41 PM, Marcus Both wrote: Hi, I am checking how many Read Count are made per minute as follows way: $ ./nodetool -h 127.0.0.1 cfstats Keyspace: KSTEST Read Count: 412303 (...) sleep 60 $ ./nodetool -h 127.0.0.1 cfstats Keyspace: KSTEST Read Count: 462555 (...) $ echo 462555 - 412303| bc 50252 And then I can see how many Read are made per minute, 50.252! Is there another way to check this? Tks. -- Marcus Both
Re: cassandra performance degrades after 12 hours
Yes look at cassandra.yaml there is a section about throttling compaction. You still *want* multi-threaded compaction. Throttling will occur across all threads. The reason being is that you don't want to get stuck compacting bigger files, while the smaller ones build up waiting for bigger compaction to finish. This will slowly degrade read performance. On Mon, Oct 3, 2011 at 1:19 PM, Ramesh Natarajan rames...@gmail.com wrote: Thanks for the pointers. I checked the system and the iostat showed that we are saturating the disk to 100%. The disk is SCSI device exposed by ESXi and it is running on a dedicated lun as RAID10 (4 600GB 15k drives) connected to ESX host via iSCSI. When I run compactionstats I see we are compacting a column family which has about 10GB of data. During this time I also see dropped messages in the system.log file. Since my io rates are constant in my tests I think the compaction is throwing things off. Is there a way I can throttle compaction on cassandra? Rather than run multiple compaction run at the same time, i would like to throttle it by io rate.. It is possible? If instead of having 5 big column families, if I create say 1000 each (5000 total), do you think it will help me in this case? ( smaller files and so smaller load on compaction ) Is it normal to have 5000 column families? thanks Ramesh On Mon, Oct 3, 2011 at 2:50 PM, Chris Goffinet c...@chrisgoffinet.comwrote: Most likely what could be happening is you are running single threaded compaction. Look at the cassandra.yaml of how to enable multi-threaded compaction. As more data comes into the system, bigger files get created during compaction. You could be in a situation where you might be compacting at a higher bucket N level, and compactions build up at lower buckets. Run nodetool -host localhost compactionstats to get an idea of what's going on. On Mon, Oct 3, 2011 at 12:05 PM, Mohit Anchlia mohitanch...@gmail.comwrote: In order to understand what's going on you might want to first just do write test, look at the results and then do just the read tests and then do both read / write tests. Since you mentioned high update/deletes I should also ask your CL for writes/reads? with high updates/delete + high CL I think one should expect reads to slow down when sstables have not been compacted. You have 20G space and 17G is used by your process and I also see 36G VIRT which I don't really understand why it's that high when swap is disabled. Look at sar -r output too to make sure there are no swaps occurring. Also, verify jna.jar is installed. On Mon, Oct 3, 2011 at 11:52 AM, Ramesh Natarajan rames...@gmail.com wrote: I will start another test run to collect these stats. Our test model is in the neighborhood of 4500 inserts, 8000 updatesdeletes and 1500 reads every second across 6 servers. Can you elaborate more on reducing the heap space? Do you think it is a problem with 17G RSS? thanks Ramesh On Mon, Oct 3, 2011 at 1:33 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am wondering if you are seeing issues because of more frequent compactions kicking in. Is this primarily write ops or reads too? During the period of test gather data like: 1. cfstats 2. tpstats 3. compactionstats 4. netstats 5. iostat You have RSS memory close to 17gb. Maybe someone can give further advise if that could be because of mmap. You might want to lower your heap size to 6-8G and see if that helps. Also, check if you have jna.jar deployed and you see malloc successful message in the logs. On Mon, Oct 3, 2011 at 10:36 AM, Ramesh Natarajan rames...@gmail.com wrote: We have 5 CF. Attached is the output from the describe command. We don't have row cache enabled. Thanks Ramesh Keyspace: MSA: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:3] Column Families: ColumnFamily: admin Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.5671875/1440/121 (millions of ops/minutes/MB) GC grace seconds: 3600 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] ColumnFamily: modseq Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds
Re: shutdown by KILL
For things like rolling restarts, we do: disablethrift disablegossip (...wait for all nodes to see this node go down..) drain 2011/9/10 Radim Kolar h...@sendmail.cz what is recommended node stop method. drain or kill Java process? i haven't seen anybody using drain in stop scripts yet If i kill Java process and log flush time is set to 10 seconds, will be last 10 seconds lost?
Re: Massive writes when only reading from Cassandra
You could tail the commit log with `strings` to see what keys are being inserted. On Sat, Sep 10, 2011 at 2:24 PM, Jonathan Ellis jbel...@gmail.com wrote: Two possibilities: 1) Hinted handoff (this will show up in the logs on the sending machine, on the receiving one it will just look like any other write) 2) You have something doing writes that you're not aware of, I guess you could track that down using wireshark to see where the write messages are coming from On Sat, Sep 10, 2011 at 3:56 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: Oh and we're running 0.8.4 and the RF is 3. On Sep 10, 2011, at 3:49 PM, Jeremy Hanna wrote: In addition, the mutation stage and the read stage are backed up like: Pool NameActive Pending Blocked ReadStage32 773 0 RequestResponseStage 0 0 0 ReadRepairStage 0 0 0 MutationStage 158525918 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 0 0 AntiEntropyStage 0 0 0 MigrationStage0 0 0 StreamStage 0 0 0 MemtablePostFlusher 1 5 0 FILEUTILS-DELETE-POOL 0 0 0 FlushWriter 2 5 0 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 0 0 0 CompactionManager n/a29 MessagingServicen/a 0,34 On Sep 10, 2011, at 3:38 PM, Jeremy Hanna wrote: We are experiencing massive writes to column families when only doing reads from Cassandra. A set of 5 hadoop jobs are reading from Cassandra and then writing out to hdfs. That is the only thing operating on the cluster. We are reading at CL.QUORUM with hadoop and have written with CL.QUORUM. Read repair chance is set to 0.0 on all column families. However, in the logs, I'm seeing flush after flush of memtables and compactions taking place. Is there something else that would be writing based on the above description? Jeremy -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Anybody out there using 0.8 in production
Twitter runs 0.8 in production/closer to trunk. No big issues from us. On Thu, Sep 8, 2011 at 8:53 PM, Eric Czech e...@nextbigsound.com wrote: We just migrated from .7.5 to .8.4 in our production environment and it was definitely the least painful transition yet (coming all the way from the .4 release series). It's been about a week for us but so far so good. On Thu, Sep 8, 2011 at 9:25 PM, Dominic Williams dwilli...@fightmymonster.com wrote: Hi I've just migrated to 0.8.5 and from first looks it is a giant leap forward - better use of CPU and memory - able to scrub files previously unfixable on 0.7.6-2 etc On 9 September 2011 01:45, Anthony Ikeda anthony.ikeda@gmail.comwrote: We plan to and have been using it in Dev and QA. There are some bugs that have been fixed that we are looking forward to in 0.8.5 and probably that would be the better build for production (there is a quorum bug that we will need). Otherwise no other 0.8 issues that we are aware of. We did go through a long analysis of properly configuring connection options and getting the server setup right but DataStax were abe to support us and get us sorted. Anthony On Thu, Sep 8, 2011 at 2:44 PM, Anand Somani meatfor...@gmail.comwrote: Hi Currently we are using 0.7.4 and was wondering if I should upgrade to 0.7.8/9 or move to 0.8? Is anybody using 0.8 in production and what is their experience? Thanks
Re: commodity server spec
It will also depend on how long you can handle recovery time. So imagine this case: 3 nodes w/ RF of 3 Each node has 30TB of space used (you never want to fill up entire node). If one node fails and you must recover, that will take over 3.6 days in just transferring data alone. That's with a sustained 800megabit/s (100MB/s). In the real world it's going to fluctuate so add some padding. Also, since you will be saturating one of the other nodes, now you're network latency performance is suffering and you only have 1 machine to handle the remaining traffic while you're recovering. And if you want to expand the cluster in the future (more nodes), the amount of data to transfer is going to be very large and most likely days to add machines. From my experience it's must better to have a larger cluster setup upfront for future growth than getting by with 6-12 nodes at the start. You will feel less pain, easier to manage node failures (bad disks, mem, etc). 3 nodes with RF of 1 wouldn't make sense. On Sat, Sep 3, 2011 at 4:05 AM, China Stoffen chinastof...@yahoo.comwrote: Many small servers would drive up the hosting cost way too high so want to avoid this solution if we can. - Original Message - From: Radim Kolar h...@sendmail.cz To: user@cassandra.apache.org Cc: Sent: Saturday, September 3, 2011 9:37 AM Subject: Re: commodity server spec many smaller servers way better
Re: how large cassandra could scale when it need to do manual operation?
As mentioned by Aaron, yes we run hundreds of Cassandra nodes across multiple clusters. We run with RF of 2 and 3 (most common). We use commodity hardware and see failure all the time at this scale. We've never had 3 nodes that were in same replica set, fail all at once. We mitigate risk by being rack diverse, using different vendors for our hard drives, designed workflows to make sure machines get serviced in certain time windows and have an extensive automated burn-in process of (disk, memory, drives) to not roll out nodes/clusters that could fail right away. On Sat, Jul 9, 2011 at 12:17 AM, Yan Chunlu springri...@gmail.com wrote: thank you very much for the reply. which brings me more confidence on cassandra. I will try the automation tools, the examples you've listed seems quite promising! about the decommission problem, here is the link: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html I am also trying to deploy cassandra across two datacenters(with 20ms latency). so I am worrying about the network latency will even make it worse. maybe I was misunderstanding the replication factor, doesn't it RF=3 means I could lose two nodes and still have one available(with 100% of the keys), once Nodes=3? besides I am not sure what's twitters setting on RF, but it is possible to lose 3 nodes in the same time(facebook once encountered photo loss because there RAID broken, rarely happen though). I have the strong willing to set RF to a very high value... Thanks! On Sat, Jul 9, 2011 at 5:22 AM, aaron morton aa...@thelastpickle.comwrote: AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time ago. Twitter is a vocal supporter with a large Apache Cassandra install, e.g. Twitter currently runs a couple hundred Cassandra nodes across a half dozen clusters. http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011 http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011If you are working with a 3 node cluster removing/rebuilding/what ever one node will effect 33% of your capacity. When you scale up the contribution from each individual node goes down, and the impact of one node going down is less. Problems that happen with a few nodes will go away at scale, to be replaced by a whole set of new ones. 1): the load balance need to manually performed on every node, according to: Yes 2): when adding new nodes, need to perform node repair and cleanup on every node You only need to run cleanup, see http://wiki.apache.org/cassandra/Operations#Bootstrap 3) when decommission a node, there is a chance that slow down the entire cluster. (not sure why but I saw people ask around about it.) and the only way to do is shutdown the entire the cluster, rsync the data, and start all nodes without the decommission one. I cannot remember any specific cases where decommission requires a full cluster stop, do you have a link? With regard to slowing down, the decommission process will stream data from the node you are removing onto the other nodes this can slow down the target node (I think it's more intelligent now about what is moved). This will be exaggerated in a 3 node cluster as you are removing 33% of the processing and adding some (temporary) extra load to the remaining nodes. after all, I think there is alot of human work to do to maintain the cluster which make it impossible to scale to thousands of nodes, Automation, Automation, Automation is the only way to go. Chef, Puppet, CF Engine for general config and deployment; Cloud Kick, munin, ganglia etc for monitoring. And Ops Centre (http://www.datastax.com/products/opscenter) for cassandra specific management. I am totally wrong about all of this, currently I am serving 1 millions pv every day with Cassandra and it make me feel unsafe, I am afraid one day one node crash will cause the data broken and all cluster goes wrong With RF3 and a 3Node cluster you have room to lose one node and the cluster will be up for 100% of the keys. While better than having to worry about *the* database server, it's still entry level fault tolerance. With RF 3 in a 6 Node cluster you can lose up to 2 nodes and still be up for 100% of the keys. Is there something you are specifically concerned about with your current installation ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 8 Jul 2011, at 08:50, Yan Chunlu wrote: hi, all: I am curious about how large that Cassandra can scale? from the information I can get, the largest usage is at facebook, which is about 150 nodes. in the mean time they are using 2000+ nodes with Hadoop, and yahoo even using 4000 nodes of Hadoop. I am not understand why is the situation, I only have little knowledge with Cassandra and even no knowledge
Re: nodetool move hammers the next node in the ring
We also have a ticket open at https://issues.apache.org/jira/browse/CASSANDRA-2399 We have observed in production the impact of streaming data to new nodes being added. We actually have our entire dataset in page cache in one of our clusters, our 99th percentiles go from 20ms to 1 second on streaming nodes when bootstrapping in new nodes because of blowing out the page cache during the process. We are hoping to have this addressed soon. I think throttling of streaming would be good too, to help minimize saturating the network card on the streaming node. Dynamic snitch should help with this, we'll try to report back our results very soon on what it looks like for that case. -Chris On Apr 8, 2011, at 7:35 PM, aaron morton wrote: My brain just started working. The streaming for the move may need to be throttled, but once the file has been received the bloom filters, row indexes and secondary indexes are built. That will also take some effort, do you have any secondary indexes? If you are doing a move again could you try turing up logging to DEBUG on one of the neighbour nodes. Once the file has been received you will see a message saying Finished {file_name}. Sending ack to {remote_ip}. After this log message the rebuilds will start, would be interesting to see what is more heavy weight I'm guessing the rebuilds. This is similar to https://issues.apache.org/jira/browse/CASSANDRA-2156 but that ticket will not cover this case. I've added this use case to the comments, please check there if you want to follow along. Cheers Aaron On 6 Apr 2011, at 16:26, Jonathan Colby wrote: thanks for the response Aaron. Our cluster has 6 nodes with 10 GB load on each. RF=3.AMD 64 bit Blades, Quad Core, 8 GB ram, running Debian Linux. Swap off. Cassandra 0.7.4 On Apr 6, 2011, at 2:40 AM, aaron morton wrote: Not that I know of, may be useful to be able to throttle things. But if the receiving node has little head room it may still be overwhelmed. Currently there is a single thread for streaming. If we were to throttle it may be best to make it multi threaded with a single concurrent stream per end point. Out of interest how many nodes do you have and whats the RF? Aaron On 6 Apr 2011, at 01:16, Jonathan Colby wrote: When doing a move, decommission, loadbalance, etc. data is streamed to the next node in such a way that it really strains the receiving node - to the point where it has a problem serving requests. Any way to throttle the streaming of data?
Re: How to use join_ring=false?
-Dcassandra.join_ring=false -Chris On Mar 21, 2011, at 10:32 PM, Jason Harvey wrote: I set join_ring=false in my java opts: -Djoin_ring=false However, when the node started up, it joined the ring. Is there something I am missing? Using 0.7.4 Thanks, Jason
Re: Nodes frozen in GC
Can you tell me how many SSTables on disk when you see GC pauses? In your 3 node cluster, what's the RF factor? On Mon, Mar 7, 2011 at 1:50 PM, ruslan usifov ruslan.usi...@gmail.comwrote: 2011/3/8 Jonathan Ellis jbel...@gmail.com It sounds like you're complaining that the JVM sometimes does stop-the-world GC. You can mitigate this but not (for most workloads) eliminate it with GC option tuning. That's simply the state of the art for Java garbage collection right now. Hm, but what to do in this cases?? In these moments throughput of cluster degrade, and I misunderstand what workaround I must do to prevent this situations?
Re: Nodes frozen in GC
The rows you are inserting, what is your update ratio to those rows? On Mon, Mar 7, 2011 at 4:03 PM, ruslan usifov ruslan.usi...@gmail.comwrote: 2011/3/8 Chris Goffinet c...@chrisgoffinet.com Can you tell me how many SSTables on disk when you see GC pauses? In your 3 node cluster, what's the RF factor? About 30-40, and i use RF=2, and insert rows with QUORUM consistency level
Re: Nodes frozen in GC
How large are your SSTables on disk? My thought was because you have so many on disk, we have to store the bloom filter + every 128 keys from index in memory. On Mon, Mar 7, 2011 at 4:35 PM, ruslan usifov ruslan.usi...@gmail.comwrote: 2011/3/8 Chris Goffinet c...@chrisgoffinet.com The rows you are inserting, what is your update ratio to those rows? I doesn't update them only insert, with speed 16000 per second
Re: Subscribe
I would like to subscribe to your newsletter. On Tue, Feb 15, 2011 at 8:04 AM, A J s5a...@gmail.com wrote:
Re: [RELEASE] 0.6.11
+1 On Fri, Jan 28, 2011 at 3:13 PM, Eric Evans eev...@rackspace.com wrote: It seems like it was just earlier this week that we announced the release of 0.6.10. Oh wait, it was. In the time since though, CASSANDRA-2058[1] was found and fixed, and that seemed like reason enough to fast-track a new release. Source and binary archives are available from the Downloads page[3], and packages for Debian-based systems are available from the project repository[4]. Thanks! [1]: https://issues.apache.org/jira/browse/CASSANDRA-2058 [2]: http://goo.gl/0bC9M (CHANGES.txt) [3]: http://cassandra.apache.org/download [4]: http://wiki.apache.org/cassandra/DebianPackaging -- Eric Evans eev...@rackspace.com
Re: [RELEASE] 0.6.11
Err. I mean't, thanks Evan for getting this released so fast :) On Fri, Jan 28, 2011 at 3:18 PM, Chris Goffinet c...@chrisgoffinet.comwrote: +1 On Fri, Jan 28, 2011 at 3:13 PM, Eric Evans eev...@rackspace.com wrote: It seems like it was just earlier this week that we announced the release of 0.6.10. Oh wait, it was. In the time since though, CASSANDRA-2058[1] was found and fixed, and that seemed like reason enough to fast-track a new release. Source and binary archives are available from the Downloads page[3], and packages for Debian-based systems are available from the project repository[4]. Thanks! [1]: https://issues.apache.org/jira/browse/CASSANDRA-2058 [2]: http://goo.gl/0bC9M (CHANGES.txt) [3]: http://cassandra.apache.org/download [4]: http://wiki.apache.org/cassandra/DebianPackaging -- Eric Evans eev...@rackspace.com
Re: Cassandra and -XX:+UseLargePages
I've seen about a 13% improvement in practice. -Chris On Jan 16, 2011, at 4:01 PM, David Dabbs wrote: Hello. Can anyone comment on the performance impact (positive or negative) of running Cassandra configured to use large pages under Linux? Yes, YMMV applies, but I thought I'd ask before enlisting sysadmin Fu, etc. Thanks! David
Re: Read Latency
If you are using Python, and raw Thrift, use the following: protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) The serialization/deserialization is done directly in C. On Wed, Oct 20, 2010 at 11:53 AM, Wayne wav...@gmail.com wrote: We did some testing and the object is 23megs that is taking more than 3 seconds for thrift to return as a python object. We also tested pickling this object to/from a string and to pickle it takes 1.5s and to convert the pickled string to a python object takes .75s. Added together they still take less than the 3 seconds Thrift is taking to create a python object. I think our 1s before also was an actual deep copy. We are definitely going to a streaming model and getting small batches of data at a time per the recommendation. The bigger concern of why thrift takes more time than Cassandra itself though is still out there. Thrift is taking too much time to convert to a python object and there is no explanation we can find why it takes so long. We have also tested with smaller and larger data requests and they all seem to have the same math - thrift takes a little more time to convert than Cassandra itself takes to respond. Is this specific to Python accessing thrift? Would it be faster to get data into C and we write our own python wrapper around C? On Tue, Oct 19, 2010 at 7:16 PM, Aaron Morton aa...@thelastpickle.comwrote: Not sure how pycassa does it, but it a simple case of... - get_slice with start=, finish= and count = 100,001 - pop the last column and store it's name - get_slice with start as the last column name, finish= and count = 100,001 repeat. A On 20 Oct, 2010,at 03:08 PM, Wayne wav...@gmail.com wrote: Thanks for all of the feedback. I may not very well be doing a deep copy, so my numbers might not be accurate. I will test with writing to/from the disk to verify how long native python takes. I will also check how large the data is coming from cassandra is in size for comparison. Our high expectations are based on actual MySQL time which is in the range of 3-4 seconds for the exact same data. I will also try to work with getting the data in batches. Not as easy of course in Cassandra, which is probably why we have not tried that yet. Thanks for all of the feedback! On Tue, Oct 19, 2010 at 8:51 PM, Aaron Morton aa...@thelastpickle.comwrote: Hard to say why your code performs that way, it may not be creating as many objects for example strings may not be re-created just referenced. Are your creating new objects for every column returned? Bring 600,000 to 10M columns back at once is always going to take time. I think any python database client would take a while to create objects for 600,000 rows. Do you have an example of pulling 600,000 rows through MySQL into python to compare against? Is it possible to break up the get_slice into chunks of 10,000 or 100,000? IMHO you will get more consistent performance if you bound the requests, so you have an idea of the upper level of latency for each request and create a more consistent memory footprint. For example in the rough test below, 100,000 objects takes 0.75 secs but 600,000 takes 13. As an example of reprocessing the results, i called go2 with the output of go below. def go2(buffer): start = timetime() buffer2 = [ {name : csc.column.name http://csccolumn.name, value : csc.column.value} for csc in buffer ] print Done2 in %s % (time.time() -start) {977} python decode_test.py 10 Done in 0.75460100174 Done2 in 0.314303874969 {978} python decode_test.py 60 Done in 13.2945489883 Done2 in 7.32861185074 My general advice is to pull back less data in a single request. Aaron On 20 Oct, 2010,at 11:30 AM, Wayne wav...@gmail.com wrote: I am not sure how many bytes, but we do convert the cassandra object that is returned in 3s into a dictionary in ~1s and then again into a custom python object in about ~1.5s. Expectations are based on this timing. If we can convert what thrift returns into a completely new python object in 1s why does thrift need 3s to give it to us? To us it is like the MySQL client we use in python. It is really C wrapped in python and adds almost zero overhead to the time it takes mysql to return the data. That is the expectation we have and the performance we are looking to get to. Disk I/O + 20%. We are returning one big row and this is not our normal use case but a requirement for us to use Cassandra. We need to get all data for a specific value, as this is a secondary index. It is like getting all users in the state of CA. CA is the key and there is a column for every user id. We are testing with 600,000 but this will grow to 10+ million in the future. We can not test .7 as we are only using .6.6. We are trying to evaluate Cassandra and stability is one concern so .7 is definitely not for us at this point. Thanks. On Tue, Oct 19,
Re: what causes MESSAGE-DESERIALIZER-POOL to spike
When you can't get the number of threads, that means you have way too many running (8,000+) usually. Try running `ps -eLf | grep cassandra`. How many threads? -Chris On Jul 29, 2010, at 8:40 PM, Dathan Pattishall wrote: To Follow up on this thread. I blew away the data for my entire cluster, waited a few days of user activity and within 3 days the server hangs requests in the same way. Background Info: Make around 60 million requests per day. 70% reads 30% writes an F5 Loadbalancer (BIGIP-LTM) in a round robin config. IOSTAT Info: 3 MB a secon of writing data @ 13% IOWAIT VMStat Info: still shows a lock of blocking procs at a low CPU utilization. Data Size: 6 GB of data per node and there is 4 nodes cass01: Pool NameActive Pending Completed cass01: FILEUTILS-DELETE-POOL 0 0 27 cass01: STREAM-STAGE 0 0 8 cass01: RESPONSE-STAGE0 0 66439845 cass01: ROW-READ-STAGE8 4098 77243463 cass01: LB-OPERATIONS 0 0 0 cass01: MESSAGE-DESERIALIZER-POOL 1 14223148 139627123 cass01: GMFD 0 0 772032 cass01: LB-TARGET 0 0 0 cass01: CONSISTENCY-MANAGER 0 0 35518593 cass01: ROW-MUTATION-STAGE0 0 19809347 cass01: MESSAGE-STREAMING-POOL0 0 24 cass01: LOAD-BALANCER-STAGE 0 0 0 cass01: FLUSH-SORTER-POOL 0 0 0 cass01: MEMTABLE-POST-FLUSHER 0 0 74 cass01: FLUSH-WRITER-POOL 0 0 74 cass01: AE-SERVICE-STAGE 0 0 0 cass01: HINTED-HANDOFF-POOL 0 0 9 Keyspace: TimeFrameClicks Read Count: 42686 Read Latency: 47.21777100220213 ms. Write Count: 18398 Write Latency: 0.17457457332318732 ms. Pending Tasks: 0 Column Family: Standard2 SSTable count: 9 Space used (live): 6561033040 Space used (total): 6561033040 Memtable Columns Count: 6711 Memtable Data Size: 241596 Memtable Switch Count: 1 Read Count: 42552 Read Latency: 41.851 ms. Write Count: 18398 Write Latency: 0.031 ms. Pending Tasks: 0 Key cache capacity: 20 Key cache size: 81499 Key cache hit rate: 0.2495154675604193 Row cache: disabled Compacted row minimum size: 0 Compacted row maximum size: 0 Compacted row mean size: 0 Attached is jconsole memory use. I would attach the thread use but I could not get any info from JMX on the threads. And clicking detect deadlock just hangs, I do not see the expected No deadlock detected. Based on Feedback from this list by jbellis, I'm hitting cassandra to hard. So I removed the offending server from the LB. Waited about 20 mins and the pending queue did not clear at all. Killing Cassandra and restarting it, this box recovered. So from my point of view I think there is a bug in Cassandra? Do you agree? Possibly a dead lock in the SEDA implementation of the ROW-READ-STAGE? On Tue, Jul 27, 2010 at 12:28 AM, Peter Schuller peter.schul...@infidyne.com wrote: average queue size column too. But given the vmstat output I doubt this is the case since you should either be seeing a lot more wait time or a lot less idle time. Hmm, another thing: you mention 16 i7 cores. I presume that's 16 in total, counting hyper-threading? Because that means 8 threads should be able to saturate 50% (as perceived by the operating system). If you have 32 (can you get this yet anyway?) virtual cores then I'd say that your vmstat output could be consistent with READ-ROW-STAGE being CPU bound rather than disk bound (presumably with data fitting in cache and not having to go down to disk). If this is the case, increasing read concurrency should at least make the actual problem more obvious (i.e., achieving CPU saturation), though it probably won't increase throughput much unless Cassandra is very friendly to hyperthreading -- / Peter Schuller memory_use.PNG
Re: Digg 4 Preview on TWiT
Digg is not forking Cassandra. We use 0.6 for production, with a few in-house patches (related to our infrastructure). The biggest difference with our branch and apache 0.6 branch is we have the work Kelvin and Twitter has done in regards to Vector Clocks + Distributed Counters. This will never go into 0.6, but should hit 0.7 hopefully soon. We will start to move to 0.7 once it gets more stable. -Chris On Jun 28, 2010, at 7:53 AM, Kochheiser,Todd W - TOK-DITT-1 wrote: On yesterday’s “This Week in Tech” (TWiT) podcast with Leo Laporte (Wiki:http://wiki.twit.tv/wiki/TWiT_254), Kevin Rose of Digg fame was a guest. He gave a public preview of the new Digg 4; it looks very nice and should be released in the next month or two. He also mentioned that Digg 4 is using Cassandra and that it is an Apache Open Source project. He mentioned Twitter and how the Twitter and Digg engineers have been working closely on Cassandra related issues. There was a passing reference to Digg also working with Facebook engineers, but I could be wrong on that point. On a related but separate note: While I am fairly new to Cassandra and have only been following the mailing lists for a few months, the conversation with Kevin Rose on TWiT made me curious if the versions of Cassandra that Digg, Twitter, and Facebook are using may end up being forks of the Apache project or old versions. As the Apache Cassandra project moves forward with new features, are these large and very public installations of Cassandra going to be able to continue contributing patches and features and/or accept patches and features from the Apache project? While most recent commits appear to come from Eric Evans and Jonathan Ellis, the committers list for Cassandra does include, among many others, Facebook, Twitter, and Digg. My apology if anyone feels this is an inappropriate post to this list. Todd
Re: Why Cassandra is space inefficient compared to MySQL?
My money is on the fact that the serializer is just horribly verbose. It's using a basic set of the java serializer. -Chris On Tue, May 25, 2010 at 10:02 AM, Ryan King r...@twitter.com wrote: Also, timestamps for each column. -ryan On Tue, May 25, 2010 at 5:41 AM, Jonathan Ellis jbel...@gmail.com wrote: That's true. But fundamentally Cassandra is expected to use more space than mysql for a few reasons; usually the biggest factor is that Cassandra has to write out each column name in each row, since column names are dynamic unlike in mysql where you declare the columns once for the whole table. 2010/5/25 Peter Schüller sc...@spotify.com: Could you please tell me why? There might be pending sstable removals on disk, which won't happen until GC or restart. If you just did a bulk insert and checked diskspace immediately afterwards, I think this is a possible explanation. (See Write path on http://wiki.apache.org/cassandra/ArchitectureInternals) -- / Peter Schuller aka scode -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: zookeeper, how do you feed the pets?
If you are running multiple datacenters, intend to have a lot of writes for counters, I highly advise against it. We got rid of ZK because of that. -Chris On May 16, 2010, at 7:04 PM, S Ahmed wrote: Can someone quickly go over how you go about using zookeeper if you want to store counts and have those counts be accurate? e.g. in digg's case I believe, they are using zookeeper so they can keep track of digg's for a particular digg story. Is it a backend change only and then storing API calls are uneffected? is it a config issue ? What are the ramifications of using this addon, are writes slower because you have to wait for the write to propogate to all the servers?
Re: Cassandra cluster runs into OOM when bulk loading data
Upgrade to b20 of Sun's version of JVM. This OOM might be related to LinkedBlockQueue issues that were fixed. -Chris 2010/4/26 Roland Hänel rol...@haenel.me Cassandra Version 0.6.1 OpenJDK Server VM (build 14.0-b16, mixed mode) Import speed is about 10MB/s for the full cluster; if a compaction is going on the individual node is I/O limited tpstats: caught me, didn't know this. I will set up a test and try to catch a node during the critical time. Thanks, Roland 2010/4/26 Chris Goffinet goffi...@digg.com Which version of Cassandra? Which version of Java JVM are you using? What do your I/O stats look like when bulk importing? When you run `nodeprobe -host tpstats` is any thread pool backing up during the import? -Chris 2010/4/26 Roland Hänel rol...@haenel.me I have a cluster of 5 machines building a Cassandra datastore, and I load bulk data into this using the Java Thrift API. The first ~250GB runs fine, then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts are done with consistency level ALL. I hope with this I have avoided all the 'usual dummy errors' that lead to OOM's. I have begun to troubleshoot the issue with JMX, however, it's difficult to catch the JVM in the right moment because it runs well for several hours before this thing happens. One thing gets to my mind, maybe one of the experts could confirm or reject this idea for me: is it possible that when one machine slows down a little bit (for example because a big compaction is going on), the memtables don't get flushed to disk as fast as they are building up under the continuing bulk import? That would result in a downward spiral, the system gets slower and slower on disk I/O, but since more and more data arrives over Thrift, finally OOM. I'm using the periodic commit log sync, maybe also this could create a situation where the commit log writer is too slow to catch up with the data intake, resulting in ever growing memory usage? Maybe these thoughts are just bullshit. Let me now if so... ;-)
Re: [RELEASE] 0.6.0
I wonder if that might be related to this: https://issues.apache.org/jira/browse/CASSANDRA-896 We switched from a Concurrent structure to LinkedBlockingQueue in 0.6. -Chris On Apr 17, 2010, at 9:26 PM, Schubert Zhang wrote: We are testing 0.6.0, compares with 0.5.1, and it seems: 1. 0.6.0 need more memory/heap. 2. after inserted billions of columns, tens-million of keys, the inseting operation become very slow and jamed. Exceptions TimeoutException and UnavailableException are throwed sometimes. I add more log, such as : WARN [pool-1-thread-4] 2010-04-18 00:00:00,534 CassandraServer.java (line 460) UnavailableException() UnavailableException() at org.apache.cassandra.service.StorageProxy.assureSufficientLiveNodes(StorageProxy.java:298) at org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:208) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:452) at org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:362) at org.apache.cassandra.thrift.Cassandra$Processor$insert.process(Cassandra.java:1484) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1125) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) WARN [pool-1-thread-5] 2010-04-18 12:20:03,614 CassandraServer.java (line 456) java.util.concurrent.TimeoutException: Operation tim ed out - received only 00 responses java.util.concurrent.TimeoutException: Operation timed out - received only 00 responses at org.apache.cassandra.service.WriteResponseHandler.get(WriteResponseHandler.java:77) at org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:262) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:452) at org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:362) at org.apache.cassandra.thrift.Cassandra$Processor$insert.process(Cassandra.java:1484) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1125) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) and INFO [Timer-1] 2010-04-18 11:09:13,928 Gossiper.java (line 179) InetAddress /10.24.1.16 is now dead. INFO [Timer-1] 2010-04-18 11:09:14,930 Gossiper.java (line 179) InetAddress /10.24.1.14 is now dead. INFO [Timer-1] 2010-04-18 11:09:14,930 Gossiper.java (line 179) InetAddress /10.24.1.18 is now dead. In fact, these node are alive. 2010/4/15 Ted Zlatanov t...@lifelogs.com On Wed, 14 Apr 2010 12:23:19 -0500 Eric Evans eev...@rackspace.com wrote: EE On Wed, 2010-04-14 at 10:16 -0500, Ted Zlatanov wrote: Can it support a non-root user through /etc/default/cassandra? I've been patching the init script myself but was hoping this would be standard. EE It's the first item on debian/TODO, but, you know, patches welcome and EE all that. The appended patch has been sufficient for me. I have to override the PIDFILE too, but that's a system issue. So my /etc/default/cassandra, for example, is: JAVA_HOME=/usr/lib/jvm/java-6-sun USER=cassandra PIDFILE=/var/tmp/$NAME.pid Ted --- debian/init 2010-04-14 12:57:30.0 -0500 +++ /etc/init.d/cassandra 2010-04-14 13:00:25.0 -0500 @@ -21,6 +21,7 @@ JSVC=/usr/bin/jsvc JVM_MAX_MEM=1G JVM_START_MEM=128M +USER=root [ -e /usr/share/cassandra/apache-cassandra.jar ] || exit 0 [ -e /etc/cassandra/storage-conf.xml ] || exit 0 @@ -75,6 +76,7 @@ is_running return 1 $JSVC \ +-user $USER \ -home $JAVA_HOME \ -pidfile $PIDFILE \ -errfile 1 \
Re: if cassandra isn't ideal for keep track of counts, how does digg count diggs?
http://issues.apache.org/jira/browse/CASSANDRA-704 http://issues.apache.org/jira/browse/CASSANDRA-721 We have our own internal codebase of Cassandra at Digg. But we are using those above patches until we have the vector clock work cleaned up, that patch will also goto jira. Most likely the vector clock work will go into 0.7, but since we run 0.6 and built it for that version, we will share that patch too. -Chris On Apr 6, 2010, at 10:17 AM, S Ahmed wrote: Chris, When you so patch, does that mean for Cassandra or your own internal codebase? Sounds interesting thanks! On Tue, Apr 6, 2010 at 12:54 PM, Chris Goffinet goffi...@digg.com wrote: That's not true. We have been using the Zookeper work we posted on jira. That's what we are using internally and have been for months. We are now just wrapping up our vector clocks + distributed counter patch so we can begin transitioning away from the Zookeeper approach because there are problems with it long-term. -Chris On Apr 6, 2010, at 9:50 AM, Ryan King wrote: They don't use cassandra for it yet. -ryan On Tue, Apr 6, 2010 at 9:00 AM, S Ahmed sahmed1...@gmail.com wrote: From what I read in another thread, Cassandra isn't used for isn't 'ideal' for keeping track of counts. For example, I would undertand this to mean keeping track of which stories were dugg. If this is true, how would a site like digg keep track of the 'dugg' counter? Also, I am assuming with eventual consistancy the number *may* not be 100% accurate. If you wanted it to be accurate, would you just use the Quorom flag? (I believe quorom is to ensure all writes are written to disk)
Re: Stalled Bootstrapping Process
+1 On Fri, Apr 2, 2010 at 3:49 PM, Jonathan Ellis jbel...@gmail.com wrote: Ah, right. That's confusing for everyone. I think the best solution there is to just get http://issues.apache.org/jira/browse/CASSANDRA-579 done so it can start streaming immediately. On Fri, Apr 2, 2010 at 5:45 PM, Dan Di Spaltro dan.dispal...@gmail.com wrote: It did once it was actually done anti-compacting. The biggest question-mark (for us) was, what was happening during the anti-compaction phase. On Fri, Apr 2, 2010 at 3:39 PM, Jonathan Ellis jbel...@gmail.com wrote: Great, glad it worked. Sounds like we do have a bug though if the destination node never showed anything in Streaming mbean. :( On Fri, Apr 2, 2010 at 5:11 PM, Dan Di Spaltro dan.dispal...@gmail.com wrote: To close the loop on this, the node finished bootstrapping. The source node rebooting definitely halted the process. Visibility-wise, watching the anti-compactions is the best way to tell how much progress is being made on the bootstrapping process. The CompactionManager mbean gives you insight into the progress of each anti-compaction as well. Thanks for the help, On Thu, Apr 1, 2010 at 4:23 PM, Jonathan Ellis jbel...@gmail.com wrote: I would turn debug logging on globally on the new node, that will answer more questions than just the streaming package. -- Dan Di Spaltro -- Dan Di Spaltro -- Chris Goffinet
Re: Hackathon?!?
Awesome! 2 tickets left. -Chris On Mar 27, 2010, at 11:42 PM, Evan Weaver wrote: Me too. On Tue, Mar 23, 2010 at 12:48 PM, Jeff Hodges jhod...@twitter.com wrote: I'll be there. -- Jeff On Mon, Mar 22, 2010 at 8:40 PM, Eric Florenzano flo...@gmail.com wrote: Nice, I'll go! -Eric Florenzano -- Evan Weaver
Re: Nodes Timing Out
what's the ulimit set to? -Chris On Mar 27, 2010, at 10:29 AM, James Golick wrote: Hey, I put our first cluster in to production (writing but not reading) a couple of days ago. Right now, it's got two pretty sizeable nodes taking about 200 writes per second each and virtually no reads. Eventually, though, (and this has happened twice), both nodes seem to start timing out. If I run nodetool cfstats, I get: [ja...@cassandra1 ~]# /opt/cassandra/bin/nodetool -h cassandra1.fetlife.com cfstats Keyspace: system Read Count: 39 Read Latency: 0.35925641025641025 ms. Write Count: 3 Write Latency: 0.166 ms. Pending Tasks: 66 Column Family: HintsColumnFamily SSTable count: 0 Space used (live): 0 Space used (total): 0 and then it just hangs there. Any ideas? - James
Re: CASSANDRA-721
https://issues.apache.org/jira/browse/CASSANDRA-580 On Mon, Mar 22, 2010 at 10:00 AM, Toby DiPasquale t...@cbcg.net wrote: Hi all, CASSANDRA-721 (https://issues.apache.org/jira/browse/CASSANDRA-721) contains the following statement: Edit: This is only a temporary solution for atomic increment/decrement operation. What is the more permanent solution? Is this known yet? Seems like a reliance on ZooKeeper would not be optimal in comparison with having some kind of atomic inc/dec native to Cassandra itself? Does anyone know what's up with this? Thanks! -- Toby DiPasquale -- Chris Goffinet