Re: Strange delay in query
Can it be that you have tons and tons of tombstoned columns in the middle of these two? I've seen plenty of performance issues with wide rows littered with column tombstones (you could check with dumping the sstables...) Just a thought... Josep M. On Thu, Nov 8, 2012 at 12:23 PM, André Cruz andre.c...@co.sapo.pt wrote: These are the two columns in question: = (super_column=13957152-234b-11e2-92bc-e0db550199f4, (column=attributes, value=, timestamp=1351681613263657) (column=blocks, value=A4edo5MhHvojv3Ihx_JkFMsF3ypthtBvAZkoRHsjulw06pez86OHch3K3OpmISnDjHODPoCf69bKcuAZSJj-4Q, timestamp=1351681613263657) (column=hash, value=8_p2QaeRaX_QwJbUWQ07ZqlNHei7ixu0MHxgu9oennfYOGfyH6EsEe_LYO8V8EC_1NPL44Gx8B7UhYV9VSb7Lg, timestamp=1351681613263657) (column=icon, value=image_jpg, timestamp=1351681613263657) (column=is_deleted, value=true, timestamp=1351681613263657) (column=is_dir, value=false, timestamp=1351681613263657) (column=mime_type, value=image/jpeg, timestamp=1351681613263657) (column=mtime, value=1351646803, timestamp=1351681613263657) (column=name, value=/Mobile Photos/Photo 2012-10-28 17_13_50.jpeg, timestamp=1351681613263657) (column=revision, value=13957152-234b-11e2-92bc-e0db550199f4, timestamp=1351681613263657) (column=size, value=1379001, timestamp=1351681613263657) (column=thumb_exists, value=true, timestamp=1351681613263657)) = (super_column=40b7ae4e-2449-11e2-8610-e0db550199f4, (column=attributes, value={posix: 420}, timestamp=1351790781154800) (column=blocks, value=9UCDkHNb8-8LuKr2bv9PjKcWCT0v7FCZa0ebNSflES4-o7QD6eYschVaweCKSbR29Dq2IeGl_Cu7BVnYJYphTQ, timestamp=1351790781154800) (column=hash, value=kao2EV8jw_wN4EBoMkCXZWCwg3qQ0X6m9_X9JIGkEkiGKJE_JeKgkdoTAkAefXgGtyhChuhWPlWMxl_tX7VZUw, timestamp=1351790781154800) (column=icon, value=text_txt, timestamp=1351790781154800) (column=is_dir, value=false, timestamp=1351790781154800) (column=mime_type, value=text/plain, timestamp=1351790781154800) (column=mtime, value=1351378576, timestamp=1351790781154800) (column=name, value=/Documents/VIMDocument.txt, timestamp=1351790781154800) (column=revision, value=40b7ae4e-2449-11e2-8610-e0db550199f4, timestamp=1351790781154800) (column=size, value=13, timestamp=1351790781154800) (column=thumb_exists, value=false, timestamp=1351790781154800)) I don't think their size is an issue here. André On Nov 8, 2012, at 6:04 PM, Andrey Ilinykh ailin...@gmail.com wrote: What is the size of columns? Probably those two are huge. On Thu, Nov 8, 2012 at 4:01 AM, André Cruz andre.c...@co.sapo.pt wrote: On Nov 7, 2012, at 12:15 PM, André Cruz andre.c...@co.sapo.pt wrote: This error also happens on my application that uses pycassa, so I don't think this is the same bug. I have narrowed it down to a slice between two consecutive columns. Observe this behaviour using pycassa: DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'), column_count=2, column_start=uuid.UUID('13957152-234b-11e2-92bc-e0db550199f4')).keys() DEBUG 2012-11-08 11:55:51,170 pycassa_library.pool:30 6849 139928791262976 Connection 52905488 (xxx:9160) was checked out from pool 51715344 DEBUG 2012-11-08 11:55:53,415 pycassa_library.pool:37 6849 139928791262976 Connection 52905488 (xxx:9160) was checked in to pool 51715344 [UUID('13957152-234b-11e2-92bc-e0db550199f4'), UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')] A two column slice took more than 2s to return. If I request the next 2 column slice: DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'), column_count=2, column_start=uuid.UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')).keys() DEBUG 2012-11-08 11:57:32,750 pycassa_library.pool:30 6849 139928791262976 Connection 52904912 (xxx:9160) was checked out from pool 51715344 DEBUG 2012-11-08 11:57:32,774 pycassa_library.pool:37 6849 139928791262976 Connection 52904912 (xxx:9160) was checked in to pool 51715344 [UUID('40b7ae4e-2449-11e2-8610-e0db550199f4'), UUID('a364b028-2449-11e2-8882-e0db550199f4')] This takes 20msec... Is there a rational explanation for this different behaviour? Is there some threshold that I'm running into? Is there any way to obtain more debugging information about this problem? Thanks, André
Re: minor compaction and delete expired column-tombstones
We've run exactly into the same problem recently. Some specific keys in a couple CFs accumulate a fair amount of column churn over time. Pre Cassandra 1.x we scheduled full compactions often to purge them. However, when we moved to 1.x but we adopted the recommended practice of avoiding full compactions. The problem took a while to manifest itself, but over the course of several weeks (few months) of not doing full compactions the load on those services slowly increased...and despite we have everything monitored, it was not trivial to find out that it was the accumulation of tombstones on 'some' keys, for 'some' CF in the cluster that were really causing long latencies and CPU spikes (high CPU is a typical signature when having a fair amount of tombstones in the SSTables). Is there any JIRA or enhancement to perhaps be able to detect when certain column tombstones can be deleted in minor compactions? The new introduction of SSTable min-max timestamps might help? or perhaps there are new ones coming up that I'm not aware of I'm saying this because there is absolutely no way (that I know of) to find out or monitor when Cassandra encounters many column tombstones when doing searches. That alone could help detect these cases so one can change the data model and/or realize that needs full compactions. For example a new metric at the CF level that tracks % of tombstones read per row (ideally a histogram based on row size), or perhaps spit something out in the logs (a la mysql slowquery log) when a wide row is read and a certain % of tombstone columns are encountered...this alone can be a huge help in at least detecting the latent problem. ...what we had to do to fully debug and understand the issue was to build some tools that scanned SSTables and provided some of those stats. In a large cluster that is painful to do. Anyway, just wanted to chime in the thread to provide our input in the matter. Cheers, Josep M. On Mon, Sep 17, 2012 at 2:01 AM, Rene Kochen rene.koc...@emea.schange.comwrote: Oke, thanks! So a column tombstone will only be removed if all row fragments are present in the tables being compacted. I have a row called Index which contains columns like page0, page1, page2, etc. Every several minutes, new columns are created and old ones deleted. The problem is that I now have an Index row in several SSTables, but the column tombstones are never deleted. And reading the Index row (and all its column tombstones) takes longer and longer. If I do a major compaction, all tombstones are deleted and reading the index row takes one millisecond again (and all the garbage-collect issues because of this). Is it not advised to use rows with many new column creates/deletes (because of how minor compactions work)? Thanks! Rene 2012/9/17 aaron morton aa...@thelastpickle.com: Does minor compaction delete expired column-tombstones when the row is also present in another table which is No. Compaction is per Column Family. Tombstones will be expired by Minor Compaction if all fragments of the row are contained in the SSTables being compacted. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 15/09/2012, at 6:32 AM, Rene Kochen rene.koc...@schange.com wrote: Hi all, Does minor compaction delete expired column-tombstones when the row is also present in another table which is not subject to the minor compaction? Example: Say there are 5 SStables: - Customers_0 (10 MB) - Customers_1 (10 MB) - Customers_2 (10 MB) - Customers_3 (10 MB) - Customers_4 (30 MB) A minor compaction is triggered which will compact the similar sized tables 0 to 3. In these tables is a customer record with key C1 with an expired column tombstone. Customer C1 is also present in table 4. Will the minor compaction delete the column (i.e. will the tombstone be present in the newly created table)? Thanks, Rene
Re: CQL3 and column slices
Thank Sylvain, The main argument for this is pagination. Let me try to explain the use cases, and compare it to RDBMS for better illustration: 1- Right now, Cassandra doesn't stream the requests, so large resultsets are a royal pain in the neck to deal with. I.e., if I have a range_slice, or even a slice query that cuts across 1 million columns...I have to completely eat it all in the client receiving the response. That is, I'll need to store 1 million results in the client no matter what, and that can be quite prohibitive. 2- In an effort to alleviate that, one can be smarter in the client and play the pagination game...i.e., start slicing at some column and get the next N results, then start the slice at the last column seen and get N moreetc. That results in many more queries from the smart client, but at least it would allow you to handle large result sets. (That's where the need for the CQL query in my original email was about). 3- There's another important factor related to this problem in my opinion: the LIMIT clause in Cassandra (in both CQL or Thrift) is a required field. What I mean by required is that cassandra requires an explicit count to operate underneath. So it is really different from RDBMS' semantics where no LIMIT means you'll get all the results (instead of the high, yet still bound count of 10K or 20K max resultset row cassandra enforces by defaul)...and I cannot tell you how many problems we've had with developers forgetting about these default counts in queries, and realizing that some had results truncated because of that...in my mind, LIMIT should be to only used restrict results...queries with no LIMIT should always return all results (much like RDBMS)...otherwise the query seems the same but it is semantically different. So, all in all I think that the main problem/use case I'm facing is that Cassandra cannot stream resultsets. If it did, I believe that the need for my pagination use case would basically disappear, since it'd be the transport/client that would throttle how many results are stored in the client buffer at any point time. At the same time, I believe that with a streaming protocol you could simply change Cassandra internals to have infinite default limits...since there wouldn't be no reason to stop scanning (unless an explicit LIMIT clause was specified by the client). That would give you not only the SQL-equivalent syntax, but also the equivalent semantics of most current DBs. I hope that makes sense. That being said, are there any plans for streaming results? I believe that without that (and especially with the new CQL restrictions) it make much more difficult to use Cassandra with wide rows and large resultsets (which, in my mind is one of its sweet spots ). I believe that if that doesn't happen it would a) force the clients to be built in a much more complex and inefficient way to handle wide rows or b) will force users to use different, less efficient datamodels for their data. Both seem bad propositions to me, as they wouldn't be taking advantage of Cassandra's power, therefore diminishing its value. Cheers, Josep M. On Tue, Jul 24, 2012 at 3:11 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Tue, Jul 24, 2012 at 12:09 AM, Josep Blanquer blanq...@rightscale.com wrote: is there some way to express that in CQL3? something logically equivalent to SELECT * FROM bug_test WHERE a:b:c:d:e 1:1:1:1:2?? No, there isn't. Not currently at least. But feel free of course to open a ticket/request on https://issues.apache.org/jira/browse/CASSANDRA. I note that I would be curious to know the concrete use case you have for such type of queries. It would also help as an argument to add such facilities more quickly (or at all). Typically, we should support it in CQL3 because it was possible with thrift is definitively an argument, but a much weaker one without concrete examples of why it might be useful in the first place. -- Sylvain
CQL3 and column slices
Hi, I am confused as to what is the way to specify column slices for composite type CFs using CQL3. I first thought that the way to do so was to use the very ugly and unintuitive syntax of constructing the PK prefix with equalities, except the last part of the composite type. But, now, after seeing https://issues.apache.org/jira/browse/CASSANDRA-4372 , and realizing that the ugly/unintuitive way to specify that has been taken away (i.e., fixed) ...I don't know what is the way to express it anymore. In particular, and following the example of 4372...if you have this table with 6 columns, 5 of them being the composite : CREATE TABLE bug_test (a int, b int, c int, d int, e int, f text, PRIMARY KEY (a, b, c, d, e) ); with some data in it: SELECT * FROM bug_test; Results: a | b | c | d | e | f --+-- 1 | 1 | 1 | 1 | 1 | 1 1 | 1 | 1 | 1 | 2 | 2 1 | 1 | 1 | 1 | 3 | 3 1 | 1 | 1 | 1 | 5 | 5 1 | 1 | 1 | 2 | 1 | 1 how can I do a slice starting after 1:1:1:1:2 to the end? I thought that the (very ugly way) was: SELECT a, b, c, d, e, f FROM bug_test WHERE a = 1 AND b = 1 AND c = 1 AND d = 1 AND e 2; (despite the fact that it felt completely wrong since these conditions need to be considered together, not as 5 independent ones...otherwise one realizes that the result will contain rows that don't match it, for example that contain d=2 in this case) is there some way to express that in CQL3? something logically equivalent to SELECT * FROM bug_test WHERE a:b:c:d:e 1:1:1:1:2?? Cheers, Josep M.
Re: poor Memtable performance on column slices?
Excellent Sylvain! Yes, that seems to remove the linear scan component of slice read times. FYI, I still see some interesting difference in some aspects though. If I do a slice without a start (i.e., get me the first column)...it seems to fly. GET(K, :count = 1 ) -- 4.832877 -- very fast, and actually in this case I see the reading client being the bottleneck, not cassandra (which it is at about 20% CPU only) If I do the same, but actually specifying the start column with the first existing value...GET(K,:start = '144abe16-416c-11e1-9e23-2cbae9ddfe8b' , :count = 1 ) -- 11.084275 -- half as fast, and using twice the CPU...hovering about 50% or more. (again Cassandra is not the bottleneck, but the significant data is that the initial seeking seems to be doubling the time/cpu If I do the same, starting by the middle. GET(K,:start = '9c13c644-416c-11e1-81dd-4ba530dc83d0' , :count = 1 ) -- 11.038187 -- as expensive as starting from the beginning The same starting at the last one. GET(K,:start = '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' , :count = 1 ) -- 6.489683 - Much faster than any other slice ... although not quite as fast as not using a start column I could see that not having to seek into whatever backing map/structure is obviously faster...although I'm surprised that seeking to an initial value results in half as slow reads. Wouldn't this mostly imply following some links/pointers in memory to start reading ordered columns? What is the backing store used for Memtables when column slices are performed? I am not sure why starting at the end (without reversing or anything) yields much better performance. Cheers, Josep M. On Wed, Jan 18, 2012 at 12:57 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Wed, Jan 18, 2012 at 2:44 AM, Josep Blanquer blanq...@rightscale.com wrote: Hi, I've been doing some tests using wide rows recently, and I've seen some odd performance problems that I'd like to understand. In particular, I've seen that the time it takes for Cassandra to perform a column slice of a single key, solely in a Memtable, seems to be very expensive, but most importantly proportional to the ordered position where the start column of the slice lives. In other words: 1- if I start Cassandra fresh (with an empty ColumnFamily with TimeUUID comparator) 2- I create a single Row with Key K 3- Then add 200K TimeUUID columns to key K 4- (and make sure nothing is flushed to SSTables...so it's all in the Memtable) ...I observe the following timings (secondds to perform 1000 reads) while performing multiget slices on it: (pardon the pseudo-code, but you'll get the gist) a) simply a get of the first column: GET(K,:count=1) -- 2.351226 b) doing a slice get, starting from the first column: GET(K,:start = '144abe16-416c-11e1-9e23-2cbae9ddfe8b' , :count = 1 ) -- 2.189224 - so with or without start doesn't seem to make much of a difference c) doing a slice get, starting from the middle of the ordered columns..approx starting at item number 100K: GET(K,:start = '9c13c644-416c-11e1-81dd-4ba530dc83d0' , :count = 1 ) -- 11.849326 - 5 times more expensive if the start of the slice is 100K positions away d) doing a slice get, starting from the last of the ordered columns..approx position 200K: GET(K,:start = '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' , :count = 1 ) -- 19.889741 - Almost twice as expensive than starting the slice at position 100K, and 10 times more expensive than starting from the first one This behavior leads me to believe that there's a clear Memtable column scan for the columns of the key. If one tries a column name read on those positions (i.e., not a slice), the performance is constant. I.e., GET(K, '144abe16-416c-11e1-9e23-2cbae9ddfe8b') . Retrieving the first, middle or last timeUUID is done in the same amount of time. Having increasingly worse performance for column slices in Memtables seems to be a bit of a problem...aren't Memtables backed by a structure that has some sort of column name indexing?...so that landing on the start column can be efficient? I'm definitely observing very high CPU utilization on those scans...By the way, with wide columns like this, slicing SSTables is quite faster than slicing Memtables...I'm attributing that to the sampled index of the SSTables, hence that's why I'm wondering if the Memtables do not have such column indexing builtin and resort to linked lists of sort Note, that the actual timings shown are not important, it's in my laptop and I have a small amount of debugging enabled...what it is important is the difference between then. I'm using Cassandra trunk as of Dec 1st, but I believe I've done experiments with 0.8 series too, leading to the same issue. You may want to retry your experiments on current trunk. We do had inefficiency in our memtable search that was fixed
Re: poor Memtable performance on column slices?
On Wed, Jan 18, 2012 at 12:44 PM, Jonathan Ellis jbel...@gmail.com wrote: On Wed, Jan 18, 2012 at 12:31 PM, Josep Blanquer blanq...@rightscale.com wrote: If I do a slice without a start (i.e., get me the first column)...it seems to fly. GET(K, :count = 1 ) Yep, that's a totally different code path (SimpleSliceReader instead of IndexedSliceReader) that we've done to optimize this common case. Thanks Jonathan, yup, that makes sense. It was surprising to me that avoiding the seek was that much faster..but I guess if it's a completely different code path, there might be many other things in play. The same starting at the last one. GET(K,:start = '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' , :count = 1 ) -- 6.489683 - Much faster than any other slice ... although not quite as fast as not using a start column That's not a special code path, but I'd guess that the last column is more likely to be still in memory instead of on disk. Well, no need to prolong the thread, but my tests are exclusively in Memtable reads (data has not flushed)...so there's no SSTable read involved here...which is exactly why is felt a bit funny to have that case be considerably faster. I just wanted to bring it up to you guys, in case you can think of some cause and/or potential issue. Thanks for the responses! Josep M.
poor Memtable performance on column slices?
Hi, I've been doing some tests using wide rows recently, and I've seen some odd performance problems that I'd like to understand. In particular, I've seen that the time it takes for Cassandra to perform a column slice of a single key, solely in a Memtable, seems to be very expensive, but most importantly proportional to the ordered position where the start column of the slice lives. In other words: 1- if I start Cassandra fresh (with an empty ColumnFamily with TimeUUID comparator) 2- I create a single Row with Key K 3- Then add 200K TimeUUID columns to key K 4- (and make sure nothing is flushed to SSTables...so it's all in the Memtable) ...I observe the following timings (secondds to perform 1000 reads) while performing multiget slices on it: (pardon the pseudo-code, but you'll get the gist) a) simply a get of the first column: GET(K,:count=1) -- 2.351226 b) doing a slice get, starting from the first column: GET(K,:start = '144abe16-416c-11e1-9e23-2cbae9ddfe8b' , :count = 1 ) -- 2.189224 - so with or without start doesn't seem to make much of a difference c) doing a slice get, starting from the middle of the ordered columns..approx starting at item number 100K: GET(K,:start = '9c13c644-416c-11e1-81dd-4ba530dc83d0' , :count = 1 ) -- 11.849326 - 5 times more expensive if the start of the slice is 100K positions away d) doing a slice get, starting from the last of the ordered columns..approx position 200K: GET(K,:start = '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' , :count = 1 ) -- 19.889741 - Almost twice as expensive than starting the slice at position 100K, and 10 times more expensive than starting from the first one This behavior leads me to believe that there's a clear Memtable column scan for the columns of the key. If one tries a column name read on those positions (i.e., not a slice), the performance is constant. I.e., GET(K, '144abe16-416c-11e1-9e23-2cbae9ddfe8b') . Retrieving the first, middle or last timeUUID is done in the same amount of time. Having increasingly worse performance for column slices in Memtables seems to be a bit of a problem...aren't Memtables backed by a structure that has some sort of column name indexing?...so that landing on the start column can be efficient? I'm definitely observing very high CPU utilization on those scans...By the way, with wide columns like this, slicing SSTables is quite faster than slicing Memtables...I'm attributing that to the sampled index of the SSTables, hence that's why I'm wondering if the Memtables do not have such column indexing builtin and resort to linked lists of sort Note, that the actual timings shown are not important, it's in my laptop and I have a small amount of debugging enabled...what it is important is the difference between then. I'm using Cassandra trunk as of Dec 1st, but I believe I've done experiments with 0.8 series too, leading to the same issue. Cheers, Josep M.
Live migrating data from 2 separate cassandra clusters
Hi, I am looking for an efficient way migrate a portion of the data existing in a Cassandra cluster to another, separate Cassandra cluster. What I need is to solve the typical live migration problem that appears in any DB sharding where need to transfer ownership of certain rows from DB1 to DB2...but in a way that clients see no (or almost no) disruption when you actually do the cutover to DB2 for those writes. I mean doing something as typical like: loop (until almost no rows have been modified): rows = SELECT * from T where criteria matches (i.e., shard_id=1) AND updated_at last_time last_time = now insert(rows) elsewhere end ... lock modifications to original DB do one last SELECT to get the last few modified rows cutover the ownership - (change and ensure the clients know that the new home for that data is in the other DB) unlock modifications So, anyway, I thought that I'd be able to apply the same principles by passing a timestamp of sorts to the get_slices call so I could further restrict getting only matching columns that have timestamps newer than the one passed. Now, looking at the thrift interface I see that there is no timestamp parameter at all...which makes me wonder how people are doing it, and if there are any well-know practices for it. Setting up a full new replicating DC within the same cluster doesn't work, as there are some clear cases where you want to have completely separate cassandra rings. Cheers, Josep M.
Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?
On Thu, Jun 23, 2011 at 5:04 AM, Peter Schuller peter.schul...@infidyne.com wrote: 1. Is it feasible to run directly against a Cassandra data directory restored from an EBS snapshot? (as opposed to nodetool snapshots restored from an EBS snapshot). Assuming EBS is not buggy, including honor write barriers, including the linux guest kernel etc, then yes. EBS snapshots of a single volumes are promised to be atomic. As such, a restore from an EBS snapshot should be semantically identical to recover after a power outage or sudden reboot of the node. I make no claims as to how well EBS snapshot atomicity is actually tested in practice. EBS volume atomicity is good. We've had tons of experience since EBS came out almost 4 years ago, to back all kinds of things, including large DBs. One important thing to have in mind though, is that EBS snapshots are done at the block level, not at the filesystem level. So depending on the filesystem you have on top of the drives you might need to tell the filesystem to make sure this is consistent or recoverable now. For example, if you use the log-based XFS, you might need to do xfs_freeze, snapshot disc/s, xfs_unfreeze. To make sure that the restored filesystem data (and not only the low level disk blocks) is recoverable when you restore them. Snapshotting volume stripes works exactly in the same way, you just have to keep track of what position each snapshot has in the stripe, so you can recreate the stripe back correctly. The freezing of the filesystem might cause a quick/mini hickup, which is usually not noticeable unless you have very stringent requirements in the box (or if you have a very large stripe, and/or some sort of network issue where the calls to amazon endpoint are very slow...and therefore you're locking the FS a tad longer than you'd want to). Cheers, Josep M.
Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?
On Thu, Jun 23, 2011 at 7:30 AM, Peter Schuller peter.schul...@infidyne.com wrote: EBS volume atomicity is good. We've had tons of experience since EBS came out almost 4 years ago, to back all kinds of things, including large DBs. One important thing to have in mind though, is that EBS snapshots are done at the block level, not at the filesystem level. So depending on the filesystem you have on top of the drives you might need to tell the filesystem to make sure this is consistent or recoverable now. For example, if you use the log-based XFS, you might need to do xfs_freeze, snapshot disc/s, xfs_unfreeze. To make sure that the restored filesystem data (and not only the low level disk blocks) is recoverable when you restore them. No. That is only require if you're doing multi-volume EBS snapshots (e.g. XFS on LVM). The entire point of an atomic snapshot is that atomicity gives a consistent snapshot; a modern file system which is already crash-consistent will be consistent in an atomic snapshot without additional action taken. That said, of course exercising those code paths regularly, rather than just on crashes, may mean that you have an elevated chance of triggering a bug that you would normally see very rarely. In that way, xfs_freeze might actually help probabilistically; however strictly speaking, discounting bugs, a crash-consistent fs will be consistent snapshot consistent as well (it is logically implied). Actually, I'm afraid that's not true (unless I'm missing something). Even if you have only 1 drive, you still need to stop writes to the disk for the short time it takes the low level drivers to snapshot it (i.e., marking all blocks as clean so you can do CopyOnWrite later). I.e., you need to give a chance to LVM, or the EBS low level 'modules' in the hypervisor ( whatever you use underneath...), to have exclusive control of the drive for a moment. Now, that being said, some systems (like LVM) will do a freeze themselves, so technically speaking you don't need to explicitly do a freeze yourself...but that's not to say that a freeze is not required for snapshotting. But this all assumes the entire stack is correct, and that e.g. an fsync() propagates correctly (i.e., not eaten by some LVM or mount option to the fs) in order to bring that consistency up to the application level. -- / Peter Schuller
Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?
On Thu, Jun 23, 2011 at 8:02 AM, William Oberman ober...@civicscience.comwrote: I've been doing EBS snapshots for mysql for some time now, and was using a similar pattern as Josep (XFS with freeze, snap, unfreeze), with the extra complication that I was actually using 8 EBS's in RAID-0 (and the extra extra complication that I had to lock the MyISAM tables... glad to be moving away from that). For cassandra I switched to ephemeral disks, as per recommendations from this forum. yes, if you want to consistently snap MySQL you need to get it into a consistent state, so you need to do the whole FLUSH TABLES WITH READ LOCK yadda yadda, on top of the rest. Otherwise you might snapshot something that is not correct/consistent...and it's a bit more tricky with snapshotting slaves, since you need to know where they are in the replication stream...etc One note on EBS snapshots though: the last time I checked (which was some time ago) I noticed degraded IO performance on the box during the snapshotting process even though the take snapshot command returns almost immediately. My theory back then was that amazon does the delta/compress/store outside of the VM, but it obviously has an effect on resources on the box the VM runs on. I was doing this on a mysql slave that no one talked to, so I didn't care/bother looking into it further. Yes, that is correct. The underlying copy-on-write-and-ship-to-EBS/S3 does has some performance impact on the running box. For the most part it's never presented a problem for us or many of our customers, although you're right, it's something you want to know about and have in mind when designing your system (for example for snapshot slaves much more often than masters, and do masters when the traffic is low, stagger cassandra snaps...yadda yadda). If you think about it, this effect is not that different from using LVM snaps on the ephemeral, and then moving the data from the snap to another disk or a remote storage...moving those blocks it would have an impact on the original LVM volume since it's reading the same physical (ephemeral) disk/s underneath (list of clean and dirty blocks). One case I could see the slightly reduced IO performance being problematic if your DB/storage is already at the edge of I/O capacity...but in that case, the small overhead of a snapshots is probably the least of your problems :) EBS slowness or malfunction can also impact the instance, obviously, although that is not only related to snapshots, since it can impact the actual volume regardless. Josep M.
Re: CFHistograms?
I believe the offset value of Writes and Reads are in *micro*seconds right? (that page talks about *milli*seconds) Also, are any timeouts or errors reflected in those times or just successful operations? if not, is there any JMX or other tool to keep track of them? Josep M. On Fri, May 6, 2011 at 9:09 AM, Jonathan Ellis jbel...@gmail.com wrote: Those are updated at compaction time. On Thu, May 5, 2011 at 11:38 PM, Xaero S xaeros...@gmail.com wrote: Can someone point me to a document that explains how to interpret CFHistograms output? i went through http://narendrasharma.blogspot.com/2011/04/cassandra-07x-understanding-output-of.html which is a good beginning, but was wondering if there was anything more detailed. For e.g when i run CFHistograms, i see rowsize and columncount items in the table always 0 (which cant be right?) -Xaero -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Remove call vs. delete mutation
Is there anybody else that might see a problem with just using delete mutations instead of remove calls? I'm thinking about changing a Cassandra client to always use delete mutations when removing objects, that way the delete/remove call interface can be kept the same: 1- the delete/remove client call would always support all features: single-key/column, multi-column and slice range deletes. 2- it could be used in the same way regardless of embedding the calls into batch mutations or removing a single column/key I'd like to hear some more thoughts about this change not causing the Cassandra server to take a much higher CPU toll just because decoding mutations is much less optimized than straight removes or something like that...(I don't think so but...). In other words, if I do 1000 inserts or 1000 single-delete mutations, would the Cassandra server see much of a difference? Cheers, Josep M. On Mon, Apr 11, 2011 at 3:49 PM, aaron morton aa...@thelastpickle.com wrote: AFAIK both follow the same path internally. Aaron On 12 Apr 2011, at 06:47, Josep Blanquer wrote: All, From a thrift client perspective using Cassandra, there are currently 2 options for deleting keys/columns/subcolumns: 1- One can use the remove call: which only takes a column path so you can only delete 'one thing' at a time (an entire key, an entire supercolumn, a column or a subcolumn) 2- A delete mutation: which is more flexible as it allows to delete a list of columns an even a slice range of them within a single call. The question I have is: is there a noticeable difference in performance between issuing a remove call, or a mutation with a single delete? In other words, why would I use the remove call if it's much less flexible than the mutation? ...or another way to put it: is the remove call just there for backwards compatibility and will be superseded by the delete mutations in the future? Cheers, Josep M.
Remove call vs. delete mutation
All, From a thrift client perspective using Cassandra, there are currently 2 options for deleting keys/columns/subcolumns: 1- One can use the remove call: which only takes a column path so you can only delete 'one thing' at a time (an entire key, an entire supercolumn, a column or a subcolumn) 2- A delete mutation: which is more flexible as it allows to delete a list of columns an even a slice range of them within a single call. The question I have is: is there a noticeable difference in performance between issuing a remove call, or a mutation with a single delete? In other words, why would I use the remove call if it's much less flexible than the mutation? ...or another way to put it: is the remove call just there for backwards compatibility and will be superseded by the delete mutations in the future? Cheers, Josep M.