Re: Mass deletion -- slowing down
i think what he means is...do you know what day the 'oldest' day is? eg if you have a rolling window of say 2 weeks, structure your query so that your slice range only goes back 2 weeks, rather than to the beginning of time. this would avoid iterating over all the tombstones from prior to the 2 week window. this wouldn't work if you are deleting arbitrary days in the middle of your date range. On 14/11/2011 02:02, Maxim Potekhin wrote: Thanks Peter, I'm not sure I entirely follow. By the oldest data, do you mean the primary key corresponding to the limit of the time horizon? Unfortunately, unique IDs and the timstamps do not correlate in the sense that chronologically newer entries might have a smaller sequential ID. That's because the timestamp corresponds to the last update that's stochastic in the sense that the jobs can take from seconds to days to complete. As I said I'm not sure I understood you correctly. Also, I note that queries on different dates (i.e. not contaminated with lots of tombstones) work just fine, which is consistent with the picture that emerged so far. Theoretically -- would compaction or cleanup help? Thanks Maxim On 11/13/2011 8:39 PM, Peter Schuller wrote: I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, Is it feasable in your situation to keep track of the oldest possible data (for example, if there is a single sequential writer that rotates old entries away it could keep a record of what the oldest might be) so that you can bound your index lookup= that value (and avoid the tombstones)?
Re: Mass deletion -- slowing down
Thanks for the note. Ideally I would not like to keep track of what is the oldest indexed date, because this means that I'm already creating a bit of infrastructure on top of my database, with attendant referential integrity problems. But I suppose I'll be forced to do that. In addition, I'll have to wait until the grace period is over and compact, removing the tombstones and finally clearing the disk (which is what I need to do in the first place). Frankly, this whole situation for me illustrates a very real deficiency in Cassandra -- one would think that deleting less than one percent of data shouldn't really lead to complete failures in certain indexed queries. That's bad. Maxim On 11/14/2011 3:01 AM, Guy Incognito wrote: i think what he means is...do you know what day the 'oldest' day is? eg if you have a rolling window of say 2 weeks, structure your query so that your slice range only goes back 2 weeks, rather than to the beginning of time. this would avoid iterating over all the tombstones from prior to the 2 week window. this wouldn't work if you are deleting arbitrary days in the middle of your date range. On 14/11/2011 02:02, Maxim Potekhin wrote: Thanks Peter, I'm not sure I entirely follow. By the oldest data, do you mean the primary key corresponding to the limit of the time horizon? Unfortunately, unique IDs and the timstamps do not correlate in the sense that chronologically newer entries might have a smaller sequential ID. That's because the timestamp corresponds to the last update that's stochastic in the sense that the jobs can take from seconds to days to complete. As I said I'm not sure I understood you correctly. Also, I note that queries on different dates (i.e. not contaminated with lots of tombstones) work just fine, which is consistent with the picture that emerged so far. Theoretically -- would compaction or cleanup help? Thanks Maxim On 11/13/2011 8:39 PM, Peter Schuller wrote: I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, Is it feasable in your situation to keep track of the oldest possible data (for example, if there is a single sequential writer that rotates old entries away it could keep a record of what the oldest might be) so that you can bound your index lookup= that value (and avoid the tombstones)?
Re: Mass deletion -- slowing down
I've done more experimentation and the behavior persists: I start with a normal dataset which is searcheable by a secondary index. I select by that index the entries that match a certain criterion, then delete those. I tried two methods of deletion -- individual cf.remove() as well as batch removal in Pycassa. What happens after that is as follows: attempts to read the same CF, using the same index values start to time out in the Pycassa client (there is a thrift message about timeout). The entries not touched by such attempted deletion are read just fine still. Has anyone seen such behavior? Thanks, Maxim On 11/10/2011 8:30 PM, Maxim Potekhin wrote: Hello, My data load comes in batches representing one day in the life of a large computing facility. I index the data by the day it was produced, to be able to quickly pull data for a specific day within the last year or two. There are 6 other indexes. When it comes to retiring the data, I intend to delete it for the oldest date and after that add a fresh batch of data, so I control the disk space. Therein lies a problem -- and it maybe Pycassa related, so I also filed an issue on github -- then I select by 'DATE=blah' and then do a batch remove, it works fine for a while, and then after a few thousand deletions (done in batches of 1000) it grinds to a halt, i.e. I can no longer iterate the result, which manifests in a timeout error. Is that a behavior seen before? Cassandra version is 0.8.6, Pycassa 1.3.0. TIA, Maxim
Re: Mass deletion -- slowing down
On Sun, Nov 13, 2011 at 5:57 PM, Maxim Potekhin potek...@bnl.gov wrote: I've done more experimentation and the behavior persists: I start with a normal dataset which is searcheable by a secondary index. I select by that index the entries that match a certain criterion, then delete those. I tried two methods of deletion -- individual cf.remove() as well as batch removal in Pycassa. What happens after that is as follows: attempts to read the same CF, using the same index values start to time out in the Pycassa client (there is a thrift message about timeout). The entries not touched by such attempted deletion are read just fine still. Has anyone seen such behavior? What you're probably running into is a huge amount of tombstone filtering on the read (see http://wiki.apache.org/cassandra/DistributedDeletes) Since you're dealing with timeseries data, using a row-bucketing technique like http://rubyscale.com/2011/basic-time-series-with-cassandra/ might help by eliminating the need for an index. -Brandon
Re: Mass deletion -- slowing down
Deletions in Cassandra imply the use of tombstones (see http://wiki.apache.org/cassandra/DistributedDeletes) and under some circumstances reads can turn O(n) with respect to the amount of columns deleted, depending. It sounds like this is what you're seeing. For example, suppose you're inserting a range of columns into a row, deleting it, and inserting another non-overlapping subsequent range. Repeat that a bunch of times. In terms of what's stored in Cassandra for the row you now have: tomb tomb tomb tomb actual data If you then do something like a slice on that row with the end-points being such that they include all the tombstones, Cassandra essentially has to read through and process all those tombstones (for the PostgreSQL aware: this is similar to the effect you can get if implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect to the number of deleted entries until the last vacuum - improved in modern versions)). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Mass deletion -- slowing down
Thanks to all for valuable insight! Two comments: a) this is not actually time series data, but yes, each item has a timestamp and thus chronological attribution. b) so, what do you practically recommend? I need to delete half a million to a million entries daily, then insert fresh data. What's the right operation procedure? For some reason I can still select on the index in the CLI, it's the Pycassa module that gives me trouble, but I need it as this is my platform and we are a Python shop. Maxim On 11/13/2011 7:22 PM, Peter Schuller wrote: Deletions in Cassandra imply the use of tombstones (see http://wiki.apache.org/cassandra/DistributedDeletes) and under some circumstances reads can turn O(n) with respect to the amount of columns deleted, depending. It sounds like this is what you're seeing. For example, suppose you're inserting a range of columns into a row, deleting it, and inserting another non-overlapping subsequent range. Repeat that a bunch of times. In terms of what's stored in Cassandra for the row you now have: tomb tomb tomb tomb actual data If you then do something like a slice on that row with the end-points being such that they include all the tombstones, Cassandra essentially has to read through and process all those tombstones (for the PostgreSQL aware: this is similar to the effect you can get if implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect to the number of deleted entries until the last vacuum - improved in modern versions)).
Re: Mass deletion -- slowing down
On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhin potek...@bnl.gov wrote: Thanks to all for valuable insight! Two comments: a) this is not actually time series data, but yes, each item has a timestamp and thus chronological attribution. b) so, what do you practically recommend? I need to delete half a million to a million entries daily, then insert fresh data. What's the right operation procedure? I'd have to know more about what your access pattern is like to give you a fully informed answer. For some reason I can still select on the index in the CLI, it's the Pycassa module that gives me trouble, but I need it as this is my platform and we are a Python shop. This seems odd, since the rpc_timeout is the same for all clients. Maybe pycassa is asking for more data than the cli? -Brandon
Re: Mass deletion -- slowing down
Brandon, thanks for the note. Each row represents a computational task (a job) executed on the grid or in the cloud. It naturally has a timestamp as one of its attributes, representing the time of the last update. This timestamp is used to group the data into buckets each representing one day in the system's activity. I create the DATE attribute and add it to each row, e.g. it's a column {'DATE','2013'}. I create an index on that column, along with a few others. Now, I want to rotate the data out of my database, on daily basis. For that, I need to select on 'DATE' and then do a delete. I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, it's just the indexed queries that start to time out. I changed timeouts and number of retries in the Pycassa pool, but that doesn't seem to help. Thanks, Maxim On 11/13/2011 8:00 PM, Brandon Williams wrote: On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhinpotek...@bnl.gov wrote: Thanks to all for valuable insight! Two comments: a) this is not actually time series data, but yes, each item has a timestamp and thus chronological attribution. b) so, what do you practically recommend? I need to delete half a million to a million entries daily, then insert fresh data. What's the right operation procedure? I'd have to know more about what your access pattern is like to give you a fully informed answer. For some reason I can still select on the index in the CLI, it's the Pycassa module that gives me trouble, but I need it as this is my platform and we are a Python shop. This seems odd, since the rpc_timeout is the same for all clients. Maybe pycassa is asking for more data than the cli? -Brandon
Re: Mass deletion -- slowing down
On Sun, Nov 13, 2011 at 7:25 PM, Maxim Potekhin potek...@bnl.gov wrote: Each row represents a computational task (a job) executed on the grid or in the cloud. It naturally has a timestamp as one of its attributes, representing the time of the last update. This timestamp is used to group the data into buckets each representing one day in the system's activity. I create the DATE attribute and add it to each row, e.g. it's a column {'DATE','2013'}. Hmm, so why is pushing this into the row key and then deleting the entire row not acceptable? (this is what the link I gave would prescribe) In other words, you bucket at the row level, instead of relying on a column attribute that needs an index. -Brandon
Re: Mass deletion -- slowing down
I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, Is it feasable in your situation to keep track of the oldest possible data (for example, if there is a single sequential writer that rotates old entries away it could keep a record of what the oldest might be) so that you can bound your index lookup = that value (and avoid the tombstones)? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Mass deletion -- slowing down
Brandon, it won't work in my application, as I need a few indexes on attributes of the job. In addition, a large portion of queries is based on key-value lookup, and that key is the unique job ID. I really can't have data packed in one row per day. Thanks, Maxim On 11/13/2011 8:34 PM, Brandon Williams wrote: On Sun, Nov 13, 2011 at 7:25 PM, Maxim Potekhinpotek...@bnl.gov wrote: Each row represents a computational task (a job) executed on the grid or in the cloud. It naturally has a timestamp as one of its attributes, representing the time of the last update. This timestamp is used to group the data into buckets each representing one day in the system's activity. I create the DATE attribute and add it to each row, e.g. it's a column {'DATE','2013'}. Hmm, so why is pushing this into the row key and then deleting the entire row not acceptable? (this is what the link I gave would prescribe) In other words, you bucket at the row level, instead of relying on a column attribute that needs an index. -Brandon
Re: Mass deletion -- slowing down
Thanks Peter, I'm not sure I entirely follow. By the oldest data, do you mean the primary key corresponding to the limit of the time horizon? Unfortunately, unique IDs and the timstamps do not correlate in the sense that chronologically newer entries might have a smaller sequential ID. That's because the timestamp corresponds to the last update that's stochastic in the sense that the jobs can take from seconds to days to complete. As I said I'm not sure I understood you correctly. Also, I note that queries on different dates (i.e. not contaminated with lots of tombstones) work just fine, which is consistent with the picture that emerged so far. Theoretically -- would compaction or cleanup help? Thanks Maxim On 11/13/2011 8:39 PM, Peter Schuller wrote: I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, Is it feasable in your situation to keep track of the oldest possible data (for example, if there is a single sequential writer that rotates old entries away it could keep a record of what the oldest might be) so that you can bound your index lookup= that value (and avoid the tombstones)?
Re: Mass deletion -- slowing down
I'm not sure I entirely follow. By the oldest data, do you mean the primary key corresponding to the limit of the time horizon? Unfortunately, unique IDs and the timstamps do not correlate in the sense that chronologically newer entries might have a smaller sequential ID. That's because the timestamp corresponds to the last update that's stochastic in the sense that the jobs can take from seconds to days to complete. As I said I'm not sure I understood you correctly. I was hoping there would be a wave of deletions that matched the order of the index (whatever is being read that is subject to the tombstones). If not, then my suggestion doesn't apply. Are you using cassandra secondary indexes or maintaining your own index btw? Theoretically -- would compaction or cleanup help? Not directly. The only way to eliminate tombstones is for them to (1) expire according to gc grace seconds (again see http://wiki.apache.org/cassandra/DistributedDeletes) and then (2) for compaction to remove them. So while decreasing the gc grace period might mitigate it somewhat, I would advise against going that route since it doesn't solve the fundamental problem and it can be dangerous: gc grace has the usual implications on how often anti-entropy/repair must be run, and a cluster which is super-sensitive to a small grace time makes it a lot more volatile if e.g. you have repair problems and must temporarily increase gc grace. It seems better to figure out some way of structuring the data that the reads in question do not suffer from this problem. Note that reading individual columns should still scale well despite tombstones, as should slicing as long as the slices you're reading are reasonably dense (in terms of data vs. tombstone ratio) even if surrounding data is sparse. How many entries are you reading per query? I have been presuming it's the index read that is causing the timeout rather than the reading of the individual matching columns, since the maximum per column penalty when reading individual columns is finite, regardless of the sparsity of the data. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)