Re: Mass deletion -- slowing down

Maxim Potekhin Mon, 14 Nov 2011 13:09:52 -0800

Thanks for the note. Ideally I would not like to keep track of what isthe oldest indexed date,because this means that I'm already creating a bit of infrastructure ontop of my database,

with attendant referential integrity problems.

But I suppose I'll be forced to do that. In addition, I'll have to waituntil the grace period is over and compact,removing the tombstones and finally clearing the disk (which is what Ineed to do in the first place).

Frankly, this whole situation for me illustrates a very real deficiencyin Cassandra -- one would think thatdeleting less than one percent of data shouldn't really lead to completefailures in certain indexed queries.

That's bad.

Maxim



On 11/14/2011 3:01 AM, Guy Incognito wrote:

i think what he means is...do you know what day the 'oldest' day is?eg if you have a rolling window of say 2 weeks, structure your queryso that your slice range only goes back 2 weeks, rather than to thebeginning of time. this would avoid iterating over all the tombstonesfrom prior to the 2 week window. this wouldn't work if you aredeleting arbitrary days in the middle of your date range.
On 14/11/2011 02:02, Maxim Potekhin wrote:
Thanks Peter,

I'm not sure I entirely follow. By the oldest data, do you mean the
primary key corresponding to the limit of the time horizon?Unfortunately,unique IDs and the timstamps do not correlate in the sense thatchronologically"newer" entries might have a smaller sequential ID. That's becausethe timestampcorresponds to the last update that's stochastic in the sense thatthe jobs can takefrom seconds to days to complete. As I said I'm not sure I understoodyou
correctly.
Also, I note that queries on different dates (i.e. not "contaminated"with lots
of tombstones) work just fine, which is consistent with the picture that
emerged so far.

Theoretically -- would compaction or cleanup help?

Thanks

Maxim




On 11/13/2011 8:39 PM, Peter Schuller wrote:
I do limit the number of rows I'm asking for in Pycassa. Queries onprimary
keys still work fine,
Is it feasable in your situation to keep track of the oldest possible
data (for example, if there is a single sequential writer that rotates
old entries away it could keep a record of what the oldest might be)
so that you can bound your index lookup>= that value (and avoid the
tombstones)?

Re: Mass deletion -- slowing down

Reply via email to