Re: Mass deletion -- slowing down

2011-11-14 Thread Guy Incognito
i think what he means is...do you know what day the 'oldest' day is?  eg 
if you have a rolling window of say 2 weeks, structure your query so 
that your slice range only goes back 2 weeks, rather than to the 
beginning of time.  this would avoid iterating over all the tombstones 
from prior to the 2 week window.  this wouldn't work if you are deleting 
arbitrary days in the middle of your date range.


On 14/11/2011 02:02, Maxim Potekhin wrote:

Thanks Peter,

I'm not sure I entirely follow. By the oldest data, do you mean the
primary key corresponding to the limit of the time horizon? 
Unfortunately,
unique IDs and the timstamps do not correlate in the sense that 
chronologically
newer entries might have a smaller sequential ID. That's because the 
timestamp
corresponds to the last update that's stochastic in the sense that the 
jobs can take

from seconds to days to complete. As I said I'm not sure I understood you
correctly.

Also, I note that queries on different dates (i.e. not contaminated 
with lots

of tombstones) work just fine, which is consistent with the picture that
emerged so far.

Theoretically -- would compaction or cleanup help?

Thanks

Maxim




On 11/13/2011 8:39 PM, Peter Schuller wrote:
I do limit the number of rows I'm asking for in Pycassa. Queries on 
primary

keys still work fine,

Is it feasable in your situation to keep track of the oldest possible
data (for example, if there is a single sequential writer that rotates
old entries away it could keep a record of what the oldest might be)
so that you can bound your index lookup= that value (and avoid the
tombstones)?







Re: Mass deletion -- slowing down

2011-11-14 Thread Maxim Potekhin
Thanks for the note. Ideally I would not like to keep track of what is 
the oldest indexed date,
because this means that I'm already creating a bit of infrastructure on 
top of my database,

with attendant referential integrity problems.

But I suppose I'll be forced to do that. In addition, I'll have to wait 
until the grace period is over and compact,
removing the tombstones and finally clearing the disk (which is what I 
need to do in the first place).


Frankly, this whole situation for me illustrates a very real deficiency 
in Cassandra -- one would think that
deleting less than one percent of data shouldn't really lead to complete 
failures in certain indexed queries.

That's bad.

Maxim



On 11/14/2011 3:01 AM, Guy Incognito wrote:
i think what he means is...do you know what day the 'oldest' day is?  
eg if you have a rolling window of say 2 weeks, structure your query 
so that your slice range only goes back 2 weeks, rather than to the 
beginning of time.  this would avoid iterating over all the tombstones 
from prior to the 2 week window.  this wouldn't work if you are 
deleting arbitrary days in the middle of your date range.


On 14/11/2011 02:02, Maxim Potekhin wrote:

Thanks Peter,

I'm not sure I entirely follow. By the oldest data, do you mean the
primary key corresponding to the limit of the time horizon? 
Unfortunately,
unique IDs and the timstamps do not correlate in the sense that 
chronologically
newer entries might have a smaller sequential ID. That's because 
the timestamp
corresponds to the last update that's stochastic in the sense that 
the jobs can take
from seconds to days to complete. As I said I'm not sure I understood 
you

correctly.

Also, I note that queries on different dates (i.e. not contaminated 
with lots

of tombstones) work just fine, which is consistent with the picture that
emerged so far.

Theoretically -- would compaction or cleanup help?

Thanks

Maxim




On 11/13/2011 8:39 PM, Peter Schuller wrote:
I do limit the number of rows I'm asking for in Pycassa. Queries on 
primary

keys still work fine,

Is it feasable in your situation to keep track of the oldest possible
data (for example, if there is a single sequential writer that rotates
old entries away it could keep a record of what the oldest might be)
so that you can bound your index lookup= that value (and avoid the
tombstones)?







Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin
I've done more experimentation and the behavior persists: I start with a 
normal dataset which is searcheable by a secondary index. I select by 
that index the entries that match a certain criterion, then delete 
those. I tried two methods of deletion -- individual cf.remove() as well 
as batch removal in Pycassa.
What happens after that is as follows: attempts to read the same CF, 
using the same index values start to time out in the Pycassa client 
(there is a thrift message about timeout). The entries not touched by 
such attempted deletion are read just fine still.


Has anyone seen such behavior?

Thanks,
Maxim

On 11/10/2011 8:30 PM, Maxim Potekhin wrote:

Hello,

My data load comes in batches representing one day in the life of a 
large computing facility.
I index the data by the day it was produced, to be able to quickly 
pull data for a specific day

within the last year or two. There are 6 other indexes.

When it comes to retiring the data, I intend to delete it for the 
oldest date and after that add
a fresh batch of data, so I control the disk space. Therein lies a 
problem -- and it maybe
Pycassa related, so I also filed an issue on github -- then I select 
by 'DATE=blah' and then
do a batch remove, it works fine for a while, and then after a few 
thousand deletions (done
in batches of 1000) it grinds to a halt, i.e. I can no longer iterate 
the result, which manifests

in a timeout error.

Is that a behavior seen before? Cassandra version is 0.8.6, Pycassa 
1.3.0.


TIA,

Maxim




Re: Mass deletion -- slowing down

2011-11-13 Thread Brandon Williams
On Sun, Nov 13, 2011 at 5:57 PM, Maxim Potekhin potek...@bnl.gov wrote:
 I've done more experimentation and the behavior persists: I start with a
 normal dataset which is searcheable by a secondary index. I select by that
 index the entries that match a certain criterion, then delete those. I tried
 two methods of deletion -- individual cf.remove() as well as batch removal
 in Pycassa.
 What happens after that is as follows: attempts to read the same CF, using
 the same index values start to time out in the Pycassa client (there is a
 thrift message about timeout). The entries not touched by such attempted
 deletion are read just fine still.

 Has anyone seen such behavior?

What you're probably running into is a huge amount of tombstone
filtering on the read (see
http://wiki.apache.org/cassandra/DistributedDeletes)

Since you're dealing with timeseries data, using a row-bucketing
technique like http://rubyscale.com/2011/basic-time-series-with-cassandra/
might help by eliminating the need for an index.

-Brandon


Re: Mass deletion -- slowing down

2011-11-13 Thread Peter Schuller
Deletions in Cassandra imply the use of tombstones (see
http://wiki.apache.org/cassandra/DistributedDeletes) and under some
circumstances reads can turn O(n) with respect to the amount of
columns deleted, depending. It sounds like this is what you're seeing.

For example, suppose you're inserting a range of columns into a row,
deleting it, and inserting another non-overlapping subsequent range.
Repeat that a bunch of times. In terms of what's stored in Cassandra
for the row you now have:

  tomb
  tomb
  tomb
  tomb
  
   actual data

If you then do something like a slice on that row with the end-points
being such that they include all the tombstones, Cassandra essentially
has to read through and process all those tombstones (for the
PostgreSQL aware: this is similar to the effect you can get if
implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect
to the number of deleted entries until the last vacuum - improved in
modern versions)).


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin

Thanks to all for valuable insight!

Two comments:
a) this is not actually time series data, but yes, each item has
a timestamp and thus chronological attribution.

b) so, what do you practically recommend? I need to delete
half a million to a million entries daily, then insert fresh data.
What's the right operation procedure?

For some reason I can still select on the index in the CLI, it's
the Pycassa module that gives me trouble, but I need it as this
is my platform and we are a Python shop.

Maxim



On 11/13/2011 7:22 PM, Peter Schuller wrote:

Deletions in Cassandra imply the use of tombstones (see
http://wiki.apache.org/cassandra/DistributedDeletes) and under some
circumstances reads can turn O(n) with respect to the amount of
columns deleted, depending. It sounds like this is what you're seeing.

For example, suppose you're inserting a range of columns into a row,
deleting it, and inserting another non-overlapping subsequent range.
Repeat that a bunch of times. In terms of what's stored in Cassandra
for the row you now have:

   tomb
   tomb
   tomb
   tomb
   
actual data

If you then do something like a slice on that row with the end-points
being such that they include all the tombstones, Cassandra essentially
has to read through and process all those tombstones (for the
PostgreSQL aware: this is similar to the effect you can get if
implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect
to the number of deleted entries until the last vacuum - improved in
modern versions)).






Re: Mass deletion -- slowing down

2011-11-13 Thread Brandon Williams
On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhin potek...@bnl.gov wrote:
 Thanks to all for valuable insight!

 Two comments:
 a) this is not actually time series data, but yes, each item has
 a timestamp and thus chronological attribution.

 b) so, what do you practically recommend? I need to delete
 half a million to a million entries daily, then insert fresh data.
 What's the right operation procedure?

I'd have to know more about what your access pattern is like to give
you a fully informed answer.

 For some reason I can still select on the index in the CLI, it's
 the Pycassa module that gives me trouble, but I need it as this
 is my platform and we are a Python shop.

This seems odd, since the rpc_timeout is the same for all clients.
Maybe pycassa is asking for more data than the cli?

-Brandon


Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin

Brandon,

thanks for the note.

Each row represents a computational task (a job) executed on the grid or 
in the cloud. It naturally has a timestamp as one of its attributes, 
representing the time of the last update. This timestamp
is used to group the data into buckets each representing one day in 
the system's activity.
I create the DATE attribute and add it to each row, e.g. it's a column 
{'DATE','2013'}.

I create an index on that column, along with a few others.

Now, I want to rotate the data out of my database, on daily basis. For 
that, I need to

select on 'DATE' and then do a delete.

I do limit the number of rows I'm asking for in Pycassa. Queries on 
primary keys still work fine,
it's just the indexed queries that start to time out. I changed timeouts 
and number of retries

in the Pycassa pool, but that doesn't seem to help.

Thanks,
Maxim

On 11/13/2011 8:00 PM, Brandon Williams wrote:

On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhinpotek...@bnl.gov  wrote:

Thanks to all for valuable insight!

Two comments:
a) this is not actually time series data, but yes, each item has
a timestamp and thus chronological attribution.

b) so, what do you practically recommend? I need to delete
half a million to a million entries daily, then insert fresh data.
What's the right operation procedure?

I'd have to know more about what your access pattern is like to give
you a fully informed answer.


For some reason I can still select on the index in the CLI, it's
the Pycassa module that gives me trouble, but I need it as this
is my platform and we are a Python shop.

This seems odd, since the rpc_timeout is the same for all clients.
Maybe pycassa is asking for more data than the cli?

-Brandon




Re: Mass deletion -- slowing down

2011-11-13 Thread Brandon Williams
On Sun, Nov 13, 2011 at 7:25 PM, Maxim Potekhin potek...@bnl.gov wrote:
 Each row represents a computational task (a job) executed on the grid or in
 the cloud. It naturally has a timestamp as one of its attributes,
 representing the time of the last update. This timestamp
 is used to group the data into buckets each representing one day in the
 system's activity.
 I create the DATE attribute and add it to each row, e.g. it's a column
 {'DATE','2013'}.

Hmm, so why is pushing this into the row key and then deleting the
entire row not acceptable? (this is what the link I gave would
prescribe)  In other words, you bucket at the row level, instead of
relying on a column attribute that needs an index.

-Brandon


Re: Mass deletion -- slowing down

2011-11-13 Thread Peter Schuller
 I do limit the number of rows I'm asking for in Pycassa. Queries on primary
 keys still work fine,

Is it feasable in your situation to keep track of the oldest possible
data (for example, if there is a single sequential writer that rotates
old entries away it could keep a record of what the oldest might be)
so that you can bound your index lookup = that value (and avoid the
tombstones)?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin

Brandon,

it won't work in my application, as I need a few indexes on attributes
of the job. In addition, a large portion of queries is based on key-value
lookup, and that key is the unique job ID. I really can't have data packed
in one row per day.


Thanks,
Maxim

On 11/13/2011 8:34 PM, Brandon Williams wrote:

On Sun, Nov 13, 2011 at 7:25 PM, Maxim Potekhinpotek...@bnl.gov  wrote:

Each row represents a computational task (a job) executed on the grid or in
the cloud. It naturally has a timestamp as one of its attributes,
representing the time of the last update. This timestamp
is used to group the data into buckets each representing one day in the
system's activity.
I create the DATE attribute and add it to each row, e.g. it's a column
{'DATE','2013'}.

Hmm, so why is pushing this into the row key and then deleting the
entire row not acceptable? (this is what the link I gave would
prescribe)  In other words, you bucket at the row level, instead of
relying on a column attribute that needs an index.

-Brandon




Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin

Thanks Peter,

I'm not sure I entirely follow. By the oldest data, do you mean the
primary key corresponding to the limit of the time horizon? Unfortunately,
unique IDs and the timstamps do not correlate in the sense that 
chronologically
newer entries might have a smaller sequential ID. That's because the 
timestamp
corresponds to the last update that's stochastic in the sense that the 
jobs can take

from seconds to days to complete. As I said I'm not sure I understood you
correctly.

Also, I note that queries on different dates (i.e. not contaminated 
with lots

of tombstones) work just fine, which is consistent with the picture that
emerged so far.

Theoretically -- would compaction or cleanup help?

Thanks

Maxim




On 11/13/2011 8:39 PM, Peter Schuller wrote:

I do limit the number of rows I'm asking for in Pycassa. Queries on primary
keys still work fine,

Is it feasable in your situation to keep track of the oldest possible
data (for example, if there is a single sequential writer that rotates
old entries away it could keep a record of what the oldest might be)
so that you can bound your index lookup= that value (and avoid the
tombstones)?





Re: Mass deletion -- slowing down

2011-11-13 Thread Peter Schuller
 I'm not sure I entirely follow. By the oldest data, do you mean the
 primary key corresponding to the limit of the time horizon? Unfortunately,
 unique IDs and the timstamps do not correlate in the sense that
 chronologically
 newer entries might have a smaller sequential ID. That's because the
 timestamp
 corresponds to the last update that's stochastic in the sense that the jobs
 can take
 from seconds to days to complete. As I said I'm not sure I understood you
 correctly.

I was hoping there would be a wave of deletions that matched the
order of the index (whatever is being read that is subject to the
tombstones). If not, then my suggestion doesn't apply. Are you using
cassandra secondary indexes or maintaining your own index btw?

 Theoretically -- would compaction or cleanup help?

Not directly. The only way to eliminate tombstones is for them to (1)
expire according to gc grace seconds (again see
http://wiki.apache.org/cassandra/DistributedDeletes) and then (2) for
compaction to remove them.

So while decreasing the gc grace period might mitigate it somewhat, I
would advise against going that route since it doesn't solve the
fundamental problem and it can be dangerous: gc grace has the usual
implications on how often anti-entropy/repair must be run, and a
cluster which is super-sensitive to a small grace time makes it a lot
more volatile if e.g. you have repair problems and must temporarily
increase gc grace.

It seems better to figure out some way of structuring the data that
the reads in question do not suffer from this problem.

Note that reading individual columns should still scale well despite
tombstones, as should slicing as long as the slices you're reading are
reasonably dense (in terms of data vs. tombstone ratio) even if
surrounding data is sparse.

How many entries are you reading per query? I have been presuming it's
the index read that is causing the timeout rather than the reading of
the individual matching columns, since the maximum per column
penalty when reading individual columns is finite, regardless of the
sparsity of the data.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)