Re: [web2py] Re: delete on GAE

2012-11-01 Thread howesc
so it turns out that GAE itself fails when i pass an iterator over a large 
list to gae.delete().  so i've tweaked the implementation to not call 
count, but to still count the number of entries deleted and it seems to be 
working.

suggested patch included 
in http://code.google.com/p/web2py/issues/detail?id=1134

thanks!

cfh

On Saturday, October 20, 2012 6:18:23 PM UTC-7, howesc wrote:

 sure.  i'll make a patch soon... 

 thanks for the input! 

 cfh 

 On 10/20/12 13:29 , Massimo Di Pierro wrote: 
  I meant to skip count. 
  
  On Saturday, 20 October 2012 15:28:56 UTC-5, Massimo Di Pierro wrote: 
  
  How about adding a gae only parameter to the gae adapter_args that 
 tells 
  it to skip fetch? 
  
  On Saturday, 20 October 2012 11:25:51 UTC-5, howesc wrote: 
  
  It appears that the most efficient way to delete on app engine is to: 
- build a query object, like we are doing now 
- call run with keys_only=True ( 
  
 https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_run)
  

  which returns an iterator. 
- pass that iterator to the datastore delete method ( 
  
 https://developers.google.com/appengine/docs/python/datastore/functions#delete
  
  ) 
  
  this avoids the cost of loading the rows into memory, decreases the 
  likelihood of timeout, and has the cost of 1 datastore small operation 
 per 
  row.  but it does prevent us from getting a count of rows deleted. 
  
  the way we do it now: 
- run count() on the query.  this has a cost (time and money) of 
  iterating over all the rows that match the query on GAE (1 datastore 
 small 
  operation per row) 
- run fetch(limit=1000) and call delete() successively until no more 
  rows.  this has the cost of running a full query (at least 1 datastore 
 read 
  operation per row) and loading the result set into memory and then 
 deleting 
  the results. 
  
  in my case i'm timing out on the count() call so i don't even start 
 the 
  delete.  from an efficiency standpoint i'd rather have more rows 
 deleted 
  for less cost then get a countbut this may not be acceptable for 
 all. 
at a minimum i think we should switch to use keys_only=True for the 
 fetch, 
  and skip the leading count() call and just sum the number of times we 
 call 
  fetch.  we may also consider catching the datastore timeout error and 
  trying to handle a partial delete more gracefully (or continue to let 
 the 
  user catch the error). 
  
  what is the right approach for web2py?  if the approach with count 
 is 
  correct, could i propose a gae bulk_delete method that does not return 
  count but uses my first method? 
  
  thanks for the input! 
  
  cfh 
  
  On Saturday, October 20, 2012 7:58:56 AM UTC-7, Massimo Di Pierro 
 wrote: 
  
  Delete should return the number of deleted records. What is your 
  proposal? 
  
  On Wednesday, 17 October 2012 17:30:22 UTC-5, howesc wrote: 
  
  Hi all, 
  
  I'm trying to clean up old expired sessions.but i waited a long 
  time to get to this and now my GAE delete is just timing out. 
  Reading the 
  GAE docs, there appears to be some improvements that we can make to 
 the 
  query delete method on GAE that will make it faster and cheaper. 
  what we 
  lose then is the count of the number of rows deleted. 
  
  my question is, does having a db(db.table.something==True).delete() 
  that does not return a count break the web2py API contract, or break 
  anyone's applications? 
  
  thanks, 
  
  christian 
  
  
  


-- 





[web2py] Re: delete on GAE

2012-10-20 Thread Massimo Di Pierro
Delete should return the number of deleted records. What is your proposal?

On Wednesday, 17 October 2012 17:30:22 UTC-5, howesc wrote:

 Hi all,

 I'm trying to clean up old expired sessions.but i waited a long time 
 to get to this and now my GAE delete is just timing out.  Reading the GAE 
 docs, there appears to be some improvements that we can make to the query 
 delete method on GAE that will make it faster and cheaper.  what we lose 
 then is the count of the number of rows deleted.

 my question is, does having a db(db.table.something==True).delete() that 
 does not return a count break the web2py API contract, or break anyone's 
 applications?

 thanks,

 christian


-- 





[web2py] Re: delete on GAE

2012-10-20 Thread howesc
It appears that the most efficient way to delete on app engine is to:
 - build a query object, like we are doing now
 - call run with keys_only=True 
(https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_run)
 
which returns an iterator.
 - pass that iterator to the datastore delete method 
(https://developers.google.com/appengine/docs/python/datastore/functions#delete)

this avoids the cost of loading the rows into memory, decreases the 
likelihood of timeout, and has the cost of 1 datastore small operation per 
row.  but it does prevent us from getting a count of rows deleted.

the way we do it now:
 - run count() on the query.  this has a cost (time and money) of iterating 
over all the rows that match the query on GAE (1 datastore small operation 
per row)
 - run fetch(limit=1000) and call delete() successively until no more rows. 
 this has the cost of running a full query (at least 1 datastore read 
operation per row) and loading the result set into memory and then deleting 
the results.

in my case i'm timing out on the count() call so i don't even start the 
delete.  from an efficiency standpoint i'd rather have more rows deleted 
for less cost then get a countbut this may not be acceptable for all. 
 at a minimum i think we should switch to use keys_only=True for the fetch, 
and skip the leading count() call and just sum the number of times we call 
fetch.  we may also consider catching the datastore timeout error and 
trying to handle a partial delete more gracefully (or continue to let the 
user catch the error).

what is the right approach for web2py?  if the approach with count is 
correct, could i propose a gae bulk_delete method that does not return 
count but uses my first method?

thanks for the input!

cfh

On Saturday, October 20, 2012 7:58:56 AM UTC-7, Massimo Di Pierro wrote:

 Delete should return the number of deleted records. What is your proposal?

 On Wednesday, 17 October 2012 17:30:22 UTC-5, howesc wrote:

 Hi all,

 I'm trying to clean up old expired sessions.but i waited a long time 
 to get to this and now my GAE delete is just timing out.  Reading the GAE 
 docs, there appears to be some improvements that we can make to the query 
 delete method on GAE that will make it faster and cheaper.  what we lose 
 then is the count of the number of rows deleted.

 my question is, does having a db(db.table.something==True).delete() that 
 does not return a count break the web2py API contract, or break anyone's 
 applications?

 thanks,

 christian



-- 





[web2py] Re: delete on GAE

2012-10-20 Thread Massimo Di Pierro
How about adding a gae only parameter to the gae adapter_args that tells it 
to skip fetch?

On Saturday, 20 October 2012 11:25:51 UTC-5, howesc wrote:

 It appears that the most efficient way to delete on app engine is to:
  - build a query object, like we are doing now
  - call run with keys_only=True (
 https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_run)
  
 which returns an iterator.
  - pass that iterator to the datastore delete method (
 https://developers.google.com/appengine/docs/python/datastore/functions#delete
 )

 this avoids the cost of loading the rows into memory, decreases the 
 likelihood of timeout, and has the cost of 1 datastore small operation per 
 row.  but it does prevent us from getting a count of rows deleted.

 the way we do it now:
  - run count() on the query.  this has a cost (time and money) of 
 iterating over all the rows that match the query on GAE (1 datastore small 
 operation per row)
  - run fetch(limit=1000) and call delete() successively until no more 
 rows.  this has the cost of running a full query (at least 1 datastore read 
 operation per row) and loading the result set into memory and then deleting 
 the results.

 in my case i'm timing out on the count() call so i don't even start the 
 delete.  from an efficiency standpoint i'd rather have more rows deleted 
 for less cost then get a countbut this may not be acceptable for all. 
  at a minimum i think we should switch to use keys_only=True for the fetch, 
 and skip the leading count() call and just sum the number of times we call 
 fetch.  we may also consider catching the datastore timeout error and 
 trying to handle a partial delete more gracefully (or continue to let the 
 user catch the error).

 what is the right approach for web2py?  if the approach with count is 
 correct, could i propose a gae bulk_delete method that does not return 
 count but uses my first method?

 thanks for the input!

 cfh

 On Saturday, October 20, 2012 7:58:56 AM UTC-7, Massimo Di Pierro wrote:

 Delete should return the number of deleted records. What is your proposal?

 On Wednesday, 17 October 2012 17:30:22 UTC-5, howesc wrote:

 Hi all,

 I'm trying to clean up old expired sessions.but i waited a long time 
 to get to this and now my GAE delete is just timing out.  Reading the GAE 
 docs, there appears to be some improvements that we can make to the query 
 delete method on GAE that will make it faster and cheaper.  what we lose 
 then is the count of the number of rows deleted.

 my question is, does having a db(db.table.something==True).delete() that 
 does not return a count break the web2py API contract, or break anyone's 
 applications?

 thanks,

 christian



-- 





[web2py] Re: delete on GAE

2012-10-20 Thread Massimo Di Pierro
I meant to skip count.

On Saturday, 20 October 2012 15:28:56 UTC-5, Massimo Di Pierro wrote:

 How about adding a gae only parameter to the gae adapter_args that tells 
 it to skip fetch?

 On Saturday, 20 October 2012 11:25:51 UTC-5, howesc wrote:

 It appears that the most efficient way to delete on app engine is to:
  - build a query object, like we are doing now
  - call run with keys_only=True (
 https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_run)
  
 which returns an iterator.
  - pass that iterator to the datastore delete method (
 https://developers.google.com/appengine/docs/python/datastore/functions#delete
 )

 this avoids the cost of loading the rows into memory, decreases the 
 likelihood of timeout, and has the cost of 1 datastore small operation per 
 row.  but it does prevent us from getting a count of rows deleted.

 the way we do it now:
  - run count() on the query.  this has a cost (time and money) of 
 iterating over all the rows that match the query on GAE (1 datastore small 
 operation per row)
  - run fetch(limit=1000) and call delete() successively until no more 
 rows.  this has the cost of running a full query (at least 1 datastore read 
 operation per row) and loading the result set into memory and then deleting 
 the results.

 in my case i'm timing out on the count() call so i don't even start the 
 delete.  from an efficiency standpoint i'd rather have more rows deleted 
 for less cost then get a countbut this may not be acceptable for all. 
  at a minimum i think we should switch to use keys_only=True for the fetch, 
 and skip the leading count() call and just sum the number of times we call 
 fetch.  we may also consider catching the datastore timeout error and 
 trying to handle a partial delete more gracefully (or continue to let the 
 user catch the error).

 what is the right approach for web2py?  if the approach with count is 
 correct, could i propose a gae bulk_delete method that does not return 
 count but uses my first method?

 thanks for the input!

 cfh

 On Saturday, October 20, 2012 7:58:56 AM UTC-7, Massimo Di Pierro wrote:

 Delete should return the number of deleted records. What is your 
 proposal?

 On Wednesday, 17 October 2012 17:30:22 UTC-5, howesc wrote:

 Hi all,

 I'm trying to clean up old expired sessions.but i waited a long 
 time to get to this and now my GAE delete is just timing out.  Reading the 
 GAE docs, there appears to be some improvements that we can make to the 
 query delete method on GAE that will make it faster and cheaper.  what we 
 lose then is the count of the number of rows deleted.

 my question is, does having a db(db.table.something==True).delete() 
 that does not return a count break the web2py API contract, or break 
 anyone's applications?

 thanks,

 christian



-- 





Re: [web2py] Re: delete on GAE

2012-10-20 Thread Christian Foster Howes

sure.  i'll make a patch soon...

thanks for the input!

cfh

On 10/20/12 13:29 , Massimo Di Pierro wrote:

I meant to skip count.

On Saturday, 20 October 2012 15:28:56 UTC-5, Massimo Di Pierro wrote:


How about adding a gae only parameter to the gae adapter_args that tells
it to skip fetch?

On Saturday, 20 October 2012 11:25:51 UTC-5, howesc wrote:


It appears that the most efficient way to delete on app engine is to:
  - build a query object, like we are doing now
  - call run with keys_only=True (
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_run)
which returns an iterator.
  - pass that iterator to the datastore delete method (
https://developers.google.com/appengine/docs/python/datastore/functions#delete
)

this avoids the cost of loading the rows into memory, decreases the
likelihood of timeout, and has the cost of 1 datastore small operation per
row.  but it does prevent us from getting a count of rows deleted.

the way we do it now:
  - run count() on the query.  this has a cost (time and money) of
iterating over all the rows that match the query on GAE (1 datastore small
operation per row)
  - run fetch(limit=1000) and call delete() successively until no more
rows.  this has the cost of running a full query (at least 1 datastore read
operation per row) and loading the result set into memory and then deleting
the results.

in my case i'm timing out on the count() call so i don't even start the
delete.  from an efficiency standpoint i'd rather have more rows deleted
for less cost then get a countbut this may not be acceptable for all.
  at a minimum i think we should switch to use keys_only=True for the fetch,
and skip the leading count() call and just sum the number of times we call
fetch.  we may also consider catching the datastore timeout error and
trying to handle a partial delete more gracefully (or continue to let the
user catch the error).

what is the right approach for web2py?  if the approach with count is
correct, could i propose a gae bulk_delete method that does not return
count but uses my first method?

thanks for the input!

cfh

On Saturday, October 20, 2012 7:58:56 AM UTC-7, Massimo Di Pierro wrote:


Delete should return the number of deleted records. What is your
proposal?

On Wednesday, 17 October 2012 17:30:22 UTC-5, howesc wrote:


Hi all,

I'm trying to clean up old expired sessions.but i waited a long
time to get to this and now my GAE delete is just timing out.  Reading the
GAE docs, there appears to be some improvements that we can make to the
query delete method on GAE that will make it faster and cheaper.  what we
lose then is the count of the number of rows deleted.

my question is, does having a db(db.table.something==True).delete()
that does not return a count break the web2py API contract, or break
anyone's applications?

thanks,

christian







--