Actually came to the same answer reading through the docs again, good 
timing. Disabling expire_on_commit seems to have done the trick.

Not entirely sure why it's attempting to access objects previously expired. 
My understanding is that it works like this:

for row in windowed_query(query, Item.id, 10000): # query 10k rows at a 
time and iterate through them 

    if count >= 10000: # after 10k, save/delete 

        commit() #commit/expire, after this the next 10k get grabbed

 
The rows are processed and complete at the end of each batch of 10k, so 
they shouldn't be accessed after that - it should move on to the next 10k 
(which don't reference the previous 10k). It seems like it's grabbing, 
processing, saving/deleting 10k rows but expiring most of the next 10k, as 
well. Not a huge issue, anyway - the logic used dictates that it shouldn't 
do any damage or go over any old rows, so no biggie.

Thanks for the help!

On Sunday, 26 October 2014 17:48:01 UTC+8, Michael Bayer wrote:
>
>
> On Oct 26, 2014, at 5:25 AM, James Meneghello <[email protected] 
> <javascript:>> wrote:
>
> The application I'm working on operates over extremely large datasets, so 
> I'm using the query windowing from here (
> https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery) 
> to break it into manageable chunks. The query window is usually around 10k 
> rows, after which it updates/deletes some rows and continues on. Simple 
> breakdown is like this:
>
> query = session.query(Item).filter(...several filters)
> total_items = query.count() # used for logging
> for row in windowed_query(query, Item.id, 10000):
>     count += 1
>
>     # process, determine whether to keep (and update) or delete (put in a 
> list for batch-deletion)
>     # one such example is:
>     if row.group_name != regex.group_name:
>         continue
>
>     if count >= 10000:
>         save(items) # items to be kept, issues updates
>         deleted = 
> db.query(Item).filter(Item.id.in_(dead_items)).delete(synchronize_session='fetch')
>         session.commit()
>         count = 0
>
> This works fine until it's gone through a save/delete cycle. Once it's 
> saved, it goes back to access the windowed query again and pull the next 
> 10k rows. This works until the following line:
>
> if row.group_name != regex.group_name:
>
> At which point sqla will emit a SELECT for the item of that specific ID, 
> presumably because the group_name wasn't available and it had to fetch it. 
> This only occurs after the commit - so I assume that committing the session 
> is breaking the query. Hence, for the next 10k rows, it emits 10k queries 
> (one per row).
>
>
> commit() will expire everything in the Session by default, meaning all the 
> attributes on objects gets cleared, and will emit a SELECT on next access. 
>   seems like you definitely want to disable this, which is the 
> expire_on_commit setting of the Session itself; or otherwise, rearrange the 
> logic here such that when you call session.commit(), you are definitely 
> done with the objects in that batch of 10000.   Usually with batching, 
> things are organized this way so that you only have the 10K objects at most 
> at once in memory; the commit() closes out dealing with those 10K objects. 
>   In this case it seems like you are accessing the full range of 10K 
> objects after that commit().  If it can’t be avoided then you’d have to 
> create a Session with expire_on_commit=False, right now that setting is 
> per-Session only.   it may be possible to do it just for a range if you 
> just set the attribute on the Session also but I haven’t tried that.
>

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Reply via email to