Actually came to the same answer reading through the docs again, good
timing. Disabling expire_on_commit seems to have done the trick.
Not entirely sure why it's attempting to access objects previously expired.
My understanding is that it works like this:
for row in windowed_query(query, Item.id, 10000): # query 10k rows at a
time and iterate through them
if count >= 10000: # after 10k, save/delete
commit() #commit/expire, after this the next 10k get grabbed
The rows are processed and complete at the end of each batch of 10k, so
they shouldn't be accessed after that - it should move on to the next 10k
(which don't reference the previous 10k). It seems like it's grabbing,
processing, saving/deleting 10k rows but expiring most of the next 10k, as
well. Not a huge issue, anyway - the logic used dictates that it shouldn't
do any damage or go over any old rows, so no biggie.
Thanks for the help!
On Sunday, 26 October 2014 17:48:01 UTC+8, Michael Bayer wrote:
>
>
> On Oct 26, 2014, at 5:25 AM, James Meneghello <[email protected]
> <javascript:>> wrote:
>
> The application I'm working on operates over extremely large datasets, so
> I'm using the query windowing from here (
> https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery)
> to break it into manageable chunks. The query window is usually around 10k
> rows, after which it updates/deletes some rows and continues on. Simple
> breakdown is like this:
>
> query = session.query(Item).filter(...several filters)
> total_items = query.count() # used for logging
> for row in windowed_query(query, Item.id, 10000):
> count += 1
>
> # process, determine whether to keep (and update) or delete (put in a
> list for batch-deletion)
> # one such example is:
> if row.group_name != regex.group_name:
> continue
>
> if count >= 10000:
> save(items) # items to be kept, issues updates
> deleted =
> db.query(Item).filter(Item.id.in_(dead_items)).delete(synchronize_session='fetch')
> session.commit()
> count = 0
>
> This works fine until it's gone through a save/delete cycle. Once it's
> saved, it goes back to access the windowed query again and pull the next
> 10k rows. This works until the following line:
>
> if row.group_name != regex.group_name:
>
> At which point sqla will emit a SELECT for the item of that specific ID,
> presumably because the group_name wasn't available and it had to fetch it.
> This only occurs after the commit - so I assume that committing the session
> is breaking the query. Hence, for the next 10k rows, it emits 10k queries
> (one per row).
>
>
> commit() will expire everything in the Session by default, meaning all the
> attributes on objects gets cleared, and will emit a SELECT on next access.
> seems like you definitely want to disable this, which is the
> expire_on_commit setting of the Session itself; or otherwise, rearrange the
> logic here such that when you call session.commit(), you are definitely
> done with the objects in that batch of 10000. Usually with batching,
> things are organized this way so that you only have the 10K objects at most
> at once in memory; the commit() closes out dealing with those 10K objects.
> In this case it seems like you are accessing the full range of 10K
> objects after that commit(). If it can’t be avoided then you’d have to
> create a Session with expire_on_commit=False, right now that setting is
> per-Session only. it may be possible to do it just for a range if you
> just set the attribute on the Session also but I haven’t tried that.
>
--
You received this message because you are subscribed to the Google Groups
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.