[sqlalchemy] Re: Yielding all instances

Michael Bayer Wed, 02 Jan 2008 14:16:06 -0800

On Jan 2, 2008, at 4:41 PM, Artur Siekielski wrote:

>
> for inst in session.query(Bar):
>  foo(inst)
>
> because all instances are loaded into memory before 'foo' is called -
> out of memory error or poor performance. Using offsets and limits
> doesnt' work good too, because doing eg. OFFSET 500000 causes very
> poor performance (memory usage is low, but executing SELECT takes much
> time; I'm using PostgreSQL). SQL Expression sql.select supports lazy
> yielding records, but the only way I can automatically transform
> fetched tuple into mapped class' object is calling
> Query.instances(cursor) method which loads all instances again (no
> possibility to load a part of result set). The only solution I have
> found is this:
>



I *really* tried in a recent branch to get yielding behavior to work  
in Query.  And it just cannot be done, without enormous complexity, in  
such a way that would not lead towards subtle unexpected behaviors  
that would drive everyone insane.  The current code in 0.4.2 has all  
been rearranged such that its quite easy to drop in a "yield" into  
query.instances(), do the requiste bookkeeping on indivudual instances  
before they are yielded, and send them out. But unfortunately, our  
Query object does not just return single objects; it returns  
collections of heterogenous objects all at once.  As soon as a single  
join is added to the criterion, or any kind of related object load is  
thrown in, you are now returning either heterogenous objects, or  
homogenous objects in different contexts.   And that means, the same  
object can be present in the result set multiple times.    In a lot of  
cases, the multiple objects are received as grouped together and are  
often "uniqued" away into just one instance of itself, but if at any  
time the ordering is changed such that an object gets delivered, and  
then later on gets delivered again, now your code can break.

Why is that ?  Because when yielding, you can stop reading the  
iterator at any time, start modifying attributes and collections on  
your instances, and even try to flush them.  Then, when you go back  
and get some more yielded instances, the *same* instance comes back,  
and SA would need to jump through enormous, time-consuming hoops to  
figure out that you've already received that instance and modified  
some of it, and to not overwrite what you've already changed.  There  
are other conditions that arise as well along similar lines - I can  
show you examples where the same instance is loaded once as the eager  
loaded child of a parent object, in which case its own child  
collection is unpopulated, but then the same instance shows up as a  
parent of its own, and its child colleciton *is* populated.  If  
query.instances() let out these instances into the wild before the  
full result is fetched, it becomes very hairy trying to guess what we  
do when we come across the instance a second time, and possibly have  
new collection items to populate...and all of that adds overhead and  
complexity to the instances() process overall.

> def generateAll(domainCls, table, session=sa.Session, minId=None,
> maxId=None):
>    if minId is None:
>        minId = session.query(domainCls).min(table.c.id)
>    if maxId is None:
>        maxId = session.query(domainCls).max(table.c.id)
>    for id in xrange(minId, maxId + 1):
>        inst = session.query(domainCls).get(id)
>        if inst is not None:
>            yield inst
>
> Am I missing something or there is no better way? I understand that in
> general case loading all instances at once is required, but I think
> that in cases like mine there should be some possibility for lazily
> loading instances.

that above is probably the most inefficient way possible of fetching  
many instances since you are issuing a get() individually for every  
row.  Better would be to create any object which has a fetchall()  
method on it, and send it to query.instances():

class FakeResult(object):
    def __init__(self, result):
         self.result = result
    def fetchall(self):
         return self.result.fetchmany(50)

result = FakeResult(table.select().execute())
while True:
     r = query.instances(result):
     if not r:
         break
     for instance in r:
         # do whatever

now, with the above, its *you* who has to know that you dont have any  
joined collections which span across result set boundaries, or that  
you are getting the same object twice in two fetches.  So if something  
breaks with the above, its not a subtle mysterious behavior in SA's  
query object, its something that you can clearly see on your end.

I have considered adding a yielding version of instances(), but I  
worry that it would be difficult for an end user to decide when the  
hard to track side-effects of this style of loading are going to  
happen and when they are not (or if they're going to happen just once  
every 1000 loads).   If its really killing you I can show you a quick  
patch to provide this but im still skeptical that this feature would  
more often than not lead to subtly broken code.  Perhaps if I added a  
highly "beware!"-marked method to Query which does the above, such as  
"yield_per()"; instances() could iterate in a manner like the above.    
Its more or less this which I considered doing.




--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

[sqlalchemy] Re: Yielding all instances

Reply via email to