On Jan 2, 2008, at 4:41 PM, Artur Siekielski wrote:
>
> for inst in session.query(Bar):
> foo(inst)
>
> because all instances are loaded into memory before 'foo' is called -
> out of memory error or poor performance. Using offsets and limits
> doesnt' work good too, because doing eg. OFFSET 500000 causes very
> poor performance (memory usage is low, but executing SELECT takes much
> time; I'm using PostgreSQL). SQL Expression sql.select supports lazy
> yielding records, but the only way I can automatically transform
> fetched tuple into mapped class' object is calling
> Query.instances(cursor) method which loads all instances again (no
> possibility to load a part of result set). The only solution I have
> found is this:
>
I *really* tried in a recent branch to get yielding behavior to work
in Query. And it just cannot be done, without enormous complexity, in
such a way that would not lead towards subtle unexpected behaviors
that would drive everyone insane. The current code in 0.4.2 has all
been rearranged such that its quite easy to drop in a "yield" into
query.instances(), do the requiste bookkeeping on indivudual instances
before they are yielded, and send them out. But unfortunately, our
Query object does not just return single objects; it returns
collections of heterogenous objects all at once. As soon as a single
join is added to the criterion, or any kind of related object load is
thrown in, you are now returning either heterogenous objects, or
homogenous objects in different contexts. And that means, the same
object can be present in the result set multiple times. In a lot of
cases, the multiple objects are received as grouped together and are
often "uniqued" away into just one instance of itself, but if at any
time the ordering is changed such that an object gets delivered, and
then later on gets delivered again, now your code can break.
Why is that ? Because when yielding, you can stop reading the
iterator at any time, start modifying attributes and collections on
your instances, and even try to flush them. Then, when you go back
and get some more yielded instances, the *same* instance comes back,
and SA would need to jump through enormous, time-consuming hoops to
figure out that you've already received that instance and modified
some of it, and to not overwrite what you've already changed. There
are other conditions that arise as well along similar lines - I can
show you examples where the same instance is loaded once as the eager
loaded child of a parent object, in which case its own child
collection is unpopulated, but then the same instance shows up as a
parent of its own, and its child colleciton *is* populated. If
query.instances() let out these instances into the wild before the
full result is fetched, it becomes very hairy trying to guess what we
do when we come across the instance a second time, and possibly have
new collection items to populate...and all of that adds overhead and
complexity to the instances() process overall.
> def generateAll(domainCls, table, session=sa.Session, minId=None,
> maxId=None):
> if minId is None:
> minId = session.query(domainCls).min(table.c.id)
> if maxId is None:
> maxId = session.query(domainCls).max(table.c.id)
> for id in xrange(minId, maxId + 1):
> inst = session.query(domainCls).get(id)
> if inst is not None:
> yield inst
>
> Am I missing something or there is no better way? I understand that in
> general case loading all instances at once is required, but I think
> that in cases like mine there should be some possibility for lazily
> loading instances.
that above is probably the most inefficient way possible of fetching
many instances since you are issuing a get() individually for every
row. Better would be to create any object which has a fetchall()
method on it, and send it to query.instances():
class FakeResult(object):
def __init__(self, result):
self.result = result
def fetchall(self):
return self.result.fetchmany(50)
result = FakeResult(table.select().execute())
while True:
r = query.instances(result):
if not r:
break
for instance in r:
# do whatever
now, with the above, its *you* who has to know that you dont have any
joined collections which span across result set boundaries, or that
you are getting the same object twice in two fetches. So if something
breaks with the above, its not a subtle mysterious behavior in SA's
query object, its something that you can clearly see on your end.
I have considered adding a yielding version of instances(), but I
worry that it would be difficult for an end user to decide when the
hard to track side-effects of this style of loading are going to
happen and when they are not (or if they're going to happen just once
every 1000 loads). If its really killing you I can show you a quick
patch to provide this but im still skeptical that this feature would
more often than not lead to subtly broken code. Perhaps if I added a
highly "beware!"-marked method to Query which does the above, such as
"yield_per()"; instances() could iterate in a manner like the above.
Its more or less this which I considered doing.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"sqlalchemy" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---