On 21 Feb 2013, at 22:44, Victor Ng <[email protected]> wrote:

>> On Thursday, February 21, 2013 1:03:49 PM UTC-8, A.M. wrote:
>> On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng <[email protected]> 
>> wrote: 
>> > I do a lot of processing on large amount of data. 
>> > 
>> > The common pattern we follow is: 
>> > 
>> > 1. Iterate through a large data set 
>> > 2. Do some sort of processing (i.e. NLP processing like tokenization, 
>> > capitalization, regex parsing, ... ) 
>> > 3. Insert the new result in another table. 
>> > 
>> > Right now we are doing something like this: 
>> > 
>> > for x in session.query(Foo).yield_per(10000): 
>> >   bar = Bar() 
>> >   bar.hello = x.world.lower() 
>> >   session.add(bar) 
>> >   session.flush() 
>> > session.commit() 
>> 
>> Do you really need to flush after making each new Bar? That implies a 
>> database round-trip and state sync with SQLAlchemy. 
>> 
>> In any case, you should gather a profile to see where/how time is getting 
>> spent. SQLAlchemy is a complex framework, so whatever performance 
>> assumptions are implied in the code may be wrong. 
>> 
>> Cheers, 
>> M 
>> 
> Um sure. 
> 
> That still doesn't answer my question. 
> 
> I am interested to persist changes in my db as I am iterating through 
> yield_per. 
> 


Do your Foo objects have an ordering that you can use, such as a numeric ID? If 
so, you could query for the first few hundred objects, process them, then do a 
new query for the next hundred, and so on. This should keep the memory usage of 
the process under control at least.

Hope that helps,

Simon


-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to