On 21 Feb 2013, at 22:44, Victor Ng <[email protected]> wrote: >> On Thursday, February 21, 2013 1:03:49 PM UTC-8, A.M. wrote: >> On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng <[email protected]> >> wrote: >> > I do a lot of processing on large amount of data. >> > >> > The common pattern we follow is: >> > >> > 1. Iterate through a large data set >> > 2. Do some sort of processing (i.e. NLP processing like tokenization, >> > capitalization, regex parsing, ... ) >> > 3. Insert the new result in another table. >> > >> > Right now we are doing something like this: >> > >> > for x in session.query(Foo).yield_per(10000): >> > bar = Bar() >> > bar.hello = x.world.lower() >> > session.add(bar) >> > session.flush() >> > session.commit() >> >> Do you really need to flush after making each new Bar? That implies a >> database round-trip and state sync with SQLAlchemy. >> >> In any case, you should gather a profile to see where/how time is getting >> spent. SQLAlchemy is a complex framework, so whatever performance >> assumptions are implied in the code may be wrong. >> >> Cheers, >> M >> > Um sure. > > That still doesn't answer my question. > > I am interested to persist changes in my db as I am iterating through > yield_per. >
Do your Foo objects have an ordering that you can use, such as a numeric ID? If so, you could query for the first few hundred objects, process them, then do a new query for the next hundred, and so on. This should keep the memory usage of the process under control at least. Hope that helps, Simon -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/sqlalchemy?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
