On Aug 3, 2011, at 8:38 PM, Mark Erbaugh wrote: > I'm using SA (with SQLite) with a schema like: > > A -< B -< C -< D > > where -< means that the tables have a one to many relationship > > I'm populating a sample data set where there are 25 rows in A, 25 rows in B > for each row in A, 25 rows in C for each row in B and 25 rows in D for each > row in C. This results in about 390k rows in D. The database itself is only > about 12 MB, but it takes a long time (several minutes) to write the data to > the file. > > I'm taking the approach of appending items to the table's relationship column. > > for i in range(25): > x = A() > session.add(A) > for j in range(25): > y = B() > x.b.append(y) > for k in range(25): > z = C() > y.c.append(z) > for l in range(25): > xx = D() > z.d.append(xx) > session.flush()
The biggest speed variable in a mass INSERT operation is whether or not individual cursor.execute() calls, or a small handful of cursor.executemany() calls each with thousands of rows, are used. With the ORM, a mass executemany() is used for INSERT in the case that primary key values are already present in the given objects. If not, the ORM has to INSERT each A() row one at a time, get the new primary key value, and then later populate 25*25 B() object's foreign key with the A.id value; then this goes down to the B()->C() area, etc. So if A(), B(), C(), D() can be created with "id=5", "id=6", etc., assuming "id" is the primary key, the ORM should be able to bunch lots of rows together into one cursor.executemany() call and you'll see an immediate, dramatic speedup. The next level would be if you populated the "a_id", "b_id", foreign-key-to-parent columns directly instead of using append(). If you did a profile on your script you'd see lots of time taken in many places, but all those append() operations would be one of them, as well as lots of event firing and bookkeeping that SQLAlchemy has to do when they occur, both at append() time as well as within the flush() (populating the primary key values to foreign key attributes). By far the fastest way to do this would be to use session.execute(a_table.insert(), [<rows>]), session.execute(b_table.insert(), [<rows>]), etc. That is, assemble the whole set of A, B, C, D, directly in terms of the mapped table, or better yet do it in chunks, perhaps drill down through B, C, D for a single A then insert everything, etc. That way you optimize how these rows are constructed in Python exactly to the pattern that corresponds directly to the database structure, instead of having SQLAlchemy decode the database structure from an object hierarchy. An insertmany is documented at http://www.sqlalchemy.org/docs/core/tutorial.html#executing-multiple-statements. The ORM Session has an execute() method just like Connection does. The range of speedups here would be between 30% and 80%, with direct usage of connection/session .execute() with Table metadata giving you the 80%. I'm not sure what "transaction" is in transaction.begin() , if you're using a regular SQLAlchemy Session in it is always "in a transaction" in that it uses a single connection until rollback() or commit() is called. -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
