Re: [sqlalchemy] Populate sample data

Michael Bayer Thu, 04 Aug 2011 06:22:29 -0700

On Aug 3, 2011, at 8:38 PM, Mark Erbaugh wrote:

> I'm using SA (with SQLite) with a schema like:
> 
> A -< B -< C -< D
> 
> where -< means that the tables have a one to many relationship
> 
> I'm populating  a sample data set where there are 25 rows in A, 25 rows in B 
> for each row in A, 25 rows in C for each row in B and 25 rows in D for each 
> row in C.  This results in about 390k rows in D.  The database itself is only 
> about 12 MB, but it takes a long time (several minutes) to write the data to 
> the file.
> 
> I'm taking the approach of appending items to the table's relationship column.
> 
> for i in range(25):
>       x = A()
>       session.add(A)
>               for j in range(25):
>                       y = B()
>                       x.b.append(y)
>                       for k in range(25):
>                               z = C()
>                               y.c.append(z)
>                               for  l in range(25):
>                                       xx = D()
>                                       z.d.append(xx)
> session.flush()


The biggest speed variable in a mass INSERT operation is whether or not 
individual cursor.execute() calls, or a small handful of cursor.executemany() 
calls each with thousands of rows, are used.

With the ORM, a mass executemany() is used for INSERT in the case that primary 
key values are already present in the given objects.   If not, the ORM has to 
INSERT each A() row one at a time, get the new primary key value, and then 
later populate 25*25 B() object's foreign key with the A.id value; then this 
goes down to the B()->C() area, etc.

So if A(), B(), C(), D() can be created with "id=5", "id=6", etc., assuming 
"id" is the primary key, the ORM should be able to bunch lots of rows together 
into one cursor.executemany() call and you'll see an immediate, dramatic 
speedup.

The next level would be if you populated the "a_id", "b_id", 
foreign-key-to-parent columns directly instead of using append().    If you did 
a profile on your script you'd see lots of time taken in many places, but all 
those append() operations would be one of them, as well as lots of event firing 
and bookkeeping that SQLAlchemy has to do when they occur, both at append() 
time as well as within the flush() (populating the primary key values to 
foreign key attributes).

By far the fastest way to do this would be to use 
session.execute(a_table.insert(), [<rows>]), session.execute(b_table.insert(), 
[<rows>]), etc.   That is, assemble the whole set of A, B, C, D, directly in 
terms of the mapped table,  or better yet do it in chunks, perhaps drill down 
through B, C, D for a single A then insert everything, etc.    That way you 
optimize how these rows are constructed in Python exactly to the pattern that 
corresponds directly to the database structure, instead of having SQLAlchemy 
decode the database structure from an object hierarchy.  An insertmany is 
documented at 
http://www.sqlalchemy.org/docs/core/tutorial.html#executing-multiple-statements.
   The ORM Session has an execute() method just like Connection does.

The range of speedups here would be between 30% and 80%, with direct usage of 
connection/session .execute() with Table metadata giving you the 80%.

I'm not sure what "transaction" is in transaction.begin() , if you're using a 
regular SQLAlchemy Session in it is always "in a transaction" in that it uses a 
single connection until rollback() or commit() is called.


-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Re: [sqlalchemy] Populate sample data

Reply via email to