I ran some tests using raw psycopg2 on my local computer, and also ran some
tests with SQLAlchemy. I misunderstood a few things in my tests above, but
I've explained the tests below and think they are more accurate.
*Context:*
- Each test was run 5 times and averaged.
- Both the python and database were running on the same computer (Macbook
Pro (2015), using Postgres 9.5 and Docker)
*Raw psycopg2:*
# Rough code:
extras.execute_values(cursor, 'INSERT INTO table_a (field1, field2) VALUES
%s', [...])
extras.execute_batch(cursor, 'INSERT INTO table_a (field1, field2) VALUES
(%s, %s)', [...])
cursor.executemany('INSERT INTO table_a (field1, field2) VALUES (%s, %s)',
[...])
# Inserting 1000 rows
execute_values, elapsed time: 0.023967s
execute_batch, elapsed time: 0.051530s
executemany, elapsed time: 0.173563s
# Inserting 10000 rows
execute_values, elapsed time: 0.268656s
execute_batch, elapsed time: 0.488736s
executemany, elapsed time: 2.017565s
# Inserting 100000 rows
execute_values, elapsed time: 1.858675s
execute_batch, elapsed time: 4.062823s
executemany, elapsed time: 19.900875s
*SQLAlchemy layer:*
for field1, field2 in genome_infos(rows):
db.session.add(TableA(field1, field2))
# Time this part:
db.session.flush()
I used the code you provided above (instead of the flag you recently
pushed), which allowed me to test both execute_batch and execute_values.
Here was what that I used:
@event.listens_for(Engine, "do_executemany")
def do_executemany(cursor, statement, parameters, context):
context.dialect.supports_sane_multi_rowcount = False
# Option: executemany
cursor.executemany(statement, parameters)
# Option: execute_batch
extras.execute_batch(cursor, statement, parameters)
# Option: execute_batch
statement = re.sub('VALUES.*', 'VALUES %s', statement)
parameters = [(info['field1'], info['field2']) for info in parameters]
extras.execute_values(cursor, statement, parameters)
return True
Obviously only one option was used at a time :)
*SQLAlchemy Results:*
Inserting 1000 rows:
execute_values: 0.083958s
execute_batch: 0.110223s
executemany: 0.276129s
Inserting 10000 rows:
execute_values: 1.243230s
execute_batch: 1.388278s
executemany: 3.131808s
Inserting 100000 rows:
execute_values: 13.909975s
execute_batch: 14.942507s
executemany: 29.671092s
*Conclusions:*
- execute_batch is a significant improvement over executemany (10x at
the pyscopg2 layer)
- Subtracting the 11-12 seconds of SQLAlchemy/Python overhead for
inserting 100,000 rows gives roughly the psycopg2 times for each execute_
option.
- The 5000% improvements stated
in https://github.com/psycopg/psycopg2/issues/491#issuecomment-276551038
are probably exaggerated for most users - that was for "transatlantic
network connections", whereas I imagine most users have databases much
closer to their servers. However, it's still a 10x speed up over
executemany at the pscyopg2 layer, and a ~2-3x speed up including
SQLAlchemy overhead.
- execute_values is still twice as fast as execute_batch at the psycopg2
layer (for this specific table*), so incorporating that would be even
better!
- execute_batch only helps with inserts to tables where all primary keys
are defined (as you noted). Thus, if users want to see improvements for
tables with auto-incrementing primary keys and relationships,
they'll likely need to combine this with something like the suggestion
in https://groups.google.com/forum/#!topic/sqlalchemy/GyAZTThJi2I
Hopefully that was helpful :) Happy to help test in any other ways as well!
On Wednesday, October 11, 2017 at 7:25:06 AM UTC-7, Mike Bayer wrote:
>
> On Wed, Oct 11, 2017 at 3:02 AM, <[email protected] <javascript:>>
> wrote:
> > Hey Mike,
> >
> > Thanks again for the detailed explanations!
> >
> > I went ahead and tested the code snippet you gave me - I'm not sure I
> > totally understand when this will change behavior though.
> >
> > I tried the following code snippet:
> > (Snippet #1)
> > for i in xrange(10):
> > db.session.add(A(id=i))
> > db.session.flush()
> >
> > This calls through to the "do_executemany" handler and only executes 1
> > insert statement with multiple VALUES. However, this was already the
> > existing behavior right?
>
> There's multiple forms of "one insert statement".
>
> There is traditional "executemany", which to the database looks like:
>
> INSERT INTO table (x, y) VALUES (:x, :y)
>
> The above statement is invoked for each set of parameters.
>
> then there is multiple values, which to the database looks like:
>
> INSERT INTO table (x, y) VALUES (:x1, :y1) (:x2, :y2), (:x3, :y3), ...
>
> The above statement is invoked once, with all sets of parameters at
> once (in reality they are usually batched in groups and the statement
> is invoked once per group). MySQL drivers do this as does psycopg2
> execute_values().
>
> then there is psycopg2 execute_batch which yes, I forgot, is doing
> "insert into table (x,y) values (:x, :y); insert into table(x, y)
> values (:x, :y), ...", it's not actually rewriting the values clause.
>
> In any case, SQLAlchemy only knows about the first form within the ORM.
>
>
>
>
> >
> > When I don't specify a primary key:
> > (Snippet #2)
> > for i in xrange(10):
> > db.session.add(A()) # don't pre-specify a primary key
> > db.session.flush()
> >
> > This does *not* call through to the do_executemany handler, and executes
> 10
> > insert statements.
>
> the ORM must retrieve newly generated primary key values if they were
> not already present, which cannot be done with the
> cursor.executemany() method. it must use cursor.execute().
>
>
> >
> > If I understand
> > http://initd.org/psycopg/docs/extras.html#fast-execution-helpers
> correctly,
> > execute_batch does *not* actually combine the insert statements into a
> > single statement with multiple VALUES clauses, but instead does
> something
> > like sends multiple INSERT statements to the server at once.
>
> yes I forgot, the execute_batch is actually stringing the statements
> together on a semicolon, not rewriting values(). However if you look
> at the performance gains at
> https://github.com/psycopg/psycopg2/issues/491#issuecomment-276551038,
> there is negligible difference between batch and values, compared to
> the default which is 5000% percent slower.
>
> We would
> > obviously still expect to see performance benefits from the reduced
> network
> > round-trips, but it looks like the above tests do not show any
> improvement
> > there.
>
> I'm not familiar with what testing you are performing. Are you
> testing the new flag? what happens if you run the tests at
>
> https://www.postgresql.org/message-id/20170130215151.GA7081%40deb76.aryehleib.com
>
> e.g. https://gist.github.com/altaurog/84668e034646fb354b5de81bb86a580d?
> if the feature is working, and the performance gains are being lost
> for some reason, then that's a major issue in psycopg2 we need to
> report.
>
> >
> > Let me know if I'm testing that incorrectly - I'm also happy to test
> this on
> > our actual production use cases to give you some more accurate speed-up
> > times.
> >
> > On Tuesday, October 10, 2017 at 10:36:33 AM UTC-7, Mike Bayer wrote:
> >>
> >> I've added
> >>
> https://bitbucket.org/zzzeek/sqlalchemy/issues/4109/support-psycopg2-batch-mode
>
> >> which will have a code review up soon.
> >>
> >> On Tue, Oct 10, 2017 at 1:06 PM, Mike Bayer <[email protected]>
> >> wrote:
> >> > On Tue, Oct 10, 2017 at 1:47 AM, <[email protected]> wrote:
> >> >> Hello! SQLAlchemy has the (really useful) behavior of batching
> together
> >> >> inserts when all the primary keys are defined. It looks like it
> groups
> >> >> similar tables together (while maintaining insert order ) to
> minimize
> >> >> the
> >> >> number of Insert statements sent to the database.
> >> >>
> >> >> I'm unsure what the expected behavior here is for single-table
> >> >> inheritance
> >> >> tables. Here's some example models:
> >> >> class Base(db.Model):
> >> >> __tablename__ = 'base'
> >> >> id = db.Column(db.Integer, primary_key=True)
> >> >> base_type = db.Column(db.String)
> >> >>
> >> >> __mapper_args__ = {
> >> >> 'polymorphic_on': base_type,
> >> >> }
> >> >>
> >> >> def __init__(self, id_):
> >> >> self.id = id_
> >> >>
> >> >>
> >> >> class SubBase1(Base):
> >> >> __mapper_args__ = {
> >> >> 'polymorphic_identity': '1',
> >> >> }
> >> >>
> >> >> col = db.Column(db.String)
> >> >>
> >> >>
> >> >> class SubBase2(Base):
> >> >> __mapper_args__ = {
> >> >> 'polymorphic_identity': '2',
> >> >> }
> >> >>
> >> >> col2 = db.Column(db.String)
> >> >>
> >> >>
> >> >> class OtherModel(db.Model):
> >> >> id = db.Column(db.Integer, primary_key=True)
> >> >>
> >> >> def __init__(self, id_):
> >> >> self.id = id_
> >> >>
> >> >>
> >> >> Base, SubBase, and SubBase2 form a tree, while OtherModel is just
> >> >> another
> >> >> model.
> >> >>
> >> >> When creating alternating SubBase1s and OtherModels, SQLAlchemy
> batches
> >> >> the
> >> >> INSERTs such that only two INSERT statements are sent to the DB:
> >> >> (Snippet #1)
> >> >> for i in xrange(0, 10, 2):
> >> >> db.session.add(SubBase1(i))
> >> >> db.session.add(OtherModel(i + 1))
> >> >>
> >> >> db.session.flush() # Performs 2 insert statements, each with 5
> >> >> elements in
> >> >> the VALUES clause
> >> >>
> >> >> However, when creating alternating SubBase1s and SubBase2s, it
> actually
> >> >> performs 10 separate insert statements:
> >> >> (Snippet #2)
> >> >> for i in xrange(0, 10, 2):
> >> >> db.session.add(SubBase1(i))
> >> >> db.session.add(SubBase2(i + 1))
> >> >> db.session.flush() # Performs 10 insert statements, each with 1
> >> >> element in
> >> >> the VALUES clause
> >> >>
> >> >>
> >> >> It seems like SQLAlchemy maintains the insert_order *across*
> SubBase1
> >> >> and
> >> >> SubBase2, but also isn't able to do a single INSERT statement with
> 10
> >> >> elements in the VALUES clause (or 2 INSERT statements with 5
> elements
> >> >> each).
> >> >>
> >> >> Questions:
> >> >> 1) Is Snippet #2 the desired behavior? I read through the
> unit-of-work
> >> >> summary at http://www.aosabook.org/en/sqlalchemy.html, but I wasn't
> >> >> totally
> >> >> sure what the desired behavior was for single-table inheritance
> models.
> >> >
> >> > yes because you'll note because sub1 and sub2 have different columns
> >> > mapped, the actual columns in each INSERT statement are different.
> >> > At the same time it is maintaining INSERT order into the table
> >> > overall, so it has no choice but to break things up in this way.
> >> >
> >> >> 2) If this is desired behavior, do you have any suggestions of how
> to
> >> >> optimize the number of queries here in e.g. a before_flush hook?
> >> >> Obviously
> >> >> one way would be to rewrite our application such that inserts of
> like
> >> >> tables
> >> >> happen together, but if there is a way to do this without rewriting
> the
> >> >> application code, that would be even better.
> >> >
> >> > if your INSERTs are very simplistic then you can still consider using
> >> > the bulk_save option, gathering your classes into lists and then
> doing
> >> > your own bulk operation where you ensure all like-rows are grouped
> >> > together ahead of time.
> >> >
> >> > Otherwise, if you really are trying to thread the needle through "I
> >> > want full ORM flush convenience" plus "I need to optimize INSERTs
> into
> >> > batches", you have one more option which is to manipulate the
> >> > insert_order on the states like this:
> >> >
> >> > for i in xrange(0, 10, 2):
> >> > s.add(SubBase1(i))
> >> > s.add(SubBase2(i + 1))
> >> >
> >> > from sqlalchemy import inspect
> >> >
> >> > for idx, obj in enumerate(
> >> > sorted(
> >> > s.new,
> >> > key=lambda obj: (obj.__class__, inspect(obj).insert_order)
> >> > )
> >> > ):
> >> > inspect(obj).insert_order = idx
> >> >
> >> >
> >> > I've not documented "insert_order" as an official thing but I can
> >> > support that.
> >> >
> >> > But also if you're really trying to maximize INSERT performance you
> >> > need to use the psycopg2 executemany() extensions:
> >> > http://initd.org/psycopg/docs/extras.html#fast-execution-helpers.
> >> > These were added specifically from an email thread I started with
> them
> >> > in response to user complaints about performance, but SQLAlchemy has
> >> > not yet built these in as features.
> >> >
> >> > I've done no testing at all of this, however if you have the
> >> > resources, you can turn on statement logging on your Postgresql
> >> > database which will hopefully show INSERT statements being collapsed
> >> > into a single statement with multiple VALUES:
> >> >
> >> > from psycopg2 import extras
> >> > from sqlalchemy import event
> >> >
> >> > e = create_engine("postgresql://scott:tiger@localhost/test",
> echo=True)
> >> >
> >> > @event.listens_for(e, "do_executemany")
> >> > def do_executemany(cursor, statement, parameters, context):
> >> > context.dialect.supports_sane_multi_rowcount = False
> >> > extras.execute_batch(cursor, statement, parameters)
> >> > return True
> >> >
> >> > if you get the time to test this, please let me know as I do want to
> >> > add support for "execute_batch" directly into the psycopg2 dialect.
> >> > thanks!
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > One option I've thought of is:
> >> >> - Overwrite the insert_order in the before_flush hook to rearrange
> >> >> the
> >> >> models in the order I want. (Not sure if this has effects that I'm
> >> >> unaware
> >> >> of though)
> >> >>
> >> >> Thanks!
> >> >>
> >> >> --
> >> >> SQLAlchemy -
> >> >> The Python SQL Toolkit and Object Relational Mapper
> >> >>
> >> >> http://www.sqlalchemy.org/
> >> >>
> >> >> To post example code, please provide an MCVE: Minimal, Complete, and
> >> >> Verifiable Example. See http://stackoverflow.com/help/mcve for a
> full
> >> >> description.
> >> >> ---
> >> >> You received this message because you are subscribed to the Google
> >> >> Groups
> >> >> "sqlalchemy" group.
> >> >> To unsubscribe from this group and stop receiving emails from it,
> send
> >> >> an
> >> >> email to [email protected].
> >> >> To post to this group, send email to [email protected].
> >> >> Visit this group at https://groups.google.com/group/sqlalchemy.
> >> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > SQLAlchemy -
> > The Python SQL Toolkit and Object Relational Mapper
> >
> > http://www.sqlalchemy.org/
> >
> > To post example code, please provide an MCVE: Minimal, Complete, and
> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> > description.
> > ---
> > You received this message because you are subscribed to the Google
> Groups
> > "sqlalchemy" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an
> > email to [email protected] <javascript:>.
> > To post to this group, send email to [email protected]
> <javascript:>.
> > Visit this group at https://groups.google.com/group/sqlalchemy.
> > For more options, visit https://groups.google.com/d/optout.
>
--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
http://www.sqlalchemy.org/
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable
Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.