Re: [sqlalchemy] How do Insert statements for single-table inheritance tables get batched together?

vineet Wed, 11 Oct 2017 22:08:29 -0700

I ran some tests using raw psycopg2 on my local computer, and also ran some 
tests with SQLAlchemy. I misunderstood a few things in my tests above, but 
I've explained the tests below and think they are more accurate.


*Context:*
- Each test was run 5 times and averaged.
- Both the python and database were running on the same computer (Macbook 
Pro (2015), using Postgres 9.5 and Docker)

*Raw psycopg2:*
# Rough code:
extras.execute_values(cursor, 'INSERT INTO table_a (field1, field2) VALUES 
%s', [...])
extras.execute_batch(cursor, 'INSERT INTO table_a (field1, field2) VALUES 
(%s, %s)', [...])
cursor.executemany('INSERT INTO table_a (field1, field2) VALUES (%s, %s)', 
[...])

# Inserting 1000 rows
execute_values, elapsed time: 0.023967s
execute_batch, elapsed time: 0.051530s
executemany, elapsed time: 0.173563s


# Inserting 10000 rows
execute_values, elapsed time: 0.268656s
execute_batch, elapsed time: 0.488736s
executemany, elapsed time: 2.017565s


# Inserting 100000 rows
execute_values, elapsed time: 1.858675s
execute_batch, elapsed time: 4.062823s
executemany, elapsed time: 19.900875s


*SQLAlchemy layer:*
for field1, field2 in genome_infos(rows):
    db.session.add(TableA(field1, field2))

# Time this part:
db.session.flush()

I used the code you provided above (instead of the flag you recently 
pushed), which allowed me to test both execute_batch and execute_values. 
Here was what that I used:
@event.listens_for(Engine, "do_executemany")
def do_executemany(cursor, statement, parameters, context):
    context.dialect.supports_sane_multi_rowcount = False

    # Option: executemany
    cursor.executemany(statement, parameters)
    
    # Option: execute_batch
    extras.execute_batch(cursor, statement, parameters)

    # Option: execute_batch
    statement = re.sub('VALUES.*', 'VALUES %s', statement)
    parameters = [(info['field1'], info['field2']) for info in parameters]
    extras.execute_values(cursor, statement, parameters)

    return True

Obviously only one option was used at a time :)

*SQLAlchemy Results:*
Inserting 1000 rows:
execute_values: 0.083958s
execute_batch: 0.110223s
executemany: 0.276129s

Inserting 10000 rows:
execute_values: 1.243230s
execute_batch: 1.388278s
executemany: 3.131808s

Inserting 100000 rows:
execute_values: 13.909975s
execute_batch: 14.942507s
executemany: 29.671092s


*Conclusions:*

   - execute_batch is a significant improvement over executemany (10x at 
   the pyscopg2 layer)
   - Subtracting the 11-12 seconds of SQLAlchemy/Python overhead for 
   inserting 100,000 rows gives roughly the psycopg2 times for each execute_ 
   option.
   - The 5000% improvements stated 
   in https://github.com/psycopg/psycopg2/issues/491#issuecomment-276551038 
   are probably exaggerated for most users - that was for "transatlantic 
   network connections", whereas I imagine most users have databases much 
   closer to their servers. However, it's still a 10x speed up over 
   executemany at the pscyopg2 layer, and a ~2-3x speed up including 
   SQLAlchemy overhead.
   - execute_values is still twice as fast as execute_batch at the psycopg2 
   layer (for this specific table*), so incorporating that would be even 
   better!
   - execute_batch only helps with inserts to tables where all primary keys 
   are defined (as you noted). Thus, if users want to see improvements for 
   tables with auto-incrementing primary keys and relationships, 
   they'll likely need to combine this with something like the suggestion 
   in https://groups.google.com/forum/#!topic/sqlalchemy/GyAZTThJi2I 

Hopefully that was helpful :) Happy to help test in any other ways as well!

On Wednesday, October 11, 2017 at 7:25:06 AM UTC-7, Mike Bayer wrote:
>
> On Wed, Oct 11, 2017 at 3:02 AM,  <[email protected] <javascript:>> 
> wrote: 
> > Hey Mike, 
> > 
> > Thanks again for the detailed explanations! 
> > 
> > I went ahead and tested the code snippet you gave me - I'm not sure I 
> > totally understand when this will change behavior though. 
> > 
> > I tried the following code snippet: 
> > (Snippet #1) 
> > for i in xrange(10): 
> >     db.session.add(A(id=i)) 
> > db.session.flush() 
> > 
> > This calls through to the "do_executemany" handler and only executes 1 
> > insert statement with multiple VALUES. However, this was already the 
> > existing behavior right? 
>
> There's multiple forms of "one insert statement". 
>
> There is traditional "executemany", which to the database looks like: 
>
> INSERT INTO table (x, y) VALUES (:x, :y) 
>
> The above statement is invoked for each set of parameters. 
>
> then there is multiple values, which to the database looks like: 
>
> INSERT INTO table (x, y) VALUES (:x1, :y1) (:x2, :y2), (:x3, :y3), ... 
>
> The above statement is invoked once, with all sets of parameters at 
> once (in reality they are usually batched in groups and the statement 
> is invoked once per group).  MySQL drivers do this as does psycopg2 
> execute_values(). 
>
> then there is psycopg2 execute_batch which yes, I forgot, is doing 
> "insert into table (x,y) values (:x, :y); insert into table(x, y) 
> values (:x, :y), ...", it's not actually rewriting the values clause. 
>
> In any case, SQLAlchemy only knows about the first form within the ORM. 
>
>
>
>
> > 
> > When I don't specify a primary key: 
> > (Snippet #2) 
> > for i in xrange(10): 
> >     db.session.add(A()) # don't pre-specify a primary key 
> > db.session.flush() 
> > 
> > This does *not* call through to the do_executemany handler, and executes 
> 10 
> > insert statements. 
>
> the ORM must retrieve newly generated primary key values if they were 
> not already present, which cannot be done with the 
> cursor.executemany() method.  it must use cursor.execute(). 
>
>
> > 
> > If I understand 
> > http://initd.org/psycopg/docs/extras.html#fast-execution-helpers 
> correctly, 
> > execute_batch does *not* actually combine the insert statements into a 
> > single statement with multiple VALUES clauses, but instead does 
> something 
> > like sends multiple INSERT statements to the server at once. 
>
> yes I forgot, the execute_batch is actually stringing the statements 
> together on a semicolon, not rewriting values().   However if you look 
> at the performance gains at 
> https://github.com/psycopg/psycopg2/issues/491#issuecomment-276551038, 
> there is negligible difference between batch and values, compared to 
> the default which is 5000% percent slower. 
>
> We would 
> > obviously still expect to see performance benefits from the reduced 
> network 
> > round-trips, but it looks like the above tests do not show any 
> improvement 
> > there. 
>
> I'm not familiar with what testing you are performing.   Are you 
> testing the new flag?   what happens if you run the tests at 
>
> https://www.postgresql.org/message-id/20170130215151.GA7081%40deb76.aryehleib.com
>  
> e.g. https://gist.github.com/altaurog/84668e034646fb354b5de81bb86a580d? 
>  if the feature is working, and the performance gains are being lost 
> for some reason, then that's a major issue in psycopg2 we need to 
> report. 
>
> > 
> > Let me know if I'm testing that incorrectly - I'm also happy to test 
> this on 
> > our actual production use cases to give you some more accurate speed-up 
> > times. 
> > 
> > On Tuesday, October 10, 2017 at 10:36:33 AM UTC-7, Mike Bayer wrote: 
> >> 
> >> I've added 
> >> 
> https://bitbucket.org/zzzeek/sqlalchemy/issues/4109/support-psycopg2-batch-mode
>  
> >> which will have a code review up soon. 
> >> 
> >> On Tue, Oct 10, 2017 at 1:06 PM, Mike Bayer <[email protected]> 
> >> wrote: 
> >> > On Tue, Oct 10, 2017 at 1:47 AM,  <[email protected]> wrote: 
> >> >> Hello! SQLAlchemy has the (really useful) behavior of batching 
> together 
> >> >> inserts when all the primary keys are defined. It looks like it 
> groups 
> >> >> similar tables together (while maintaining insert order ) to 
> minimize 
> >> >> the 
> >> >> number of Insert statements sent to the database. 
> >> >> 
> >> >> I'm unsure what the expected behavior here is for single-table 
> >> >> inheritance 
> >> >> tables. Here's some example models: 
> >> >> class Base(db.Model): 
> >> >>     __tablename__ = 'base' 
> >> >>     id = db.Column(db.Integer, primary_key=True) 
> >> >>     base_type = db.Column(db.String) 
> >> >> 
> >> >>     __mapper_args__ = { 
> >> >>         'polymorphic_on': base_type, 
> >> >>     } 
> >> >> 
> >> >>     def __init__(self, id_): 
> >> >>         self.id = id_ 
> >> >> 
> >> >> 
> >> >> class SubBase1(Base): 
> >> >>     __mapper_args__ = { 
> >> >>         'polymorphic_identity': '1', 
> >> >>     } 
> >> >> 
> >> >>     col = db.Column(db.String) 
> >> >> 
> >> >> 
> >> >> class SubBase2(Base): 
> >> >>     __mapper_args__ = { 
> >> >>         'polymorphic_identity': '2', 
> >> >>     } 
> >> >> 
> >> >>     col2 = db.Column(db.String) 
> >> >> 
> >> >> 
> >> >> class OtherModel(db.Model): 
> >> >>     id = db.Column(db.Integer, primary_key=True) 
> >> >> 
> >> >>     def __init__(self, id_): 
> >> >>         self.id = id_ 
> >> >> 
> >> >> 
> >> >> Base, SubBase, and SubBase2 form a tree, while OtherModel is just 
> >> >> another 
> >> >> model. 
> >> >> 
> >> >> When creating alternating SubBase1s and OtherModels, SQLAlchemy 
> batches 
> >> >> the 
> >> >> INSERTs such that only two INSERT statements are sent to the DB: 
> >> >> (Snippet #1) 
> >> >> for i in xrange(0, 10, 2): 
> >> >>     db.session.add(SubBase1(i)) 
> >> >>     db.session.add(OtherModel(i + 1)) 
> >> >> 
> >> >> db.session.flush()  # Performs 2 insert statements, each with 5 
> >> >> elements in 
> >> >> the VALUES clause 
> >> >> 
> >> >> However, when creating alternating SubBase1s and SubBase2s, it 
> actually 
> >> >> performs 10 separate insert statements: 
> >> >> (Snippet #2) 
> >> >> for i in xrange(0, 10, 2): 
> >> >>     db.session.add(SubBase1(i)) 
> >> >>     db.session.add(SubBase2(i + 1)) 
> >> >> db.session.flush()  # Performs 10 insert statements, each with 1 
> >> >> element in 
> >> >> the VALUES clause 
> >> >> 
> >> >> 
> >> >> It seems like SQLAlchemy maintains the insert_order *across* 
> SubBase1 
> >> >> and 
> >> >> SubBase2, but also isn't able to do a single INSERT statement with 
> 10 
> >> >> elements in the VALUES clause (or 2 INSERT statements with 5 
> elements 
> >> >> each). 
> >> >> 
> >> >> Questions: 
> >> >> 1) Is Snippet #2 the desired behavior? I read through the 
> unit-of-work 
> >> >> summary at http://www.aosabook.org/en/sqlalchemy.html, but I wasn't 
> >> >> totally 
> >> >> sure what the desired behavior was for single-table inheritance 
> models. 
> >> > 
> >> > yes because you'll note because sub1 and sub2 have different columns 
> >> > mapped, the actual columns in each INSERT statement are different. 
> >> > At the same time it is maintaining INSERT order into the table 
> >> > overall, so it has no choice but to break things up in this way. 
> >> > 
> >> >> 2) If this is desired behavior, do you have any suggestions of how 
> to 
> >> >> optimize the number of queries here in e.g. a before_flush hook? 
> >> >> Obviously 
> >> >> one way would be to rewrite our application such that inserts of 
> like 
> >> >> tables 
> >> >> happen together, but if there is a way to do this without rewriting 
> the 
> >> >> application code, that would be even better. 
> >> > 
> >> > if your INSERTs are very simplistic then you can still consider using 
> >> > the bulk_save option, gathering your classes into lists and then 
> doing 
> >> > your own bulk operation where you ensure all like-rows are grouped 
> >> > together ahead of time. 
> >> > 
> >> > Otherwise, if you really are trying to thread the needle through "I 
> >> > want full ORM flush convenience" plus "I need to optimize INSERTs 
> into 
> >> > batches", you have one more option which is to manipulate the 
> >> > insert_order on the states like this: 
> >> > 
> >> > for i in xrange(0, 10, 2): 
> >> >     s.add(SubBase1(i)) 
> >> >     s.add(SubBase2(i + 1)) 
> >> > 
> >> > from sqlalchemy import inspect 
> >> > 
> >> > for idx, obj in enumerate( 
> >> >     sorted( 
> >> >         s.new, 
> >> >         key=lambda obj: (obj.__class__, inspect(obj).insert_order) 
> >> >     ) 
> >> > ): 
> >> >     inspect(obj).insert_order = idx 
> >> > 
> >> > 
> >> > I've not documented "insert_order" as an official thing but I can 
> >> > support that. 
> >> > 
> >> > But also if you're really trying to maximize INSERT performance you 
> >> > need to use the psycopg2 executemany() extensions: 
> >> > http://initd.org/psycopg/docs/extras.html#fast-execution-helpers. 
> >> > These were added specifically from an email thread I started with 
> them 
> >> > in response to user complaints about performance, but SQLAlchemy has 
> >> > not yet built these in as features. 
> >> > 
> >> > I've done no testing at all of this, however if you have the 
> >> > resources, you can turn on statement logging on your Postgresql 
> >> > database which will hopefully show INSERT statements being collapsed 
> >> > into a single statement with multiple VALUES: 
> >> > 
> >> > from psycopg2 import extras 
> >> > from sqlalchemy import event 
> >> > 
> >> > e = create_engine("postgresql://scott:tiger@localhost/test", 
> echo=True) 
> >> > 
> >> > @event.listens_for(e, "do_executemany") 
> >> > def do_executemany(cursor, statement, parameters, context): 
> >> >     context.dialect.supports_sane_multi_rowcount = False 
> >> >     extras.execute_batch(cursor, statement, parameters) 
> >> >     return True 
> >> > 
> >> > if you get the time to test this, please let me know as I do want to 
> >> > add support for "execute_batch" directly into the psycopg2 dialect. 
> >> > thanks! 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > One option I've thought of is: 
> >> >>   - Overwrite the insert_order in the before_flush hook to rearrange 
> >> >> the 
> >> >> models in the order I want. (Not sure if this has effects that I'm 
> >> >> unaware 
> >> >> of though) 
> >> >> 
> >> >> Thanks! 
> >> >> 
> >> >> -- 
> >> >> SQLAlchemy - 
> >> >> The Python SQL Toolkit and Object Relational Mapper 
> >> >> 
> >> >> http://www.sqlalchemy.org/ 
> >> >> 
> >> >> To post example code, please provide an MCVE: Minimal, Complete, and 
> >> >> Verifiable Example. See http://stackoverflow.com/help/mcve for a 
> full 
> >> >> description. 
> >> >> --- 
> >> >> You received this message because you are subscribed to the Google 
> >> >> Groups 
> >> >> "sqlalchemy" group. 
> >> >> To unsubscribe from this group and stop receiving emails from it, 
> send 
> >> >> an 
> >> >> email to [email protected]. 
> >> >> To post to this group, send email to [email protected]. 
> >> >> Visit this group at https://groups.google.com/group/sqlalchemy. 
> >> >> For more options, visit https://groups.google.com/d/optout. 
> > 
> > -- 
> > SQLAlchemy - 
> > The Python SQL Toolkit and Object Relational Mapper 
> > 
> > http://www.sqlalchemy.org/ 
> > 
> > To post example code, please provide an MCVE: Minimal, Complete, and 
> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> > description. 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "sqlalchemy" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to [email protected] <javascript:>. 
> > To post to this group, send email to [email protected] 
> <javascript:>. 
> > Visit this group at https://groups.google.com/group/sqlalchemy. 
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] How do Insert statements for single-table inheritance tables get batched together?

Reply via email to