Re: [sqlalchemy] Speed up bulk inserts

Michael Bayer Wed, 13 Nov 2013 10:00:04 -0800

On Nov 13, 2013, at 12:22 PM, Claudio Freire <[email protected]> wrote:


> On Wed, Nov 13, 2013 at 2:04 PM, Michael Bayer <[email protected]> 
> wrote:
>> 
>> On Nov 13, 2013, at 11:52 AM, Claudio Freire <[email protected]> wrote:
>> 
>>> On Wed, Nov 13, 2013 at 1:45 PM, Michael Bayer <[email protected]> 
>>> wrote:
>>>> 
>>>> Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer:
>>>>> 
>>>>> I wrote a full post regarding this topic on stackoverflow at
>>>>> http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768
>>>>> .  If you start with this, I can answer more specific questions.
>>>> 
>>>> 
>>>> The article was very helpful, thanks. I still want to figure out the best
>>>> balance between convenience and speed for my use case. Do the following 
>>>> make
>>>> sense and is possible?
>>>> 
>>>> I work only with Postgresql and I'm sure that all involved objects have a
>>>> unique id column which is called 'id'.  So before doing a session.commit(),
>>>> I could check how many objects are in my session. As I'm just bulk
>>>> inserting, I know that all of them are new and don't have their id set yet.
>>>> Now I ask the database for that number of new ids, iterate over the objects
>>>> in my session and set the ids. Internally all ids would come from a single
>>>> sequence, so I don't have to care about object types and so on. Afterwards
>>>> SqlAlchemy should be aware that ids have already been set, so no generated
>>>> ids have to be returned and the session.commit() should be much simpler and
>>>> faster.
>>>> 
>>>> Sounds like a still quite simple, but hopefully much faster solution. Do 
>>>> you
>>>> agree?
>>>> 
>>>> 
>>>> 
>>>> sure that should be fine, if you can pre-calc your PKs.   It just won’t 
>>>> work
>>>> under any kind of concurrency, as in such a situation there could be
>>>> interleaved INSERTs from different processes.
>>> 
>>> 
>>> Postgresql sequences already handle that kind of concurrency scenario.
>> 
>> 
>> how exactly, if two transactions T1 and T2 both pull a number from a 
>> sequence, T1 gets 40, T2 gets 41, then we pre-calc 10 inserts that have not 
>> yet occurred for each, T1 has 40-50, T2 has 41-51, the number is totally 
>> wrong for both - in reality it would be some random distribution of 40-60 
>> between T1 and T2.   No ?
> 
> 
> No, you ask for 10 ids to the same sequence, and the sequence
> allocates T1 40-49, and T2 50-59

oh right I’m thinking of the trick as applied to MySQL.    Sure, if you 
actually run the seq’s you’re fine.

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [sqlalchemy] Speed up bulk inserts

Reply via email to