Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-16 Thread Lawrence D'Oliveiro
In message i96squ$aq...@localhost.localdomain, Martin Gregorie wrote:

 On Thu, 14 Oct 2010 16:36:34 +1300, Lawrence D'Oliveiro wrote:
 
 In message 4cb5e659$0$1650$742ec...@news.sonic.net, John Nagle wrote:
 
  Also note that there are some issues with doing a huge volume of
 updates in one MySQL InnoDB transaction.  The system has to keep the
 data needed to undo the updates, and there's a limit on the amount of
 pending transaction history that can be stored.
 
 How does “load data” avoid this? Is that not a transaction too?

 Not usually. Its faster because there's no journalling overhead.

So what happens if there’s an error part way? Do you end up with some 
records inserted and others not?

 The loader takes out an exclusive table lock, dumps the data into the
 table, rebuilds indexes and releases the lock.

That’s not acceptable in a situation where other processes are concurrently 
making queries on the table.

 Seems to me this isn’t going to help, since both old and new tables are
 on the same disk, after all. And it’s the disk access that’s the
 bottleneck.

 There's a lot of overhead in journalling - much more than in applying
 changes to a table. The before and after images *must* be flushed to disk
 on commit. In UNIX terms fsync() must be called on the journal file(s)
 and this is an expensive operation on all OSes because committing a
 series of small transactions can cause the same disk block to be written
 several times. However, the table pages can safely be left in the DBMS
 cache and flushed as part of normal cache operation since, after a crash,
 the table changes can always be recovered from a journal roll-forward. A
 good DBMS will do that automatically when its restarted.

And how does this address the point of whether to load the new records into 
a new table or the same table?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-16 Thread Martin Gregorie
On Sun, 17 Oct 2010 11:27:17 +1300, Lawrence D'Oliveiro wrote:

 In message i96squ$aq...@localhost.localdomain, Martin Gregorie wrote:
 
 On Thu, 14 Oct 2010 16:36:34 +1300, Lawrence D'Oliveiro wrote:
 
 In message 4cb5e659$0$1650$742ec...@news.sonic.net, John Nagle
 wrote:
 
  Also note that there are some issues with doing a huge volume of
 updates in one MySQL InnoDB transaction.  The system has to keep the
 data needed to undo the updates, and there's a limit on the amount of
 pending transaction history that can be stored.
 
 How does “load data” avoid this? Is that not a transaction too?

 Not usually. Its faster because there's no journalling overhead.
 
 So what happens if there’s an error part way? Do you end up with some
 records inserted and others not?
 
Since the load is usually used to populate an empty table, you simply 
empty the table and do it again if it doesn't complete for some reason. 
If you're doing a partial load you'd normally dump the table and/or take 
a backup beforehand.

 The loader takes out an exclusive table lock, dumps the data into the
 table, rebuilds indexes and releases the lock.
 
 That’s not acceptable in a situation where other processes are
 concurrently making queries on the table.
 
Obviously, but you're unlikely to be running anything while you're doing 
a bulk load. If the database is used to support interactive processing 
you'd normally only use bulk loading during recovery after a disk failure 
or as part of a database migration or restructuring exercise.

In fact the only case I know of where bulk loads are a normal part of the 
data life cycle is for data warehousing, where bulk loads are the norm. 
The last month's transactions are removed from the online system and bulk 
loaded into the data warehouse for analysis. In this case the bulk loads 
are either done overnight or during a weekend. However, any RDBMS that 
has been optimised for data warehousing will almost certainly have the 
ability to segment its fact table. This has two benefits:

- you can bulk load a new segment offline and put it online when the
  load is complete. This can be done during normal operation.

- the DBMS can automatically split a query and run parallel copies
  against each segment before combining the result sets for return
  to the program that made the query.
   
 Seems to me this isn’t going to help, since both old and new tables
 are on the same disk, after all. And it’s the disk access that’s the
 bottleneck.

 There's a lot of overhead in journalling - much more than in applying
 changes to a table. The before and after images *must* be flushed to
 disk on commit. In UNIX terms fsync() must be called on the journal
 file(s) and this is an expensive operation on all OSes because
 committing a series of small transactions can cause the same disk block
 to be written several times. However, the table pages can safely be
 left in the DBMS cache and flushed as part of normal cache operation
 since, after a crash, the table changes can always be recovered from a
 journal roll-forward. A good DBMS will do that automatically when its
 restarted.
 
 And how does this address the point of whether to load the new records
 into a new table or the same table?

That's system design choice. There are a limited set of options and its 
up to the designer to choose one:
  
- if you want to load data into a new table and rename CURRENT-OLD, 
  NEW-CURRENT you have to design all queries to operate on both OLD
  and CURRENT tables

- if your DBMS allows table segmentation and gives the ability to offline
  segments you can bulk load as I already described.

- you can stop the system during a quiet time while you do a bulk load
  and/or minimise overheads by using very large transaction.

- you can bite the bullet, accept the performance hit and simply run a
  relatively slow load process using small transactions during normal
  operation.

There are different overheads associated with each choice. A good system 
designer will understand them, know their relative weight for the target 
hardware and database, and use this information to optimise the design.
  
What I said about transactional and journalling overheads was 
explanation: unless you know that you'll never understand why you should 
expect bulk loading to be a lot faster than doing it with transactions 
designed to logically group related operations or (much, much worse) to 
use implied commits. This really is the worse of all worlds since it 
maximises journalling overhead.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org   |
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-15 Thread Lawrence D'Oliveiro
In message
ecdbf6b3-3f99-4b59-b7f8-85bd22f97...@w9g2000prc.googlegroups.com, Steve 
Howell wrote:

 On Oct 13, 8:32 pm, Lawrence D'Oliveiro l...@geek-central.gen.new_zealand
 wrote:

 In message
 d2451907-c0d2-4571-b3e1-1e4d4f66a...@a7g2000prb.googlegroups.com, Steve
 Howell wrote:

 Bulk-load strategies usually solve one or more of these problems:

 network latency

 That’s not an issue. This is a bulk operation, after all.

 index maintenance during the upload

 There are usually options to temporarily turn this off.

 parsing of the SQL

 Not usually a problem, as far as I can tell.

 reallocating storage

 You mean for thr SQL? Again, not usually a problem, as far as I can tell.
 
 If you are avoiding network latency and turning off indexes, then you
 are using some kind of a bulk-load strategy.

I thought we were talking about “load data” versus using simple variants of 
“insert”. At least, that’s what I was talking about. What were you talking 
about?

 If you are not concerned about parsing costs or storage churn ...

What is this “storage churn” thing, and why are you raising it now?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-14 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 10/12/2010 11:10 AM, Roy Smith wrote:
 PEP 249 says about executemany():

  Prepare a database operation (query or command) and then
  execute it against all parameter sequences or mappings
  found in the sequence seq_of_parameters.

 are there any plans to update the api to allow an iterable instead of
 a sequence?
 
 That question would best be addressed to the pep author
 Marc-André Lemburg mal at lemburg.com

Questions about the DB-API should be discussed on the Python DB-SIG
list (put on CC):

http://mail.python.org/mailman/listinfo/db-sig

Regarding your question:

At the time the PEP was written, Python did not have iterables.

However, even with iterables, please keep in mind that pushing
the data row-per-row over a network does not result in good
performance, so using an iterable will make you update slower.

cursor.executemany() is meant to allow the database module
to optimize sending bulk data to the database and ideally,
it will send the whole sequence to the database in one go.

If you want to efficiently run an update with millions of
entries based on an iterable, it is better to use an intermediate
loop which builds sequences of say 1000 rows and then processes
those with a cursor.executemany() call.

You will likely also do this in multiple transactions to
prevent the database from creating a multi-GB transaction log
for the upload.

Another aspect to keep in mind is error reporting. When sending
bulk data to a database, some databases only report error
for the whole data block, so finding the problem can be
troublesome. For that reason, using smaller blocks is better
even when having the data available as real sequence.

Hope that helps,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 14 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-14 Thread Martin Gregorie
On Thu, 14 Oct 2010 16:36:34 +1300, Lawrence D'Oliveiro wrote:

 In message 4cb5e659$0$1650$742ec...@news.sonic.net, John Nagle wrote:
 
  Also note that there are some issues with doing a huge volume of
 updates in one MySQL InnoDB transaction.  The system has to keep the
 data needed to undo the updates, and there's a limit on the amount of
 pending transaction history that can be stored.
 
 How does “load data” avoid this? Is that not a transaction too?

Not usually. Its faster because there's no journalling overhead. The 
loader takes out an exclusive table lock, dumps the data into the table, 
rebuilds indexes and releases the lock. I can't comment about MySQL 
(don't use it) but this has been the case on the RDBMS databases I have 
used.
 
 Seems to me this isn’t going to help, since both old and new tables are
 on the same disk, after all. And it’s the disk access that’s the
 bottleneck.

There's a lot of overhead in journalling - much more than in applying 
changes to a table. The before and after images *must* be flushed to disk 
on commit. In UNIX terms fsync() must be called on the journal file(s) 
and this is an expensive operation on all OSes because committing a 
series of small transactions can cause the same disk block to be written 
several times. However, the table pages can safely be left in the DBMS 
cache and flushed as part of normal cache operation since, after a crash, 
the table changes can always be recovered from a journal roll-forward. A 
good DBMS will do that automatically when its restarted.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org   |
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-14 Thread Steve Howell
On Oct 13, 8:32 pm, Lawrence D'Oliveiro l...@geek-
central.gen.new_zealand wrote:
 In message
 d2451907-c0d2-4571-b3e1-1e4d4f66a...@a7g2000prb.googlegroups.com, Steve

 Howell wrote:
  Bulk-load strategies usually solve one or more of these problems:

   network latency

 That’s not an issue. This is a bulk operation, after all.

   index maintenance during the upload

 There are usually options to temporarily turn this off.

   parsing of the SQL

 Not usually a problem, as far as I can tell.

   reallocating storage

 You mean for thr SQL? Again, not usually a problem, as far as I can tell.

If you are avoiding network latency and turning off indexes, then you
are using some kind of a bulk-load strategy.

If you are not concerned about parsing costs or storage churn, then
you are simply evaluating the costs of a non-bulk-oriented strategy
and determining that they are minimal for your needs, which is fine.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-14 Thread Lawrence D'Oliveiro
In message mailman.1711.1287055174.29448.python-l...@python.org, M.-A. 
Lemburg wrote:

 However, even with iterables, please keep in mind that pushing
 the data row-per-row over a network does not result in good
 performance, so using an iterable will make you update slower.
 
 cursor.executemany() is meant to allow the database module
 to optimize sending bulk data to the database and ideally,
 it will send the whole sequence to the database in one go.

You seem to be assuming that using an iterator precludes buffering.

What’s wrong with evaluating the iterator to produce as many records as the 
API implementation finds convenient to send at once?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-13 Thread Jon Clements
On 12 Oct, 20:21, J. Gerlach gerlach_jo...@web.de wrote:
 Am 12.10.2010 17:10, schrieb Roy Smith:

  [A]re there any plans to update the api to allow an iterable instead of
  a sequence?

 sqlite3 (standard library, python 2.6.6., Windows 32Bit) does that already::

 import sqlite3 as sql

 connection = sql.connect(:memory:)

 cursor = connection.execute(
     CREATE TABLE test (
         id INTEGER PRIMARY KEY AUTOINCREMENT,
     text TEXT)
     ;)
 connection.commit()
 cursor.executemany(
     INSERT INTO test (text) VALUES ( ? );
     ,
     # A generator expression - delivers one row at a time
     ( (hello nr %03d! % i,) for i in xrange(100)))
 connection.commit()
 cursor.execute(SELECT * FROM test)

 for id_, text in cursor.fetchall():
     print text, id_

What happens if you do itertools.repeat(0) instead of xrange(100) ?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-13 Thread J. Gerlach
Am 13.10.2010 14:26, schrieb Jon Clements:
 On 12 Oct, 20:21, J. Gerlach gerlach_jo...@web.de wrote:
 Am 12.10.2010 17:10, schrieb Roy Smith:

 [A]re there any plans to update the api to allow an iterable instead of
 a sequence?

 [sqlite3 example snipped]
 
 What happens if you do itertools.repeat(0) instead of xrange(100) ?

And there I see why I don't want ``executemany()`` to take a generator
-- the never-ending ones will do some damage.

But on a similar note: would a ``execute_n_repeats(statement, iterable,
maximum_rows)`` be better?


Jörg

--
Have a nice day!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-13 Thread Steven D'Aprano
On Wed, 13 Oct 2010 14:01:39 +1300, Lawrence D'Oliveiro wrote:

 In message 4cb4ba4e$0$1641$742ec...@news.sonic.net, John Nagle wrote:
 
 In general, if you find yourself making millions of SQL database
 requests in a loop, you're doing it wrong.
 
 I’ve done this. Not millions, but certainly on the order of tens of
 thousands.

I've done it doesn't disprove that it is the wrong thing to do. It may 
simply mean that you too are doing it wrong.

I have no opinion either way, I'm just saying.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-13 Thread John Nagle

On 10/12/2010 6:01 PM, Lawrence D'Oliveiro wrote:

In message4cb4ba4e$0$1641$742ec...@news.sonic.net, John Nagle wrote:


In general, if you find yourself making millions of
SQL database requests in a loop, you're doing it wrong.


I’ve done this. Not millions, but certainly on the order of tens of
thousands.


   It's a scaling issue.  Millions of INSERT or UPDATE requests can
take hours.  That's when you need the speedup of bulk loading.


  Big database loads are usually done by creating a text file
with the desired data, then using a LOAD DATA INFILE command.


May not always be flexible enough.


   True; you can't do anything with LOAD DATA INFILE but load data.
If you need selects or joins within inserts, you may have to do it
the long way.


This (in MySQL) is tens to hundreds of times faster than doing individual
INSERT or UPDATE commands.


Why should that be? The database cannot read a text file any faster than I
can.


Because the indices are updated in one big update, rather than
after each change to the database.

Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction.  The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.

It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT-OLD, NEW-CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.

I have some bulk databases which are updated from external
sources.  The PhishTank database is updated with UPDATE statements
every three hours. But the Open Directory database is updated by
downloading a big gzipped file of XML, creating a new database
table, then renaming.  That load takes hours, once a week.

(All this applies to MySQL, and to some extent, Postgres.
If you're using SQLite, it's different.  But a million records
is big for SQLite, which is, after all, a lite database.
At 10,000 records, you don't have to worry about any of this
stuff.  At 1,000,000 records, you do.)

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-13 Thread Lawrence D'Oliveiro
In message
d2451907-c0d2-4571-b3e1-1e4d4f66a...@a7g2000prb.googlegroups.com, Steve 
Howell wrote:

 Bulk-load strategies usually solve one or more of these problems:
 
  network latency

That’s not an issue. This is a bulk operation, after all.

  index maintenance during the upload

There are usually options to temporarily turn this off.

  parsing of the SQL

Not usually a problem, as far as I can tell.

  reallocating storage

You mean for thr SQL? Again, not usually a problem, as far as I can tell.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-13 Thread Lawrence D'Oliveiro
In message 4cb5e659$0$1650$742ec...@news.sonic.net, John Nagle wrote:

  Also note that there are some issues with doing a huge volume of
 updates in one MySQL InnoDB transaction.  The system has to keep the
 data needed to undo the updates, and there's a limit on the amount of
 pending transaction history that can be stored.

How does “load data” avoid this? Is that not a transaction too?

  It's common to load data into a new, empty table, then, once
 the reload has succeeded, do a RENAME like CURRENT-OLD, NEW-CURRENT.
 Rename of multiple databases is atomic and interlocked with other
 operations, so you can replace an entire table on a live server.

Is that to avoid impacting searches on the existing data while loading the 
new records?

Seems to me this isn’t going to help, since both old and new tables are on 
the same disk, after all. And it’s the disk access that’s the bottleneck.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Jon Clements
On 12 Oct, 16:10, Roy Smith r...@panix.com wrote:
 PEP 249 says about executemany():

         Prepare a database operation (query or command) and then
         execute it against all parameter sequences or mappings
         found in the sequence seq_of_parameters.

 are there any plans to update the api to allow an iterable instead of
 a sequence?

I'm not understanding (probably me). Example?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Roy Smith
On Oct 12, 1:20 pm, Jon Clements jon...@googlemail.com wrote:
 On 12 Oct, 16:10, Roy Smith r...@panix.com wrote:

  PEP 249 says about executemany():

          Prepare a database operation (query or command) and then
          execute it against all parameter sequences or mappings
          found in the sequence seq_of_parameters.

  are there any plans to update the api to allow an iterable instead of
  a sequence?

 I'm not understanding (probably me). Example?

I have a dictionary, d, which has a million items in it, and want to
do something like:

executemany(update foo set bar = %s where id = %s,
d.iteritems())

If executemany accepted an iterable, that would work.  But, it only
accepts a sequence, so I need to do:

executemany(update foo set bar = %s where id = %s, d.items())

which generates a million-item temporary list.  Or am I mis-
understanding the PEP?



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Jon Clements
On 12 Oct, 18:32, Roy Smith r...@panix.com wrote:
 On Oct 12, 1:20 pm, Jon Clements jon...@googlemail.com wrote:

  On 12 Oct, 16:10, Roy Smith r...@panix.com wrote:

   PEP 249 says about executemany():

           Prepare a database operation (query or command) and then
           execute it against all parameter sequences or mappings
           found in the sequence seq_of_parameters.

   are there any plans to update the api to allow an iterable instead of
   a sequence?

  I'm not understanding (probably me). Example?

 I have a dictionary, d, which has a million items in it, and want to
 do something like:

     executemany(update foo set bar = %s where id = %s,
 d.iteritems())

 If executemany accepted an iterable, that would work.  But, it only
 accepts a sequence, so I need to do:

     executemany(update foo set bar = %s where id = %s, d.items())

 which generates a million-item temporary list.  Or am I mis-
 understanding the PEP?

Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

That's just my guess and makes sense to me at least!

Jon.



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Terry Reedy

On 10/12/2010 11:10 AM, Roy Smith wrote:

PEP 249 says about executemany():

 Prepare a database operation (query or command) and then
 execute it against all parameter sequences or mappings
 found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?


That question would best be addressed to the pep author
Marc-André Lemburg mal at lemburg.com

--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Ned Deily
In article 
2eeb1c54-83f5-4375-93fb-478bdbd7e...@j25g2000yqa.googlegroups.com,
 Jon Clements jon...@googlemail.com wrote:

 On 12 Oct, 18:32, Roy Smith r...@panix.com wrote:
  On Oct 12, 1:20 pm, Jon Clements jon...@googlemail.com wrote:
 
   On 12 Oct, 16:10, Roy Smith r...@panix.com wrote:
 
PEP 249 says about executemany():
 
        Prepare a database operation (query or command) and then
        execute it against all parameter sequences or mappings
        found in the sequence seq_of_parameters.
 
are there any plans to update the api to allow an iterable instead of
a sequence?
 
   I'm not understanding (probably me). Example?
 
  I have a dictionary, d, which has a million items in it, and want to
  do something like:
 
      executemany(update foo set bar = %s where id = %s,
  d.iteritems())
 
  If executemany accepted an iterable, that would work.  But, it only
  accepts a sequence, so I need to do:
 
      executemany(update foo set bar = %s where id = %s, d.items())
 
  which generates a million-item temporary list.  Or am I mis-
  understanding the PEP?
 
 Interesting, but here's my guess...
 
 Replace d.items() with itertools.repeat( ('a', 'b') )
 
 So, if you have a sequence, which has a length and known size, at
 least you can have an attempt at the DB operations: whether the
 transaction fails or not is another thing...In short, a sequence is
 finite, while an iterable may be infinite.

Also, keep in mind that PEP 249 DB adapter implementations are typically 
wrappers around a lower-level client library for a particular DB 
implementation and that most of those client APIs - written in C - will 
likely require all of the items to be passed in one call.  If so, the DB 
adapter would need to immediately evaluate the iterable and produce a 
list in memory anyway.

-- 
 Ned Deily,
 n...@acm.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Jon Clements
On 12 Oct, 18:53, Jon Clements jon...@googlemail.com wrote:
 On 12 Oct, 18:32, Roy Smith r...@panix.com wrote:



  On Oct 12, 1:20 pm, Jon Clements jon...@googlemail.com wrote:

   On 12 Oct, 16:10, Roy Smith r...@panix.com wrote:

PEP 249 says about executemany():

        Prepare a database operation (query or command) and then
        execute it against all parameter sequences or mappings
        found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?

   I'm not understanding (probably me). Example?

  I have a dictionary, d, which has a million items in it, and want to
  do something like:

      executemany(update foo set bar = %s where id = %s,
  d.iteritems())

  If executemany accepted an iterable, that would work.  But, it only
  accepts a sequence, so I need to do:

      executemany(update foo set bar = %s where id = %s, d.items())

  which generates a million-item temporary list.  Or am I mis-
  understanding the PEP?

 Interesting, but here's my guess...

 Replace d.items() with itertools.repeat( ('a', 'b') )

 So, if you have a sequence, which has a length and known size, at
 least you can have an attempt at the DB operations: whether the
 transaction fails or not is another thing...In short, a sequence is
 finite, while an iterable may be infinite.

 That's just my guess and makes sense to me at least!

 Jon.

Actually, thinking about it some more, I would take the following
approach:
(this is only loosely do with the Python DB API mind you...)

1) Start a transaction
2) Create a temporary table
3) Bulk insert your million update records to the temp table (from my
understanding of the PEP, executemany(), is allowed to repeatedly call
execute() unless it can do something cleverer)
4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)
5) End the transaction

Far more efficient as a repeated execute of 'update' will only just
hit the DB once, while an update statement with a from should allow
the DB a chance to optimise it.

Hope that makes sense, lemme know.

Cheers,

Jon.




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Antoine Pitrou
On Tue, 12 Oct 2010 13:58:37 -0400
Terry Reedy tjre...@udel.edu wrote:
 On 10/12/2010 11:10 AM, Roy Smith wrote:
  PEP 249 says about executemany():
 
   Prepare a database operation (query or command) and then
   execute it against all parameter sequences or mappings
   found in the sequence seq_of_parameters.
 
  are there any plans to update the api to allow an iterable instead of
  a sequence?
 
 That question would best be addressed to the pep author
 Marc-André Lemburg mal at lemburg.com

Or, rather to python-ideas. Python development isn't supposed to happen
in private.

Regards

Antoine.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Petite Abeille

On Oct 12, 2010, at 8:35 PM, Jon Clements wrote:

 4) Execute an update with a from statement joining your main table and
 temp table (pretty sure that's ANSI standard, and DB's should support
 it -- embedded one's may not though, but if you're dealing with 1mil
 records, I'm taking a guess you're not dealing with embedded)

Not ANSI. Perhaps with a subquery, which is ANSI-compliant.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread J. Gerlach
Am 12.10.2010 17:10, schrieb Roy Smith:

 [A]re there any plans to update the api to allow an iterable instead of
 a sequence?

sqlite3 (standard library, python 2.6.6., Windows 32Bit) does that already::

import sqlite3 as sql

connection = sql.connect(:memory:)

cursor = connection.execute(
CREATE TABLE test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT)
;)
connection.commit()
cursor.executemany(
INSERT INTO test (text) VALUES ( ? );
,
# A generator expression - delivers one row at a time
( (hello nr %03d! % i,) for i in xrange(100)))
connection.commit()
cursor.execute(SELECT * FROM test)

for id_, text in cursor.fetchall():
print text, id_
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread John Nagle

On 10/12/2010 11:35 AM, Jon Clements wrote:

On 12 Oct, 18:53, Jon Clementsjon...@googlemail.com  wrote:

On 12 Oct, 18:32, Roy Smithr...@panix.com  wrote:




On Oct 12, 1:20 pm, Jon Clementsjon...@googlemail.com  wrote:



On 12 Oct, 16:10, Roy Smithr...@panix.com  wrote:



PEP 249 says about executemany():



 Prepare a database operation (query or command) and then
 execute it against all parameter sequences or mappings
 found in the sequence seq_of_parameters.



are there any plans to update the api to allow an iterable instead of
a sequence?



I'm not understanding (probably me). Example?



I have a dictionary, d, which has a million items in it, and want to
do something like:



 executemany(update foo set bar = %s where id = %s,
d.iteritems())



If executemany accepted an iterable, that would work.  But, it only
accepts a sequence, so I need to do:



 executemany(update foo set bar = %s where id = %s, d.items())



which generates a million-item temporary list.  Or am I mis-
understanding the PEP?


Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

That's just my guess and makes sense to me at least!

Jon.


Actually, thinking about it some more, I would take the following
approach:
(this is only loosely do with the Python DB API mind you...)

1) Start a transaction
2) Create a temporary table
3) Bulk insert your million update records to the temp table (from my
understanding of the PEP, executemany(), is allowed to repeatedly call
execute() unless it can do something cleverer)
4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)
5) End the transaction

Far more efficient as a repeated execute of 'update' will only just
hit the DB once, while an update statement with a from should allow
the DB a chance to optimise it.

Hope that makes sense, lemme know.


Right.  In general, if you find yourself making millions of
SQL database requests in a loop, you're doing it wrong.

Big database loads are usually done by creating a text file
with the desired data, then using a LOAD DATA INFILE command.
This (in MySQL) is tens to hundreds of times faster than
doing individual INSERT or UPDATE commands.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Lawrence D'Oliveiro
In message 4cb4ba4e$0$1641$742ec...@news.sonic.net, John Nagle wrote:

 In general, if you find yourself making millions of
 SQL database requests in a loop, you're doing it wrong.

I’ve done this. Not millions, but certainly on the order of tens of 
thousands.

  Big database loads are usually done by creating a text file
 with the desired data, then using a LOAD DATA INFILE command.

May not always be flexible enough.

 This (in MySQL) is tens to hundreds of times faster than doing individual
 INSERT or UPDATE commands.

Why should that be? The database cannot read a text file any faster than I 
can.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 (database api) -- executemany() with iterable?

2010-10-12 Thread Steve Howell
On Oct 12, 6:01 pm, Lawrence D'Oliveiro l...@geek-
central.gen.new_zealand wrote:
 In message 4cb4ba4e$0$1641$742ec...@news.sonic.net, John Nagle wrote:

  In general, if you find yourself making millions of
  SQL database requests in a loop, you're doing it wrong.

 I’ve done this. Not millions, but certainly on the order of tens of
 thousands.

       Big database loads are usually done by creating a text file
  with the desired data, then using a LOAD DATA INFILE command.

 May not always be flexible enough.

  This (in MySQL) is tens to hundreds of times faster than doing individual
  INSERT or UPDATE commands.

 Why should that be? The database cannot read a text file any faster than I
 can.

Bulk-load strategies usually solve one or more of these problems:

 network latency
 index maintenance during the upload
 parsing of the SQL
 reallocating storage

They usually come at some cost or inconvenience:

 only work locally
 indexes not available during the bulk load
 requires non-SQL data format
 up-front cost to preallocate storage

The exact benefits and costs will vary according to the system, of
course.  Also, you might employ multiple strategies; for example, for
a medium-sized problem, there may be no need to sacrifice the
familiarity and convenience of SQL if you can arrange for large
batches of SQL to be submitted from a local client.  If indexes are
the problem, and you don't need 24/7 availability, you can drop them
and rebuild them.




-- 
http://mail.python.org/mailman/listinfo/python-list