Michael, Thank you for your response - it was very helpful for me.
It turns out my main problem was that I was importing an order of magnitude or so more data than I realized, but you were also right about using flush(). You were also right about the overhead of creating extra indexes. In the spirit of putting some data in a public space... Starting with en empty database, with a test dataset of 1200 values (about 1150 unique insertions) and flushing after every insertion I have the following timings: No extra indexes: 2:00 Three extra indexes: 2:15 This is using SQLite with a flat file on a 1.8GHz laptop. The records are each nearly 1Kb. There's an overhead of about 5 seconds for reading the data, so most of the above time is loading the database. I haven't yet had time to judge how the performance varies with larger datasets. #g On Apr 4, 4:29 pm, Michael Bayer <[EMAIL PROTECTED]> wrote: > On Apr 4, 2008, at 11:09 AM, GK wrote: > > > > > > > I tried initially without any additional column indexes, and (observed > > by watching debug output) the program started off quickly enough, but > > after a hundred or so records the rate of insertion started to drop > > off dramatically. I reasoned that this was because the code was being > > forced to perform a linear search for each new entry processed, and > > hypothesized that this would be improved by judicious addition of some > > additional column indexes. I duly did this, but the performance is > > not observably improved. Code snippets below. > > > I'm using Python 2.5.1, SA 0.4.4 and SQLite 3.5.7 > > > I'm not looking for blinding performance, just "some time today". I'm > > trying to process about 1300 source records, and so far it's taken a > > couple of hours at 100% CPU utilization. The sqlite data file is > > still zero length. I can't tell how far it's got through the source > > data. > > 1300 rows, even if individually fetched and inserted one by one (which > it seems is what youre doing), shouldn't take more than a minute. So > some questions to ask are, how big is this table you're selecting > from ? You are filtering on about ten columns. What is important to > note there is that if you did in fact place indexes on all ten columns > to speed up selecting, you'd also directly impact the speed of > insertion negatively. So thats something to consider. > > > > > My questions, then, are: > > 1. does the SQLAlchemy ORM/query/filter_by use column indexes to > > optimize queries? > > your database is what would be taking advantage of column indexes. If > you tell SQLA to select from a table and filter on these 10 columns, > theres no particular optimization to be done at the SQL construction > layer. > > > > > 2. if so, are there any particular steps I need to take to ensure this > > happens? > > you should experiment with your databases "explain plan" feature to > show what indexes are taking effect for the SQL being issued. With > SQLA, turn on SQL echoing to see the conversation taking place. > > > As a supplementary question, are there any good reference materials > > for SA whose intended audience level lies somewhere between the > > excellent introductory tutorials and concise API reference available > > at the web site? > > There is an oreilly book, which I havent seen yet, coming out in > June. I cant speak to its accuracy or up-to-dateness since we > release new features very frequently. > > > > > > > and the access code like this: > > > # Update occurrence count or add new message > > emlq = session.query(Email).filter_by( > > sendadr = sadr, > > fromadr = fadr, > > toadr = tadr, > > sendnam = snam, > > fromnam = fnam, > > tonam = tnam, > > subject = msgsubj, > > received = msgrcvd, > > spam = msgspam, > > folderid = folderid) > > try: > > eml = emlq.one() > > except Exception, e: > > eml = Email( > > sadr, snam, fadr, fnam, tadr, tnam, > > msgsubj, msgrcvd, msgspam, > > mboxname, folderid) > > eml.occurs += 1 > > session.save_or_update(eml) > > One thing you definitely want to do here is flush() your session after > several rows. The reason for your latency might be just lots of > pending data building up in the session unflushed. > > Additionally, for a datafile import I'd probably not use the ORM at > all and use straight SQL constructs, i.e. > mytable.select().where(sendadr=sadr, fromadr=fadr, ...), and then > mytable.update() or mytable.insert() depending on the results. The > SQL expression tutorial lays it all out how to use those. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---
