Michael,

Thank you for your response - it was very helpful for me.

It turns out my main problem was that I was importing an order of
magnitude or so more data than I realized, but you were also right
about using flush().

You were also right about the overhead of creating extra indexes.  In
the spirit of putting some data in a public space...

Starting with en empty database, with a test dataset of 1200 values
(about 1150 unique insertions) and flushing after every insertion I
have the following timings:

No extra indexes: 2:00
Three extra indexes: 2:15

This is using SQLite with a flat file on a 1.8GHz laptop.  The records
are each nearly 1Kb.  There's an overhead of about 5 seconds for
reading the data, so most of the above time is loading the database.

I haven't yet had time to judge how the performance varies with larger
datasets.

#g

On Apr 4, 4:29 pm, Michael Bayer <[EMAIL PROTECTED]> wrote:
> On Apr 4, 2008, at 11:09 AM, GK wrote:
>
>
>
>
>
> > I tried initially without any additional column indexes, and (observed
> > by watching debug output) the program started off quickly enough, but
> > after a hundred or so records the rate of insertion started to drop
> > off dramatically.  I reasoned that this was because the code was being
> > forced to perform a linear search for each new entry processed, and
> > hypothesized that this would be improved by judicious addition of some
> > additional column indexes.  I duly did this, but the performance is
> > not observably improved.  Code snippets below.
>
> > I'm using Python 2.5.1, SA 0.4.4 and SQLite 3.5.7
>
> > I'm not looking for blinding performance, just "some time today".  I'm
> > trying to process about 1300 source records, and so far it's taken a
> > couple of hours at 100% CPU utilization.  The sqlite data file is
> > still zero length.  I can't tell how far it's got through the source
> > data.
>
> 1300 rows, even if individually fetched and inserted one by one (which
> it seems is what youre doing), shouldn't take more than a minute.  So
> some questions to ask are, how big is this table you're selecting
> from ?   You are filtering on about ten columns.  What is important to
> note there is that if you did in fact place indexes on all ten columns
> to speed up selecting, you'd also directly impact the speed of
> insertion negatively.  So thats something to consider.
>
>
>
> > My questions, then, are:
> > 1. does the SQLAlchemy ORM/query/filter_by use column indexes to
> > optimize queries?
>
> your database is what would be taking advantage of column indexes.  If
> you tell SQLA to select from a table and filter on these 10 columns,
> theres no particular optimization to be done at the SQL construction
> layer.
>
>
>
> > 2. if so, are there any particular steps I need to take to ensure this
> > happens?
>
> you should experiment with your databases "explain plan" feature to
> show what indexes are taking effect for the SQL being issued.   With
> SQLA, turn on SQL echoing to see the conversation taking place.
>
> > As a supplementary question, are there any good reference materials
> > for SA whose intended audience level lies somewhere between the
> > excellent introductory tutorials and concise API reference available
> > at the web site?
>
> There is an oreilly book, which I havent seen yet, coming out in
> June.    I cant speak to its accuracy or up-to-dateness since we
> release new features very frequently.
>
>
>
>
>
> > and the access code like this:
>
> >                # Update occurrence count or add new message
> >                emlq = session.query(Email).filter_by(
> >                    sendadr  = sadr,
> >                    fromadr  = fadr,
> >                    toadr    = tadr,
> >                    sendnam  = snam,
> >                    fromnam  = fnam,
> >                    tonam    = tnam,
> >                    subject  = msgsubj,
> >                    received = msgrcvd,
> >                    spam     = msgspam,
> >                    folderid = folderid)
> >                try:
> >                    eml = emlq.one()
> >                except Exception, e:
> >                    eml = Email(
> >                        sadr, snam, fadr, fnam, tadr, tnam,
> >                        msgsubj, msgrcvd, msgspam,
> >                        mboxname, folderid)
> >                eml.occurs += 1
> >                session.save_or_update(eml)
>
> One thing you definitely want to do here is flush() your session after
> several rows.  The reason for your latency might be just lots of
> pending data building up in the session unflushed.
>
> Additionally, for a datafile import I'd probably not use the ORM at
> all and use straight SQL constructs, i.e.
> mytable.select().where(sendadr=sadr, fromadr=fadr, ...), and then
> mytable.update() or mytable.insert() depending on the results.  The
> SQL expression tutorial lays it all out how to use those.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to