Jason,
Thanks for your examples - there are lots of useful SQLAlchemy coding
hints in there for me...
On Apr 5, 5:30 pm, jason kirtland <[EMAIL PROTECTED]> wrote:
> GK wrote:
> > Michael,
>
> > Thank you for your response - it was very helpful for me.
>
> > It turns out my main problem was that I was importing an order of
> > magnitude or so more data than I realized, but you were also right
> > about using flush().
>
> > You were also right about the overhead of creating extra indexes. In
> > the spirit of putting some data in a public space...
>
> > Starting with en empty database, with a test dataset of 1200 values
> > (about 1150 unique insertions) and flushing after every insertion I
> > have the following timings:
>
> > No extra indexes: 2:00
> > Three extra indexes: 2:15
>
> > This is using SQLite with a flat file on a 1.8GHz laptop. The records
> > are each nearly 1Kb. There's an overhead of about 5 seconds for
> > reading the data, so most of the above time is loading the database.
>
> > I haven't yet had time to judge how the performance varies with larger
> > datasets.
>
> 2:00 seems very high- is that 2 minutes? Below are two similar bulk
> table loads. The first uses the same insert-or-update methodology and
> only the relational layer (no ORM)- that clocks in at 1.25 seconds on my
> laptop. The second is an ORM implementation with a different duplicate
> detection methodology- that clocks in at 2.0 seconds.
>
> ---
>
> ##
> ## Relational version
> ##
>
> import os
> import time
> import random
> from sqlalchemy import *
> from sqlalchemy.exceptions import IntegrityError
>
> data_cols = (
> 'sendadr', 'fromadr', 'toadr', 'sendnam', 'fromnam',
> 'tonam', 'subject', 'received', 'spam', 'folderid' )
> chunk = lambda: '%x' % random.getrandbits(400)
> dataset = [dict((col, chunk()) for col in data_cols)
> for _ in xrange(1200)]
> dupes = random.sample(dataset, 50)
>
> db = '1krows.db'
> if os.path.exists(db):
> os.unlink(db)
>
> engine = create_engine('sqlite:///%s' % db)
> metadata = MetaData(engine)
> table = Table('t', metadata,
> Column('id', Integer, primary_key=True),
> Column('occurs', Integer, default=1),
> *(Column(col, Text) for col in data_cols))
> table.append_constraint(UniqueConstraint(*data_cols))
> metadata.create_all()
>
> table.insert().execute(dupes)
> assert table.select().count().scalar() == 50
>
> start = time.time()
>
> insert = table.insert()
> update = (table.update().
> where(and_(*((table.c[col] == bindparam(col))
> for col in data_cols))).
> values({'occurs': table.c.occurs+1}))
> conn = engine.connect()
> tx = conn.begin()
> for row in dataset:
> try:
> conn.execute(insert, row)
> except IntegrityError:
> conn.execute(update, row)
> tx.commit()
>
> end = time.time()
>
> assert table.select().count().scalar() == 1200
> assert select([func.count(table.c.id)],
> table.c.occurs==2).scalar() == 50
> print "elapsed: %04f" % (end - start)
>
> ##
> ## ORM version
> ##
>
> import hashlib
> import os
> import time
> import random
> from sqlalchemy import *
> from sqlalchemy.orm import *
>
> data_cols = (
> 'sendadr', 'fromadr', 'toadr', 'sendnam', 'fromnam',
> 'tonam', 'subject', 'received', 'spam', 'folderid' )
> chunk = lambda: '%x' % random.getrandbits(400)
> dataset = [dict((col, chunk()) for col in data_cols)
> for _ in xrange(1200)]
>
> def hashrow(row):
> return hashlib.sha1(
> ','.join(row[c] for c in data_cols)).hexdigest()
>
> dupes = []
> for row in random.sample(dataset, 50):
> dupe = row.copy()
> dupe['hash'] = hashrow(dupe)
> dupes.append(dupe)
>
> db = '1krows.db'
> if os.path.exists(db):
> os.unlink(db)
>
> engine = create_engine('sqlite:///%s' % db)
> metadata = MetaData(engine)
> table = Table('t', metadata,
> Column('id', Integer, primary_key=True),
> Column('occurs', Integer, default=1),
> Column('hash', String(40), unique=True),
> *(Column(col, Text) for col in data_cols))
> metadata.create_all()
>
> table.insert().execute(dupes)
> assert table.select().count().scalar() == 50
>
> class Email(object):
> def __init__(self, **kw):
> for key, value in kw.items():
> setattr(self, key, value)
>
> def hashval(self):
> return hashrow(dict((col, getattr(self, col))
> for col in data_cols))
>
> mapper(Email, table)
>
> start = time.time()
> session = create_session()
> session.begin()
>
> data = [Email(**row) for row in dataset]
>
> chunk, remaining = [], [(e.hashval(), e) for e in data]
> while remaining:
> chunk, remaining = remaining[:100], remaining[100:]
> by_hash = dict(chunk)
> dupes = (session.query(Email).
> filter(Email.hash.in_(by_hash.keys()))).all()
> for dupe in dupes:
> dupe.occurs += 1
> by_hash.pop(dupe.hash)
> for hashval, email in by_hash.items():
> email.hash = hashval
> session.save(email)
> session.flush()
> session.commit()
>
> end = time.time()
>
> assert table.select().count().scalar() == 1200
> assert select([func.count(table.c.id)],
> table.c.occurs==2).scalar() == 50
> print "elapsed: %04f" % (end - start)
Yes, my timing was 2 minutes. I ran your examples and got the same
times as you quote, so I started wondering what was causing the
dramatic difference.
Starting with your ORM sample, I started tripping out optimizations
and got some progressively longer run times:
1. as supplied: 2.0 seconds
2. removed hash optimizations (use original columns for uniqueness
constraint and existing value query): 5.7 seconds
3. removed chunking optimization (i.e. flush after every new email
processed instead of every 100): 7.7 seconds
4. removed session.being() and session.commit() around all data: 131
seconds (!)
I went back to my original code and tried adding a begin/commit around
the top-level operation, and voila! I get about 7 seconds to process
about 500 messages (my previous data now being lost). This is a vast
improvement.
So it appears that, even for a non-transactional database session,
wrapping the execution in begin()/commit() is vital for performance.
Thanks for helping me to find this!
#g
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"sqlalchemy" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---