On Thursday 25 January 2007 14:36, Eric Walstad wrote: > Alan Gauld wrote: > > "Shadab Sayani" <[EMAIL PROTECTED]> wrote > > > >> The data I need to deal with is in 100s of GB. > >> I am using postgresql backend and SQLALCHEMY ORM. > > > > All ORMs will introduce a significant performance hit. > > If you really need high speed, and populating a 100G+ database > > probably is included in that, then you should look at raw SQL. > > In fact for really big data loads most databases provide a > > C api that goes under the SQL, because even SQL is > > relatively slow. > > I've had this experience, too. I routinely load a couple million > records from a tab delimited file into our PostgreSQL database with > Python. My import script: > 1. iterates over the tab file a line/record at a time > 2. does some data validation, some data manipulation > 3. writes an SQL file with the massaged data > 4. calls psql <dbname> -f <python-generated sql file> > > Generating the SQL file in python goes amazingly fast, considering the > amount of validation and manipulation (normalizing addresses, generating > hashes, etc) - about 1.5minutes/million records. > > The SQL it generates does the usual DB stuff including dropping indexes > and constraints, COPY FROM stdin, regenerate indexes, replace > constraints. In my experience, psql is MUCH faster at these kinds of > imports than even direct python interaction with psycopg. Adding an ORM > on top of the database connector (psycopg in my case) slows things down > even more. > > As a rough metric, the import I described takes a little over 2 hours on > my 2GHz/1GB laptop. > > Good luck, > > Eric. > _______________________________________________ > Tutor maillist - [email protected] > http://mail.python.org/mailman/listinfo/tutor
I agree - never thought of automating psql for copy...from action. But it can be done. -- John Fabiani _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
