Re: [PERFORM] Improve BULK insertion

2004-12-05 Thread Rodrigo Carvalhaes
Hi!
1. I am doing the inserts using pg_restore. The dump was created using 
pg_dump and the standard format (copy statements)
2. See below the table schema. There are only 7 indexes.  
3. My transaction log configuration are : checkpoint_segments = 3  and 
checkpoint_timeout = 300 and my transaction logs are on the same disk .

I know that I can increase the performance separating the transaction 
logs and making a RAID 5 array BUT I am really curious about WHY this 
performance is so poor and HOW can I try to improve on this actual 
machine because actualy this inserts are taking around 90 minutes!!!

Cheers!
Rodrigo
dadosadv=# \d si2010
   Table public.si2010
  Column   |   Type   |  Modifiers
+--+-
i2_filial  | character(2) | not null default '  '::bpchar
i2_num | character(10)| not null default '  '::bpchar
i2_linha   | character(2) | not null default '  '::bpchar
i2_data| character(8) | not null default ''::bpchar
i2_dc  | character(1) | not null default ' '::bpchar
i2_debito  | character(20)| not null default '
'::bpchar
i2_dcd | character(1) | not null default ' '::bpchar
i2_credito | character(20)| not null default '
'::bpchar
i2_dcc | character(1) | not null default ' '::bpchar
i2_moedas  | character(5) | not null default ' '::bpchar
i2_valor   | double precision | not null default 0.0
i2_hp  | character(3) | not null default '   '::bpchar
i2_hist| character(40)| not null default 
''::bpchar
i2_ccd | character(9) | not null default ' '::bpchar
i2_ccc | character(9) | not null default ' '::bpchar
i2_ativdeb | character(6) | not null default '  '::bpchar
i2_ativcrd | character(6) | not null default '  '::bpchar
i2_vlmoed2 | double precision | not null default 0.0
i2_vlmoed3 | double precision | not null default 0.0
i2_vlmoed4 | double precision | not null default 0.0
i2_vlmoed5 | double precision | not null default 0.0
i2_dtvenc  | character(8) | not null default ''::bpchar
i2_criter  | character(4) | not null default ''::bpchar
i2_rotina  | character(8) | not null default ''::bpchar
i2_periodo | character(6) | not null default '  '::bpchar
i2_listado | character(1) | not null default ' '::bpchar
i2_origem  | character(40)| not null default 
''::bpchar
i2_permat  | character(4) | not null default ''::bpchar
i2_filorig | character(2) | not null default '  '::bpchar
i2_intercp | character(1) | not null default ' '::bpchar
i2_identcp | character(12)| not null default ''::bpchar
i2_lote| character(4) | not null default ''::bpchar
i2_doc | character(6) | not null default '  '::bpchar
i2_emporig | character(2) | not null default '  '::bpchar
i2_lp  | character(3) | not null default '   '::bpchar
i2_itemd   | character(9) | not null default ' '::bpchar
i2_itemc   | character(9) | not null default ' '::bpchar
i2_prelan  | character(1) | not null default ' '::bpchar
i2_tipo| character(2) | not null default '  '::bpchar
i2_dcc | character(1) | not null default ' '::bpchar
i2_moedas  | character(5) | not null default ' '::bpchar
i2_valor   | double precision | not null default 0.0
i2_hp  | character(3) | not null default '   '::bpchar
i2_hist| character(40)| not null default 
''::bpchar
i2_ccd | character(9) | not null default ' '::bpchar
i2_ccc | character(9) | not null default ' '::bpchar
i2_ativdeb | character(6) | not null default '  '::bpchar
i2_ativcrd | character(6) | not null default '  '::bpchar
i2_vlmoed2 | double precision | not null default 0.0
i2_vlmoed3 | double precision | not null default 0.0
i2_vlmoed4 | double precision | not null default 0.0
i2_vlmoed5 | double precision | not null default 0.0
i2_dtvenc  | character(8) | not null default ''::bpchar
i2_criter  | character(4) | not null default ''::bpchar
i2_rotina  | character(8) | not null default ''::bpchar
i2_periodo | character(6) | not null default '  '::bpchar
i2_listado | character(1) | not null default ' '::bpchar
i2_origem  | character(40)| not null default 
''::bpchar
i2_permat  | character(4) | not null default ''::bpchar
i2_filorig | character(2) | not null default '  '::bpchar
i2_intercp | character(1) | not null default ' '::bpchar
i2_identcp | character(12)| not null default ''::bpchar
i2_lote| character(4) | 

Re: [PERFORM] Improve BULK insertion

2004-12-05 Thread Gregory S. Williamson
Rodrigo --

You should definitely drop the indexes and any other FK constraints before 
loading and then rebuild them. Check your logs and see if there are warnings 
about checkpoint intervals -- only 3 logs seems like it might be small; if you 
have the disk space I would definitely consider raising the number. If you 
haven't already posted your config settings you might do so -- this seems very 
slow. I regularly use COPY to load or unload data sets in the 200k-900k range 
and they don't take 90 minutes, even on slower hardware (and usually only a few 
minutes on our production servers; rebuilding the indexes usually takes longer. 
 

This unloading a 300k+ row data set on a dell linux box with not very good 
disks and 1 gig of RAM:

Starting copy of parcel staging table parcels_12031 at Thu Dec  2 01:13:52 2004
Done with staging table copy at Thu Dec  2 01:15:16 2004
...
Starting compression of parcel file at Thu Dec  2 01:15:22 2004
gzip: /tmp/parcels_12031.unl.gz already exists; do you wish to overwrite (y or n
)? y
Done with compression of parcel file at Thu Dec  2 01:17:23 2004
...

And loading them on a rather faster server:

Starting unzip of parcels at Thu Dec  2 01:29:15 2004
Finished with unzip at Thu Dec  2 01:29:22 2004
...
Target db detail table updated at Thu Dec  2 01:29:29 2004
Dropping indexes
Dropping fk constraint on tracking id
Dropping indexes
Done dropping indexes on target parcels table at Thu Dec  2 01:29:30 2004
NOTICE:  drop cascades to table f12031.parcel_pins
NOTICE:  drop cascades to table f12031.parcel_addresses
NOTICE:  drop cascades to table f12031.parcel_owner_fti
NOTICE:  drop cascades to table f12031.parcel_owners
Removing old parcels entries starting at Thu Dec  2 01:29:30 2004
Done deleting schema and parcels for track_id 10163541 at Thu Dec  2 01:33:04 
2004
Starting load of parcels at Thu Dec  2 01:33:04 2004
Done copying data into parcels at Thu Dec  2 01:35:18 2004
Deleting old v_detail reference for track_id 10163541
Done with delete of old v_detail reference
Starting creation of foreign key constraint at Thu Dec  2 01:39:43 2004
Done with creation of foreign key constraint at Thu Dec  2 01:42:14 2004
Starting spatial index create at Thu Dec  2 01:42:14 2004
Done creating spatial index at Thu Dec  2 01:55:04 2004
Starting stats on geometry column now
Done doing stats for spatial index at Thu Dec  2 02:03:47 2004
Starting index on PIN now
Done creating pin index at Thu Dec  2 02:09:36 2004
Starting index on tracking id now
Done creating trid index at Thu Dec  2 02:12:35 2004
Starting centroid index now
Done creating centroid index at Thu Dec  2 02:24:11 2004
Starting stats on centroid column
Done doing stats for spatial index at Thu Dec  2 02:29:55 2004
Doing City/Street Index on parcels table ...Done creating city/street index at 
Thu Dec  2 02:42:41 2004 with result -1
Committing changes

So this took about 70 minutes to delete 20+ rows from a table with about 5 
million rows, load a new set and reindex them (and do some statistics for 
spatial geometry). If the table had only this data the indexing would have been 
*much* faster. These are moderate size columns -- about 2 dozen columns and 
some spatial data (polygon and point). Both servers have rather more log files 
than your setup, but I am not familiar enough with postgres to know how much of 
an impact that alone will have. The comment about it slowing down part way 
through a load makes me suspect indexing issues, somehow (not from postgres 
experience but it rings a bell with other DBs); if you explicitly drop the 
indexes first and then load does it show the same performance behavior ?

If you are doing the data read from, the database write and the WAL logging all 
on single disk drive, then I would guess that that is your bottleneck. If you 
use vmstat and/or top or the like, is your I/O pegged ?

HTH

Greg WIlliamson
DBA
GlobeXplorer LLC

-Original Message-
From:   Rodrigo Carvalhaes [mailto:[EMAIL PROTECTED]
Sent:   Sun 12/5/2004 11:52 AM
To: Christopher Browne
Cc: [EMAIL PROTECTED]
Subject:Re: [PERFORM] Improve BULK insertion
Hi!

1. I am doing the inserts using pg_restore. The dump was created using 
pg_dump and the standard format (copy statements)
2. See below the table schema. There are only 7 indexes.  
3. My transaction log configuration are : checkpoint_segments = 3  and 
checkpoint_timeout = 300 and my transaction logs are on the same disk .

I know that I can increase the performance separating the transaction 
logs and making a RAID 5 array BUT I am really curious about WHY this 
performance is so poor and HOW can I try to improve on this actual 
machine because actualy this inserts are taking around 90 minutes!!!

Cheers!

Rodrigo

dadosadv=# \d si2010
Table public.si2010
   Column   |   Type   |  Modifiers

[PERFORM] Improve BULK insertion

2004-12-04 Thread Grupos
Hi !
I need to insert 500.000 records on a table frequently. It´s a bulk 
insertion from my applicatoin.
I am with a very poor performance. PostgreSQL insert very fast until the 
tuple 200.000 and after it the insertion starts to be really slow.
I am seeing on the log and there is a lot of transaction logs, something 
like :

2004-12-04 11:08:59 LOG:  recycled transaction log file 00060012
2004-12-04 11:08:59 LOG:  recycled transaction log file 00060013
2004-12-04 11:08:59 LOG:  recycled transaction log file 00060011
2004-12-04 11:14:04 LOG:  recycled transaction log file 00060015
2004-12-04 11:14:04 LOG:  recycled transaction log file 00060014
2004-12-04 11:19:08 LOG:  recycled transaction log file 00060016
2004-12-04 11:19:08 LOG:  recycled transaction log file 00060017
2004-12-04 11:24:10 LOG:  recycled transaction log file 00060018
How can I configure PostgreSQL to have a better performance on this bulk 
insertions ? I already increased the memory values.

My data:
Conectiva linux kernel 2.6.9
PostgreSQL 7.4.6 - 1,5gb memory
max_connections = 30
shared_buffers = 3
sort_mem = 32768
vacuum_mem = 32768
max_fsm_pages = 3
max_fsm_relations = 1500
The other configurations are default.
Cheers,
Rodrigo Carvalhaes  


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Improve BULK insertion

2004-12-04 Thread Christopher Browne
In the last exciting episode, [EMAIL PROTECTED] (Grupos) wrote:
 Hi !

 I need to insert 500.000 records on a table frequently. It´s a bulk
 insertion from my applicatoin.
 I am with a very poor performance. PostgreSQL insert very fast until
 the tuple 200.000 and after it the insertion starts to be really slow.
 I am seeing on the log and there is a lot of transaction logs,
 something like :

 2004-12-04 11:08:59 LOG:  recycled transaction log file 00060012
 2004-12-04 11:08:59 LOG:  recycled transaction log file 00060013
 2004-12-04 11:08:59 LOG:  recycled transaction log file 00060011
 2004-12-04 11:14:04 LOG:  recycled transaction log file 00060015
 2004-12-04 11:14:04 LOG:  recycled transaction log file 00060014
 2004-12-04 11:19:08 LOG:  recycled transaction log file 00060016
 2004-12-04 11:19:08 LOG:  recycled transaction log file 00060017
 2004-12-04 11:24:10 LOG:  recycled transaction log file 00060018

It is entirely normal for there to be a lot of transaction log file
recycling when bulk inserts are taking place; that goes through a lot
of transaction logs.

 How can I configure PostgreSQL to have a better performance on this
 bulk insertions ? I already increased the memory values.

Memory is, as likely as not, NOT the issue.

Two questions:

 1.  How are you doing the inserts?  Via INSERT statements?  Or
 via COPY statements?  What sort of transaction grouping
 is involved?

 COPY is way faster than INSERT, and grouping plenty of updates
 into a single transaction is generally a win.

 2.  What is the schema like?  Does the table have a foreign key
 constraint?  Does it have a bunch of indices?

 If there should eventually be lots of indices, it tends to be
 faster to create the table with none/minimal indices, and add
 indexes afterwards, as long as your load process can be trusted
 to not break unique constraints...

 If there is some secondary table with a foreign key constraint,
 and _that_ table is growing, it is possible that a sequential
 scan is being used to search the secondary table where, if you
 did an ANALYZE on that table, an index scan would be preferred
 once it grew to larger size...

There isn't a particular reason for PostgreSQL to hit a wall upon
seeing 200K records; I and coworkers routinely load database dumps
that have millions of (sometimes pretty fat) records, and they don't
choke.  That's true whether talking about loading things onto my
(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAID
array, or higher end stuff involving high end SMP and EMC disk arrays.
The latter obviously being orders of magnitude faster than desktop
equipment :-).
-- 
(format nil [EMAIL PROTECTED] cbbrowne acm.org)
http://www3.sympatico.ca/cbbrowne/unix.html
Rules of the  Evil Overlord #207. Employees will  have conjugal visit
trailers which  they may use provided  they call in  a replacement and
sign out on  the timesheet. Given this, anyone caught  making out in a
closet  while  leaving  their   station  unmonitored  will  be  shot.
http://www.eviloverlord.com/

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Improve BULK insertion

2004-12-04 Thread Patrick Hatcher
I do mass inserts daily into PG. I drop the all indexes except my primary key and then use the COPY FROM command. This usually takes less than 30 seconds. I spend more time waiting for indexes to recreate.PatrickHatcherMacys.Com[EMAIL PROTECTED] wrote: -To: [EMAIL PROTECTED]From: Christopher Browne [EMAIL PROTECTED]Sent by: [EMAIL PROTECTED]Date: 2004-12-04 06:48AMSubject: Re: [PERFORM] Improve BULK insertionIn the last exciting episode, [EMAIL PROTECTED] (Grupos) wrote: Hi ! I need to insert 500.000 records on a table frequently. It´s a bulk insertion from my applicatoin. I am with a very poor performance. PostgreSQL insert very fast until the tuple 200.000 and after it the insertion starts to be really slow. I am seeing on the log and there is a lot of transaction logs, something like : 2004-12-04 11:08:59 LOG: recycled transaction log file "00060012" 2004-12-04 11:08:59 LOG: recycled transaction log file "00060013" 2004-12-04 11:08:59 LOG: recycled transaction log file "00060011" 2004-12-04 11:14:04 LOG: recycled transaction log file "00060015" 2004-12-04 11:14:04 LOG: recycled transaction log file "00060014" 2004-12-04 11:19:08 LOG: recycled transaction log file "00060016" 2004-12-04 11:19:08 LOG: recycled transaction log file "00060017" 2004-12-04 11:24:10 LOG: recycled transaction log file "00060018"It is entirely normal for there to be a lot of transaction log filerecycling when bulk inserts are taking place; that goes through a lotof transaction logs. How can I configure PostgreSQL to have a better performance on this bulk insertions ? I already increased the memory values.Memory is, as likely as not, NOT the issue.Two questions: 1. How are you doing the inserts? Via INSERT statements? Or   via COPY statements? What sort of transaction grouping   is involved?   COPY is way faster than INSERT, and grouping plenty of updates   into a single transaction is generally a "win." 2. What is the schema like? Does the table have a foreign key   constraint? Does it have a bunch of indices?   If there should eventually be lots of indices, it tends to be   faster to create the table with none/minimal indices, and add   indexes afterwards, as long as your "load" process can be trusted   to not break "unique" constraints...   If there is some secondary table with a foreign key constraint,   and _that_ table is growing, it is possible that a sequential   scan is being used to search the secondary table where, if you   did an ANALYZE on that table, an index scan would be preferred   once it grew to larger size...There isn't a particular reason for PostgreSQL to "hit a wall" uponseeing 200K records; I and coworkers routinely load database dumpsthat have millions of (sometimes pretty fat) records, and they don't"choke." That's true whether talking about loading things onto my(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAIDarray, or higher end stuff involving high end SMP and EMC disk arrays.The latter obviously being orders of magnitude faster than desktopequipment :-).-- (format nil "[EMAIL PROTECTED]" "cbbrowne" "acm.org")http://www3.sympatico.ca/cbbrowne/unix.htmlRules of the Evil Overlord #207. "Employees will have conjugal visittrailers which they may use provided they call in a replacement andsign out on the timesheet. Given this, anyone caught making out in acloset while leaving their  station unmonitored will be shot."http://www.eviloverlord.com/---(end of broadcast)---TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Improve BULK insertion

2004-12-04 Thread Josh Berkus
Rodrigo,

 I need to insert 500.000 records on a table frequently. It´s a bulk
 insertion from my applicatoin.
 I am with a very poor performance. PostgreSQL insert very fast until the
 tuple 200.000 and after it the insertion starts to be really slow.
 I am seeing on the log and there is a lot of transaction logs, something

In addition to what Chris Browne asked:
What's your transaction log setup?   Are your database transaction logs on a 
seperate disk resource?   What is checkpoint_segments and checkpoint_timeout 
set to?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]