[PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Mark Wong
Hi all,

I've started to display the effects of changing the Linux block device
readahead buffer to the sequential read performance using fio.  There
are lots of raw data buried in the page, but this is what I've
distilled thus far.  Please have a look and let me know what you
think:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide#Readahead_Buffer_Size

Regards,
Mark

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel's X25-M SSD

2008-09-10 Thread Chris Browne
[EMAIL PROTECTED] (Merlin Moncure) writes:
 I think the SSD manufacturers made a tactical error chasing the
 notebook market when they should have been chasing the server
 market...

That's a very good point; I agree totally!
-- 
output = reverse(moc.enworbbc @ enworbbc)
http://www3.sympatico.ca/cbbrowne/nonrdbms.html
We are all somehow dreadfully cracked about the head, and sadly need
mending. --/Moby-Dick/, Ch 17 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Greg Smith

On Tue, 9 Sep 2008, Mark Wong wrote:


I've started to display the effects of changing the Linux block device
readahead buffer to the sequential read performance using fio.


Ah ha, told you that was your missing tunable.  I'd really like to see the 
whole table of one disk numbers re-run when you get a chance.  The 
reversed ratio there on ext2 (59MB read/92MB write) was what tipped me off 
that something wasn't quite right initially, and until that's fixed it's 
hard to analyze the rest.


Based on your initial data, I'd say that the two useful read-ahead 
settings for this system are 1024KB (conservative but a big improvement) 
and 8192KB (point of diminishing returns).  The one-disk table you've got 
(labeled with what the default read-ahead is) and new tables at those two 
values would really flesh out what each disk is capable of.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] too many clog files

2008-09-10 Thread Kevin Grittner
 Matt Smiley [EMAIL PROTECTED] wrote: 
 Alvaro Herrera wrote:
 Move the old clog files back where they were, and run VACUUM FREEZE
in
 all your databases.  That should clean up all the old pg_clog files,
if
 you're really that desperate.
 
 Has anyone actually seen a CLOG file get removed under 8.2 or 8.3?
 
Some of my high-volume databases don't quite go back to , but this
does seem to be a problem.  I have confirmed that VACUUM FREEZE on all
but template0 (which doesn't allow connections) does not clean them
up.  No long running transactions are present.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Scott Carey
How does that readahead tunable affect random reads or mixed random /
sequential situations?  In many databases, the worst case scenarios aren't
when you have a bunch of concurrent sequential scans but when there is
enough random read/write concurrently to slow the whole thing down to a
crawl.   How the file system behaves under this sort of concurrency

I would be very interested in a mixed fio profile with a background writer
doing moderate, paced random and sequential writes combined with concurrent
sequential reads and random reads.

-Scott

On Wed, Sep 10, 2008 at 7:49 AM, Greg Smith [EMAIL PROTECTED] wrote:

 On Tue, 9 Sep 2008, Mark Wong wrote:

  I've started to display the effects of changing the Linux block device
 readahead buffer to the sequential read performance using fio.


 Ah ha, told you that was your missing tunable.  I'd really like to see the
 whole table of one disk numbers re-run when you get a chance.  The reversed
 ratio there on ext2 (59MB read/92MB write) was what tipped me off that
 something wasn't quite right initially, and until that's fixed it's hard to
 analyze the rest.

 Based on your initial data, I'd say that the two useful read-ahead settings
 for this system are 1024KB (conservative but a big improvement) and 8192KB
 (point of diminishing returns).  The one-disk table you've got (labeled with
 what the default read-ahead is) and new tables at those two values would
 really flesh out what each disk is capable of.

 --
 * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



[PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Ryan Hansen

Greetings,

I'm relatively new to PostgreSQL but I've been in the IT applications 
industry for a long time, mostly in the LAMP world.


One thing I'm experiencing some trouble with is running a COPY of a 
large file (20+ million records) into a table in a reasonable amount of 
time.  Currently it's taking about 12 hours to complete on a 64 bit 
server with 3 GB memory allocated (shared_buffer), single SATA 320 GB 
drive.  I don't seem to get any improvement running the same operation 
on a dual opteron dual-core, 16 GB server.


I'm not asking for someone to solve my problem, just some direction in 
the best ways to tune for faster bulk loading, since this will be a 
fairly regular operation for our application (assuming it can work this 
way).  I've toyed with the maintenance_work_mem and some of the other 
params, but it's still way slower than it seems like it should be.

So any contributions are much appreciated.

Thanks!

P.S. Assume I've done a ton of reading and research into PG tuning, 
which I have.  I just can't seem to find anything beyond the basics that 
talks about really speeding up bulk loads.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Ryan Hansen

NEVERMIND!!

I found it.  Turns out there was still a constraint on the table.  Once 
I dropped that, the time went down to 44 minutes.


Maybe I am an idiot after all. :)

-Ryan
---BeginMessage---

Greetings,

I'm relatively new to PostgreSQL but I've been in the IT applications 
industry for a long time, mostly in the LAMP world.


One thing I'm experiencing some trouble with is running a COPY of a 
large file (20+ million records) into a table in a reasonable amount of 
time.  Currently it's taking about 12 hours to complete on a 64 bit 
server with 3 GB memory allocated (shared_buffer), single SATA 320 GB 
drive.  I don't seem to get any improvement running the same operation 
on a dual opteron dual-core, 16 GB server.


I'm not asking for someone to solve my problem, just some direction in 
the best ways to tune for faster bulk loading, since this will be a 
fairly regular operation for our application (assuming it can work this 
way).  I've toyed with the maintenance_work_mem and some of the other 
params, but it's still way slower than it seems like it should be.

So any contributions are much appreciated.

Thanks!

P.S. Assume I've done a ton of reading and research into PG tuning, 
which I have.  I just can't seem to find anything beyond the basics that 
talks about really speeding up bulk loads.


---End Message---

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Alan Hodgson
On Wednesday 10 September 2008, Ryan Hansen [EMAIL PROTECTED] 
wrote:
Currently it's taking about 12 hours to complete on a 64 bit
 server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
 drive.  I don't seem to get any improvement running the same operation
 on a dual opteron dual-core, 16 GB server.

 I'm not asking for someone to solve my problem, just some direction in
 the best ways to tune for faster bulk loading, since this will be a
 fairly regular operation for our application (assuming it can work this
 way).  I've toyed with the maintenance_work_mem and some of the other
 params, but it's still way slower than it seems like it should be.
 So any contributions are much appreciated.

Your drive subsystem, such as it is, is inappropriate for a database. Your 
bottleneck is your drive. 

Turning fsync off might help. You should also drop all indexes on the table 
before the COPY and add them back after (which would eliminate a lot of 
random I/O during the COPY).

-- 
Alan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Dimitri Fontaine
Hi,

Le mercredi 10 septembre 2008, Ryan Hansen a écrit :
 One thing I'm experiencing some trouble with is running a COPY of a
 large file (20+ million records) into a table in a reasonable amount of
 time.  Currently it's taking about 12 hours to complete on a 64 bit
 server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
 drive.  I don't seem to get any improvement running the same operation
 on a dual opteron dual-core, 16 GB server.

You single SATA disk is probably very busy going from reading source file to 
writing data. You could try raising checkpoint_segments to 64 or more, but a 
single SATA disk won't give you high perfs for IOs. You're getting what you 
payed for...

You could maybe ease the disk load by launching the COPY from a remote (local 
netword) machine, and while at it if the file is big, try parallel loading 
with pgloader.

Regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] too many clog files

2008-09-10 Thread Scott Marlowe
On Wed, Sep 10, 2008 at 8:58 AM, Kevin Grittner
[EMAIL PROTECTED] wrote:
 Matt Smiley [EMAIL PROTECTED] wrote:
 Alvaro Herrera wrote:
 Move the old clog files back where they were, and run VACUUM FREEZE
 in
 all your databases.  That should clean up all the old pg_clog files,
 if
 you're really that desperate.

 Has anyone actually seen a CLOG file get removed under 8.2 or 8.3?

 Some of my high-volume databases don't quite go back to , but this
 does seem to be a problem.  I have confirmed that VACUUM FREEZE on all
 but template0 (which doesn't allow connections) does not clean them
 up.  No long running transactions are present.

I have a pretty high volume server that's been online for one month
and it had somewhere around 53, going back in order to , and it
was recently vacuumdb -az 'ed. Running another one.  No long running
transactions, etc...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Mark Wong
On Wed, Sep 10, 2008 at 9:26 AM, Scott Carey [EMAIL PROTECTED] wrote:
 How does that readahead tunable affect random reads or mixed random /
 sequential situations?  In many databases, the worst case scenarios aren't
 when you have a bunch of concurrent sequential scans but when there is
 enough random read/write concurrently to slow the whole thing down to a
 crawl.   How the file system behaves under this sort of concurrency

 I would be very interested in a mixed fio profile with a background writer
 doing moderate, paced random and sequential writes combined with concurrent
 sequential reads and random reads.

The data for the other fio profiles we've been using are on the wiki,
if your eyes can take the strain.  We are working on presenting the
data in a more easily digestible manner.  I don't think we'll add any
more fio profiles in the interest of moving on to doing some sizing
exercises with the dbt2 oltp workload.  We're just going to wrap up a
couple more scenarios first and get through a couple of conference
presentations.  The two conferences in particular are the Linux
Plumbers Conference, and the PostgreSQL Conference: West 08, which are
both in Portland, Oregon.

Regards,
Mark

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] too many clog files

2008-09-10 Thread Tom Lane
Scott Marlowe [EMAIL PROTECTED] writes:
 On Wed, Sep 10, 2008 at 8:58 AM, Kevin Grittner
 [EMAIL PROTECTED] wrote:
 Some of my high-volume databases don't quite go back to , but this
 does seem to be a problem.  I have confirmed that VACUUM FREEZE on all
 but template0 (which doesn't allow connections) does not clean them
 up.  No long running transactions are present.

 I have a pretty high volume server that's been online for one month
 and it had somewhere around 53, going back in order to , and it
 was recently vacuumdb -az 'ed. Running another one.  No long running
 transactions, etc...

The expected behavior (in 8.2 and newer) is to maintain about
autovacuum_freeze_max_age transactions' worth of clog; which is to say
about 50MB at the default settings.  If you've got significantly more
than that then we should look more closely.

I don't remember what the truncation rule was in 8.1, so I can't speak
to the OP's complaint.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] too many clog files

2008-09-10 Thread Kevin Grittner
 Tom Lane [EMAIL PROTECTED] wrote: 
 
 The expected behavior (in 8.2 and newer) is to maintain about
 autovacuum_freeze_max_age transactions' worth of clog; which is to
say
 about 50MB at the default settings.
 
The active database I checked, where it didn't go all the way back to
, had 50 MB of files; so I guess it is working as intended.
 
It sounds like the advice to the OP that running VACUUM FREEZE on all
databases to clean up the files was off base?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Scott Carey
A single SATA drive may not be the best performer, but:

1. It won't make a load take 12 hours unless we're talking a load that is in
total, similar to the size of the disk.  A slow, newer SATA drive will read
and write at at ~50MB/sec at minimum, so the whole 320GB can be scanned at
3GB per minute.  Thats ~ 5 hours.  It is not likely that 20M records is over
20GB, and at that size there is no way the disk is the bottleneck.

2. To figure out if the disk or CPU is a bottleneck, don't assume.  Check
iostat or top and look at the disk utilization % and io wait times.  Check
the backend process CPU utilization.  In my experience, there are many
things that can cause COPY to be completely CPU bound even with slow disks
-- I have seen it bound to a 5MB/sec write rate on a 3Ghz CPU, which a drive
from 1998 could handle.

It seems like this case is resolved, but there are some other good tuning
recommendations.  Don't blame the disk until the disk is actually showing
high utilization though.

COPY is bound typically by the disk or a single CPU.  It is usually CPU
bound if there are indexes or constraints on the table, and sometimes even
when there are none.

The pg_bulkload tool in almost all cases, will be significantly faster but
it has limitations that make it inappropriate for some to use.



On Wed, Sep 10, 2008 at 10:14 AM, Alan Hodgson [EMAIL PROTECTED] wrote:

 On Wednesday 10 September 2008, Ryan Hansen 
 [EMAIL PROTECTED]
 wrote:
 Currently it's taking about 12 hours to complete on a 64 bit
  server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
  drive.  I don't seem to get any improvement running the same operation
  on a dual opteron dual-core, 16 GB server.
 
  I'm not asking for someone to solve my problem, just some direction in
  the best ways to tune for faster bulk loading, since this will be a
  fairly regular operation for our application (assuming it can work this
  way).  I've toyed with the maintenance_work_mem and some of the other
  params, but it's still way slower than it seems like it should be.
  So any contributions are much appreciated.

 Your drive subsystem, such as it is, is inappropriate for a database. Your
 bottleneck is your drive.

 Turning fsync off might help. You should also drop all indexes on the table
 before the COPY and add them back after (which would eliminate a lot of
 random I/O during the COPY).

 --
 Alan

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Scott Carey
Correction --
 2 hours to read the whole disk.

1. It won't make a load take 12 hours unless we're talking a load that is in
 total, similar to the size of the disk.  A slow, newer SATA drive will read
 and write at at ~50MB/sec at minimum, so the whole 320GB can be scanned at
 3GB per minute.  Thats ~ 5 hours.  It is not likely that 20M records is over
 20GB, and at that size there is no way the disk is the bottleneck.



Re: [PERFORM] too many clog files

2008-09-10 Thread Scott Carey
And potentially to tune down the number kept by modifying the appropriate
freeze parameter for 8.1 (I'm not sure of the details), so that it keeps
perhaps 20MB or so rather than 50MB.

On Wed, Sep 10, 2008 at 10:47 AM, Kevin Grittner 
[EMAIL PROTECTED] wrote:

  Tom Lane [EMAIL PROTECTED] wrote:

  The expected behavior (in 8.2 and newer) is to maintain about
  autovacuum_freeze_max_age transactions' worth of clog; which is to
 say
  about 50MB at the default settings.

 The active database I checked, where it didn't go all the way back to
 , had 50 MB of files; so I guess it is working as intended.

 It sounds like the advice to the OP that running VACUUM FREEZE on all
 databases to clean up the files was off base?

 -Kevin

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] too many clog files

2008-09-10 Thread Alvaro Herrera
Kevin Grittner escribió:

 It sounds like the advice to the OP that running VACUUM FREEZE on all
 databases to clean up the files was off base?

His responses are not explicit enough to know.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Scott Carey
I am planning my own I/O tuning exercise for a new DB and am setting up some
fio profiles.  I appreciate the work and will use some of yours as a
baseline to move forward.  I will be making some mixed mode fio profiles and
running our own application and database as a test as well.  I'll focus on
ext3 versus xfs (Linux) and zfs (Solaris) however, and expect to be working
with sequential transfer rates many times larger than your test and am
interested in performance under heavy concurrency -- so the results may
differ quite a bit.

I'll share the info I can.


On Wed, Sep 10, 2008 at 10:38 AM, Mark Wong [EMAIL PROTECTED] wrote:

 On Wed, Sep 10, 2008 at 9:26 AM, Scott Carey [EMAIL PROTECTED]
 wrote:
  How does that readahead tunable affect random reads or mixed random /
  sequential situations?  In many databases, the worst case scenarios
 aren't
  when you have a bunch of concurrent sequential scans but when there is
  enough random read/write concurrently to slow the whole thing down to a
  crawl.   How the file system behaves under this sort of concurrency
 
  I would be very interested in a mixed fio profile with a background
 writer
  doing moderate, paced random and sequential writes combined with
 concurrent
  sequential reads and random reads.

 The data for the other fio profiles we've been using are on the wiki,
 if your eyes can take the strain.  We are working on presenting the
 data in a more easily digestible manner.  I don't think we'll add any
 more fio profiles in the interest of moving on to doing some sizing
 exercises with the dbt2 oltp workload.  We're just going to wrap up a
 couple more scenarios first and get through a couple of conference
 presentations.  The two conferences in particular are the Linux
 Plumbers Conference, and the PostgreSQL Conference: West 08, which are
 both in Portland, Oregon.

 Regards,
 Mark



Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Greg Smith

On Wed, 10 Sep 2008, Scott Carey wrote:


How does that readahead tunable affect random reads or mixed random /
sequential situations?


It still helps as long as you don't make the parameter giant.  The read 
cache in a typical hard drive noawadays is 8-32MB.  If you're seeking a 
lot, you still might as well read the next 1MB or so after the block 
requested once you've gone to the trouble of moving the disk somewhere. 
Seek-bound workloads will only waste a relatively small amount of the 
disk's read cache that way--the slow seek rate itself keeps that from 
polluting the buffer cache too fast with those reads--while sequential 
ones benefit enormously.


If you look at Mark's tests, you can see approximately where the readahead 
is filling the disk's internal buffers, because what happens then is the 
sequential read performance improvement levels off.  That looks near 8MB 
for the array he's tested, but I'd like to see a single disk to better 
feel that out.  Basically, once you know that, you back off from there as 
much as you can without killing sequential performance completely and that 
point should still support a mixed workload.


Disks are fairly well understood physical components, and if you think in 
those terms you can build a gross model easily enough:


Average seek time:  4ms
Seeks/second:   250
Data read/seek: 1MB (read-ahead number goes here)
Total read bandwidth:   250MB/s

Since that's around what a typical interface can support, that's why I 
suggest a 1MB read-ahead shouldn't hurt even seek-only workloads, and it's 
pretty close to optimal for sequential as well here (big improvement from 
the default Linux RA of 256 blocks=128K).  If you know your work is biased 
heavily toward sequential scans, you might pick the 8MB read-ahead 
instead.  That value (--setra=16384 - 8MB) has actually been the standard 
start here setting 3ware suggests on Linux for a while now: 
http://www.3ware.com/kb/Article.aspx?id=11050



I would be very interested in a mixed fio profile with a background writer
doing moderate, paced random and sequential writes combined with concurrent
sequential reads and random reads.


Trying to make disk benchmarks really complicated is a path that leads to 
a lot of wasted time.  I one made this gigantic design plan for something 
that worked like the PostgreSQL buffer management system to work as a disk 
benchmarking tool.  I threw it away after confirming I could do better 
with carefully scripted pgbench tests.


If you want to benchmark something that looks like a database workload, 
benchmark a database workload.  That will always be better than guessing 
what such a workload acts like in a synthetic fashion.  The seeks/second 
number bonnie++ spits out is good enough for most purposes at figuring out 
if you've detuned seeks badly.


pgbench -S run against a giant database gives results that look a lot 
like seeks/second, and if you mix multiple custom -f tests together it 
will round-robin between them at random...


It's really helpful to measure these various disk subsystem parameters 
individually.  Knowing the sequential read/write, seeks/second, and commit 
rate for a disk setup is mainly valuable at making sure you're getting the 
full performance expected from what you've got.  Like in this example, 
where something was obviously off on the single disk results because reads 
were significantly slower than writes.  That's not supposed to happen, so 
you know something basic is wrong before you even get into RAID and such. 
Beyond confirming whether or not you're getting approximately what you 
should be out of the basic hardware, disk benchmarks are much less useful 
than application ones.


With all that, I think I just gave away what the next conference paper 
I've been working on is about.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Dimitri Fontaine

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Le 10 sept. 08 à 19:16, Bill Moran a écrit :
There's a program called pgloader which supposedly is faster than  
copy.

I've not used it so I can't say definitively how much faster it is.


In fact pgloader is using COPY under the hood, and doing so via a  
network connection (could be unix domain socket), whereas COPY on the  
server reads the file content directly from the local file. So no,  
pgloader is not good for being faster than copy.


That said, pgloader is able to split the workload between as many  
threads as you want to, and so could saturate IOs when the disk  
subsystem performs well enough for a single CPU not to be able to  
overload it. Two parallel loading mode are supported, pgloader will  
either hav N parts of the file processed by N threads, or have one  
thread read and parse the file then fill up queues for N threads to  
send COPY commands to the server.


Now, it could be that using pgloader with a parallel setup performs  
better than plain COPY on the server. This remains to get tested, the  
use case at hand is said to be for hundreds of GB or some TB data  
file. I don't have any facilities to testdrive such a setup...


Note that those pgloader parallel options have been asked by  
PostgreSQL hackers in order to testbed some ideas with respect to a  
parallel pg_restore, maybe re-explaining what have been implemented  
will reopen this can of worms :)


Regards,
- --
dim

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkjINB0ACgkQlBXRlnbh1bmhkgCgu4TduBB0bnscuEsy0CCftpSp
O5IAoMsrPoXAB+SJEr9s5pMCYBgH/CNi
=1c5H
-END PGP SIGNATURE-

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread Scott Marlowe
On Wed, Sep 10, 2008 at 11:16 AM, Bill Moran
[EMAIL PROTECTED] wrote:
 There's a program called pgloader which supposedly is faster than copy.
 I've not used it so I can't say definitively how much faster it is.

I think you are thinking of pg_bulkloader...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Scott Carey
Great info Greg,

Some follow-up questions and information in-line:

On Wed, Sep 10, 2008 at 12:44 PM, Greg Smith [EMAIL PROTECTED] wrote:

 On Wed, 10 Sep 2008, Scott Carey wrote:

  How does that readahead tunable affect random reads or mixed random /
 sequential situations?


 It still helps as long as you don't make the parameter giant.  The read
 cache in a typical hard drive noawadays is 8-32MB.  If you're seeking a lot,
 you still might as well read the next 1MB or so after the block requested
 once you've gone to the trouble of moving the disk somewhere. Seek-bound
 workloads will only waste a relatively small amount of the disk's read cache
 that way--the slow seek rate itself keeps that from polluting the buffer
 cache too fast with those reads--while sequential ones benefit enormously.

 If you look at Mark's tests, you can see approximately where the readahead
 is filling the disk's internal buffers, because what happens then is the
 sequential read performance improvement levels off.  That looks near 8MB for
 the array he's tested, but I'd like to see a single disk to better feel that
 out.  Basically, once you know that, you back off from there as much as you
 can without killing sequential performance completely and that point should
 still support a mixed workload.

 Disks are fairly well understood physical components, and if you think in
 those terms you can build a gross model easily enough:

 Average seek time:  4ms
 Seeks/second:   250
 Data read/seek: 1MB (read-ahead number goes here)
 Total read bandwidth:   250MB/s

 Since that's around what a typical interface can support, that's why I
 suggest a 1MB read-ahead shouldn't hurt even seek-only workloads, and it's
 pretty close to optimal for sequential as well here (big improvement from
 the default Linux RA of 256 blocks=128K).  If you know your work is biased
 heavily toward sequential scans, you might pick the 8MB read-ahead instead.
  That value (--setra=16384 - 8MB) has actually been the standard start
 here setting 3ware suggests on Linux for a while now:
 http://www.3ware.com/kb/Article.aspx?id=11050


Ok, so this is a drive level parameter that affects the data going into the
disk cache?  Or does it also get pulled over the SATA/SAS link into the OS
page cache?  I've been searching around with google for the answer and can't
seem to find it.

Additionally, I would like to know how this works with hardware RAID -- Does
it set this value per disk?  Does it set it at the array level (so that 1MB
with an 8 disk stripe is actually 128K per disk)?  Is it RAID driver
dependant?  If it is purely the OS, then it is above raid level and affects
the whole array -- and is hence almost useless.  If it is for the whole
array, it would have horrendous negative impact on random I/O per second if
the total readahead became longer than a stripe width -- if it is a full
stripe then each I/O, even those less than the size of a stripe, would cause
an I/O on every drive, dropping the I/O per second to that of a single
drive.
If it is a drive level setting, then it won't affect i/o per sec by making
i/o's span multiple drives in a RAID, which is good.

Additionally, the O/S should have a good heuristic based read-ahead process
that should make the drive/device level read-ahead much less important.  I
don't know how long its going to take for Linux to do this right:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
http://kerneltrap.org/node/6642


Lets expand a bit on your model above for a single disk:

A single disk, with 4ms seeks, and max disk throughput of 125MB/sec.  The
interface can transfer 300MB/sec.
250 seeks/sec. Some chunk of data in that seek is free, afterwords it is
surely not.
512KB can be read in 4ms then.  A 1MB read-ahead would result in:
4ms seek, 8ms read.   1MB seeks/sec ~=83 seeks/sec.
However, some chunk of that 1MB is free with the seek.  I'm not sure how
much per drive, but it is likely on the order of 8K - 64K.

I suppose I'll have to experiment in order to find out.  But I can't see how
a 1MB read-ahead, which should take 2x as long as seek time to read off the
platters, could not have significant impact on random I/O per second on
single drives.   For SATA drives the transfer rate to seek time ratio is
smaller, and their caches are bigger, so a larger read-ahead will impact
things less.






  I would be very interested in a mixed fio profile with a background
 writer
 doing moderate, paced random and sequential writes combined with
 concurrent
 sequential reads and random reads.


 Trying to make disk benchmarks really complicated is a path that leads to a
 lot of wasted time.  I one made this gigantic design plan for something that
 worked like the PostgreSQL buffer management system to work as a disk
 benchmarking tool.  I threw it away after confirming I could do better with
 carefully scripted pgbench tests.

 If you want to benchmark something that looks like a 

[PERFORM] 答复: [PERFORM] Improve COPY performance for large data sets

2008-09-10 Thread jay
I suspect your table has index, or checkpoint_segments is small and lead PG
do checkpoint frequently. 
If the table has index or constraint, drop it and copy it ,after copy
finished, do create index or constraint again.
If checkpoint_segments is small, enlarge it.
And also you can turn fsync off when you do copy, after finish, turn it on
again.
And also you can enlarge maintenance_work_mem.

If you take above, time cost will down significantly.

 莫建祥
阿里巴巴软件(上海)有限公司
研发中心-IM服务端开发部 
联系方式:86-0571-85022088-13072
贸易通ID:jaymo 淘宝ID:jackem
公司网站:www.alisoft.com
wiki:http://10.0.32.21:1688/confluence/pages/viewpage.action?pageId=10338

-邮件原件-
发件人: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] 代表 Ryan Hansen
发送时间: 2008年9月11日 1:14
收件人: pgsql-performance@postgresql.org
主题: Re: [PERFORM] Improve COPY performance for large data sets

NEVERMIND!!

I found it.  Turns out there was still a constraint on the table.  Once 
I dropped that, the time went down to 44 minutes.

Maybe I am an idiot after all. :)

-Ryan


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Greg Smith

On Wed, 10 Sep 2008, Scott Carey wrote:


Ok, so this is a drive level parameter that affects the data going into the
disk cache?  Or does it also get pulled over the SATA/SAS link into the OS
page cache?


It's at the disk block driver level in Linux, so I believe that's all 
going into the OS page cache.  They've been rewriting that section a bit 
and I haven't checked it since that change (see below).



Additionally, I would like to know how this works with hardware RAID -- Does
it set this value per disk?


Hardware RAID controllers usually have their own read-ahead policies that 
may or may not impact whether the OS-level read-ahead is helpful.  Since 
Mark's tests are going straight into the RAID controller, that's why it's 
helpful here, and why many people don't ever have to adjust this 
parameter.  For example, it doesn't give a dramatic gain on my Areca card 
even in JBOD mode, because that thing has its own cache to manage with its 
own agenda.


Once you start fiddling with RAID stripe sizes as well the complexity 
explodes, and next thing you know you're busy moving the partition table 
around to make the logical sectors line up with the stripes better and 
similar exciting work.



Additionally, the O/S should have a good heuristic based read-ahead process
that should make the drive/device level read-ahead much less important.  I
don't know how long its going to take for Linux to do this right:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
http://kerneltrap.org/node/6642


That was committed in 2.6.23:

http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

but clearly some larger minimum hints still helps, as the system we've 
been staring at benchmarks has that feature.



Some chunk of data in that seek is free, afterwords it is surely not...


You can do a basic model of the drive to get a ballpark estimate on these 
things like I threw out, but trying to break down every little bit gets 
hairy.  In most estimation cases you see, where 128kB is the amount being 
read, the actual read time is so small compared to the rest of the numbers 
that it just gets ignored.


I was actually being optimistic about how much cache can get filled by 
seeks.  If the disk is spinning at 15000RPM, that's 4ms to do a full 
rotation.  That means that on average you'll also wait 2ms just to get the 
heads lined up to read that one sector on top of the 4ms seek to get in 
the area; now we're at 6ms before you've read anything, topping seeks out 
at under 167/second.  That number--average seek time plus half a 
rotation--is what a lot of people call the IOPS for the drive.  There, 
typically the time spent actually reading data once you've gone through 
all that doesn't factor in.  IOPS is not very well defined, some people 
*do* include the reading time once you're there; one reason I don't like 
to use it.  There's a nice chart showing some typical computations here at 
http://www.dbasupport.com/oracle/ora10g/disk_IO_02.shtml if anybody wants 
to see how this works for other classes of disk.  The other reason I don't 
like focusing too much on IOPS (some people act like it's the only 
measurement that matters) is that it tells you nothing about the 
sequential read rate, and you have to consider both at once to get a clear 
picture--particularly when there are adjustments that impact those two 
oppositely, like read-ahead.


As far as the internal transfer speed of the heads to the drive's cache 
once it's lined up, those are creeping up toward the 200MB/s range for the 
kind of faster drives the rest of these stats come from.  So the default 
of 128kB is going to take 0.6ms, while a full 1MB might take 5ms.  You're 
absolutely right to question how hard that will degrade seek performance; 
these slightly more accurate numbers suggest that might be as bad as going 
from 6.6ms to 11ms per seek, or from 150 IOPS to 91 IOPS.  It also points 
out how outrageously large the really big read-ahead numbers are once 
you're seeking instead of sequentially reading.


One thing it's hard to know is how much read-ahead the drive was going to 
do on its own, no matter what you told it, anyway as part of its caching 
algorithm.



I suppose I should learn more about pgbench.


Most people use it as just a simple benchmark that includes a mixed 
read/update/insert workload.  But that's internally done using a little 
command substition language that let's you easily write things like 
generate a random number between 1 and 1M, read the record from this 
table, and then update this associated record that scale based on how big 
the data set you've given it is.  You an write your own scripts in that 
form too.  And if you specify several scripts like that at a time, it will 
switch between them at random, and you can analyze the average execution 
time broken down per type if you save the latency logs. Makes it real easy 
to adjust the number of clients and the mix of 

Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread James Mansion

Greg Smith wrote:

Average seek time:  4ms
Seeks/second:250
Data read/seek:1MB(read-ahead number goes here)
Total read bandwidth:250MB/s

Most spinning disks now are nearer to 100MB/s streaming.  You've talked 
yourself into twice that, random access!


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Scott Marlowe
On Wed, Sep 10, 2008 at 11:21 PM, James Mansion
[EMAIL PROTECTED] wrote:
 Greg Smith wrote:

 Average seek time:  4ms
 Seeks/second:250
 Data read/seek:1MB(read-ahead number goes here)
 Total read bandwidth:250MB/s

 Most spinning disks now are nearer to 100MB/s streaming.  You've talked
 yourself into twice that, random access!

The fastest cheetahs on this page hit 171MB/second:

http://www.seagate.com/www/en-us/products/servers/cheetah/

Are there any drives that have a faster sequential transfer rate out there?

Checked out hitachi's global storage site and they're fastest drive
seems just a tad slower.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread Greg Smith

On Thu, 11 Sep 2008, James Mansion wrote:

Most spinning disks now are nearer to 100MB/s streaming.  You've talked 
yourself into twice that, random access!


The point I was trying to make there is that even under impossibly optimal 
circumstances, you'd be hard pressed to blow out the disk's read cache 
with seek-dominated data even if you read a lot at each seek point.  That 
idea didn't make it from my head into writing very well though.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance