[HACKERS] Re: [ADMIN] v7.1b4 bad performance

2001-02-17 Thread Tatsuo Ishii

lockhart  ... See included png file.
lockhart 
lockhart What kind of machine was this run on?
lockhart 
lockhart  - Thomas

Sorry to forget to mention about that.

SONY VAIO Z505CR/K (note PC)
Pentium III 750MHz/256MB memory/20GB IDE HDD
Linux (kernel 2.2.17)
configure --enable-multibyte=EUC_JP
postgresql.conf:
fsync = on
max_connections = 128
shared_buffers = 1024
silent_mode = on
commit_delay = 0
postmaster opts for 7.0.3:
-B 1024 -N 128 -S
pgbench settings:
scaling factor = 1
data excludes connetion establishing time
number of total transactions are always 640
   (see included scripts I ran for the testing)
--
#! /bin/sh
pgbench -i test
for i in 1 2 4 8 16 32 64 128
do
t=`expr 640 / $i`
pgbench -t $t -c $i test
echo "= sync =="
sync;sync;sync;sleep 10
echo "= sync done =="
done
--
--
Tatsuo Ishii



[HACKERS] Docs generation fixed

2001-02-17 Thread Thomas Lockhart

The html docs should once again be generated automatically on
postgresql.org on a twice-daily basis. Thanks to Peter E for working me
through the toolset and configuration changes...

   - Thomas



Re: [HACKERS] Re: [ADMIN] v7.1b4 bad performance

2001-02-17 Thread Larry Rosenman

* Tom Lane [EMAIL PROTECTED] [010216 22:49]:
 "Schmidt, Peter" [EMAIL PROTECTED] writes:
  So, is it OK to use commit_delay=0?
 
 Certainly.  In fact, I think that's about to become the default ;-)
 
 I have now experimented with several different platforms --- HPUX,
 FreeBSD, and two considerably different strains of Linux --- and I find
 that the minimum delay supported by select(2) is 10 or more milliseconds
 on all of them, as much as 20 msec on some popular platforms.  Try it
 yourself (my test program is attached).
 
 Thus, our past arguments about whether a few microseconds of delay
 before commit are a good idea seem moot; we do not have any portable way
 of implementing that, and a ten millisecond delay for commit is clearly
 Not Good.
 
   regards, tom lane
Here is another one.  UnixWare 7.1.1 on a P-III 500 256 Meg Ram:

$ cc -o tgl.test -O tgl.test.c
$ time ./tgl.test 0

real0m0.01s
user0m0.01s
sys 0m0.00s
$ time ./tgl.test 1

real0m10.01s
user0m0.00s
sys 0m0.01s
$ time ./tgl.test 2

real0m10.01s
user0m0.00s
sys 0m0.00s
$ time ./tgl.test 3

real0m10.11s
user0m0.00s
sys 0m0.01s
$ uname -a
UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5
$ 

-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749



[HACKERS] Non-locale 7.1beta4 binaries on RedHat 6.2 test results.

2001-02-17 Thread Lamar Owen

Ok, after Tatsuo and Peter have both said that building without locale
support should not use the locale support in the OS, and remembering my
6.5.3 experience of a year back, I decided to test it out completely. 
And I am wrong with respect to 7.1beta4.

For 7.1beta4 disabling locale will indeed work properly, at least on
RedHat 6.2.

Testing methodology:
1.) Blow out entire PGDATA tree;
2.) Initdb with locale-enabled backend;
3.) Run regression with locale-enable binaries (locale=en_US);
4.) Rebuild without --enable-locale;
5.) Blow out entire PGDATA tree;
6.) Initdb with non-locale backend;
7.) Run regression with non-locale binaries.

Results:
For --enable-locale RPM's, pg_regress --schedule=parallel_schedule
produces:
parallel group (13 tests):  boolean char name varchar int4 int2 oid
float4 float
8 text bit int8 numeric
 boolean  ... ok
 char ... ok
 name ... ok
 varchar  ... ok
 text ... ok
 int2 ... ok
 int4 ... ok
 int8 ... FAILED
 oid  ... ok
 float4   ... ok
 float8   ... ok
 bit  ... ok
 numeric  ... FAILED
test strings  ... ok
test numerology   ... ok
parallel group (18 tests):  point lseg box path polygon circle comments
reltime
date abstime interval time inet type_sanity tinterval timestamp oidjoins
opr_san
ity
 point... ok
 lseg ... ok
 box  ... ok
 path ... ok
 polygon  ... ok
 circle   ... ok
 date ... ok
 time ... ok
 timestamp... ok
 interval ... ok
 abstime  ... ok
 reltime  ... ok
 tinterval... ok
 inet ... ok
 comments ... ok
 oidjoins ... ok
 type_sanity  ... ok
 opr_sanity   ... ok
test geometry ... ok
test horology ... ok
test create_function_1... ok
test create_type  ... ok
test create_table ... ok
test create_function_2... ok
test copy ... ok
parallel group (7 tests):  create_aggregate create_operator triggers
inherit con
straints create_misc create_index
 constraints  ... ok
 triggers ... ok
 create_misc  ... ok
 create_aggregate ... ok
 create_operator  ... ok
 create_index ... ok
 inherit  ... ok
test create_view  ... ok
test sanity_check ... ok
test errors   ... ok
test select   ... ok
parallel group (16 tests):  select_into select_distinct_on
select_distinct selec
t_having select_implicit subselect transactions union case random arrays
aggrega
tes join portals hash_index btree_index
 select_into  ... ok
 select_distinct  ... ok
 select_distinct_on   ... ok
 select_implicit  ... FAILED
 select_having... FAILED
 subselect... ok
 union... ok
 case ... ok
 join ... ok
 aggregates   ... ok
 transactions ... ok
 random   ... failed (ignored)
 portals  ... ok
 arrays   ... ok
 btree_index  ... ok
 hash_index   ... ok
test misc ... ok
parallel group (5 tests):  portals_p2 alter_table rules foreign_key
select_views
 select_views ... FAILED
 alter_table  ... ok
 portals_p2   ... ok
 rules... ok
 foreign_key  ... ok
parallel group (3 tests):  limit temp plpgsql
 limit... ok
 plpgsql  ... ok
 temp ... ok


With locale disabled:
All 76 tests passed.

So, there's the data.  This is different behavior from the 6.5.3
non-locale set I produced a year ago.  Is there interest in a non-locale
RPM distribution, or?  The locale enabled regression results fail due to
currency format and collation errors.  Diffs attached.  I'm not sure I
understand the select_views failure, either.  Locale used was en_US.

Comments?
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
 locale-run.diffs


Re: [HACKERS] Non-locale 7.1beta4 binaries on RedHat 6.2 test results.

2001-02-17 Thread Tom Lane

Lamar Owen [EMAIL PROTECTED] writes:
 The locale enabled regression results fail due to
 currency format and collation errors.  Diffs attached.  I'm not sure I
 understand the select_views failure, either.  Locale used was en_US.

The select_views delta looks like a sort-order issue as well; nothing
to worry about.

These deltas would go away if you allowed pg_regress to build a temp
installation in which it could force the locale to C.  Of course,
that doesn't presently work without a built source tree to install
from.  I wonder if it is worth adding a third operating mode to
pg_regress that would build a temp PGDATA directory but use the
already-installed bin/lib/share directories ...

regards, tom lane



Re: [HACKERS] Non-locale 7.1beta4 binaries on RedHat 6.2 test results.

2001-02-17 Thread Lamar Owen

Tom Lane wrote:
 Lamar Owen [EMAIL PROTECTED] writes:
  The locale enabled regression results fail due to
  currency format and collation errors.  Diffs attached.  I'm not sure I
  understand the select_views failure, either.  Locale used was en_US.
 
 The select_views delta looks like a sort-order issue as well; nothing
 to worry about.

Good.  I didn't see any difference -- but maybe that's because I went
cross-eyed :-)
 
 These deltas would go away if you allowed pg_regress to build a temp
 installation in which it could force the locale to C.  Of course,
 that doesn't presently work without a built source tree to install
 from.  I wonder if it is worth adding a third operating mode to

Possibly.  If pg_regress uses a different port for postmaster, AND a
different PGDATA, you could run regression on a sandbox while a
production system was running, FWIW.  Since that's more of an RPM issue
than a core issue, I can do that third mode work, as I would be the
direct benefactor (unless someone else does it first, of course).

Both the locale and non-locale installation were from RPM, BTW, as I
wanted the least number of variables possible.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11



Re: [HACKERS] Microsecond sleeps with select()

2001-02-17 Thread Lamar Owen

Bruce Momjian wrote:
 In fact, the kernel doesn't even contain have a way
 to measure microsecond timings.

Linux has patches available to do microsecond timings, but they're
nonportable, of course.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11



Re: [HACKERS] Re: beta5 ...

2001-02-17 Thread The Hermit Hacker

On Sat, 17 Feb 2001, Bruce Momjian wrote:

  
   BTW, is 7.1 going to be a bit slower than 7.0? Or just Beta 5? Just
   curious. Don't mind waiting for 7.2 for the speed-up if necessary.
 
  It is possible that it will be ... the question is whether the slow down
  is unbearable or not, as to whether we'll let it hold things up or not ...
 
  From reading one of Tom's email's, it looks like the changes to 'fix' the
  slowdown are drastic/large enough that it might not be safe (or desirable)
  to fix it at this late of a stage in beta ...
 
  Depending on what is involved, we might put out a v7.1 for March 1st, so
  that ppl can feel confident about using the various features, but have a
  v7.1.1 that follows relatively closely on its heels that addresses the
  performance problem ...

 The easy fix is to just set the delay to zero.  Looks like that will fix
 most of the problem.

Except that Vadim had a reason for setting it to 5, and I'm loath to see
that changed unless someone actaully understands the ramifications other
then increasing performance ...

 The near-committers thing may indeed be overkill, and certainly is not
 worth holding beta.

What is this 'near-committers thing'??





Re: [HACKERS] Re: beta5 ...

2001-02-17 Thread Tom Lane

The Hermit Hacker [EMAIL PROTECTED] writes:
 The easy fix is to just set the delay to zero.  Looks like that will fix
 most of the problem.

 Except that Vadim had a reason for setting it to 5,

He claimed to have seen better performance with a nonzero delay.
So far none of the rest of us have been able to duplicate that.
Perhaps he was using a machine where a 5-microsecond select() delay
actually is 5 microseconds?  If so, he's the outlier, not the
rest of us ...

regards, tom lane



[HACKERS] WAL and commit_delay

2001-02-17 Thread Bruce Momjian

I want to give some background on commit_delay, its initial purpose, and
possible options.

First, looking at the process that happens during a commit:

write() - copy WAL dirty page to kernel disk buffer
fsync() - force WAL kernel disk buffer to disk platter

fsync() take much longer than write().

What Vadim doesn't want is:

timebackend 1   backend 2
-   -
0   write() 
1   fysnc() write()
2   fsync()

This would be better as:

timebackend 1   backend 2
-   -
0   write() 
1   write()
2   fsync() fsync()

This was the purpose of the commit_delay.  Having two fsync()'s is not a
problem because only one will see there are dirty buffers.  The other
will probably either return right away, or wait for the other's fsync()
to complete.

With the delay, it looks like:

timebackend 1   backend 2
-   -
0   write() 
1   sleep() write()
2   fsync() sleep()
3   fsync()

Which shows the second fsync() doing nothing, which is good, because
there are no dirty buffers at that time.  However, a very possible
circumstance is:

timebackend 1   backend 2   backend 3
-   -   -
0   write() 
1   sleep() write() 
2   fsync() sleep() write()
3   fsync() sleep()
4   fsync()

In this case, the fsync() by backend 2 does indeed do some work because
fsync's backend 3's write().  Frankly, I don't see how the sleep does
much except delay things because it doesn't have any smarts about when
the delay is useful, and when it is useless.  Without that feedback, I
recommend removing the entire setting.  For single backends, the sleep
is clearly a loser.

Another situation it can not deal with is:

timebackend 1   backend 2
-   -
0   write() 
1   sleep() 
2   fsync() write()
3   sleep()
4   fsync()

My solution can't deal with this either.

---

The quick fix is to remove the commit_delay code.  A more elaborate
performance boost would be to have the each backend get feedback from
other backends, so they can block and wait for other about-to-fsync
backends before fsync().  This allows the write() to bunch up before 
the fsync().

Here is the single backend case, which experiences no delays:

timebackend 1   backend 2
-   -
0   get_shlock()
1   write() 
2   rel_shlock()
3   get_exlock()
4   rel_exlock()
5   fsync()

Here is the two-backend case, which shows both write()'s completing
before the fsync()'s:

timebackend 1   backend 2
-   -
0   get_shlock()
1   write() 
2   rel_shlock()get_shlock()
3   get_exlock()write()
4   rel_shlock()
5   rel_exlock()
6   fsync() get_exlock()
7   rel_exlock()
8   fsync()

Contrast that with the first 2 backend case presented above:

timebackend 1   backend 2
-   -
0   write() 
1   fysnc() write()
2   fsync()

Now, it is my understanding that instead of just shared locking around
the write()'s, we could block the entire commit code, so the backend can
signal to other about-to-fsync backends to wait.

I believe our existing lock code can be used for the locking/unlocking. 
We can just lock a random, unused table log pg_log or something.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] Microsecond sleeps with select()

2001-02-17 Thread Bruce Momjian

 Bruce Momjian [EMAIL PROTECTED] writes:
  A comment on microsecond delays using select().  Most Unix kernels run
  at 100hz, meaning that they have a programmable timer that interrupts
  the CPU every 10 milliseconds.
 
 Right --- this probably also explains my observation that some kernels
 seem to add an extra 10msec to the requested sleep time.  Actually
 they're interpreting a one-clock-tick select() delay as "wait till
 the next clock tick, plus one tick".  The actual delay will be between
 one and two ticks depending on just when you went to sleep.
 

The BSDI code would be pselect():

/*
 * If poll wait was tiny, this could be zero; we will
 * have to round it up to avoid sleeping forever.  If
 * we retry below, the timercmp above will get us out.
 * Note that if wait was 0, the timercmp will prevent
 * us from getting here the first time.
 */
timo = hzto(atv);
if (timo == 0)
timo = 1;

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] Microsecond sleeps with select()

2001-02-17 Thread Bruce Momjian

 I have been thinking some more about the s_lock() delay loop in
 connection with this.  We currently have
 
 /*
  * Each time we busy spin we select the next element of this array as the
  * number of microseconds to wait. This accomplishes pseudo random back-off.
  * Values are not critical but 10 milliseconds is a common platform
  * granularity.
  *
  * Total time to cycle through all 20 entries might be about .07 sec,
  * so the given value of S_MAX_BUSY results in timeout after ~70 sec.
  */
 #define S_NSPINCYCLE  20
 #define S_MAX_BUSY1000 * S_NSPINCYCLE
 
 int   s_spincycle[S_NSPINCYCLE] =
 { 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
   0, 1, 0, 0, 1, 0, 1, 0, 1, 1
 };
 
 Having read the select(2) man page more closely, I now realize that
 it is *defined* not to yield the processor when the requested delay
 is zero: it just checks the file ready status and returns immediately.

Actually, a kernel call is something.  On kernel call return, process
priorities are checked and the CPU may be yielded to a higher-priority
backend that perhaps just had its I/O completed.

I think the 0 and 1 are correct.  They would be zero ticks and one
tick.  You think 5000 and 1 would be better?  I can see that.


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] Re: beta5 ...

2001-02-17 Thread Bruce Momjian

  The easy fix is to just set the delay to zero.  Looks like that will fix
  most of the problem.
 
 Except that Vadim had a reason for setting it to 5, and I'm loath to see
 that changed unless someone actaully understands the ramifications other
 then increasing performance ...

See post from a few minutes ago with analysis of purpose and actual
affect of Vadim's parameter.  I objected to the delay when it was
introduced because of my analysis, but Vadim's argument is that 5
microseconds is very small delay, just enough to yield the CPU.  We now
see that is much longer than that.

 
  The near-committers thing may indeed be overkill, and certainly is not
  worth holding beta.
 
 What is this 'near-committers thing'??

Other backends about to commit.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] Microsecond sleeps with select()

2001-02-17 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Having read the select(2) man page more closely, I now realize that
 it is *defined* not to yield the processor when the requested delay
 is zero: it just checks the file ready status and returns immediately.

 Actually, a kernel call is something.  On kernel call return, process
 priorities are checked and the CPU may be yielded to a higher-priority
 backend that perhaps just had its I/O completed.

So *if* some I/O just completed, the call *might* do what we need,
which is yield the CPU.  Otherwise we're just wasting cycles, and
will continue to waste them until we do a select with a nonzero
delay.  I propose we cut out the spinning and just do a nonzero delay
immediately.

 I think the 0 and 1 are correct.  They would be zero ticks and one
 tick.  You think 5000 and 1 would be better?  I can see that.

No, I am not suggesting that, because there is no difference between
5000 and 1.

All of this stuff probably ought to be replaced with a less-bogus
mechanism (POSIX semaphores maybe?), but not in late beta.

regards, tom lane



Re: [HACKERS] Microsecond sleeps with select()

2001-02-17 Thread Bruce Momjian

 So *if* some I/O just completed, the call *might* do what we need,
 which is yield the CPU.  Otherwise we're just wasting cycles, and
 will continue to waste them until we do a select with a nonzero
 delay.  I propose we cut out the spinning and just do a nonzero delay
 immediately.

Well, any backend with a higher piority would get run over the current
process.  The question is how would that happen.  I will say that
because of CPU cache issues, the system tries _not_ to change processes
if the current one still needs the CPU, so the zero may be bogus.

 
  I think the 0 and 1 are correct.  They would be zero ticks and one
  tick.  You think 5000 and 1 would be better?  I can see that.
 
 No, I am not suggesting that, because there is no difference between
 5000 and 1.
 
 All of this stuff probably ought to be replaced with a less-bogus
 mechanism (POSIX semaphores maybe?), but not in late beta.

Good question.  We have sched_yield, that is a threads function, or at
least only in the pthreads library.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 With the delay, it looks like:

 time  backend 1   backend 2
   -   -
 0 write() 
 1 sleep() write()
 2 fsync() sleep()
 3 fsync()

Actually ... take a close look at the code.  The delay is done in
xact.c between XLogInsert(commitrecord) and XLogFlush().  As near
as I can tell, both the write() and the fsync() will happen in
XLogFlush().  This means the delay is just plain broken: placed
there, it cannot do anything except waste time.

Another thing I am wondering about is why we're not using fdatasync(),
where available, instead of fsync().  The whole point of preallocating
the WAL files is to make fdatasync safe, no?

regards, tom lane



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Tom Lane

I wrote:
 Actually ... take a close look at the code.  The delay is done in
 xact.c between XLogInsert(commitrecord) and XLogFlush().  As near
 as I can tell, both the write() and the fsync() will happen in
 XLogFlush().  This means the delay is just plain broken: placed
 there, it cannot do anything except waste time.

Uh ... scratch that ... nevermind.  The point is that we've inserted
our commit record into the WAL output buffer.  Now we are sleeping
in the hope that some other backend will do both the write and the
fsync for us, and that when we eventually call XLogFlush() it will find
nothing to do.  So the delay is not in the wrong place.

 Another thing I am wondering about is why we're not using fdatasync(),
 where available, instead of fsync().  The whole point of preallocating
 the WAL files is to make fdatasync safe, no?

This still looks like it'd be a win, by reducing the number of seeks
needed to complete a WAL logfile flush.  Right now, each XLogFlush
requires writing both the file's data area and its inode.

regards, tom lane



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Bruce Momjian

  Another thing I am wondering about is why we're not using fdatasync(),
  where available, instead of fsync().  The whole point of preallocating
  the WAL files is to make fdatasync safe, no?
 
 This still looks like it'd be a win, by reducing the number of seeks
 needed to complete a WAL logfile flush.  Right now, each XLogFlush
 requires writing both the file's data area and its inode.

Don't we have to fsync the inode too?  Actually, I was hoping sequential
fsync's could sit on the WAL disk track, but I can imagine it has to
seek around to hit both areas.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Another thing I am wondering about is why we're not using fdatasync(),
 where available, instead of fsync().  The whole point of preallocating
 the WAL files is to make fdatasync safe, no?

 Don't we have to fsync the inode too?  Actually, I was hoping sequential
 fsync's could sit on the WAL disk track, but I can imagine it has to
 seek around to hit both areas.

That's the point: we're trying to get things set up so that successive
writes/fsyncs in the WAL file do the minimum amount of seeking.  The WAL
code tries to preallocate the whole log file (incorrectly, but that's
easily fixed, see below) so that we should not need to update the file
metadata when we write into the file.

 I don't have fdatasync() here.  How does it compare to fsync().

HPUX's man page says

: fdatasync() causes all modified data and file attributes of fildes
: required to retrieve the data to be written to disk.

: fsync() causes all modified data and all file attributes of fildes
: (including access time, modification time and status change time) to
: be written to disk.

The implication is that the only thing you can lose after fdatasync is
the highly-inessential file mod time.  However, I have been told that
on some implementations, fdatasync only flushes data blocks, and never
writes the inode or indirect blocks.  That would mean that if you had
allocated new disk space to the file, fdatasync would not guarantee
that that allocation was reflected on disk.  This is the reason for
preallocating the WAL log file (and doing a full fsync *at that time*).
Then you know the inode block pointers and indirect blocks are down
on disk, and so fdatasync is sufficient even if you have the cheesy
version of fdatasync.

Right now the WAL preallocation code (XLogFileInit) is not good enough
because it does lseek to the 16MB position and then writes 1 byte there.
On an implementation that supports holes in files (which is most Unixen)
that doesn't cause physical allocation of the intervening space.  We'd
have to actually write zeroes into all 16MB to ensure the space is
allocated ... but that's just a couple more lines of code.

regards, tom lane



Re: [HACKERS] Performance lossage in checkpoint dumping

2001-02-17 Thread The Hermit Hacker

On Sat, 17 Feb 2001, Bruce Momjian wrote:

   No, but I haven't looked at it.  I am now much more concerned with the
   delay, and am wondering if I should start thinking about trying my idea
   of looking for near-committers and post the patch to the list to see if
   anyone likes it for 7.1 final.  Vadim will not be back in enough time to
   write any new code in this area, I am afraid.
 
  Near committers? *puzzled look*

 Umm, uh, it means backends that have entered COMMIT and will be issuing
 an fsync() of their own very soon.  It took me a while to remember what
 I mean too because I was thinking of CVS committers.

That's what I was thinking to, which was what was confusing the hell out
of me ... like, a near committer ... is that the guy sitting beside you
while you commit? :)





Re: [HACKERS] Performance lossage in checkpoint dumping

2001-02-17 Thread The Hermit Hacker

On Sat, 17 Feb 2001, Tom Lane wrote:

 The Hermit Hacker [EMAIL PROTECTED] writes:
  No way to group the writes to you can keep the most recent one open?
  Don't see an easy way, do you?
 
  No, but I haven't looked at it.  I am now much more concerned with the
  delay,

 I concur.  The blind write business is not important enough to hold up
 the release for --- for one thing, it has nothing to do with the pgbench
 results we're seeing, because these tests don't run long enough to
 include any checkpoint cycles.  The commit delay, on the other hand,
 is a big problem.

  and am wondering if I should start thinking about trying my idea
  of looking for near-committers and post the patch to the list to see if
  anyone likes it for 7.1 final.  Vadim will not be back in enough time to
  write any new code in this area, I am afraid.

  Near committers? *puzzled look*

 Processes nearly ready to commit.  I'm thinking that any mechanism for
 detecting that might be overkill, however, especially compared to just
 setting commit_delay to zero by default.

 I've been sitting here running pgbench under various scenarios, and so
 far I can't find any condition where commit_delay0 is materially better
 than commit_delay=0, even under heavy load.  It's either the same or
 much worse.  Numbers to follow...

Okay, if the whole commit_delay is purely means as a performance thing,
I'd say go with lowering the default to zero for v7.1, and once Vadim gets
back, we can properly determine why it appears to improve performance in
his case ... since I believe his OS of choice is FreeBSD, and you
mentioned doing tests on it, I can't see how he'd have a more fine
grain'd select() then you have for testing ...





Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Bruce Momjian

 Right now the WAL preallocation code (XLogFileInit) is not good enough
 because it does lseek to the 16MB position and then writes 1 byte there.
 On an implementation that supports holes in files (which is most Unixen)
 that doesn't cause physical allocation of the intervening space.  We'd
 have to actually write zeroes into all 16MB to ensure the space is
 allocated ... but that's just a couple more lines of code.

Are OS's smart enough to not allocate zero-written blocks?  Do we need
to write non-zeros?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Larry Rosenman

* Bruce Momjian [EMAIL PROTECTED] [010217 14:46]:
  Right now the WAL preallocation code (XLogFileInit) is not good enough
  because it does lseek to the 16MB position and then writes 1 byte there.
  On an implementation that supports holes in files (which is most Unixen)
  that doesn't cause physical allocation of the intervening space.  We'd
  have to actually write zeroes into all 16MB to ensure the space is
  allocated ... but that's just a couple more lines of code.
 
 Are OS's smart enough to not allocate zero-written blocks?  Do we need
 to write non-zeros?
I don't believe so.  writing Zeros is valid.  
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 853-3000
   +  If your life is a hard drive, |  830 Blythe Avenue
   +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026
-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Bruce Momjian

 * Bruce Momjian [EMAIL PROTECTED] [010217 14:46]:
   Right now the WAL preallocation code (XLogFileInit) is not good enough
   because it does lseek to the 16MB position and then writes 1 byte there.
   On an implementation that supports holes in files (which is most Unixen)
   that doesn't cause physical allocation of the intervening space.  We'd
   have to actually write zeroes into all 16MB to ensure the space is
   allocated ... but that's just a couple more lines of code.
  
  Are OS's smart enough to not allocate zero-written blocks?  Do we need
  to write non-zeros?
 I don't believe so.  writing Zeros is valid.  

The reason I ask is because I know you get zeros when trying to read
data from a file with holes, so it seems some OS could actually drop
those blocks from storage.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Tom Lane

Larry Rosenman [EMAIL PROTECTED] writes:
 I've written swap files and such with:
 dd if=/dev/zero of=SWAPFILE bs=512 count=204800
 and all the blocks are allocated. 

I've also confirmed that writing zeroes is sufficient on HPUX (du
shows that the correct amount of space is allocated, unlike the
current seek-to-the-end method).

Some poking around the net shows that pre-2.4 Linux kernels implement
fdatasync() as fsync(), and we already knew that BSD hasn't got it
at all.  So distinguishing fdatasync from fsync won't be helpful for
very many people yet --- but I still think we should do it.  I'm
playing with a test setup in which I just changed pg_fsync to call
fdatasync instead of fsync, and on HPUX I'm seeing pgbench tps values
around 17, as opposed to 13 yesterday.  (The HPUX man page warns that
these calls are inefficient for large files, and I wouldn't be surprised
if a lot of the run time is now being spent in the kernel scanning
through all the buffers that belong to the logfile.  2.4 Linux is
apparently reasonably smart about this case, and only looks at the
actually dirty buffers.)

Is anyone out there running a 2.4 Linux kernel?  Would you try pgbench
with current sources, commit_delay=0, -B at least 1024, no -F, and see
how the results change when pg_fsync is made to call fdatasync instead
of fsync?  (It's in src/backend/storage/file/fd.c)

regards, tom lane



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Nathan Myers

On Sat, Feb 17, 2001 at 03:45:30PM -0500, Bruce Momjian wrote:
  Right now the WAL preallocation code (XLogFileInit) is not good enough
  because it does lseek to the 16MB position and then writes 1 byte there.
  On an implementation that supports holes in files (which is most Unixen)
  that doesn't cause physical allocation of the intervening space.  We'd
  have to actually write zeroes into all 16MB to ensure the space is
  allocated ... but that's just a couple more lines of code.
 
 Are OS's smart enough to not allocate zero-written blocks?  

No, but some disks are.  Writing zeroes is a bit faster on smart disks.
This has no real implications for PG, but it is one of the reasons that 
writing zeroes doesn't really wipe a disk, for forensic purposes.

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] WAL and commit_delay

2001-02-17 Thread Dominic J. Eidson

On Sat, 17 Feb 2001, Tom Lane wrote:

 Another thing I am wondering about is why we're not using fdatasync(),
 where available, instead of fsync().  The whole point of preallocating
 the WAL files is to make fdatasync safe, no?

Linux/x86 fdatasync(2) manpage:

BUGS
   Currently (Linux 2.0.23) fdatasync is equivalent to fsync.


-- 
Dominic J. Eidson
"Baruk Khazad! Khazad ai-menu!" - Gimli
---
http://www.the-infinite.org/  http://www.the-infinite.org/~dominic/




[HACKERS] Linux 2.2 vs 2.4

2001-02-17 Thread Matthew Kirkwood

Hi,

Not sure if anyone will find this of interest, but I ran
pgbench on my main Linux box to see what sort of performance
difference might be visible between 2.2 and 2.4 kernels.

Hardware: A dual P3-450 with 384Mb of RAM and 3 SCSI disks.
The pg datafiles live in a half-gig partition on the first
one.

Software: Red Hat 6.1 plus all sort of bits and pieces.
PostgreSQL 7.1beta4 RPMs.  pgbench hand-compiled from source
for same.  No options changed from defaults.  (I'll look at
that tomorrow -- is there anything worth changing other than
commit_delay and fsync?)

Kernels: 2.2.15 + software RAID patches, 2.4.2-pre2

With 2.2.15:
pgbench -s5 -i: 1.27.78 elapsed
pgbench -s5 -t100:
clients: TPS / TPS (excluding connection establishment)
1: 39.66 / 40.08 TPS
2: 60.77 / 61.64 TPS
4: 76.15 / 77.42
8: 90.99 / 92.73
16: 71.10 / 72.15
32: 49.20 / 49.70
1: 27.76 / 28.00
1: 27.82 / 28.03

pgbench -v -s5 -t100:
1: 30.73 / 30.98


And with 2.4.2-pre2:
pgbench -s5 -i: 1:17.46 elapsed
pgbench -s5 -t100
1: 43.57 / 44.11 TPS
2: 62.85 / 63.86 TPS
4: 87.24 / 89.08 TPS
8: 86.60 / 88.38 TPS
16: 53.22 / 53.88 TPS
32: 60.28 / 61.10 TPS
1: 35.93 / 36.33
1: 34.82 / 35.18

pgbench -v -s5 -t100:
1: 35.70 / 36.01


Overall, two things jump out at me.

Firstly, it looks like 2.4 is mixed news for heavy pgbench users
:)  Low-utilisation numbers are better, but the sweet spot seems
lower and narrower.

Secondly, in both occasions after a run, performance has been
more than 20% lower.  Restarting or performing a full vacuum does
not seem to help.  Is there some sort of fragmentation issue
here?

Matthew.




Re: [HACKERS] Microsecond sleeps with select()

2001-02-17 Thread Nathan Myers

On Sat, Feb 17, 2001 at 12:26:31PM -0500, Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  A comment on microsecond delays using select().  Most Unix kernels run
  at 100hz, meaning that they have a programmable timer that interrupts
  the CPU every 10 milliseconds.
 
 Right --- this probably also explains my observation that some kernels
 seem to add an extra 10msec to the requested sleep time.  Actually
 they're interpreting a one-clock-tick select() delay as "wait till
 the next clock tick, plus one tick".  The actual delay will be between
 one and two ticks depending on just when you went to sleep.
 ...
 In short: s_spincycle in its current form does not do anything anywhere
 near what the author thought it would.  It's wasted complexity.
 
 I am thinking about simplifying s_lock_sleep down to simple
 wait-one-tick-on-every-call logic.  An alternative is to keep
 s_spincycle, but populate it with, say, 1, 2 and larger entries,
 which would offer some hope of actual random-backoff behavior.
 Either change would clearly be a win on single-CPU machines, and I doubt
 it would hurt on multi-CPU machines.
 
 Comments?

I don't believe that most kernels schedule only on clock ticks.
They schedule on a clock tick *or* whenever the process yields, 
which on a loaded system may be much more frequently.

The question is whether, scheduling, the kernel considers processes
that have requested to sleep less than a clock tick as "ready" once
their actual request time expires.  On V7 Unix, the answer was no, 
because the kernel had no way to measure any time shorter than a
tick, so it rounded up all sleeps to "the next tick".

Certainly there are machines and kernels that count time more precisely 
(isn't PG ported to QNX?).  We do users of such kernels no favors by 
pretending they only count clock ticks.  Furthermore, a 1ms clock
tick is pretty common, e.g. on Alpha boxes.  A 10ms initial delay is 
ten clock ticks, far longer than seems appropriate.

This argues for yielding the minimum discernable amount of time (1us)
and then backing off to a less-minimal time (1ms).  On systems that 
chug at 10ms, this is equivalent to a sleep of up-to-10ms (i.e. until 
the next tick), then a sequence of 10ms sleeps; on dumbOS Alphas, it's 
equivalent to a sequence of 1ms sleeps; and on a smartOS on an Alpha it's 
equivalent to a short, variable time (long enough for other runnable 
processes to run and yield) followed by a sequence of 1ms sleeps.  
(Some of the numbers above are doubled on really dumb kernels, as
Tom noted.)

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] Re: WAL and commit_delay

2001-02-17 Thread Nathan Myers

On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote:
 On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote:
 
 [snipped]
 
 | Is anyone out there running a 2.4 Linux kernel?  Would you try pgbench
 | with current sources, commit_delay=0, -B at least 1024, no -F, and see
 | how the results change when pg_fsync is made to call fdatasync instead
 | of fsync?  (It's in src/backend/storage/file/fd.c)
 
 I've not run this requested test, but glibc-2.2 provides this bit
 of code for fdatasync, so it /appears/ to me that kernel version
 will not affect the test case.
 
 [glibc-2.2/sysdeps/generic/fdatasync.c]
 
   int
   fdatasync (int fildes)
   {
   return fsync (fildes);
   }

In the 2.4 kernel it says (fs/buffer.c)

   /* this needs further work, at the moment it is identical to fsync() */
   down(inode-i_sem);
   err = file-f_op-fsync(file, dentry);
   up(inode-i_sem);

We can probably expect this to be fixed in an upcoming 2.4.x, i.e.
well before 2.6.

This is moot, though, if you're writing to a raw volume, which
you will be if you are really serious.  Then, fsync really is 
equivalent to fdatasync.

Nathan Myers
[EMAIL PROTECTED]



[HACKERS] Re: Re: WAL and commit_delay

2001-02-17 Thread Brent Verner

On 17 Feb 2001 at 15:53 (-0800), Nathan Myers wrote:
| On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote:
|  On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote:
|  
|  [snipped]
|  
|  | Is anyone out there running a 2.4 Linux kernel?  Would you try pgbench
|  | with current sources, commit_delay=0, -B at least 1024, no -F, and see
|  | how the results change when pg_fsync is made to call fdatasync instead
|  | of fsync?  (It's in src/backend/storage/file/fd.c)
|  
|  I've not run this requested test, but glibc-2.2 provides this bit
|  of code for fdatasync, so it /appears/ to me that kernel version
|  will not affect the test case.
|  
|  [glibc-2.2/sysdeps/generic/fdatasync.c]
|  
|int
|fdatasync (int fildes)
|{
|return fsync (fildes);
|}
| 
| In the 2.4 kernel it says (fs/buffer.c)
| 
|/* this needs further work, at the moment it is identical to fsync() */
|down(inode-i_sem);
|err = file-f_op-fsync(file, dentry);
|up(inode-i_sem);
|
| We can probably expect this to be fixed in an upcoming 2.4.x, i.e.
| well before 2.6.

2.4.0-ac11 already has provisions for fdatasync 

[fs/buffer.c]

  352 asmlinkage long sys_fsync(unsigned int fd)
  353 {
  ...
  372   down(inode-i_sem);
  373   filemap_fdatasync(inode-i_mapping);
  374   err = file-f_op-fsync(file, dentry, 0);
  375   filemap_fdatawait(inode-i_mapping);
  376   up(inode-i_sem);

  384 asmlinkage long sys_fdatasync(unsigned int fd)
  385 {
  ...
  403   down(inode-i_sem);
  404   filemap_fdatasync(inode-i_mapping);
  405   err = file-f_op-fsync(file, dentry, 1);
  406   filemap_fdatawait(inode-i_mapping);
  407   up(inode-i_sem);

ext2 does use this third param of its fsync() operation to (potentially)
bypass a call to ext2_sync_inode(inode)

  b




Re: [HACKERS] Linux 2.2 vs 2.4

2001-02-17 Thread Tom Lane

Matthew Kirkwood [EMAIL PROTECTED] writes:
 No options changed from defaults.  (I'll look at
 that tomorrow -- is there anything worth changing other than
 commit_delay and fsync?)

-B for sure ... the default -B is way too small for WAL.

 Firstly, it looks like 2.4 is mixed news for heavy pgbench users
 :)  Low-utilisation numbers are better, but the sweet spot seems
 lower and narrower.

Huh?  With the exception of the 16-user case (possibly measurement
noise), 2.4 looks better across the board, AFAICS.  But see below.

 Secondly, in both occasions after a run, performance has been
 more than 20% lower.

I find that pgbench's reported performance can vary quite a bit from run
to run, at least with smaller values of total transactions.  I think
this is because it's a bit of a crapshoot how many WAL logfile
initializations occur during the run and get charged against the total
time.  Not to mention whatever else the machine might be doing.  With
longer runs (say at least 1 total transactions) the numbers should
stabilize.  I wouldn't put any faith at all in tests involving less
than about 1000 total transactions...

regards, tom lane



Re: [HACKERS] Microsecond sleeps with select()

2001-02-17 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 Certainly there are machines and kernels that count time more precisely 
 (isn't PG ported to QNX?).  We do users of such kernels no favors by 
 pretending they only count clock ticks.  Furthermore, a 1ms clock
 tick is pretty common, e.g. on Alpha boxes.

Okay, I didn't know there were any popular systems that did that.

 This argues for yielding the minimum discernable amount of time (1us)
 and then backing off to a less-minimal time (1ms).

Fair enough.  As you say, it's the same result on machines with coarse
time resolution, and it should help on smarter boxes.  The main thing
is that I want to change the zero entries in s_spincycle, which
clearly aren't doing what the author intended.

regards, tom lane



Re: [HACKERS] Re: WAL and commit_delay

2001-02-17 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 In the 2.4 kernel it says (fs/buffer.c)

/* this needs further work, at the moment it is identical to fsync() */
down(inode-i_sem);
err = file-f_op-fsync(file, dentry);
up(inode-i_sem);

Hmm, that's the same code that's been there since 2.0 or before.
I had trawled the Linux kernel mail lists and found patch submissions
from several different people to make fdatasync really work, and what
I thought was an indication that at least one had been applied.
Evidently not.  Oh well...

regards, tom lane



Re: [HACKERS] Re: WAL and commit_delay

2001-02-17 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 I.e. yes, Linux 2.4.0 and ext2 do implement the distinction.
 Sorry for the misinformation.

Okay ... meanwhile I've got to report the reverse: I've just confirmed
that on HPUX 10.20, there is *not* a distinction between fsync and
fdatasync.  I was misled by what was apparently an outlier result on my
first try with fdatasync plugged in ... but when I couldn't reproduce
that, some digging led to the fact that the fsync and fdatasync symbols
in libc are at the same place :-(.

Still, using fdatasync for the WAL file seems like a forward-looking
thing to do, and it'll just take another couple of lines of configure
code, so I'll go ahead and plug it in.

regards, tom lane