Re: [HACKERS] Scalable postgresql using sys_epoll

2004-03-10 Thread Matthew Kirkwood
On Wed, 10 Mar 2004, Shachar Shemesh wrote:

 IBM has rewritten their Domino database system to use the new
 sys_epoll call available in the Linux 2.6 kernel.
 
 Would Postgresql benefit from using this API? Is anyone looking at
 this?

 I'm not familiar enough with the postgres internals, but is using
 libevent (http://monkey.org/~provos/libevent/) an option? It uses state
 triggered, rather than edge triggered, interface, and it automatically
 selects the best API for the job (epoll, poll, select). I'm not sure
 whether it's available for all the platforms postgres is available for.

libevent is cool, but postgres uses a process-per-client
model, so the number of file descriptors of active interest
to a backend at any given time is low.

Matthew.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Named arguments in function calls

2004-01-26 Thread Matthew Kirkwood
On Mon, 26 Jan 2004, Tom Lane wrote:

  If that was IS, then foo(x is 13) makes sense.

  I like that syntax.  For example
  select interest(amount is 500.00, rate is 1.3)
  is very readable, yet brief.

 On second thought though, it doesn't work.

   select func(x is null);

 is ambiguous, especially if func() accepts boolean.

You're unlikely to care, but Oracle's syntax is Perlish:

select interest(amount = 500.0, rate = 1.3);

That'd be ambiguous again, though.  Perhaps:

select interest(amount := 500.0, rate := 1.3);

?

Matthew.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Preventing stack-overflow crashes (improving on

2003-12-31 Thread Matthew Kirkwood
On Wed, 31 Dec 2003, Tom Lane wrote:

  Is ABS enough on a 64-bit architecture ?

 That was pseudocode, I wasn't actually planning to rely on a function.
 Something more like

   longdiff;

FWIW, ISO has a ptrdiff_t, which may be useful here.

Matthew.

   diff = stack_base_ptr - stack_top_loc;
   if (diff  0)
   diff = -diff;
   if (diff  max)
   elog ...

   regards, tom lane

 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?

http://archives.postgresql.org



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Pre-allocation of shared memory ...

2003-06-14 Thread Matthew Kirkwood
On Sat, 14 Jun 2003, Andrew Dunstan wrote:

 The trouble with this advice is that if I am an SA wanting to run a
 DBMS server, I will want to run a kernel supplied by a vendor, not an
 arbitrary kernel released by a developer, even one as respected as
 Alan Cox.

Like, say, Red Hat:

$ ls -l /proc/sys/vm/overcommit_memory
-rw-r--r--1 root root0 Jun 14 18:58 /proc/sys/vm/overcommit_memory
$ uname -a
Linux stinky.hoopy.net 2.4.20-20.1.1995.2.2.nptl #1 Fri May 23 12:18:31 EDT 2003 i686 
i686 i386 GNU/Linux

(This is a Rawhide kernel, but I think that control has been
in stock RH kernels for some time now.)

Matthew.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Pre-allocation of shared memory ...

2003-06-14 Thread Matthew Kirkwood
On Sat, 14 Jun 2003, Kurt Roeckx wrote:

  $ ls -l /proc/sys/vm/overcommit_memory
  -rw-r--r--1 root root0 Jun 14 18:58 
  /proc/sys/vm/overcommit_memory
  $ uname -a
  Linux stinky.hoopy.net 2.4.20-20.1.1995.2.2.nptl #1 Fri May 23 12:18:31 EDT 2003 
  i686 i686 i386 GNU/Linux

 I also got that /proc/sys/vm/overcommit_memory on a plain 2.4.21.

This might also be interesting:

http://www.cs.helsinki.fi/linux/linux-kernel/2002-33/0826.html

I couldn't say how much of it is in the stock RH kernels,
or how successful the heuristic is.

Matthew.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Suggestion; WITH VACUUM option

2002-12-17 Thread Matthew Kirkwood
On Tue, 17 Dec 2002, mlw wrote:

 update largetable set foo=bar;

 Lets also assume that largetable has tens of millions of rows.
[..]
 On some of my databases a statement which updates all the rows is
 unworkable in PostgreSQL, on Oracle, however, there is no poblem.

.. provided you have a lot of rollback space, which is
essentially what the datafile growth here is providing.

Matthew.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] HEADS UP: Win32/OS2/BeOS native ports

2002-05-07 Thread Matthew Kirkwood

On Mon, 6 May 2002, Tom Lane wrote:

  As a backend is started up, connect to that socket ... if socket is open
  when trying to start a new frontend, fail as there are currently other
  connections attached to it?

 But the backends would only have the socket open, they'd not be
 actively listening to it.  So how could you tell whether anyone
 had the socket open or not?

It's easy.  As startup, the postmaster (or standalone
backend) creates a Unix socket, binds it to the filename
and calls listen on it.

If another backend is running, it'll get EADDRINUSE from
the bind or listen.

Nobody actually needs to connect to the socket.  Simple,
race-free, 10 lines of code.

Matthew.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] HEADS UP: Win32/OS2/BeOS native ports

2002-05-07 Thread Matthew Kirkwood

On Tue, 7 May 2002, Tom Lane wrote:

  Nobody actually needs to connect to the socket.  Simple,
  race-free, 10 lines of code.

 ... and we already do it.  But it protects the port number, not
 the data directory.

If I understood him correctly, Marc was suggesting a further
domain socket inside the data directory.

Matthew.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] HEADS UP: Win32/OS2/BeOS native ports

2002-05-04 Thread Matthew Kirkwood

On Fri, 3 May 2002, Tom Lane wrote:

 But what we must *not* do is allow a new postmaster to start while the
 old backends are still running; that would mean two sets of backends
 running without contact with each other, which would be fatal for data
 integrity.  The SysV API lets us detect that case, but I don't see any
 equally good way to do it if we are using anonymous shared memory.

It's a hack (and has slight security implications), but you
could just allow the postgres backends to keep the listening
socket(s) open.

Matthew.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Bitmap indexes?

2002-03-19 Thread Matthew Kirkwood

On Tue, 19 Mar 2002, Oleg Bartunov wrote:

Sorry to reply over you, Oleg.

 On 13 Mar 2002, Greg Copeland wrote:

  One of the reasons why I originally stated following the hackers list is
  because I wanted to implement bitmap indexes.  I found in the archives,
  the follow link, http://www.it.iitb.ernet.in/~rvijay/dbms/proj/, which
  was extracted from this,
  
http://groups.google.com/groups?hl=enthreadm=01C0EF67.5105D2E0.mascarm%40mascari.comrnum=1prev=/groups%3Fq%3Dbitmap%2Bindex%2Bgroup:comp.databases.postgresql.hackers%26hl%3Den%26selm%3D01C0EF67.5105D2E0.mascarm%2540mascari.com%26rnum%3D1,
 archive thread.

For every case I have used a bitmap index on Oracle, a
partial index[0] made more sense (especialy since it
could usefully be compound).

Our troublesome case (on Oracle) is a table of events
where maybe fifty to a couple of hundred are published
(ie. web-visible) at any time.  The events are categorised
by sport (about a dozen) and by event type (about five).
We never really query events except by PK or by sport/type/
published.

We make a bitmap index on published, and trust Oracle to
use it correctly, and hope that our other indexes are also
useful.

On Postgres[1] we would make a partial compound index:

create index ... on events(sport_id,event_type_id)
where published='Y';

Matthew.

[0] Is this a postgres-only feature; my tame Oracle and
Sybase DBAs had never heard of such a thing, but
were rather impressed at the idea.
[1] Disclaimer.  Our system doesn't run on PG, though I
do have a nearly equivalent prototype system which
does.  I'd love to hear any success (or otherwise)
stories about PG partial indexes.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Survey results on Oracle/M$NT4 to PG72/RH72 migration

2002-03-14 Thread Matthew Kirkwood

On Thu, 14 Mar 2002, Jean-Paul ARGUDO wrote:

 This daemon wakes up every 5 seconds. It scans (SELECT...) for new
 insert in a table (lika trigger). When new tuples are found, it
 launches the work. The work consist in computing total sales of a big
 store...

You might find it worthwhile to investigate listen and
notify -- combined with a rule or trigger, you can get
this effect in near-real-time

You'll probably still want a sleep(5) at the end of the
loop so you can batch a reasonable number of updates if
there's a lot going on.

Matthew.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] anoncvs and CVS link off developers.postgresql.org

2001-10-06 Thread Matthew Kirkwood

On Sat, 6 Oct 2001, Larry Rosenman wrote:

 If I try:
 cvs -d :pserver:[EMAIL PROTECTED]:/cvsroot login
 I get a time out

Moi aussi.  I can't reach www.postgresql.org either.

It doesn't seem obviously to be a routing problem.

Matthew.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Notes about int8 sequences

2001-08-07 Thread Matthew Kirkwood

On Mon, 6 Aug 2001, Tom Lane wrote:

 * How should one invoke nextval() and friends on such a sequence?

 Perhaps we could allow people to write nextval(sequencename) and/or
 sequencename.nextval, which would expose the sequence object to the
 parser so that datatype overloading could occur.

I'm not worried about the size of the return type of
a sequence, but I like the idea of Oracle-compatible
seq.nextval syntax.

Matthew.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Notes about int8 sequences

2001-08-07 Thread Matthew Kirkwood

On Tue, 7 Aug 2001, Tom Lane wrote:

  I'm not worried about the size of the return type of
  a sequence, but I like the idea of Oracle-compatible
  seq.nextval syntax.

 I didn't realize we had any Oracle-compatibility issues here.  What
 exactly does Oracle's sequence facility look like?

It's exactly seqname.nextval.  It seems that it
can be used in exactly the places where PG allows
nextval(seqname) (subject to the usual sprinkling
of from duals, of course).

Matthew.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Performance TODO items

2001-07-31 Thread Matthew Kirkwood

On Mon, 30 Jul 2001, Bruce Momjian wrote:

 * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or
   spining to obtain lock on multi-cpu systems

You may be interested in a discussion which happened over on
linux-kernel a few months ago.

Quite a lot of people want a lightweight userspace semaphore,
and for pretty much the same reasons.

Linus proposed a pretty interesting solution which has the
same minimal overhead as the current spinlocks in the non-
contention case, but avoids the spin where there's contention:

http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg39615.html

Matthew.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Re: New Linux xfs/reiser file systems

2001-05-03 Thread Matthew Kirkwood

On Thu, 3 May 2001, mlw wrote:

 I would bet it is a huge amount of work to use a table space system
 and no one wants that.

From some stracing of 7.1, the most common syscall issued by
postgres is an lseek() to the end of the file, presumably to
find its length, which seems to happen up to about a dozen
times per (pgbench) transaction.

Tablespaces would solve this (not that lseek is a particularly
expensive operation, of course).

 Perhaps we can convince the Linux community to create a dbfs which
 is a stripped down simple no nonsense file system designed for
 applications like databases?

Sync-metadata ext2 should be fine.  Filesystems fsck pretty
quick when they contain only a few large files.

Otherwise, something like smugfs (now obsolete) might do.

Matthew.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



[HACKERS] Archived redo logs / Managed recovery mode?

2001-04-27 Thread Matthew Kirkwood

Hi,

Firstly, the attached patch implements archiving of off-
line redo logs, via the wal_archive_dir GUC option.  It
builds and appears to work (though it looks like guc-file.l
has some problems with unquoted strings containing slashes).


TODO: handle EXDEV from link/rename, and copy rather
than renaming.


Clearly this isn't a lot of use at the moment, but what I'd
really like would be a way to implement what our (Oracle)
DBA calls managed recovery.

Essentially, the standby database is opened in read-only
mode (since PG seems to lack this, having it not open at
all should suffice :). and archived redo logs are copied
over from the live database (we do it via rsync, every 5
minutes) and rolled forward.

(Note: for what it's worth, we're using this because
Oracle's Advanced Replication is too unstable.)


Is there an easy way to do this?  I suppose that while
there isn't a readonly option, it might be best done with
an external tool, not unlike resetxlog.

What are the plans for replication in 7.2 (assuming that
is what's next)?  The rserv stuff looks neat, but rather
intricate.  A cheap, out-of-band replication system would
make me very happy.

Matthew.


Index: src/backend/access/transam/xlog.c
===
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.65
diff -u -r1.65 xlog.c
--- src/backend/access/transam/xlog.c   2001/04/05 16:55:21 1.65
+++ src/backend/access/transam/xlog.c   2001/04/27 14:49:44
@@ -97,7 +97,7 @@
 intXLOG_DEBUG = 0;
 char  *XLOG_sync_method = NULL;
 const char XLOG_sync_method_default[] = DEFAULT_SYNC_METHOD_STR;
-char   XLOG_archive_dir[MAXPGPATH];/* null string means
+char   *XLOG_archive_dir = NULL;   /* null string means
   
  * delete 'em */
 
 /* these are derived from XLOG_sync_method by assign_xlog_sync_method */
@@ -1476,9 +1476,7 @@
DIR*xldir;
struct dirent *xlde;
charlastoff[32];
-   charpath[MAXPGPATH];
-
-   Assert(XLOG_archive_dir[0] == 0);   /* ! implemented yet */
+   charpath[MAXPGPATH], arcpath[MAXPGPATH];
 
xldir = opendir(XLogDir);
if (xldir == NULL)
@@ -1493,11 +1491,25 @@
strspn(xlde-d_name, 0123456789ABCDEF) == 16 
strcmp(xlde-d_name, lastoff) = 0)
{
-   elog(LOG, MoveOfflineLogs: %s %s, (XLOG_archive_dir[0]) ?
+   elog(LOG, MoveOfflineLogs: %s %s, XLOG_archive_dir ?
 archive : remove, xlde-d_name);
sprintf(path, %s%c%s, XLogDir, SEP_CHAR, xlde-d_name);
-   if (XLOG_archive_dir[0] == 0)
+   if (XLOG_archive_dir == NULL)
unlink(path);
+   else {
+   sprintf(arcpath, %s%c%s, XLOG_archive_dir, SEP_CHAR, 
+xlde-d_name);
+#ifndef__BEOS__
+   if (link(path, arcpath)  0)
+   elog(STOP, MoveOfflineLogs: %s = %s failed: 
+%m,
+   path, arcpath);
+   else
+   unlink(path);
+#else
+   if (rename(path, arcpath)  0)
+   elog(STOP, MoveOfflineLogs: %s = %s failed: 
+%m,
+   path, arcpath);
+#endif
+   }
}
errno = 0;
}
Index: src/backend/utils/misc/guc.c
===
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.35
diff -u -r1.35 guc.c
--- src/backend/utils/misc/guc.c2001/03/22 17:41:47 1.35
+++ src/backend/utils/misc/guc.c2001/04/27 14:49:48
@@ -13,6 +13,9 @@
 
 #include postgres.h
 
+#include sys/types.h
+#include sys/stat.h
+
 #include errno.h
 #include float.h
 #include limits.h
@@ -41,6 +44,8 @@
 extern int CommitSiblings;
 extern bool FixBTree;
 
+static bool check_dirname(const char *dirname);
+
 #ifdef ENABLE_SYSLOG
 extern char *Syslog_facility;
 extern char *Syslog_ident;
@@ -351,6 +356,9 @@
XLOG_sync_method_default,
check_xlog_sync_method, assign_xlog_sync_method},
 
+   {wal_archive_dir, PGC_SUSET, XLOG_archive_dir,
+   , check_dirname, NULL},
+
{NULL, 0, NULL, NULL, NULL, NULL}
 };
 
@@ -869,6 +877,17 @@
*cp = '_';
 }
 
+
+static bool

Re: [HACKERS] RE: xlog checkpoint depends on sync() ... seems unsafe

2001-03-13 Thread Matthew Kirkwood

On Tue, 13 Mar 2001, Tom Lane wrote:

  I was told the same a long ago about FreeBSD. How much can we count on
  this undocumented sync() feature?
 
 Sounds quite unreliable to me.  Unless there's some interlock ...
 like, say, the second sync not being able to advance past a buffer
 page that's as yet unwritten by the first sync.  But would all Unixen
 share such a strange detail of implementation?

The Linux manpage says:

NAME
   sync - commit buffer cache to disk.
[..]

DESCRIPTION
   sync  first commits inodes to buffers, and then buffers to
   disk.
[..]

CONFORMING TO
   SVr4, SVID, X/OPEN, BSD 4.3

BUGS
   According to  the  standard  specification  (e.g.,  SVID),
   sync()  schedules  the  writes,  but may return before the
   actual writing is done.   However,  since  version  1.3.20
   Linux  does actually wait.  (This still does not guarantee
   data integrity: modern disks have large caches.)


And it's still true.  On a fast system, if you do:

$ cp /dev/zero /tmp  sleep 1; sync

the sync will often never finish.  (Of course, that's
just an implementation detail really.)

Matthew.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] WAL SHM principles

2001-03-13 Thread Matthew Kirkwood

On Tue, 13 Mar 2001, Ken Hirsch wrote:

  mlock() guarantees that the locked address space is in memory.  This
  doesn't imply that updates are not written to the backing file.

 I've wondered about this myself.  It _is_ true on Linux that mlock
 prevents writes to the backing store,

I don't believe that this is true.  The manpage offers no
such promises, and the semantics are not useful.

 and this is used as a security feature for cryptography software.

mlock() is used to prevent pages being swapped out.  Its
use for crypto software is essentially restricted to anon
memory (allocated via brk() or mmap() of /dev/zero).

If my understanding is accurate, before 2.4 Linux would
never swap out pages which had a backing store.  It would
simply write them back or drop them (if clean).  (This is
why you need around twice as much swap with 2.4.)

 The code for gnupg assumes that if you have mlock() on any operating
 system, it does mean this--which doesn't mean it's true, but perhaps
 whoever wrote it does have good reason to think so.

strace on gpg startup says:

mmap(0, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000
getuid()= 500
mlock(0x40015000)   = -1 EPERM (Operation not permitted)

so whatever the authors think, it does not require this semantic.

Matthew.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] WAL SHM principles

2001-03-13 Thread Matthew Kirkwood

On Tue, 13 Mar 2001, Alfred Perlstein wrote:

[..]
 Linux does not filesystem-sync file-backed writable mmap pages on a
 regular basis.

Very intersting.  I'm not sure that is necessarily the case in
2.4, though -- my understanding is that the new all-singing,
all-dancing page cache makes very little distinction between
mapped and unmapped dirty pages.

 Basically any mmap'd data doesn't seem to get sync()'d out on
 a regular basis.

Hmm.. I'd call that a bug, anyway.

   and this is used as a security feature for cryptography software.
 
  mlock() is used to prevent pages being swapped out.  Its
  use for crypto software is essentially restricted to anon
  memory (allocated via brk() or mmap() of /dev/zero).

 What about userland device drivers that want to send parts
 of a disk backed file to a driver's dma routine?

And realtime software.  I'm not disputing that mlock is useful,
but what it can do be security software is not that huge.  The
Linux manpage says:

   Memory locking has two main applications: real-time  algo
   rithms and high-security data processing.

Matthew.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] Multi-process pgbench?

2001-03-04 Thread Matthew Kirkwood

Hi,

Did I read allegations here a while ago that someone
had a multi-process version of pgbench?  I've poked
around the website and mail archives, but couldn't
find it.

I have access to a couple of 4-CPU boxes, and reckon
that a single-process benching tool could well prove
a bottleneck.

Matthew.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



[HACKERS] Re: mmap for zeroing WAL log

2001-02-28 Thread Matthew Kirkwood

On Tue, 27 Feb 2001, Tom Lane wrote:

 Matthew Kirkwood [EMAIL PROTECTED] writes:
  I had assumed that the overhead would come from synchronous
  metadata incurring writes of at least the inode, block bitmap
  and probably an indirect block for each syscall.

 No Unix that I've ever heard of forces metadata to disk after each
 "write" call; anyone who tried it would have abysmal performance.
 That's what fsync and the syncer daemon are for.

My understanding was that that's exactly what ffs' synchronous
metadata writes do.

Am I missing something here?  Do they jsut schedule I/O, but
return without waiting for its completion?

Matthew.




[HACKERS] Re: mmap for zeroing WAL log

2001-02-27 Thread Matthew Kirkwood

On Sat, 24 Feb 2001, Tom Lane wrote:

  I am confused why mmap() is better than writing to a real file.
 
  It isn't, except that it allows to initialise the logfile in
  one syscall, without first allocating and zeroing (and hence
  dirtying) 16Mb of memory.
 
 Uh, the existing code does not zero 16Mb of memory... it zeroes
 8K and then writes that block repeatedly.

See the "one syscall" bit above.

 It's possible that the overhead of a syscall for each 8K block is
 significant,

I had assumed that the overhead would come from synchronous
metadata incurring writes of at least the inode, block bitmap
and probably an indirect block for each syscall.

 but on the other hand writing a block at a time is a heavily used and
 heavily optimized path in all Unixen.  It's at least as plausible that
 the mmap-as-source-of-zeroes path will be slower!

Results:

On Linux/ext2, it appears good for a gain of 3-5% for log
creations (via a fairly minimal test program).

On FreeBSD 4.1-RELEASE/ffs (with all of sync/async/softupdates)
it is a couple of percent worse in elapsed time, but consumes
around a third more system CPU time (12sec vs 9sec on one test
system).

I am awaiting numbers from reiserfs but, for now, it looks like
I am far from vindicated.

Matthew.




Re: [HACKERS] WAL and commit_delay

2001-02-19 Thread Matthew Kirkwood

On Sun, 18 Feb 2001, Tom Lane wrote:

 I think that there may be a performance advantage to pre-filling the
 logfile even so, assuming that file allocation info is stored in a
 Berkeley/McKusik-like fashion (note: I have no idea what ext2 or
 reiserfs actually do).

ext2 is a lot like [UF]FS.  reiserfs is very different, but does
have similar hole semantics.

BTW, I have attached two patches which streamline log initialisation
a little.  The first (xlog-sendfile.diff) adds support for Linux's
sendfile system call.  FreeBSD and HP/UX have sendfile() too, but the
prototype is different.  If it's interesting, someone will have to
come up with a configure test, as autoconf scares me.

The second removes a further three syscalls from the log init path.
There are a couple of things to note here:
 * I don't know why link/unlink is currently preferred over
   rename.  POSIX offers strong guarantees on the semantics
   of the latter.
 * I have assumed that the close/rename/reopen stuff is only
   there for the benefit of Windows users, and ifdeffed it
   for everyone else.

Matthew.


--- xlog.c.old  Mon Feb 19 12:35:53 2001
+++ xlog.c  Mon Feb 19 13:05:23 2001
@@ -24,6 +24,10 @@
 #include locale.h
 #endif
 
+#ifdef _HAVE_LINUX_SENDFILE
+#include sys/sendfile.h
+#endif
+
 #include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/catversion.h"
@@ -962,6 +966,24 @@
elog(STOP, "InitCreate(logfile %u seg %u) failed: %m",
 logId, logSeg);
 
+#ifdef _HAVE_LINUX_SENDFILE
+   {
+   static int  zfd = -1;
+   ssize_t len;
+
+   if (zfd  0) {
+   zfd = BasicOpenFile("/dev/zero", O_RDONLY, 0);
+   if (zfd  0)
+   elog(STOP, "Can't open /dev/zero: %m");
+   }
+   len = sendfile(fd, zfd, NULL, XLogSegSize);
+   if (len  0)
+   /* XXX - header support sendfile, but kernel doesn't?  Fall 
+back */
+   elog(STOP, "sendfile failed: %m");
+   if (len  XLogSegSize)
+   elog(STOP, "short read on sendfile: %m");
+   }
+#else
if (lseek(fd, XLogSegSize - 1, SEEK_SET) != (off_t) (XLogSegSize - 1))
elog(STOP, "lseek(logfile %u seg %u) failed: %m",
 logId, logSeg);
@@ -969,6 +991,7 @@
if (write(fd, "", 1) != 1)
elog(STOP, "write(logfile %u seg %u) failed: %m",
 logId, logSeg);
+#endif
 
if (pg_fsync(fd) != 0)
elog(STOP, "fsync(logfile %u seg %u) failed: %m",



--- xlog.c.sf   Mon Feb 19 13:10:38 2001
+++ xlog.c  Mon Feb 19 13:13:55 2001
@@ -1001,22 +1001,20 @@
elog(STOP, "lseek(logfile %u seg %u off %u) failed: %m",
 log, seg, 0);
 
+#ifndefWIN32
close(fd);
+#endif
 
-#ifndef __BEOS__
-   if (link(tpath, path)  0)
-#else
if (rename(tpath, path)  0)
-#endif
elog(STOP, "InitRelink(logfile %u seg %u) failed: %m",
 logId, logSeg);
 
-   unlink(tpath);
-
+#ifndefWIN32
fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
if (fd  0)
elog(STOP, "InitReopen(logfile %u seg %u) failed: %m",
 logId, logSeg);
+#endif
 
return (fd);
 }




Re: [HACKERS] WAL and commit_delay

2001-02-19 Thread Matthew Kirkwood

On Mon, 19 Feb 2001, Matthew Kirkwood wrote:

 BTW, I have attached two patches which streamline log initialisation
 a little.  The first (xlog-sendfile.diff) adds support for Linux's
 sendfile system call.

Whoops, don't use this.  It looks like Linux won't sendfile()
from /dev/zero.  I'll endeavour to get this fixed, but it
looks like it'll be rather harder to use sendfile for this.

Bah.

Matthew.




[HACKERS] beta4 RPM bug

2001-02-18 Thread Matthew Kirkwood

Hi,

There seems to be a teeny-tiny bug in the beta4 RPMS.

/etc/rc.d/init.d/postgresql contains:

# PGVERSION is:
PGVERSION=7.1beta3

Matthew.




Re: [HACKERS] Linux 2.2 vs 2.4

2001-02-18 Thread Matthew Kirkwood

On Sat, 17 Feb 2001, Tom Lane wrote:

 the default -B is way too small for WAL.

OK, here are some 2.4 numbers with 1K transactions/client
and -B10240.

 Huh?  With the exception of the 16-user case (possibly measurement
 noise), 2.4 looks better across the board, AFAICS.  But see below.

OK.

Rough methodology:
# service postgresql stop
# rpm -e postgresql-server
# rm -fr /var/lib/pgsql
# service postgresql start
# reboot
# sysctl -w kernel.shmmax=186048768
pg$ creatuser matthew
pg$ createdb matthew
me$ ./pgbench -i -s5 -t$T -c$N

Does this look fairly immune to troubles?

  Secondly, in both occasions after a run, performance has been
  more than 20% lower.

 I find that pgbench's reported performance can vary quite a bit from
 run to run, at least with smaller values of total transactions.  I
 think this is because it's a bit of a crapshoot how many WAL logfile
 initializations occur during the run and get charged against the total
 time.  Not to mention whatever else the machine might be doing.  With
 longer runs (say at least 1 total transactions) the numbers should
 stabilize.  I wouldn't put any faith at all in tests involving less
 than about 1000 total transactions...

Ah, good point.  Here are some with 2.4.2pre2 and 1000 transactions.

I'll try to find time tomorrow to do some batch benching with 10K
transactions on various kernels.

I hear allegations that the 2.4.1 disk elevator and VM are subject
to investigation to I'll try to keep some up-to-date numbers if any-
one is interested.

Matthew.

-- 
Numbers:
2.4.2-pre2 (-B10240):

pgbench -s5 -i: 1:13:02 elapsed
pgbench -s5 -t1000
1: 40.06 / 40.10 TPS
2: 53.01 / 53.08
4: 57.14 / 57.23
8: 62.82 / 62.92
16: 62.46 / 62.56
32: 43.15 / 43.20
1: 23.48 / 26.05
1: 30.85 / 30.88

pgbench -v -s5 -t1000
1: 26.37 / 26.39




[HACKERS] Linux 2.2 vs 2.4

2001-02-17 Thread Matthew Kirkwood

Hi,

Not sure if anyone will find this of interest, but I ran
pgbench on my main Linux box to see what sort of performance
difference might be visible between 2.2 and 2.4 kernels.

Hardware: A dual P3-450 with 384Mb of RAM and 3 SCSI disks.
The pg datafiles live in a half-gig partition on the first
one.

Software: Red Hat 6.1 plus all sort of bits and pieces.
PostgreSQL 7.1beta4 RPMs.  pgbench hand-compiled from source
for same.  No options changed from defaults.  (I'll look at
that tomorrow -- is there anything worth changing other than
commit_delay and fsync?)

Kernels: 2.2.15 + software RAID patches, 2.4.2-pre2

With 2.2.15:
pgbench -s5 -i: 1.27.78 elapsed
pgbench -s5 -t100:
clients: TPS / TPS (excluding connection establishment)
1: 39.66 / 40.08 TPS
2: 60.77 / 61.64 TPS
4: 76.15 / 77.42
8: 90.99 / 92.73
16: 71.10 / 72.15
32: 49.20 / 49.70
1: 27.76 / 28.00
1: 27.82 / 28.03

pgbench -v -s5 -t100:
1: 30.73 / 30.98


And with 2.4.2-pre2:
pgbench -s5 -i: 1:17.46 elapsed
pgbench -s5 -t100
1: 43.57 / 44.11 TPS
2: 62.85 / 63.86 TPS
4: 87.24 / 89.08 TPS
8: 86.60 / 88.38 TPS
16: 53.22 / 53.88 TPS
32: 60.28 / 61.10 TPS
1: 35.93 / 36.33
1: 34.82 / 35.18

pgbench -v -s5 -t100:
1: 35.70 / 36.01


Overall, two things jump out at me.

Firstly, it looks like 2.4 is mixed news for heavy pgbench users
:)  Low-utilisation numbers are better, but the sweet spot seems
lower and narrower.

Secondly, in both occasions after a run, performance has been
more than 20% lower.  Restarting or performing a full vacuum does
not seem to help.  Is there some sort of fragmentation issue
here?

Matthew.




Re: [HACKERS] SSL Connections

2000-12-21 Thread Matthew Kirkwood

On Wed, 20 Dec 2000, Oliver Elphick wrote:

 To create a quick self-signed certificate, use the CA.pl script
 included in OpenSSL:
 
 CA.pl -newcert

Or you can do it manually:

openssl req -new -text -out cert.req (you will have to enter a password)
mv privkey.pem cert.pem.pw
openssl rsa -in cert.pem.pw -out cert.pem  (this removes the password)
openssl req -x509 -in cert.req -text -key cert.pem -out cert.cert

Matthew.