date:20021007

Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance

2002-10-07 Thread Martijn van Oosterhout


On Tue, Oct 08, 2002 at 11:14:11AM +0530, Shridhar Daithankar wrote:
> On 7 Oct 2002 at 11:21, Tom Lane wrote:
> 
> > "Shridhar Daithankar" <[EMAIL PROTECTED]> writes:
> > > I say if it's a char field, there should be no indicator of length as
> > > it's not required. Just store those many characters straight ahead..
> > 
> > Your assumption fails when considering UNICODE or other multibyte
> > character encodings.
> 
> Correct but is it possible to have real char string when database is not 
> unicode or when locale defines size of char, to be exact?
> 
> In my case varchar does not make sense as all strings are guaranteed to be of 
> defined length. While the argument you have put is correct, it's causing a disk 
> space leak, to say so.

Well, maybe. But since 7.1 or so char() and varchar() simply became text
with some length restrictions. This was one of the reasons. It also
simplified a lot of code.
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Hannu Krosing


Curtis Faith kirjutas T, 08.10.2002 kell 01:04:
> > I may be missing something obvious, but I don't see a way to get more
> > than 1 trx/process/revolution, as each previous transaction in that
> > process must be written to disk before the next can start, and the only
> > way it can be written to the disk is when the disk heads are on the
> > right place and that happens exactly once per revolution.
> 
> Okay, consider the following scenario.
> 
> 1) Process A commits when the platter is at 0 degrees.
> 2) There are enough XLog writes from other processes to fill 1/4 platter
> rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit
> record and keeps writing other log entries as the head rotates.
> 3) Process A receives a confirmation of the write before the platter
> rotates 60 degrees.
> 4) Process A continues and adds another commit before the platter rotates
> to 90 degrees.

for this scheme to work there are _very_ specific conditions to be met
(i.e. the number of writing processes and size of WAL records has to
meet very strict criteria)

I'd suspect that this will not work for 95% of cases.

> This should be very possible and more and more likely in the future as CPUs
> get faster and faster relative to disks.

You example of >1 trx/proc/rev will wok _only_ if no more and no less
than 1/4 of platter is filled by _other_ log writers.

> I'm not suggesting this would happen all the time, just that it's possible
> and that an SMP machine with good CPUs and a fast I/O subsystem should be
> able to keep the log writing at close to I/O bandwidth limits.

Keeping log writing parallel and close to I/O bandwidth limits is the
real key issue there. 

Toms test case of fast inserting of small records by relatively small
number of processes (forcing repeated writes of the same page) seems
also a border case.

> The case of bulk inserts is one where I would expect that for simple tables
> we should be able to peg the disks given today's hardware and enough
> inserting processes.

bulk inserts should probably be chunked at higher level by inserting
several records inside a single transaction.

---
Hannu


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [GENERAL] [HACKERS] Hot Backup

2002-10-07 Thread Shridhar Daithankar


On 7 Oct 2002 at 13:48, Neil Conway wrote:

> "Sandeep Chadha" <[EMAIL PROTECTED]> writes:
> > Postgresql has been lacking this all along. I've installed postgres
> > 7.3b2 and still don't see any archive's flushed to any other
> > place. Please let me know how is hot backup procedure implemented in
> > current 7.3 beta(2) release.
> AFAIK no such hot backup feature has been implemented for 7.3 -- you
> appear to have been misinformed.

Is replication an answer to hot backup?

Bye
 Shridhar

--
ink, n.:A villainous compound of tannogallate of iron, gum-arabic,  and 
water, 
chiefly used to facilitate the infection of idiocy and promote intellectual 
crime.  -- H.L. Mencken


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance

2002-10-07 Thread Shridhar Daithankar

On 7 Oct 2002 at 11:21, Tom Lane wrote:

> "Shridhar Daithankar" <[EMAIL PROTECTED]> writes:
> > I say if it's a char field, there should be no indicator of length as
> > it's not required. Just store those many characters straight ahead..
> 
> Your assumption fails when considering UNICODE or other multibyte
> character encodings.

Correct but is it possible to have real char string when database is not 
unicode or when locale defines size of char, to be exact?

In my case varchar does not make sense as all strings are guaranteed to be of 
defined length. While the argument you have put is correct, it's causing a disk 
space leak, to say so.

Bye
 Shridhar

--
Boucher's Observation:  He who blows his own horn always plays the music
several 
octaves higher than originally written.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Bruce Momjian

Curtis Faith wrote:
> > This is the trickle syncer.  It prevents bursts of disk activity every
> > 30 seconds.  It is for non-fsync writes, of course, and I assume if the
> > kernel buffers get low, it starts to flush faster.
> 
> AFAICT, the syncer only speeds up when virtual memory paging fills the
> buffers past
> a threshold and even in that event it only speeds it up by a factor of two.
> 
> I can't find any provision for speeding up flushing of the dirty buffers
> when they fill for normal file system writes, so I don't think that
> happens.

So you think if I try to write a 1 gig file, it will write enough to
fill up the buffers, then wait while the sync'er writes out a few blocks
every second, free up some buffers, then write some more?

Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when
it can't get a buffer, it will async write a dirty buffer to disk.

As far as this AIO conversation is concerned, I want to see someone come
up with some performance improvement that we can only do with AIO. 
Unless I see it, I am not interested in pursuing this thread.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Curtis Faith


> Greg Copeland <[EMAIL PROTECTED]> writes:
> > Doesn't this also increase the likelihood that people will be
> > running in a buffer-poor environment more frequently that I
> > previously asserted, especially in very heavily I/O bound
> > systems?  Unless I'm mistaken, that opens the door for a
> > general case of why an aio implementation should be looked into.

Neil Conway replies:
> Well, at least for *this specific sitation*, it doesn't really change
> anything -- since FreeBSD doesn't implement POSIX AIO as far as I
> know, we can't use that as an alternative.

I haven't tried it yet but there does seem to be an aio implementation that
conforms to POSIX in FreeBSD 4.6.2.  Its part of the kernel and can be
found in:
/usr/src/sys/kern/vfs_aio.c

> However, I'd suspect that the FreeBSD kernel allows for some way to
> tune the behavior of the syncer. If that's the case, we could do some
> research into what settings are more appropriate for FreeBSD, and
> recommend those in the docs. I don't run FreeBSD, however -- would
> someone like to volunteer to take a look at this?

I didn't see anything obvious in the docs but I still believe there's some
way to tune it. I'll let everyone know if I find some better settings.

> BTW Curtis, did you happen to check whether this behavior has been
> changed in FreeBSD 5.0?

I haven't checked but I will.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Ken Hirsch



I sent this yesterday, but it seems not to have made it to the list... 


I have a couple of comments orthogonal to the present discussion. 

1) It would be fairly easy to write log records over a network to a 
dedicated process on another system.  If the other system has an 
uninterruptible power supply, this is about as safe as writing to disk. 

This would get rid of the need for any fsync on the log at all.  There 
would be extra code needed on restart to get the end of the log from the 
other system, but it doesn't seem like much. 

I think this would be an attractive option to a lot of people.  Most 
people have at least two systems, and the requirements of the logging 
system would be minimal. 


2) It is also possible, with kernel modifications, to have special 
logging partitions where log records are written where the head is. 
Tzi-cker Chueh and Lan Huang at Stony Brook 
(http://www.cs.sunysb.edu/~lanhuang/research.htm) have written this, 
although I don't think they have released any code. 

(A similar idea called WADS is mentioned in Gray & Reuter's book.) 

If the people at Red Hat are interested in having some added value for 
using PostgreSQL on Red Hat Linux, this would be one idea.  It could 
also be used to speed up ext3 and other journaling file systems. 





---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Where to call SetQuerySnapshot

2002-10-07 Thread Joe Conway


Tom Lane wrote:
> 1. Where is the cleanest place to call SetQuerySnapshot() for utility
> statements that need it?  Should we follow the lead of the existing
> COPY code, and add the call to the ExecuteStmt case in utility.c?
> Or should we move the calls into the command-specific routines (DoCopy
> and ExecuteQuery)?  Or perhaps it should be done in postgres.c, which
> has this responsibility for non-utility statements?

Without looking at it too closely, I would think postgres.c would be best, 
unless there is a legit reason for a utility statement to *not* want 
SetQuerySnapshot called. If that's the case, then utility.c seems to make sense.

> 2. Would it be a good idea to change CopyQuerySnapshot to elog(ERROR)
> instead of silently creating a snapshot when none has been made?
> I think I was the one who put in its auto-create-snapshot behavior,
> but I'm now feeling like that was a mistake.  It hides omissions that
> we should want to find.

Is an assert appropriate?

Joe


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Neil Conway

Greg Copeland <[EMAIL PROTECTED]> writes:
> Doesn't this also increase the likelihood that people will be running in
> a buffer-poor environment more frequently that I previously asserted,
> especially in very heavily I/O bound systems?  Unless I'm mistaken, that
> opens the door for a general case of why an aio implementation should be
> looked into.

Well, at least for *this specific sitation*, it doesn't really change
anything -- since FreeBSD doesn't implement POSIX AIO as far as I
know, we can't use that as an alternative.

However, I'd suspect that the FreeBSD kernel allows for some way to
tune the behavior of the syncer. If that's the case, we could do some
research into what settings are more appropriate for FreeBSD, and
recommend those in the docs. I don't run FreeBSD, however -- would
someone like to volunteer to take a look at this?

BTW Curtis, did you happen to check whether this behavior has been
changed in FreeBSD 5.0?

> Also, on a side note, IIRC, linux kernel 2.5.x has a new priority
> elevator which is said to be MUCH better as saturating disks than ever
> before.

Yeah, there are lots of new & interesting features for database
systems in the new kernel -- I'm looking forward to when 2.6 is widely
deployed...

Cheers,

Neil

-- 
Neil Conway <[EMAIL PROTECTED]> || PGP Key ID: DB3C29FC

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

[HACKERS] Where to call SetQuerySnapshot

2002-10-07 Thread Tom Lane


I noticed that the new EXECUTE statement does not call SetQuerySnapshot,
which seems like a bad thing.  The omission is masked by the defensive
code in CopyQuerySnaphot, which will automatically do SetQuerySnapshot
if it hasn't been done yet in the current transaction.  However, this
doesn't provide the right behavior in read-committed mode: if we are
inside a transaction and not the first query, I'd think EXECUTE should
take a new snapshot; but it won't.

Comparable code can be found in COPY OUT, for which tcopy/utility.c
does SetQuerySnapshot() before calling the command-specific routine.

Questions for the list:

1. Where is the cleanest place to call SetQuerySnapshot() for utility
statements that need it?  Should we follow the lead of the existing
COPY code, and add the call to the ExecuteStmt case in utility.c?
Or should we move the calls into the command-specific routines (DoCopy
and ExecuteQuery)?  Or perhaps it should be done in postgres.c, which
has this responsibility for non-utility statements?

2. Would it be a good idea to change CopyQuerySnapshot to elog(ERROR)
instead of silently creating a snapshot when none has been made?
I think I was the one who put in its auto-create-snapshot behavior,
but I'm now feeling like that was a mistake.  It hides omissions that
we should want to find.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Curtis Faith


> I may be missing something obvious, but I don't see a way to get more
> than 1 trx/process/revolution, as each previous transaction in that
> process must be written to disk before the next can start, and the only
> way it can be written to the disk is when the disk heads are on the
> right place and that happens exactly once per revolution.

Okay, consider the following scenario.

1) Process A commits when the platter is at 0 degrees.
2) There are enough XLog writes from other processes to fill 1/4 platter
rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit
record and keeps writing other log entries as the head rotates.
3) Process A receives a confirmation of the write before the platter
rotates 60 degrees.
4) Process A continues and adds another commit before the platter rotates
to 90 degrees.

This should be very possible and more and more likely in the future as CPUs
get faster and faster relative to disks.

I'm not suggesting this would happen all the time, just that it's possible
and that an SMP machine with good CPUs and a fast I/O subsystem should be
able to keep the log writing at close to I/O bandwidth limits.

The case of bulk inserts is one where I would expect that for simple tables
we should be able to peg the disks given today's hardware and enough
inserting processes.

- Curtis


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] 7.2.3 patching done

2002-10-07 Thread Alvaro Herrera


On Mon, Oct 07, 2002 at 09:22:38PM +0200, Peter Eisentraut wrote:
> Tom Lane writes:
> 
> > But the source distribution hasn't *got* any binary files.
> 
> There are some under doc/src/graphics, and then there are
> doc/postgres.tar.gz and doc/man.tar.gz.

And what about publishing xdelta patches?

-- 
Alvaro Herrera ()
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Greg Copeland


Well, I was thinking that aio may not be available on all platforms,
thus the conditional compile option.   On the other hand, wouldn't you
pretty much want it either on or off for all instances?  I can see that
it would be nice for testing though.  ;)

Greg

On Mon, 2002-10-07 at 16:23, Justin Clift wrote:
> Greg Copeland wrote:
> 
> > If so, I assume it would become a configure option (--with-aio)?
> 
> Or maybe a GUC "use_aio" ?
> 
> :-)
> 
> Regards and best wishes,
> 
> Justin Clift
> 
> > 
> > Regards,
> > 
> > Greg
> > 
> >   
> >Name: signature.asc
> >signature.asc   Type: application/pgp-signature
> > Description: This is a digitally signed message part
> 
> -- 
> "My grandfather once told me that there are two kinds of people: those
> who work and those who take the credit. He told me to try to be in the
> first group; there was less competition there."
>- Indira Gandhi




signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Hannu Krosing


On Mon, 2002-10-07 at 21:35, Neil Conway wrote:
> Greg Copeland <[EMAIL PROTECTED]> writes:
> > Ya, I have read this before.  The problem here is that I'm not aware of
> > which AIO implementation on Linux is the forerunner nor do I have any
> > idea how it's implementation or performance details defer from that of
> > other implementations on other platforms.
> 
> The implementation of AIO in 2.5 is the one by Ben LaHaise (not
> SGI). Not sure what the performance is like -- although it's been
> merged into 2.5 already, so someone can do some benchmarking. Can
> anyone suggest a good test?

What would be really interesting is to aio_write small chunks to the
same 8k page by multiple threads/processes and then wait for the page
getting written to disk.

Then check how many backends get their wait back at the same write. 

The docs for POSIX aio_xxx are at:

http://www.opengroup.org/onlinepubs/007904975/functions/aio_write.html


Hannu



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Tom Lane

"Curtis Faith" <[EMAIL PROTECTED]> writes:
>> Well, too bad.  If you haven't gotten your commit record down to disk,
>> then *you have not committed*.  This is not negotiable.  (If you think
>> it is, then turn off fsync and quit worrying ;-))

> I've never disputed this, so if I seem to be suggesting that, I've beee
> unclear. I'm just assuming that the disk can get a confirmation back to the
> INSERTing process in much less than one rotation.

You've spent way too much time working with lying IDE drives :-(

Do you really trust a confirm-before-write drive to make that write if
it loses power?  I sure don't.

If you do trust your drive to hold that data across a crash, then ISTM
the whole problem goes away anyway, as writes will "complete" quite
independently of disk rotation.  My Linux box has no problem claiming
that it's completing several thousand TPS with a single client ... and
yes, fsync is on, but it's using an IDE drive, and I don't know how to
disable confirm-before-write on that drive.  (That's why I did these
tests on my old slow HP hardware.)  Basically, the ganging of commit
writes happens inside the disk controller on a setup like that.  You
still don't need aio_write --- unless perhaps to reduce wastage of IDE
bus bandwidth by repeated writes, but that doesn't seem to be a scarce
resource in this context.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Greg Copeland

On Mon, 2002-10-07 at 15:28, Bruce Momjian wrote:
> This is the trickle syncer.  It prevents bursts of disk activity every
> 30 seconds.  It is for non-fsync writes, of course, and I assume if the
> kernel buffers get low, it starts to flush faster.

Doesn't this also increase the likelihood that people will be running in
a buffer-poor environment more frequently that I previously asserted,
especially in very heavily I/O bound systems?  Unless I'm mistaken, that
opens the door for a general case of why an aio implementation should be
looked into.

Also, on a side note, IIRC, linux kernel 2.5.x has a new priority
elevator which is said to be MUCH better as saturating disks than ever
before.  Once 2.6 (or whatever it's number will be) is released, it may
not be as much of a problem as it seems to be for FreeBSD (I think
that's the one you're using).

Greg

signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Justin Clift


Greg Copeland wrote:

> If so, I assume it would become a configure option (--with-aio)?

Or maybe a GUC "use_aio" ?

:-)

Regards and best wishes,

Justin Clift

> 
> Regards,
> 
> Greg
> 
>   
>Name: signature.asc
>signature.asc   Type: application/pgp-signature
> Description: This is a digitally signed message part

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
   - Indira Gandhi

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Greg Copeland

On Mon, 2002-10-07 at 16:06, Curtis Faith wrote:
> > Well, too bad.  If you haven't gotten your commit record down to disk,
> > then *you have not committed*.  This is not negotiable.  (If you think
> > it is, then turn off fsync and quit worrying ;-))
> 

At this point, I think we've come full circle.  Can we all agree that
this concept is a *potential* source of improvement in a variety of
situations?  If we can agree on that, perhaps we should move to the next
stage in the process, validation?

How long do you think it would take to develop something worthy of
testing?  Do we have known test cases which will properly (in)validate
the approach that everyone will agree to?  If code is reasonably clean
so as to pass the smell test and shows a notable performance boost, will
it be seriously considered for inclusion?  If so, I assume it would
become a configure option (--with-aio)?

Regards,

Greg

signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Curtis Faith


> Well, too bad.  If you haven't gotten your commit record down to disk,
> then *you have not committed*.  This is not negotiable.  (If you think
> it is, then turn off fsync and quit worrying ;-))

I've never disputed this, so if I seem to be suggesting that, I've beee
unclear. I'm just assuming that the disk can get a confirmation back to the
INSERTing process in much less than one rotation. This would allow that
process to start working again, perhaps in time to complete another
transaction.

> An application that is willing to have multiple transactions in flight
> at the same time can open up multiple backend connections to issue those
> transactions, and thereby perhaps beat the theoretical limit.  But for
> serial transactions, there is not anything we can do to beat that limit.
> (At least not with the log structure we have now.  One could imagine
> dropping a commit record into the nearest one of multiple buckets that
> are carefully scattered around the disk.  But exploiting that would take
> near-perfect knowledge about disk head positioning; it's even harder to
> solve than the problem we're considering now.)

Consider the following scenario:

Time measured in disk rotations.

Time 1.00 - Process A Commits - Causing aio_write to log and wait
Time 1.03 - aio_completes for Process A write - wakes process A
Time 1.05 - Process A Starts another transaction.
Time 1.08 - Process A Commits
etc.

I agree that a process can't proceed from a commit until it receives
confirmation of the write, but if the write has hit the disk before a full
rotation then the process should be able to continue processing new
transactions

> You're failing to distinguish total throughput to the WAL drive from
> response time seen by any one transaction.  Yes, a policy of writing
> each WAL block once when it fills would maximize potential throughput,
> but it would also mean a potentially very large delay for a transaction
> waiting to commit.  The lower the system load, the worse the performance
> on that scale.

You are assuming fsync or fdatasync behavior, I am not. There would be no
delay under the scenario I describe. The transaction would exit commit as
soon as the confirmation of the write is received from the aio system. I
would hope that with a decent aio implementation this would generally be
much less than one rotation.

I think that the single transaction response time is very important and
that's one of the chief problems I sought to solve when I proposed
aio_writes for logging in my original email many moons ago.

> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel.  I haven't heard anything to make me think that that feature
> actually exists.  AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

All aio_write has to do is pass the write off to the device as soon as it
aio_write gets it bypassing the system buffers. The code on the disk's
hardware is very good at knowing when the disk head is coming. IMHO,
bypassing the kernel's less than enlightened writing system is the main
point of using aio_write.

- Curtis


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Hannu Krosing


On Tue, 2002-10-08 at 01:27, Tom Lane wrote:
>
> The scheme we now have (with my recent patch) essentially says that the
> commit delay seen by any one transaction is at most two disk rotations.
> Unfortunately it's also at least one rotation :-(, except in the case
> where there is no contention, ie, no already-scheduled WAL write when
> the transaction reaches the commit stage.  It would be nice to be able
> to say "at most one disk rotation" instead --- but I don't see how to
> do that in the absence of detailed information about disk head position.
> 
> Something I was toying with this afternoon: assume we have a background
> process responsible for all WAL writes --- not only filled buffers, but
> the currently active buffer.  It periodically checks to see if there
> are unwritten commit records in the active buffer, and if so schedules
> a write for them.  If this could be done during each disk rotation,
> "just before" the disk reaches the active WAL log block, we'd have an
> ideal solution.  And it would not be too hard for such a process to
> determine the right time: it could measure the drive rotational speed
> by observing the completion times of successive writes to the same
> sector, and it wouldn't take much logic to empirically find the latest
> time at which a write can be issued and have a good probability of
> hitting the disk on time.  (At least, this would work pretty well given
> a dedicated WAL drive, else there'd be too much interference from other
> I/O requests.)
> 
> However, this whole scheme falls down on the same problem we've run into
> before: user processes can't schedule themselves with millisecond
> accuracy.  The writer process might be able to determine the ideal time
> to wake up and make the check, but it can't get the Unix kernel to
> dispatch it then, at least not on most Unixen.  The typical scheduling
> slop is one time slice, which is comparable to if not more than the
> disk rotation time.

Standard for Linux has been 100Hz time slice, but it is configurable for
some time.

The latest RedHat (8.0) is built with 500Hz that makes about 4
slices/rev for 7200 rpm disks (2 for 15000rpm)
 
> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel.  I haven't heard anything to make me think that that feature
> actually exists.  AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

At least for open source kernels it could be possible to

1. write a patch to kernel 

or 

2. get the authors of kernel aio interested in doing it.

or

3. the third possibility would be using some real-time (RT) OS or mixed
RT/conventional OS where some threads can be scheduled for hard-RT .
In an RT os you are supposed to be able to do exactly what you describe.


I think that 2 and 3 could be "outsourced" (the respective developers
talked into supporting it) as both KAIO and RT Linuxen/BSDs are probably
also inetersted in high-profile applications so they could boast that
"using our stuff enabled PostgreSQL database run twice as fast".

Anyway, getting to near-harware speeds for database will need more
specific support from OS than web browsing or compiling.

---
Hannu








---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Curtis Faith


> This is the trickle syncer.  It prevents bursts of disk activity every
> 30 seconds.  It is for non-fsync writes, of course, and I assume if the
> kernel buffers get low, it starts to flush faster.

AFAICT, the syncer only speeds up when virtual memory paging fills the
buffers past
a threshold and even in that event it only speeds it up by a factor of two.

I can't find any provision for speeding up flushing of the dirty buffers
when they fill for normal file system writes, so I don't think that
happens.

- Curtis


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Hannu Krosing

On Tue, 2002-10-08 at 00:12, Curtis Faith wrote:
> Tom, first of all, excellent job improving the current algorithm. I'm glad
> you look at the WALCommitLock code.
> 
> > This must be so because the backends that are
> > released at the end of any given disk revolution will not be able to
> > participate in the next group commit, if there is already at least
> > one backend ready to commit.
> 
> This is the major reason for my original suggestion about using aio_write.
> The writes don't block each other and there is no need for a kernel level
> exclusive locking call like fsync or fdatasync.
> 
> Even the theoretical limit you mention of one transaction per revolution
> per committing process seem like a significant bottleneck.
> 
> Is committing 1 and 4 transactions on every revolution good? It's certainly
> better than 1 per revolution.

Of course committing all 5 at each rev would be better ;)

> However, what if we could have done 3 transactions per process in the time
> it took for a single revolution?

I may be missing something obvious, but I don't see a way to get more
than 1 trx/process/revolution, as each previous transaction in that
process must be written to disk before the next can start, and the only
way it can be written to the disk is when the disk heads are on the
right place and that happens exactly once per revolution. 

In theory we could devise some clever page interleave scheme that would
allow us to go like this: fill one page - write page to disk, commit
trx's - fill the page in next 1/3 of rev - write next page to disk ... ,
but this will work only for some limited set ao WAL page sizes.

It could be possible to get near 5/trx/rev for 5 backends if we do the
following (A-E are backends from Toms explanation):

1. write the page for A's trx to its proper pos P (wher P is page
number)

2. if after sync for A returns and we already have more transactions
waiting for write()+sync() of the same page, immediately write the
_same_ page to pos P+N (where N is a tunable parameter). If N is small
enough then P+N will be on the same cylinder for most cases and thus
will get transactions B-E also committed on the same rev.

3. make sure that the last version will also be written to its proper
place before the end of log will overwrite P+N. (This may be tricky.)

4. When restoring from WAL, always check for a page at EndPos+N for a
possible newer version of last page.

This scheme requires page numbers+page versions to be stored in each
page and could get us near 1 trx/backend/rev performance, but it's hard
to tell if it is really useful in real life.

This could also possibly be extended to more than one "end page" and
more than one "continuation end page copy" to get better than 1
trx/backend/rev.

-
Hannu

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Bruce Momjian


Curtis Faith wrote:
> Good points.
> 
> Now for some surprising news (at least it surprised me).
> 
> I researched the file system source on my system (FreeBSD 4.6) and found
> that the behavior was optimized for non-database access to eliminate
> unnecessary writes when temp files are created and deleted rapidly. It was
> not optimized to get data to the disk in the most efficient manner.
> 
> The syncer on FreeBSD appears to place dirtied filesystem buffers into
> work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer
> processes one of the queues and increments a counter syncer_delayno.
> 
> On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd
> of the writes that were buffered are processed. If the syncer gets behind
> and the writes for a given second exceed one second to process the syncer
> does not wait but begins processing the next queue.
> 
> AFAICT this means that there is no opportunity to have writes combined by
> the  disk since they are processed in buckets based on the time the writes
> came in.

This is the trickle syncer.  It prevents bursts of disk activity every
30 seconds.  It is for non-fsync writes, of course, and I assume if the
kernel buffers get low, it starts to flush faster.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Tom Lane

"Curtis Faith" <[EMAIL PROTECTED]> writes:
> Even the theoretical limit you mention of one transaction per revolution
> per committing process seem like a significant bottleneck.

Well, too bad.  If you haven't gotten your commit record down to disk,
then *you have not committed*.  This is not negotiable.  (If you think
it is, then turn off fsync and quit worrying ;-))

An application that is willing to have multiple transactions in flight
at the same time can open up multiple backend connections to issue those
transactions, and thereby perhaps beat the theoretical limit.  But for
serial transactions, there is not anything we can do to beat that limit.
(At least not with the log structure we have now.  One could imagine
dropping a commit record into the nearest one of multiple buckets that
are carefully scattered around the disk.  But exploiting that would take
near-perfect knowledge about disk head positioning; it's even harder to
solve than the problem we're considering now.)

> I still think that it would be much faster to just keep writing the WAL log
> blocks when they fill up and have a separate process wake the commiting
> process when the write completes. This would eliminate WAL writing as a
> bottleneck.

You're failing to distinguish total throughput to the WAL drive from
response time seen by any one transaction.  Yes, a policy of writing
each WAL block once when it fills would maximize potential throughput,
but it would also mean a potentially very large delay for a transaction
waiting to commit.  The lower the system load, the worse the performance
on that scale.

The scheme we now have (with my recent patch) essentially says that the
commit delay seen by any one transaction is at most two disk rotations.
Unfortunately it's also at least one rotation :-(, except in the case
where there is no contention, ie, no already-scheduled WAL write when
the transaction reaches the commit stage.  It would be nice to be able
to say "at most one disk rotation" instead --- but I don't see how to
do that in the absence of detailed information about disk head position.

Something I was toying with this afternoon: assume we have a background
process responsible for all WAL writes --- not only filled buffers, but
the currently active buffer.  It periodically checks to see if there
are unwritten commit records in the active buffer, and if so schedules
a write for them.  If this could be done during each disk rotation,
"just before" the disk reaches the active WAL log block, we'd have an
ideal solution.  And it would not be too hard for such a process to
determine the right time: it could measure the drive rotational speed
by observing the completion times of successive writes to the same
sector, and it wouldn't take much logic to empirically find the latest
time at which a write can be issued and have a good probability of
hitting the disk on time.  (At least, this would work pretty well given
a dedicated WAL drive, else there'd be too much interference from other
I/O requests.)

However, this whole scheme falls down on the same problem we've run into
before: user processes can't schedule themselves with millisecond
accuracy.  The writer process might be able to determine the ideal time
to wake up and make the check, but it can't get the Unix kernel to
dispatch it then, at least not on most Unixen.  The typical scheduling
slop is one time slice, which is comparable to if not more than the
disk rotation time.

ISTM aio_write only improves the picture if there's some magic in-kernel
processing that makes this same kind of judgment as to when to issue the
"ganged" write for real, and is able to do it on time because it's in
the kernel.  I haven't heard anything to make me think that that feature
actually exists.  AFAIK the kernel isn't much more enlightened about
physical head positions than we are.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Hot Backup

2002-10-07 Thread Sandeep Chadha

Hmmm, Then are there any new enhancements as far as backups are concerned between 
current 7.2.x to 7.3.x.
Like can we do a tar when database is up and running or another feature.

Thanks a bunch in advance.

-Original Message-
From: Neil Conway [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 07, 2002 1:48 PM
To: Sandeep Chadha
Cc: Tom Lane; [EMAIL PROTECTED]; pgsql-general
Subject: Re: [HACKERS] Hot Backup

"Sandeep Chadha" <[EMAIL PROTECTED]> writes:
> Postgresql has been lacking this all along. I've installed postgres
> 7.3b2 and still don't see any archive's flushed to any other
> place. Please let me know how is hot backup procedure implemented in
> current 7.3 beta(2) release.

AFAIK no such hot backup feature has been implemented for 7.3 -- you
appear to have been misinformed.

That said, I agree that would be a good feature to have.

Cheers,

Neil

-- 
Neil Conway <[EMAIL PROTECTED]> || PGP Key ID: DB3C29FC

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] v7.2.3 - tag'd, packaged ... need it checked ...

2002-10-07 Thread Peter Eisentraut

Marc G. Fournier writes:

> Looks good from my end, Peter, I pulled the same docs that I pulled for
> v7.2.2, which I hope is okay?

Probably not, because the version number needs to be changed and they need
to be rebuilt for each release.

-- 
Peter Eisentraut   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] 7.2.3 patching done

2002-10-07 Thread Peter Eisentraut


Tom Lane writes:

> But the source distribution hasn't *got* any binary files.

There are some under doc/src/graphics, and then there are
doc/postgres.tar.gz and doc/man.tar.gz.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Curtis Faith


Tom, first of all, excellent job improving the current algorithm. I'm glad
you look at the WALCommitLock code.

> This must be so because the backends that are
> released at the end of any given disk revolution will not be able to
> participate in the next group commit, if there is already at least
> one backend ready to commit.

This is the major reason for my original suggestion about using aio_write.
The writes don't block each other and there is no need for a kernel level
exclusive locking call like fsync or fdatasync.

Even the theoretical limit you mention of one transaction per revolution
per committing process seem like a significant bottleneck.

Is committing 1 and 4 transactions on every revolution good? It's certainly
better than 1 per revolution.

However, what if we could have done 3 transactions per process in the time
it took for a single revolution?

Then we are looking at (1 + 4)/ 2 = 2.5 transactions per revolution versus
the theoretical maximum of (3 * 5) = 15 transactions per revolution if we
can figure out a way to do non-blocking writes that we can guarantee are on
the disk platter so we can return from commit.

Separating out whether or not aio is viable. Do you not agree that
eliminating the blocking would result in potentially a 6X improvement for
the 5 process case?

>
> So this solution isn't perfect; it would still be nice to have a way to
> delay initiation of the WAL write until "just before" the disk is ready
> to accept it.  I dunno any good way to do that, though.

I still think that it would be much faster to just keep writing the WAL log
blocks when they fill up and have a separate process wake the commiting
process when the write completes. This would eliminate WAL writing as a
bottleneck.

I have yet to hear anyone say that this can't be done, only that we might
not want to do it because the code might not be clean.

I'm generally only happy when I can finally remove a bottleneck completely,
but speeding one up by 3X like you have done is pretty damn cool for a day
or two's work.

- Curtis


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Curtis Faith

> On Sun, 2002-10-06 at 11:46, Tom Lane wrote:
> > I can't personally get excited about something that only helps if your
> > server is starved for RAM --- who runs servers that aren't fat on RAM
> > anymore?  But give it a shot if you like.  Perhaps your analysis is
> > pessimistic.
>
>  I don't find it far fetched to
> imagine situations where people may commit large amounts of memory for
> the database yet marginally starve available memory for file system
> buffers.  Especially so on heavily I/O bound systems or where sporadicly
> other types of non-database file activity may occur.
>
>  Of course, that opens the door for simply adding more memory
> and/or slightly reducing the amount of memory available to the database
> (thus making it available elsewhere).  Now, after all that's said and
> done, having something like aio in use would seemingly allowing it to be
> somewhat more "self-tuning" from a potential performance perspective.

Good points.

Now for some surprising news (at least it surprised me).

I researched the file system source on my system (FreeBSD 4.6) and found
that the behavior was optimized for non-database access to eliminate
unnecessary writes when temp files are created and deleted rapidly. It was
not optimized to get data to the disk in the most efficient manner.

The syncer on FreeBSD appears to place dirtied filesystem buffers into
work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer
processes one of the queues and increments a counter syncer_delayno.

On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd
of the writes that were buffered are processed. If the syncer gets behind
and the writes for a given second exceed one second to process the syncer
does not wait but begins processing the next queue.

AFAICT this means that there is no opportunity to have writes combined by
the  disk since they are processed in buckets based on the time the writes
came in.

Also, it seems very likely that many installations won't have enough
buffers for 30 seconds worth of changes and that there would be some level
of SYNCHRONOUS writing because of this delay and the syncer process getting
backed up. This might happen once per second as the buffers get full and
the syncer has not yet started for that second interval.

Linux might handle this better. I saw some emails exchanged a year or so
ago about starting writes immediately in a low-priority way but I'm not
sure if those patches got applied to the linux kernel or not. The source I
had access to seems to do something analogous to FreeBSD but using fixed
percentages of the dirty blocks or a minimum number of blocks. They appear
to be handled in LRU order however.

On-disk caches are much much larger these days so it seems that some way of
getting the data out sooner would result in better write performance for
the cache. My newer drive is a 10K RPM IBM Ultrastar SCSI and it has a 4M
cache. I don't see these caches getting smaller over time so not letting
the disk see writes will become more and more of a performance drain.

- Curtis

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] BTree metapage lock and freelist structure

2002-10-07 Thread Tom Lane

Alvaro Herrera <[EMAIL PROTECTED]> writes:
> There doesn't seem to be a way to acquire two different locks on the
> same page, so I propose to lock the InvalidBlockNumer for the btree, and
> use that as the lock to be obtained before doing extend, addfree,
> getfree or shrink.  The setroot/getroot operations would still use the
> locking on BTREE_METAPAGE, so they can proceed whether the
> InvalidBlockNumber is blocked or not.

Unfortunately, blkno == InvalidBlockNumber is already taken --- it's the
encoding of a relation-level lock.  Compare LockRelation and LockPage
in src/backend/storage/lmgr/lmgr.c.

It looks to me like nbtree.c is currently using LockPage(rel, 0) to
interlock relation extension --- this is the same convention used to
interlock heap extension.  But access to the metapage is determined by
buffer locks, which are an independent facility.

I agree with your conclusion that extend, shrink, addfree, and getfree
operations may as well all use the same exclusive lock.  I think it
could be LockPage(rel, 0), though.

Since addfree/getfree need to touch the metapage, at first glance it
would seem that they need to do LockPage(rel, 0) *plus* get a buffer
lock on the metapage.  But I think it might work for them to get only
a shared lock on the metapage; the LockPage lock will guarantee
exclusion against other addfree/getfree operations, and they don't
really care if someone else is changing the root pointer.  This is ugly
but given the limited set of operations that occur against a btree
metapage, it seems acceptable to me.  Comments anyone?

> On a different topic:  the freelist structure I think should be
> represented as a freelist int32 number (btm_numfreepages) in
> BTMetaPageData, and a pointer to the first BlockNumber.  Adding a new
> page is done by sequentially scanning the list until a zero is found or
> the end of the block is reached.  Getting a page sequentially scans the
> same list until a blocknumber > 0 is found (iff btm_numfreepages is
> greater than zero).  This allows for ~ 2000 free pages (very unlikely to
> actually happen if the shrink operation is in place).

No more than 2000 free pages in an index?  What makes you think that's
unlikely?  It's only 16 meg of free space, which would be about 1% of
a gigabyte-sized index.  I think it's a bad idea to make such a
hardwired assumption.

The original concept I believe was to have a linked list of free pages,
ie, each free page contains the pointer to the next one.  This allows
indefinite expansion.  It does mean that you have to read in the
selected free page to get the next-free-page pointer from it, which you
have to do to update the metapage, so that read has to happen while
you're still holding the getfree lock.  That's annoying.

Perhaps we could use a hybrid data structure: up to 2000 free pages
stored in the metapage, with any remaining ones chained together.
Most of the time, freelist operations wouldn't need to touch the
chain, hopefully.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: Parallel Executors [was RE: [HACKERS] Threaded Sorting]

2002-10-07 Thread Curtis Faith

> Curtis Faith wrote:
>
> > The current transaction/user state seems to be stored in process
> > global space. This could be changed to be a sointer to a struct
> > stored in a back-end specific shared memory area which would be
> > accessed by the executor process at execution start. The backend
> > would destroy and recreate the shared memory and restart execution
> > in the case where an executor process dies much like the postmaster
> > does with backends now.
> >
> > To the extent the executor process might make changes to the state,
> > which I'd try to avoid if possible (don't know if it is), the
> > executors could obtain locks, otherwise if the executions were
> > constrained to isolated elements (changes to different indexes for
> > example) it seems like it would be possible using an architecture
> > where you have:

Jan Wieck replied:
> Imagine there is a PL/Tcl function. On the first call in a session, the
> PL/Tcl interpreter get's created (that's during execution, okay?). Now
> the procedure that's called inside of that interpreter creates a
> "global" variable ... a global Tcl variable inside of that interpreter,
> which is totally unknown to the backend since it doesn't know what Tcl
> is at all and that variable is nothing than an entry in a private hash
> table inside of that interpreter. On a subsequent call to any PL/Tcl
> function during that session, it might be good if that darn hashtable
> entry exists.
>
> How do you propose to let this happen?
>
> And while at it, the Tcl procedure next calls spi_exec, causing the
> PL/Tcl function handler to call SPI_exec(), so your isolated executor
> all of the sudden becomes a fully operational backend, doing the
> parsing, planning and optimizing, or what?

You bring up a good point, we couldn't do what I propose for all
situations. I had never anticipated that splitting things up would be the
rule. For example, the optimizer would have to decide whether it made sense
to split up a query from a strictly performance perspective. So now, if we
consider the fact that some things could not be done with split backend
execution, the logic becomes:

if ( splitting is possible && splitting is faster )
do the split execution;
else
do the normal execution;

Since the design already splits the backend internally into a separate
execution phase, it seems like one could keep the current current
implementation for the typical case where splitting doesn't buy anything or
cases where there is complex state information that needs to be maintained.
If there are no triggers or functions that will be accessed by a given
query then I don't see your concerns applying.

If there are triggers or other conditions which preclude multi-process
execution, we can keep exactly the same behavior as now. The plan execution
entry could easily be a place where it either A) did the same thing it
currently does or B) passed execution off to a pool as per the original
proposal.

I have to believe that most SELECTs won't be affected by your concerns.
Additionally, even in the case of an UPDATE, many times there are large
portions of the operation's actual work that wouldn't be affected even if
there are lots of triggers on the tables being updated. The computation of
the inside of the WHERE could often be split out without causing any
problems with context or state information. The master executor could
always be the original backend as it is now and this would be the place
where the UPDATE part would be processed after the WHERE tuples had been
identified.

As with any optimization, it is more complicated and won't handle all the
cases. It's just an idea to handle common cases that would otherwise be
much slower.

That having been said, I'm sure there are much lower hanging fruit on the
performance tree and likely will be for a little while.

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Hot Backup

2002-10-07 Thread Neil Conway


"Sandeep Chadha" <[EMAIL PROTECTED]> writes:
> Postgresql has been lacking this all along. I've installed postgres
> 7.3b2 and still don't see any archive's flushed to any other
> place. Please let me know how is hot backup procedure implemented in
> current 7.3 beta(2) release.

AFAIK no such hot backup feature has been implemented for 7.3 -- you
appear to have been misinformed.

That said, I agree that would be a good feature to have.

Cheers,

Neil

-- 
Neil Conway <[EMAIL PROTECTED]> || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Tom Lane

I wrote:
> That says that the best possible throughput on this test scenario is 5
> transactions per disk rotation --- the CPU is just not capable of doing
> more.  I am actually getting about 4 xact/rotation for 10 or more
> clients (in fact it seems to reach that plateau at 8 clients, and be
> close to it at 7).

After further thought I understand why it takes 8 clients to reach full
throughput in this scenario.  Assume that we have enough CPU oomph so
that we can process four transactions, but not five, in the time needed
for one revolution of the WAL disk.  If we have five active clients
then the behavior will be like this:

1. Backend A becomes ready to commit.  It locks WALWriteLock and issues
a write/flush that will only cover its own commit record.  Assume that
it has to wait one full disk revolution for the write to complete (this
will be the steady-state situation).

2. While A is waiting, there is enough time for B, C, D, and E to run
their transactions and become ready to commit.  All eventually block on
WALWriteLock.

3. When A finishes its write and releases WALWriteLock, B will acquire
the lock and initiate a write that (with my patch) will cover C, D, and
E's commit records as well as its own.

4. While B is waiting for the disk to spin, A receives a new transaction
from its client, processes it, and becomes ready to commit.  It blocks
on WALWriteLock.

5. When B releases the lock, C, D, E acquire it and quickly fall
through, seeing that they need do no work.  Then A acquires the lock.
GOTO step 1.

So with five active threads, we alternate between committing one
transaction and four transactions on odd and even disk revolutions.

It's pretty easy to see that with six or seven active threads, we
will alternate between committing two or three transactions and
committing four.  Only when we get to eight threads do we have enough
backends to ensure that four transactions are available to commit on
every disk revolution.  This must be so because the backends that are
released at the end of any given disk revolution will not be able to
participate in the next group commit, if there is already at least
one backend ready to commit.

So this solution isn't perfect; it would still be nice to have a way to
delay initiation of the WAL write until "just before" the disk is ready
to accept it.  I dunno any good way to do that, though.

I went ahead and committed the patch for 7.3, since it's simple and does
offer some performance improvement.  But maybe we can think of something
better later on...

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: Parallel Executors [was RE: [HACKERS] Threaded Sorting]

2002-10-07 Thread Jan Wieck

Curtis Faith wrote:

> The current transaction/user state seems to be stored in process
> global space. This could be changed to be a sointer to a struct
> stored in a back-end specific shared memory area which would be
> accessed by the executor process at execution start. The backend
> would destroy and recreate the shared memory and restart execution
> in the case where an executor process dies much like the postmaster
> does with backends now.
> 
> To the extent the executor process might make changes to the state,
> which I'd try to avoid if possible (don't know if it is), the
> executors could obtain locks, otherwise if the executions were
> constrained to isolated elements (changes to different indexes for
> example) it seems like it would be possible using an architecture
> where you have:

Imagine there is a PL/Tcl function. On the first call in a session, the
PL/Tcl interpreter get's created (that's during execution, okay?). Now
the procedure that's called inside of that interpreter creates a
"global" variable ... a global Tcl variable inside of that interpreter,
which is totally unknown to the backend since it doesn't know what Tcl
is at all and that variable is nothing than an entry in a private hash
table inside of that interpreter. On a subsequent call to any PL/Tcl
function during that session, it might be good if that darn hashtable
entry exists.

How do you propose to let this happen?

And while at it, the Tcl procedure next calls spi_exec, causing the
PL/Tcl function handler to call SPI_exec(), so your isolated executor
all of the sudden becomes a fully operational backend, doing the
parsing, planning and optimizing, or what?

Jan

-- 

#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

[HACKERS] Hot Backup

2002-10-07 Thread Sandeep Chadha


Hello to all the Doers of Postgres!!!

Last time I went through forums, people spoke highly about 7.3 and its capability to 
do hot backups. My problem is if the database goes down and I lose my main data store, 
then I will lose all transactions back to the time I did the pg_dump.

Other databases (i e Oracle) solves this by retaining their archive logs in some 
physically separate storage. So, when you lose your data, you can restore the data 
from back-up, and then apply your archive log, and avoid losing any committed 
transactions. 

Postgresql has been lacking this all along. I've installed postgres 7.3b2 and still 
don't see any archive's flushed to any other place. Please let me know how is hot 
backup procedure implemented in current 7.3 beta(2) release.


Thanks.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

[HACKERS] BTree metapage lock and freelist structure

2002-10-07 Thread Alvaro Herrera


Hello hackers,

I'm thinking about the btree metapage locking problem.

In the current situation there are three operations that lock the
metapage:
1. when looking for the root page (share lock, "getroot")
2. when setting a new root page (exclusive lock, "setroot")
3. when extending the relation (exclusive lock, "extend").

Now, I want to add three more operations:
4. add a page to the freelist ("addfree")
5. get a page from the freelist ("getfree")
6. shrink a relation ("shrink").

The shrink operation only happens when one tries to add the last page of
the relation to the freelist.  I don't know if that's very common, but
in case of relation truncating or massive deletion this is important.


What I want is to be able to do getroot and setroot without being
blocked by any of the other four.  Of course the other four are all
mutually exclusive.

There doesn't seem to be a way to acquire two different locks on the
same page, so I propose to lock the InvalidBlockNumer for the btree, and
use that as the lock to be obtained before doing extend, addfree,
getfree or shrink.  The setroot/getroot operations would still use the
locking on BTREE_METAPAGE, so they can proceed whether the
InvalidBlockNumber is blocked or not.


On a different topic:  the freelist structure I think should be
represented as a freelist int32 number (btm_numfreepages) in
BTMetaPageData, and a pointer to the first BlockNumber.  Adding a new
page is done by sequentially scanning the list until a zero is found or
the end of the block is reached.  Getting a page sequentially scans the
same list until a blocknumber > 0 is found (iff btm_numfreepages is
greater than zero).  This allows for ~ 2000 free pages (very unlikely to
actually happen if the shrink operation is in place).

Comments?  Another solution would be to have a separate page for the
freelist, but it seems to be a waste.

-- 
Alvaro Herrera ()
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... ¿Quién es el machito que tendría carnet?"  (Mafalda)

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Table spaces again [was Re: Threaded Sorting]

2002-10-07 Thread Michael Paesold

Shridhar Daithankar <[EMAIL PROTECTED]> wrote:
[snip]
> On 7 Oct 2002 at 15:52, Hans-Jürgen Schönig wrote:
[snip]
> > With tablespaces you can assign 30mb to use a, 120mb to user b etc. ...
> > Table spaces are a nice abstraction layer to the file system.
>
> Hmm.. And how does that fit in database metaphor? What practical use is
that? I
> can't imagine as I am a developer and not a DBA.

Virtual hosting at ISP's for example.

> I believe giving each database it's own transaction log would be a great
> advantage of this scheme.

Well, if you think of Tom's recent patch (ganged WAL writes), from a
performance point of view, this would only be good if each transaction
log had it's own disk. Otherwise a single transaction log is still better.

I think tablespaces is a good idea. I also prefer associating tablespaces
with directory structures better over the oracle style.

Regards,
Michael Paesold

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Neil Conway

Greg Copeland <[EMAIL PROTECTED]> writes:
> Ya, I have read this before.  The problem here is that I'm not aware of
> which AIO implementation on Linux is the forerunner nor do I have any
> idea how it's implementation or performance details defer from that of
> other implementations on other platforms.

The implementation of AIO in 2.5 is the one by Ben LaHaise (not
SGI). Not sure what the performance is like -- although it's been
merged into 2.5 already, so someone can do some benchmarking. Can
anyone suggest a good test?

Keep in mind that glibc has had a user-space implementation for a
little while (although I'd guess the performance to be unimpressive),
so AIO would not be *that* kernel-version specific.

> Anyone have any idea of Red Hat's Advanced Server uses KAIO or what?

RH AS uses Ben LaHaise's implemention of AIO, I believe.

Cheers,

Neil

-- 
Neil Conway <[EMAIL PROTECTED]> || PGP Key ID: DB3C29FC

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[HACKERS] reminder for those working on docs

2002-10-07 Thread Vince Vielhaber



Since 7.4 is getting real close and docs are going to be going through
their final once-overs.  Please remember to have a look at the DocNote
comments that have been submitted.  Once 7.4 is released the current
notes will be gone.

http://www.postgresql.org/idocs/checknotes.php

The above url will show the notes and what they're in relation to with
a link to that particular piece of documentation.

Vince.
-- 
==
Vince Vielhaber -- KA8CSHemail: [EMAIL PROTECTED]http://www.pop4.net
 56K Nationwide Dialup from $16.00/mo at Pop4 Networking
  http://www.camping-usa.com  http://www.cloudninegifts.com
   http://www.meanstreamradio.com   http://www.unknown-artists.com
==




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Greg Copeland

On Mon, 2002-10-07 at 10:38, Antti Haapala wrote:
> Browsed web and came across this piece of text regarding a Linux-KAIO
> patch by Silicon Graphics...
> 

Ya, I have read this before.  The problem here is that I'm not aware of
which AIO implementation on Linux is the forerunner nor do I have any
idea how it's implementation or performance details defer from that of
other implementations on other platforms.  I know there are at least two
aio efforts underway for Linux.  There could yet be others.  Attempting
to cite specifics that only pertain to Linux and then, only with a
specific implementation which may or may not be in general use is
questionable.  Because of this I simply left it as saying that I believe
my analysis is pessimistic.

Anyone have any idea of Red Hat's Advanced Server uses KAIO or what?

> 
> Preliminary experience with KAIO have shown  over  35% improvement in
> database performance tests. Unit tests (which only perform I/O) using KAIO
> and Raw I/O have been successful in achieving 93% saturation with 12 disks
> hung off 2  X 40 MB/s Ultra-Wide SCSI channels. We believe that these
> encouraging results are a direct result of implementing  a significant
> part of KAIO in the kernel using split-phase I/O while avoiding or
> minimizing the use of any globally contented locks."

The problem here is, I have no idea what they are comparing to (worse
case read/writes which we know PostgreSQL *mostly* isn't suffering
from).  If we assume that PostgreSQL's read/write operations are
somewhat optimized (as it currently sounds like they are), I'd seriously
doubt we'd see that big of a difference.  On the other hand, I'm hoping
that if an aio postgresql implementation does get done we'll see
something like a 5%-10% performance boost.  Even still, I have nothing
to pin that on other than hope.  If we do see a notable performance
increase for Linux, I have no idea what it will do for other platforms.

Then, there are all of the issues that Tom brought up about
bloat/uglification and maintainability.  So, while I certainly do keep
those remarks in my mind, I think it's best to simply encourage the
effort (or something like it) and help determine where we really sit by
means of empirical evidence.

Greg

signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Moving to PostGres

2002-10-07 Thread Neil Conway


"Benjamin Stewart" <[EMAIL PROTECTED]> writes:
> 1. What is the postgresql equiv to Stored procedures and can they be written
> in another langauage such s JAVA?

PostgreSQL supports user-defined functions; in 7.3 (currently in beta)
they can return sets of tuples.

You can define functions in Java using http://pljava.sf.net , or in a
variety of other languages (Perl, Python, Tcl, Ruby, C, PL/PgSQL, SQL,
sh, etc.)

> 2. How well is JAva supported for developing DB applications using
>PG?

"Pretty well", I guess. If you have a specific question, ask it.

Cheers,

Neil

-- 
Neil Conway <[EMAIL PROTECTED]> || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] [GENERAL] Large databases, performance

2002-10-07 Thread Zeugswetter Andreas SB SD



> if i'm not mistaken, a char(n)/varchar(n) column is stored as a 32-bit
> integer specifying the length followed by as many characters as the
> length tells.  On 32-bit Intel hardware this structure is aligned on a
> 4-byte boundary.

Yes.

> | opc0 char (3)   nono   8 4
> | opc1 char (3)   nono   8 4
> | opc2 char (3)   nono   8 4

> Hackers, do you think it's possible to hack together a quick and dirty
> patch, so that string length is represented by one byte?  IOW can a
> database be built that doesn't contain any char/varchar/text value
> longer than 255 characters in the catalog?

Since he is only using fixchar how about doing a fixchar implemetation, that 
does not store length at all ? It is the same for every row anyways !

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Antti Haapala

On 6 Oct 2002, Greg Copeland wrote:

> On Sat, 2002-10-05 at 14:46, Curtis Faith wrote:
> >
> > 2) aio_write vs. normal write.
> >
> > Since as you and others have pointed out aio_write and write are both
> > asynchronous, the issue becomes one of whether or not the copies to the
> > file system buffers happen synchronously or not.
>
> Actually, I believe that write will be *mostly* asynchronous while
> aio_write will always be asynchronous.  In a buffer poor environment, I
> believe write will degrade into a synchronous operation.  In an ideal
> situation, I think they will prove to be on par with one another with a
> slight bias toward aio_write.  In less than ideal situations where
> buffer space is at a premium, I think aio_write will get the leg up.

Browsed web and came across this piece of text regarding a Linux-KAIO
patch by Silicon Graphics...

"The asynchronous I/O (AIO) facility implements interfaces defined by the
POSIX standard, although it has not been through formal compliance
certification. This version of AIO is implemented with support from kernel
modifications, and hence will be called KAIO to distinguish it from AIO
facilities available from newer versions of glibc/librt.  Because of the
kernel support, KAIO is able to perform split-phase I/O to maximize
concurrency of I/O at the device. With split-phase I/O, the initiating
request (such as an aio_read) truly queues the I/O at the device as the
first phase of the I/O request; a second phase of the I/O request,
performed as part of the I/O completion, propagates results of the
request.  The results may include the contents of the I/O buffer on a
read, the number of bytes read or written, and any error status.

Preliminary experience with KAIO have shown  over  35% improvement in
database performance tests. Unit tests (which only perform I/O) using KAIO
and Raw I/O have been successful in achieving 93% saturation with 12 disks
hung off 2  X 40 MB/s Ultra-Wide SCSI channels. We believe that these
encouraging results are a direct result of implementing  a significant
part of KAIO in the kernel using split-phase I/O while avoiding or
minimizing the use of any globally contented locks."

Well...

> In a worse case scenario, it seems that aio_write does get a win.
>
> I personally would at least like to see an aio implementation and would
> be willing to even help benchmark it to benchmark/validate any returns
> in performance.  Surely if testing reflected a performance boost it
> would be considered for baseline inclusion?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Jim Buttafuoco

Is this NOT what I have been after for many months now.  I dropped the 
tablespace/location idea before 7.2 because that
didn't seem to be any interest.  Please see my past email's for the SQL commands and 
on disk directory layout I have
proposed.  I have a working 7.2 system with tablespaces/locations (what ever you want 
to call them,  I like locations
because tablespace are an Oracle thing).  I would like to get this code ported into 
7.4.

Jim

> =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <[EMAIL PROTECTED]> writes:
> > how would you handle table spaces?
> 
> The plan that's been discussed simply defines a tablespace as being a
> directory somewhere; physical storage of individual tables would remain
> basically the same, one or more files under the containing directory.
> 
> The point of this being, of course, that the DBA could create the
> tablespace directories on different partitions or volumes in order to
> provide the behavior he wants.
> 
> In my mind this would be primarily a cleaner, more flexible
> reimplementation of the existing "database location" feature.
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 4: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Hans-Jürgen Schönig


>
>
Quotas are handled differently on ever platform (if available).


>>>Yeah. But that's sysadmins responsibility not DBA's.
>>>  
>>>
>>Maybe many people ARE the sysadmins of their PostgreSQL box ...
>>When developing a database with an open mind people should try to see a 
>>problem from more than just one perspective.
>>Why should anybody just forget about sysdbas???
>>
>>
>
>If DBA is sysadmin, does it make a difference if he learnes about mount/ln or 
>table spaces. Yes it does. Table spaces are limited to databases but mount/ln 
>is useful for any general purpose sysadmin work.
>
>That answers the last point as well, I guess..
>  
>

I agree but still, think of hybrid systems ..

>
>Agreed. Perfect point and I didn't thought of it.
>
>But it can be done in directory structure as well. Of course it's quite a 
>deviation from what one thinks as plain old directory structure. But if this is 
>one point where table spaces win, let's borrow that. There is lot of baggage in 
>table spaces that can be left out..
>
>Besides AFAIU, tablespaces implements quota using data files which are pre-
>allocated. Pre-claiming space/resource  is the evil of everything likes of 
>oracle do and runs in exact opposite direction of postgresql philosophy.
>
>If postgresql has to implement quotas on object, it should do without 
>preclaiming space.
>
>Besides if postgresql offers quota on per object basis in directory/object 
>scheme, I am sure that's far more granular than tablespaces. Choice is good..
> 
>
I didn't think of pre allocation - this is pretty much like Oracle would 
do it.
I was thinking of having a maximum size or something like that.
Overhead such as EXTENDS and things like that don't seem too useful for 
me - that's what a filesystem can be used for.

I agree with Tom: If a tablespace was a directory it would be pretty 
simple and pretty useful. If people could define a maximum size it would 
be more than perfect.
All I think is necessary is:
- having data in different, user defined, locations
- having the chance to define a maximum size for that tablespace.

Suggestion:
CREATE TABLESPACE: Create a "directory" with a certain size (optional) - 
nothing special here.
ALTER TABLESPACE: resize table space. resizing is possible if the amount 
of data in the tablespace < new size of tablespace
DROP TABLESPACE:  remove table space. the question in this case is - 
what about the objects in the tablespace?
objects can not always be deleted (just think of inheritance and 
parent tables)

>It's not about devil. It's about revaluating need once again. Especially at the 
>level of tablespace concept in itself.
>
>  
>

That's why people should discuss it and think about it :).
People want a good implementation or no implementation :).
This is Open Source - it is designed to be discussed :).

Hans

-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at , cluster.postgresql.at 
, www.cybertec.at 
, kernel.cybertec.at 



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance

2002-10-07 Thread Tom Lane


"Shridhar Daithankar" <[EMAIL PROTECTED]> writes:
> I say if it's a char field, there should be no indicator of length as
> it's not required. Just store those many characters straight ahead..

Your assumption fails when considering UNICODE or other multibyte
character encodings.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] [GENERAL] Large databases, performance

2002-10-07 Thread Manfred Koizar


On Mon, 07 Oct 2002 19:48:31 +0530, "Shridhar Daithankar"
<[EMAIL PROTECTED]> wrote:
>I say if it's a char field, there should be no indicator of length as it's not 
>required. Just store those many characters straight ahead..

This is out of reach for a quick hack ...

>Sure. But the server machine is not available this week. Some other project is 
>using it. So the results won't be out unless at least a week from now.

 :-)

>This time results are correct. Postgresql loads data faster, indexes it faster 
>and queries in almost same time.. Way to go..

Great!  And now let's work on making selects faster, too.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Tom Lane


=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <[EMAIL PROTECTED]> writes:
> how would you handle table spaces?

The plan that's been discussed simply defines a tablespace as being a
directory somewhere; physical storage of individual tables would remain
basically the same, one or more files under the containing directory.

The point of this being, of course, that the DBA could create the
tablespace directories on different partitions or volumes in order to
provide the behavior he wants.

In my mind this would be primarily a cleaner, more flexible
reimplementation of the existing "database location" feature.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Shridhar Daithankar

On 7 Oct 2002 at 16:49, Hans-Jürgen Schönig wrote:

> >Mount a directory on a partition. If the data exceeds on that partition, there 
> >would be disk error. Like tablespace getting overflown. I have seen both the 
> >scenarios in action..
> Of course it can be done somehow. However, with tablespaces it is more 
> db-like and you need not be familiar with the operating system itself.
> Just think of a company having several different operating systems 
> (suns, linux, bsd, ...).
> what do you think could be done in this case? my answer would be an 
> abstraction layer called table spaces ...

OK. Point noted. Suspended till next point.

> >>Quotas are handled differently on ever platform (if available).
> >Yeah. But that's sysadmins responsibility not DBA's.
> Maybe many people ARE the sysadmins of their PostgreSQL box ...
> When developing a database with an open mind people should try to see a 
> problem from more than just one perspective.
> Why should anybody just forget about sysdbas???

If DBA is sysadmin, does it make a difference if he learnes about mount/ln or 
table spaces. Yes it does. Table spaces are limited to databases but mount/ln 
is useful for any general purpose sysadmin work.

That answers the last point as well, I guess..

> >>With tablespaces you can assign 30mb to use a, 120mb to user b etc. ...
> >>Table spaces are a nice abstraction layer to the file system.
> >Hmm.. And how does that fit in database metaphor? What practical use is that? I 
> >can't imagine as I am a developer and not a DBA.
> One of our customers did some minor hosting projects with PostgreSQL. 
> That's what he wanted to have because it is a practical issue.
> a. you don't want to have more than one instance per machine.
> b. you want to assign a certain amount of space to a certain user 
> without using quotas. just think of administration tools - tablespaces 
> are as simple as a select.

Agreed. Perfect point and I didn't thought of it.

But it can be done in directory structure as well. Of course it's quite a 
deviation from what one thinks as plain old directory structure. But if this is 
one point where table spaces win, let's borrow that. There is lot of baggage in 
table spaces that can be left out..

Besides AFAIU, tablespaces implements quota using data files which are pre-
allocated. Pre-claiming space/resource  is the evil of everything likes of 
oracle do and runs in exact opposite direction of postgresql philosophy.

If postgresql has to implement quotas on object, it should do without 
preclaiming space.

Besides if postgresql offers quota on per object basis in directory/object 
scheme, I am sure that's far more granular than tablespaces. Choice is good..

> per directory is a first step - a good step and a good idea but 
> tablespaces are a useful invention. just think of hosting companies, 
> hybrid environments, etc ...
> tablespaces or not a devil and sysdbas may be developers ...

It's not about devil. It's about revaluating need once again. Especially at the 
level of tablespace concept in itself.

Bye
 Shridhar

--
Oblivion together does not frighten me, beloved.-- Thalassa (in Anne 
Mulhall's body), "Return to Tomorrow", stardate 4770.3.

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Hans-Jürgen Schönig



>>>2) What a directory structure does not offer that table space does?
>>>  
>>>
>>You need to the command line in order to manage quotas - you might not 
>>want that.
>>
>>
>
>Mount a directory on a partition. If the data exceeds on that partition, there 
>would be disk error. Like tablespace getting overflown. I have seen both the 
>scenarios in action..
>  
>

Of course it can be done somehow. However, with tablespaces it is more 
db-like and you need not be familiar with the operating system itself.
Just think of a company having several different operating systems 
(suns, linux, bsd, ...).
what do you think could be done in this case? my answer would be an 
abstraction layer called table spaces ...

>  
>
>>Quotas are handled differently on ever platform (if available).
>>
>>
>
>Yeah. But that's sysadmins responsibility not DBA's.
>

Maybe many people ARE the sysadmins of their PostgreSQL box ...
When developing a database with an open mind people should try to see a 
problem from more than just one perspective.
Why should anybody just forget about sysdbas???


>>With tablespaces you can assign 30mb to use a, 120mb to user b etc. ...
>>Table spaces are a nice abstraction layer to the file system.
>>
>>
>
>Hmm.. And how does that fit in database metaphor? What practical use is that? I 
>can't imagine as I am a developer and not a DBA.
>
>  
>

One of our customers did some minor hosting projects with PostgreSQL. 
That's what he wanted to have because it is a practical issue.
a. you don't want to have more than one instance per machine.
b. you want to assign a certain amount of space to a certain user 
without using quotas. just think of administration tools - tablespaces 
are as simple as a select.

per directory is a first step - a good step and a good idea but 
tablespaces are a useful invention. just think of hosting companies, 
hybrid environments, etc ...
tablespaces or not a devil and sysdbas may be developers ...


Hans


-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at , cluster.postgresql.at 
, www.cybertec.at 
, kernel.cybertec.at 



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance

2002-10-07 Thread Shridhar Daithankar


On 7 Oct 2002 at 10:30, Tom Lane wrote:

> "Shridhar Daithankar" <[EMAIL PROTECTED]> writes:
> > MySQL 3.23.52 with innodb transaction support: 
> 
> > 4 concurrent queries:-  257.36 ms
> > 40 concurrent queries   :-  35.12 ms
> 
> > Postgresql 7.2.2 
> 
> > 4 concurrent queries:- 257.43 ms
> > 40 concurrent   queries :- 41.16 ms
> 
> I find this pretty fishy.  The extreme similarity of the 4-client
> numbers seems improbable, from what I know of the two databases.
> I suspect your numbers are mostly measuring some non-database-related
> overhead --- communications overhead, maybe?

I don't know but three numbers, postgresql/mysql/oracle all are 25x.xx ms. The 
clients were on same machie as of server. So no real area to point at..
> 
> > Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All 
> > numbers include indexes. This is really going to be a problem when things are
> > deployed. Any idea how can it be taken down? 
> 
> 7.3 should be a little bit better because of Manfred's work on reducing
> tuple header size --- if you create your tables WITHOUT OIDS, you should
> save 8 bytes per row compared to earlier releases.

Got it..

Bye
 Shridhar

--
Sweater, n.:A garment worn by a child when its mother feels chilly.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Shridhar Daithankar

On 7 Oct 2002 at 15:52, Hans-Jürgen Schönig wrote:

> >Can anybody please tell me in detail.(Not just a pointing towards TODO items)
> >1) What a table space supposed to offer?
> They allow you to define a maximum amount of storage for a certain set 
> of data.

Use quota

> They help you to define the location of data.

Mount/symlink whereever you want assuming database supports one directory per 
object metaphor. This is finer control than tablespaces as tablespaces often 
hosts many objects from possibly many databases. Once a tablespace is created, 
it's difficult to keep track of what all goes on a table space. You look at 
directory structure and you get a clear picture..

> They help you to define how much data can be used by which ressource.

Which resource? I am confused. Disk space is only resource we are talking 
about.

> >2) What a directory structure does not offer that table space does?
> You need to the command line in order to manage quotas - you might not 
> want that.

Mount a directory on a partition. If the data exceeds on that partition, there 
would be disk error. Like tablespace getting overflown. I have seen both the 
scenarios in action..

> Quotas are handled differently on ever platform (if available).

Yeah. But that's sysadmins responsibility not DBA's.

> With tablespaces you can assign 30mb to use a, 120mb to user b etc. ...
> Table spaces are a nice abstraction layer to the file system.

Hmm.. And how does that fit in database metaphor? What practical use is that? I 
can't imagine as I am a developer and not a DBA.

> how would you handle table spaces? just propose it to the hackers' list ...
> we should definitely discuss that ...
> a bad implementation of table spaces would be painful ...

I suggest a directory per object structure. A database gets it's own directory, 
an index gets it's own dir. etc. 

In addition to that, postgresql should offer capability to suspend a 
database/table so that they can be moved without restarting database daemon. 

This is to acknowledge the fact that database storage is relocatable. In 
current implementation, database does not offer any assistance in relocating 
the database structure on disk..

Besides postgresql runs a cluster of databases as opposed to single database 
run by likes of oracle. So relocating a single database should not affect 
others.

Additionally postgresql should handle out of disk space errors gracefully 
noting that out of space for an object may not mean out of space for all 
objects. Obviously tablespaces can do better here.

Let's say for each object that gets storage, e.g. database, table, index and 
transaction log, we maintain a flag in metadata to indicate whether it's 
relocated or not. t can offer a command to set this flag. Typically the command 
sequence would look like this

sql> take  offline;
---
relocate/symlink/remount the appropriate dir.
--
sql> take  online mark relocated;

Postgresql should continue working if a relocated object experiences disk space 
full.

Of course, if postgresql could accept arguments at object creation time for 
alternate directories to symlink to, that would be better than sliced bread.

I believe giving each database it's own transaction log would be a great 
advantage of this scheme.

These are some thoughts how it would work. I believe this course of action 
would make the transition easier, offer a much better granular control over 
object allocation on disk and ultimately would prove useful to users.

I might have lost couple of points as I took long time composing it. But having 
all the possible objections dealt with should be the starting point. 

Thanks once again..

Bye
 Shridhar

--
Cheit's Lament: If you help a friend in need, he is sure to remember you--  the 
next time he's in need.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance

2002-10-07 Thread Tom Lane

"Shridhar Daithankar" <[EMAIL PROTECTED]> writes:
> MySQL 3.23.52 with innodb transaction support: 

> 4 concurrent queries  :-  257.36 ms
> 40 concurrent queries :-  35.12 ms

> Postgresql 7.2.2 

> 4 concurrent queries  :- 257.43 ms
> 40 concurrent queries :- 41.16 ms

I find this pretty fishy.  The extreme similarity of the 4-client
numbers seems improbable, from what I know of the two databases.
I suspect your numbers are mostly measuring some non-database-related
overhead --- communications overhead, maybe?

> Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All 
> numbers include indexes. This is really going to be a problem when things are
> deployed. Any idea how can it be taken down? 

7.3 should be a little bit better because of Manfred's work on reducing
tuple header size --- if you create your tables WITHOUT OIDS, you should
save 8 bytes per row compared to earlier releases.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] pg_filedump

2002-10-07 Thread Tom Lane


Alvaro Herrera <[EMAIL PROTECTED]> writes:
> I'm trying to get something from pg_filedump.  However, the version
> published in sources.redhat.com/rhdb doesn't grok a lot of changes in
> current CVS.  I changed all those and made it compile... but looks like
> that's only the easy part.  I get bogus values everywhere (block sizes,
> item numbers, etc).
> Does somebody know whether it's mantained for current versions?

AFAIK, no one has yet updated it for 7.3's changes in tuple header
format.  That needs to get done sometime soon ...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] [GENERAL] Large databases, performance

2002-10-07 Thread Shridhar Daithankar

On 7 Oct 2002 at 16:10, Manfred Koizar wrote:
> if i'm not mistaken, a char(n)/varchar(n) column is stored as a 32-bit
> integer specifying the length followed by as many characters as the
> length tells.  On 32-bit Intel hardware this structure is aligned on a
> 4-byte boundary.

That shouldn't be necessary for a char field as space is always pre-allocated. 
Sounds like a possible area of imporvement to me, if that's the case..

> Hackers, do you think it's possible to hack together a quick and dirty
> patch, so that string length is represented by one byte?  IOW can a
> database be built that doesn't contain any char/varchar/text value
> longer than 255 characters in the catalog?

I say if it's a char field, there should be no indicator of length as it's not 
required. Just store those many characters straight ahead..

> 
> If I'm not told that this is impossibly, I'd give it a try.  Shridhar,
> if such a patch can be made available, would you be willing to test
> it?

Sure. But the server machine is not available this week. Some other project is 
using it. So the results won't be out unless at least a week from now.

> What can you do right now?  Try using v7.3 beta and creating your
> table WITHOUT OIDS.  This saves 8 bytes per tuple; not much, but
> better save 4% than nothing.

IIRC there was some header optimisation which saved 4 bytes. So without OIDs 
that should save 8. Would do that as first next thing.

I talked to my friend regarding postgresql surpassing mysql substantially in 
this test. He told me that the last test where postgresql took 23000+/150 sec 
for load/index and mysql took 18,000+/130 index, postgresql was running in 
default configuration. He forgot to copy postgresql.conf to data directory 
after he modified it.

This time results are correct. Postgresql loads data faster, indexes it faster 
and queries in almost same time.. Way to go..

Regards,
 Shridhar

---
Shridhar Daithankar
LIMS CPE Team Member, PSPL.
mailto:[EMAIL PROTECTED]
Phone:- +91-20-5678900 Extn.270
Fax  :- +91-20-5678901 
---

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] [GENERAL] Large databases, performance

2002-10-07 Thread Manfred Koizar

On Mon, 07 Oct 2002 15:07:29 +0530, "Shridhar Daithankar"
<[EMAIL PROTECTED]> wrote:
>Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All 
>numbers include indexes. This is really going to be a problem when things are 
>deployed. Any idea how can it be taken down? 

Shridhar,

if i'm not mistaken, a char(n)/varchar(n) column is stored as a 32-bit
integer specifying the length followed by as many characters as the
length tells.  On 32-bit Intel hardware this structure is aligned on a
4-byte boundary.

For your row layout this gives the following sizes (look at the "phys
size" column):

| FieldField Null Indexed phys  mini
| Name Type   size  
|
| type intnono   4 4
| esn  char (10)  noyes 1611
| min  char (10)  noyes 1611
| datetime timestamp  noyes  8 8
| opc0 char (3)   nono   8 4
| opc1 char (3)   nono   8 4
| opc2 char (3)   nono   8 4
| dpc0 char (3)   nono   8 4
| dpc1 char (3)   nono   8 4
| dpc2 char (3)   nono   8 4
| npa  char (3)   nono   8 4
| nxx  char (3)   nono   8 4
| rest char (4)   nono   8 5
| field0   intyes   no   4 4
| field1   char (4)   yes   no   8 5
| field2   intyes   no   4 4
| field3   char (4)   yes   no   8 5
| field4   intyes   no   4 4
| field5   char (4)   yes   no   8 5
| field6   intyes   no   4 4
| field7   char (4)   yes   no   8 5
| field8   intyes   no   4 4
| field9   char (4)   yes   no   8 5
| - -
|  176   116

Ignoring nulls for now, you have to add 32 bytes for a v7.2 heap tuple
header and 4 bytes for ItemIdData per tuple, ending up with 212 bytes
per tuple or ca. 85 GB heap space for 43200 tuples.  Depending on
fill factor similar calculations give some 30 GB for your index.

Now if we had a datatype with only one byte for the string length,
char columns could be byte aligned and we'd have column sizes given
under "mini" in the table above.  The columns would have to be
rearranged according to alignment requirements.

Thus 60 bytes per heap tuple and 8 bytes per index tuple could be
saved, resulting in a database size of ~ 85 GB (index included).  And
I bet this would be significantly faster, too.

Hackers, do you think it's possible to hack together a quick and dirty
patch, so that string length is represented by one byte?  IOW can a
database be built that doesn't contain any char/varchar/text value
longer than 255 characters in the catalog?

If I'm not told that this is impossibly, I'd give it a try.  Shridhar,
if such a patch can be made available, would you be willing to test
it?

What can you do right now?  Try using v7.3 beta and creating your
table WITHOUT OIDS.  This saves 8 bytes per tuple; not much, but
better save 4% than nothing.

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Hans-Jürgen Schönig


>
>
>Can anybody please tell me in detail.(Not just a pointing towards TODO items)
>
>1) What a table space supposed to offer?
>

They allow you to define a maximum amount of storage for a certain set 
of data.
They help you to define the location of data.
They help you to define how much data can be used by which ressource.

>2) What a directory structure does not offer that table space does?
>

You need to the command line in order to manage quotas - you might not 
want that.
Quotas are handled differently on ever platform (if available).
With tablespaces you can assign 30mb to use a, 120mb to user b etc. ...
Table spaces are a nice abstraction layer to the file system.

>
>3) How do they compare for advantages/disadvantages..
>
>Oracle familiarity is out. That's not even close to being good merit IMO. If 
>postgresql moves to oracle way of doing things, .. well, I won't be as much 
>hapy as I am now..
>
>Thanks for your patience..
>

how would you handle table spaces? just propose it to the hackers' list ...
we should definitely discuss that ...
a bad implementation of table spaces would be painful ...


Hans

-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at , cluster.postgresql.at 
, www.cybertec.at 
, kernel.cybertec.at 



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Implicit Lock Row

2002-10-07 Thread Shridhar Daithankar


On 5 Oct 2002 at 23:56, Antoine Lobato wrote:

> 
> I currently develop an interface to simulate a indexed sequential file 
> management with PostgreSql. I must reproduce the same philosophy used of 
> control of locking of the records.
> I seek a solution to lock and unlock implicitly a row of a table. The locking 
> of several rows, of the same table or various tables, can last a long time and 
> consequently locking cannot be included in a transaction for not to lock the 
> whole table for the other users.
> There is a viable solution with PostgreSql?
> There is an accessible basic structure of locking?

You can use select for update to lock rows.

HTH

Bye
 Shridhar

--
Strategy:   A long-range plan whose merit cannot be evaluated until sometime   
 
after those creating it have left the organization.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Table spaces again [was Re: [HACKERS] Threaded Sorting]

2002-10-07 Thread Shridhar Daithankar

On 4 Oct 2002 at 21:13, Hans-Jürgen Schönig wrote:

> Bingo = great :).
> The I/O problem seems to be solved :).
> 
> A table space concept would be top of the histlist :).
> 
> The symlink version is not very comfortable and I think it would be a 
> real hack.
> Also: If we had a clean table space concept it would be real advantage.
> In the first place it would be enough to define a directory (alter 
> tablespace, changing sizes etc. could be a lot of work).
> 
> How could CREATE TABLESPACE look like?
> Personally I like the Oracle Syntax.

Well. I (hopefully) understand need to get table spaces. But I absolutely hate 
it the way oracle does it.. I am repeating all the points I posted before. 
There was no follow up. I hope I get some on this.

1) It tries to be a volume assuming OS handles volumes inefficiently. Same 
mentality as handling all disk I/O by in it self. May be worth when oracle did 
it but is it worth now?

2) It allows joining multiple volumes for performance reason. If you want to 
join multiple volume for performance, let RAID handle it. Is it job of RDBMS?

3) It puts multiple objets together. Why? I never fully understood having a 
opeque file sitting on drive v/s neatly laid directory structure. I would 
always prefer the directory structure.

Can anybody please tell me in detail.(Not just a pointing towards TODO items)

1) What a table space supposed to offer?

2) What a directory structure does not offer that table space does?

3) How do they compare for advantages/disadvantages..

Oracle familiarity is out. That's not even close to being good merit IMO. If 
postgresql moves to oracle way of doing things, .. well, I won't be as much 
hapy as I am now..

Thanks for your patience..

Bye
 Shridhar

--
Newton's Little-Known Seventh Law:  A bird in the hand is safer than one 
overhead.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Use of sync() [was Re: Potential Large Performance Gain in WAL synching]

2002-10-07 Thread Doug McNaught

Tom Lane <[EMAIL PROTECTED]> writes:

> Doug McNaught <[EMAIL PROTECTED]> writes:

> > In my understanding, it means "all currently dirty blocks in the file
> > cache are queued to the disk driver".  The queued writes will
> > eventually complete, but not necessarily before sync() returns.  I
> > don't think subsequent write()s will block, unless the system is low
> > on buffers and has to wait until dirty blocks are freed by the driver.
> 
> We don't need later write()s to block.  We only need them to not hit
> disk before the sync-queued writes hit disk.  So I guess the question
> boils down to what "queued to the disk driver" means --- has the order
> of writes been determined at that point?

It's certainy possible that new write(s) get put into the queue
alongside old ones--I think the Linux block layer tries to do this
when it can, for one.  According to the manpage, Linux used to wait
until everything was written to return from sync(), though I don't
*think* it does anymore.  But that's not mandated by the specs.

So I don't think we can rely on such behavior (not reordering writes
across a sync()), though it will probably happen in practice a lot of
the time.  AFAIK there isn't anything better than sync() + sleep() as
far as the specs go.  Yes, it kinda sucks.  ;)

-Doug

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Threaded Sorting

2002-10-07 Thread Hans-Jürgen Schönig

Threads are bad - I know ...
I like the idea of a pool of processes instead of threads - from my 
point of view this would be useful.

I am planning to run some tests (GEQO, AIX, sorts) as soon as I have 
time to do so (still too much work ahead before :( ...).
If I had time I'd love to do something for the PostgreSQL community :(.

As far as sorting is concerned: It would be fine if it was possible to 
define an alternative location for temporary sort files using SET.
If you had multiple disks this would help in the case of concurrent 
sorts because this way people could insert and index many tables at once 
without having to access just one storage system.
This would be an easy way out of the IO limitation ... - at least for 
some problems.

Hans

Bruce Momjian wrote:

>
>We haven't thought about it yet because there are too many buggy thread
>implementations.  We are probably just now getting to a point where we
>can consider it.  However, lots of databases have moved to threads for
>all sorts of things and ended up with a royal mess of code.  Threads
>can only improve things in a few areas of the backend so it would be
>nice if we could limit the exposure to threads to those areas;  sorting
>could certainly be one of them, but frankly, I think disk I/O is our
>limiting factore there.  I would be interested to see some tests that
>showed otherwise.
>  
>

-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at , cluster.postgresql.at 
, www.cybertec.at 
, kernel.cybertec.at 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

[HACKERS] Implicit Lock Row

2002-10-07 Thread Antoine Lobato





I currently develop an interface to simulate a indexed 
sequential file management with PostgreSql. I must reproduce the same philosophy 
used of control of locking of the records.
I seek a solution to lock and unlock implicitly a row of a 
table. The locking of several rows, of the same table or various tables, can 
last a long time and consequently locking cannot be included in a transaction 
for not to lock the whole table for the other users.
There is a viable solution with PostgreSql?
There is an accessible basic structure of locking?
Thank you.

Re: [HACKERS] Bad rules

2002-10-07 Thread Steve King

Thankyou very much for your enlightened comment, it worked a treat.

I do not seem to be able to find references to this kind of useful
information in the postgresql online manual or in books such as bruce
momjian's 'postgresql-introduction and concepts'. Where is this info to be
found other than the mailing list?

Thanks again.
Regards
Steve

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED]]
Sent: 04 October 2002 15:48
To: Steve King
Cc: PostgreSQL-development
Subject: Re: [HACKERS] Bad rules

Steve King <[EMAIL PROTECTED]> writes:
> I am using postgres 7.2, and have rule on a table which causes a notify if
> an insert/update/delete is performed on the table.
> The table is very very small.
> When performing a simple (very simple) update on the table this takes
about
> 3 secs, when I remove the rule it is virtually instantaneous.
> The rest of the database seems to perform fine, have you any ideas or come
> across this before??

Let's see the rule exactly?  NOTIFY per se is not slow in my experience.

(One thing to ask: have you done a VACUUM FULL on pg_listener in recent
memory?  Heavy use of LISTEN/NOTIFY does tend to bloat that table if you
don't keep after it with VACUUM.)

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] Moving to PostGres

2002-10-07 Thread


In article ,
Benjamin Stewart <[EMAIL PROTECTED]> wrote:

>I am looking at moving our company away from MS SQL.


Here's a good place to start:

  http://techdocs.postgresql.org/redir.php?link=/techdocs/sqlserver2pgsql.php

--
http://www.spinics.net/linux/

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

[HACKERS] Case insensitive columns

2002-10-07 Thread Christoph Strebin


I have a problem similar to that described by Shaw Terwilliger on
2001-03-14 (Subject: Case Insensitive CHECK CONSTRAINTs):

I need some case insensitive char/varchar columns. A unique index on
lower(col_name) wo

<>


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[HACKERS] Moving to PostGres

2002-10-07 Thread Benjamin Stewart


Hello,
I am looking at moving our company away from MS SQL. Have been looking at
DB2 and it looks to have some good features. Now wondering if POSTGRESQL
could be a viable alternative. I have a few questions though;
1. What is the postgresql equiv to Stored procedures and can they be written
in another langauage such s JAVA?
2. How well is JAva supported for developing DB applications using PG?
3. What are the limitations to PG compared to DB2, Oracle, Sybase ?

Thanks
Ben



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Threaded Sorting

2002-10-07 Thread Hans-Jürgen Schönig


Threads are not the best solutions when it comes to portability. A 
prefer a process model as well.
My concern was that a process model might be a bit too slow for that but 
if we had processes in memory this would be wonderful thing.
Using it for small amounts of data is pretty useless - I totally agree 
but when it comes to huge amounts of data it can be useful.

It is a mechanism for huge installations with a lot of data.

Hans



>That was a fork IIRC.  Threading is not used in baseline PostgreSQL nor
>is there any such plans that I'm aware of.  People from time to time ask
>about threads for this or that and are always told what I'm telling
>you.  The use of threads leads to portability issues not to mention
>PostgreSQL is entirely built around the process model.
>
>Tom is right to dismiss the notion of adding additional CPUs to
>something that is already I/O bound, however, the concept it self should
>not be dismissed.  Applying multiple CPUs to a sort operation is well
>accepted and understood technology.
>
>At this point, perhaps Tom or one of the other core developers having
>insight in this area would be willing to address how readily such a
>mechanism could could be put in place.
>
>Also, don't be so fast to dismiss what the process model can do.  There
>is not reason to believe that having a process pool would not be able to
>perform wonderful things if implemented properly.  Basically, the notion
>would be that the backend processing the query would solicit assistance
>from the sort pool if one or more processes were available.  At that
>point, several methods could be employed to divide the work.  Some form
>of threshold would also have to be created to prevent the pool from
>being used when a single backend is capable of addressing the need. 
>Basically the idea is, you only have the pool assist with large tuple
>counts and then, only when resources are available and resource are
>available from within the pool.  By doing this, you avoid additional
>overhead for small sort efforts and gain when it matters the most.
>
>
>Regards,
>
>   Greg
>
>  
>


-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at , cluster.postgresql.at 
, www.cybertec.at 
, kernel.cybertec.at 


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Threaded Sorting

2002-10-07 Thread Hans-Jürgen Schönig

Greg Copeland wrote:

>I wouldn't hold your breath for any form of threading.  Since PostgreSQL
>is process based, you might consider having a pool of sort processes
>which address this but I doubt you'll get anywhere talking about threads
>here.
>
>Greg
>
>  
>

I came across the problem yesterday. We thought about SMP and did some 
tests on huge tables. The postmaster was running full speed to get the 
stuff sorted - even on an IDE system.
I asked my friends who are doing a lot of work with Oracle on huge SMP 
machines. I was told that Oracle has a mechanism which can run efficient 
sorts on SMP machines. It seems to speed up sorting a lot.

If we could reduce the time needed to build up an index by 25% it would 
be a wonderful thing. Just think of a scenario:
1 thread: 24 hours
many threads: 18 hours

We  could gain 6 hours which is a LOT.
We have many people running PostgreSQL on systems having wonderful IO 
systems - in this case IO is not the bottleneck anymore.

I had a brief look at the code used for sorting. It is very well 
documented so maybe it is worth thinking about a parallel algorithm.

When talking about threads: A pool of processes for sorting? Maybe this 
could be useful but I doubt if it the best solution to avoid overhead.
Somewhere in the TODO it says that there will be experiments with a 
threaded backend. This make me think that threads are not a big no no.

Hans

-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at , cluster.postgresql.at 
, www.cybertec.at 
, kernel.cybertec.at 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Threaded Sorting

2002-10-07 Thread Hans-Jürgen Schönig

Bingo = great :).
The I/O problem seems to be solved :).

A table space concept would be top of the histlist :).

The symlink version is not very comfortable and I think it would be a 
real hack.
Also: If we had a clean table space concept it would be real advantage.
In the first place it would be enough to define a directory (alter 
tablespace, changing sizes etc. could be a lot of work).

How could CREATE TABLESPACE look like?
Personally I like the Oracle Syntax.

Is it already time to work on the parser for CREATE/ALTER/DROP TABLESPACE?

Hans

Tom Lane wrote:

>Bruce Momjian <[EMAIL PROTECTED]> writes:
>  
>
>>Bingo!  Want to increase sorting performance, give it more I/O
>>bandwidth, and it will take 1/100th of the time to do threading.
>>
>>
>
>  
>
>>Added to TODO:
>>  * Allow sorting to use multiple work directories
>>
>>
>
>Yeah, I like that.  Actually it should apply to all temp files not only
>sorting.
>
>A crude hack would be to allow there to be multiple pg_temp_NNN/
>subdirectories (read symlinks) in a database, and then the code would
>automatically switch among these.
>
>Probably a cleaner idea would be to somehow integrate this with
>tablespace management --- if you could mark some tablespaces as intended
>for temp stuff, the system could round-robin among those as it creates
>temp files and/or temp tables.
>
>   regards, tom lane
>  
>

-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at , cluster.postgresql.at 
, www.cybertec.at 
, kernel.cybertec.at 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Tom Lane

Hannu Krosing <[EMAIL PROTECTED]> writes:
> in an ideal world this would be 5*120=600 tps. 
> Have you any good any ideas what holds it back for the other 300 tps ?

Well, recall that the CPU usage was about 20% in the single-client test.
(The reason I needed a variant version of pgbench is that this machine
is too slow to do more than 120 TPC-B transactions per second anyway.)

That says that the best possible throughput on this test scenario is 5
transactions per disk rotation --- the CPU is just not capable of doing
more.  I am actually getting about 4 xact/rotation for 10 or more
clients (in fact it seems to reach that plateau at 8 clients, and be
close to it at 7).  I'm inclined to think that the fact that it's 4 not
5 is just a matter of "not quite there" --- there's some additional CPU
overhead due to lock contention, etc, and any slowdown at all will cause
it to miss making 5.  The 20% CPU figure was approximate to begin with,
anyway.

The other interesting question is why we're not able to saturate the
machine with only 4 or 5 clients.  I think pgbench itself is probably
to blame for that: it can't keep all its backend threads constantly
busy ... especially not when several of them report back transaction
completion at essentially the same instant, as will happen under
ganged-commit conditions.  There will be intervals where multiple
backends are waiting for pgbench to send a new command.  That delay
in starting a new command cycle is probably enough for them to "miss the
bus" of getting included in the next commit write.

That's just a guess though; I don't have tools that would let me see
exactly what's happening.  Anyone else want to reproduce the test on
a different system and see what it does?

> If it has CPU utilisation of only 50% then there must be still some
> moderate lock contention. 

No, that's I/O wait I think, forced by the quantization of the number
of transactions that get committed per rotation.

> btw, what is the number for 1-5-10 clients with fsync off ? 

About 640 tps at 1 and 5, trailing off to 615 at 10, and down to 450
at 100 clients (now that must be lock contention...)

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] cross-posts (was Re: [GENERAL] Large databases,

2002-10-07 Thread Larry Rosenman


On Mon, 2002-10-07 at 07:01, Michael Paesold wrote:
> > On Sun, 2002-10-06 at 22:20, Tom Lane wrote:
> > > Curt Sampson <[EMAIL PROTECTED]> writes:
> > > > ... Avoiding cross-posting would be nice, since I am getting lots of
> > > > duplicate messages these days.
> > >
> > > Cross-posting is a fact of life, and in fact encouraged, on the pg
> > > lists.  I suggest adapting.  Try sending
> > > set all unique your-email-address
> > > to the PG majordomo server; this sets you up to get only one copy
> > > of each cross-posted message.
> > That doesn't seem to work any more:
> >
> >  set all unique [EMAIL PROTECTED]
> >  The "all" mailing list is not supported at
> >  PostgreSQL User Support Lists.
> >
> > What do I need to send now?
> >
> > Marc?
> 
> it is:
> set ALL unique your-email
> 
> if you also don't want to get emails that have already been cc'd to you, you
> can use:
> 
> set ALL eliminatecc your-email
> 
> for a full list of set options send:
> 
> help set
> 
> to majordomo.
Thanks.  That worked great.  (I use Mailman, and didn't realize the ALL
needed to be capitalized. 

LER


-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] cross-posts (was Re: [GENERAL] Large databases,

2002-10-07 Thread Michael Paesold


> On Sun, 2002-10-06 at 22:20, Tom Lane wrote:
> > Curt Sampson <[EMAIL PROTECTED]> writes:
> > > ... Avoiding cross-posting would be nice, since I am getting lots of
> > > duplicate messages these days.
> >
> > Cross-posting is a fact of life, and in fact encouraged, on the pg
> > lists.  I suggest adapting.  Try sending
> > set all unique your-email-address
> > to the PG majordomo server; this sets you up to get only one copy
> > of each cross-posted message.
> That doesn't seem to work any more:
>
>  set all unique [EMAIL PROTECTED]
>  The "all" mailing list is not supported at
>  PostgreSQL User Support Lists.
>
> What do I need to send now?
>
> Marc?

it is:
set ALL unique your-email

if you also don't want to get emails that have already been cc'd to you, you
can use:

set ALL eliminatecc your-email

for a full list of set options send:

help set

to majordomo.

Regards,
Michael Paesold



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] cross-posts (was Re: [GENERAL] Large databases,

2002-10-07 Thread Larry Rosenman


On Sun, 2002-10-06 at 22:20, Tom Lane wrote:
> Curt Sampson <[EMAIL PROTECTED]> writes:
> > ... Avoiding cross-posting would be nice, since I am getting lots of
> > duplicate messages these days.
> 
> Cross-posting is a fact of life, and in fact encouraged, on the pg
> lists.  I suggest adapting.  Try sending
>   set all unique your-email-address
> to the PG majordomo server; this sets you up to get only one copy
> of each cross-posted message.
That doesn't seem to work any more:

 set all unique [EMAIL PROTECTED]
 The "all" mailing list is not supported at
 PostgreSQL User Support Lists.

What do I need to send now? 

Marc? 


-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Hannu Krosing


Tom Lane kirjutas E, 07.10.2002 kell 01:07:
> 
> To test this, I made a modified version of pgbench in which each
> transaction consists of a simple
>   insert into table_NNN values(0);
> where each client thread has a separate insertion target table.
> This is about the simplest transaction I could think of that would
> generate a WAL record each time.
> 
> Running this modified pgbench with postmaster parameters
>   postmaster -i -N 120 -B 1000 --wal_buffers=250
> and all other configuration settings at default, CVS tip code gives me
> a pretty consistent 115-118 transactions per second for anywhere from
> 1 to 100 pgbench client threads.  This is exactly what I expected,
> since the database (including WAL file) is on a 7200 RPM SCSI drive.
> The theoretical maximum rate of sync'd writes to the WAL file is
> therefore 120 per second (one per disk revolution), but we lose a little
> because once in awhile the disk has to seek to a data file.
> 
> Inserting the above patch, and keeping all else the same, I get:
> 
> $ mybench -c 1 -t 1 bench1
> number of clients: 1
> number of transactions per client: 1
> number of transactions actually processed: 1/1
> tps = 116.694205 (including connections establishing)
> tps = 116.722648 (excluding connections establishing)
> 
> $ mybench -c 5 -t 2000 -S -n bench1
> number of clients: 5
> number of transactions per client: 2000
> number of transactions actually processed: 1/1
> tps = 282.808341 (including connections establishing)
> tps = 283.656898 (excluding connections establishing)

in an ideal world this would be 5*120=600 tps. 

Have you any good any ideas what holds it back for the other 300 tps ? 

If it has CPU utilisation of only 50% then there must be still some
moderate lock contention. 

btw, what is the number for 1-5-10 clients with fsync off ? 

> $ mybench -c 10 -t 1000 bench1
> number of clients: 10
> number of transactions per client: 1000
> number of transactions actually processed: 1/1
> tps = 443.131083 (including connections establishing)
> tps = 447.406534 (excluding connections establishing)
> 
> CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients
> to <10% idle at 10 or more.
> 
> So this does seem to be a nice win, and unless I hear objections
> I will apply it ...

3x speedup is not just nice, it's great ;)

--
Hannu




---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] [GENERAL] Large databases, performance

2002-10-07 Thread Shridhar Daithankar

On 3 Oct 2002 at 8:54, Charles H. Woloszynski wrote:

> I'd be curious what happens when you submit more queries than you have 
> processors (you had four concurrent queries and four CPUs), if you care 
> to run any additional tests.  Also, I'd report the query time in 
> absolute (like you did) and also in 'Time/number of concurrent queries". 
>  This will give you a sense of how the system is scaling as the workload 
> increases.  Personally I am more concerned about this aspect than the 
> load time, since I am going to guess that this is where all the time is 
> spent.  

OK. I am back from my cave after some more tests are done. Here are the 
results. I am not repeating large part of it but answering your questions..

Don't ask me how these numbers changed. I am not the person who conducts the 
test neither I have access to the system. Rest(or most ) of the things remains 
same..

MySQL 3.23.52 with innodb transaction support: 

4 concurrent queries:-  257.36 ms
40 concurrent queries   :-  35.12 ms

Postgresql 7.2.2 

4 concurrent queries:- 257.43 ms
40 concurrent   queries :- 41.16 ms

Though I can not report oracle numbers, suffice to say that they fall in 
between these two numbers.

Oracle seems to be hell lot faster than mysql/postgresql to load raw data even 
when it's installed on reiserfs. We plan to run XFS tests later in hope that 
that would improve mysql/postgresql load times. 

In this run postgresql has better load time than mysql/innodb ( 18270 sec v/s 
17031 sec.) Index creation times are faster as well (100 sec v/s 130 sec). 
Don't know what parameters are changed.

Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All 
numbers include indexes. This is really going to be a problem when things are 
deployed. Any idea how can it be taken down? 

WAL is out, it's not counted.

Schema optimisation is later issue. Right now all three databases are using 
same schema..

Will it help in this situation if I recompile posgresql with block size say 32K 
rather than 8K default? Will it saev some overhead and offer better performance 
in data load etc?

Will keep you guys updated..

Regards,
 Shridhar

---
Shridhar Daithankar
LIMS CPE Team Member, PSPL.
mailto:[EMAIL PROTECTED]
Phone:- +91-20-5678900 Extn.270
Fax  :- +91-20-5678901 
---

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-07 Thread Zeugswetter Andreas SB SD



> > Keep in mind that we support platforms without O_DSYNC.  I am not
> > sure whether there are any that don't have O_SYNC either, but I am
> > fairly sure that we measured O_SYNC to be slower than fsync()s on
> > some platforms.

This measurement is quite understandable, since the current software 
does 8k writes, and the OS only has a chance to write bigger blocks in the
write+fsync case. In the O_SYNC case you need to group bigger blocks yourself.
(bigger blocks are essential for max IO)

I am still convinced, that writing bigger blocks would allow the fastest
solution. But reading the recent posts the solution might only be to change
the current "loop foreach dirty 8k WAL buffer write 8k" to one or two large 
write calls.  

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

76 matches

Mail list logo