Re: [HACKERS] floor function in 7.3b2

2002-10-04 Thread Neil Conway

Tom Lane [EMAIL PROTECTED] writes:
 Mario Weilguni [EMAIL PROTECTED] writes:
  Why is floor not working anymore?
 
 Mph.  Seems we have floor(numeric) but not floor(float8), and the latter
 is what you need here.

Sorry, I missed much of the casting discussion -- but is there a
reason why we can't cast from float8 - numeric implicitely? IIRC the
idea was to allow implicit casts from lower precision types to higher
precision ones.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



[HACKERS] Threaded Sorting

2002-10-04 Thread Hans-Jürgen Schönig

Did anybody think about threaded sorting so far?
Assume an SMP machine. In the case of building an index or in the case 
of sorting a lot of data there is just one backend working. Therefore 
just one CPU is used.
What about starting a thread for every temporary file being created? 
This way CREATE INDEX could use many CPUs.
Maybe this is worth thinking about because it will speed up huge 
databases and enterprise level computing.

Best regards,

Hans-Jürgen Schönig

-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at http://www.postgresql.at, cluster.postgresql.at 
http://cluster.postgresql.at, www.cybertec.at 
http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Shridhar Daithankar

On 4 Oct 2002 at 9:46, Hans-Jürgen Schönig wrote:

 Did anybody think about threaded sorting so far?
 Assume an SMP machine. In the case of building an index or in the case 
 of sorting a lot of data there is just one backend working. Therefore 
 just one CPU is used.
 What about starting a thread for every temporary file being created? 
 This way CREATE INDEX could use many CPUs.
 Maybe this is worth thinking about because it will speed up huge 
 databases and enterprise level computing.

I have a better plan. I have a thread architecture ready which acts as generic 
thread templates. Even the function pointers in the thread can be altered on 
the fly.

I suggest we use some such architecture for threading. It can be used in any 
module without hardcoding things. Like say in sorting we assign exclusive 
jobs/data ranges to threads then there would be minimum locking and one thread 
could merge the results.. Something like that.

All it takes to change entry functions to accept one more parameter that 
indicates range of values to act upon. In non-threaded version, it's not there 
because the function acts on entire data set.

Further more, with this model threading support can be turned off easily. In 
non-threaded model, a wrapper function can call the entry point in series with 
necessary arguments. So postgresql does not have to deal with not-so-good-
enough thread implementations. Keeping tradition to conservative defaults we 
can set default threads to off..

The code is in C++ but it's hardly couple of pages. I can convert it to C and 
post it if required..

Let me know..

Bye
 Shridhar

--
Parkinson's Fourth Law: The number of people in any working group tends to 
increaseregardless of the amount of work to be done.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] Bad rules

2002-10-04 Thread Steve King

I am using postgres 7.2, and have rule on a table which causes a notify if
an insert/update/delete is performed on the table.
The table is very very small.
When performing a simple (very simple) update on the table this takes about
3 secs, when I remove the rule it is virtually instantaneous.
The rest of the database seems to perform fine, have you any ideas or come
across this before??

Regards
Steve


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] (Fwd) Re: Any Oracle 9 users? A test please...

2002-10-04 Thread Roland Roberts

 Mike == Mike Mascari [EMAIL PROTECTED] writes:

Mike Tom Lane wrote:
 Yury Bokhoncovich [EMAIL PROTECTED] writes:

 As reported by my friend: Oracle 8.1.7 (ver.9 behaves the same way):

 [ to_char(sysdate) advances in a transaction ]

 Now I'm really confused; this directly contradicts the report
 of Oracle 8's behavior that we had earlier from Roland Roberts.
 Can someone explain why the different results?

Mike Roland used an anonymous PL/SQL procedure:

You're right and I didn't think enough about what was happening.  This
also explains why I so often see the same timestamp throughout a
transaction---the transaction is all taking place inside a PL/SQL
procedure.

roland
-- 
   PGP Key ID: 66 BC 3B CD
Roland B. Roberts, PhD RL Enterprises
[EMAIL PROTECTED] 76-15 113th Street, Apt 3B
[EMAIL PROTECTED]   Forest Hills, NY 11375

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)

2002-10-04 Thread Tom Lane

Neil Conway [EMAIL PROTECTED] writes:
 Sorry, I missed much of the casting discussion -- but is there a
 reason why we can't cast from float8 - numeric implicitely? IIRC the
 idea was to allow implicit casts from lower precision types to higher
 precision ones.

The implicit casting hierarchy is now

int2 - int4 - int8 - numeric - float4 - float8

Moving to the left requires an explicit cast (or at least an assignment
to a column).  I know this looks strange to someone who knows that our
numeric type beats float4/float8 on both range and precision, but it's
effectively mandated by the SQL spec.  Any combination of exact and
inexact numeric types is supposed to yield an inexact result per
spec, thus numeric + float8 yields float8 not numeric.  Another reason
for doing it this way is that a numeric literal like 123.456 can be
initially typed as numeric, and later implicitly promoted to float4 or
float8 if context demands it.  Doing that the other way 'round would
introduce problems with precision loss.  We had speculated about
introducing an unknown_numeric pseudo-type to avoid that problem, but
the above hierarchy eliminates the need for unknown_numeric.  We can
initially type a literal as the smallest thing it will fit in, and then
do implicit promotion as needed.  (7.3 is not all the way there on that
plan, but 7.4 will be.)

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Tom Lane

=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes:
 Did anybody think about threaded sorting so far?
 Assume an SMP machine. In the case of building an index or in the case 
 of sorting a lot of data there is just one backend working. Therefore 
 just one CPU is used.
 What about starting a thread for every temporary file being created? 
 This way CREATE INDEX could use many CPUs.

In my experience, once you have enough data to force a temp file to be
used, the sort algorithm is I/O bound anyway.  Throwing more CPUs at it
won't help much.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

I wouldn't hold your breath for any form of threading.  Since PostgreSQL
is process based, you might consider having a pool of sort processes
which address this but I doubt you'll get anywhere talking about threads
here.

Greg


On Fri, 2002-10-04 at 02:46, Hans-Jürgen Schönig wrote:
 Did anybody think about threaded sorting so far?
 Assume an SMP machine. In the case of building an index or in the case 
 of sorting a lot of data there is just one backend working. Therefore 
 just one CPU is used.
 What about starting a thread for every temporary file being created? 
 This way CREATE INDEX could use many CPUs.
 Maybe this is worth thinking about because it will speed up huge 
 databases and enterprise level computing.
 
 Best regards,
 
 Hans-Jürgen Schönig
 
 -- 
 *Cybertec Geschwinde u Schoenig*
 Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
 Tel: +43/1/913 68 09; +43/664/233 90 75
 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at 
 http://cluster.postgresql.at, www.cybertec.at 
 http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at
 
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
 http://www.postgresql.org/users-lounge/docs/faq.html




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Bad rules

2002-10-04 Thread Tom Lane

Steve King [EMAIL PROTECTED] writes:
 I am using postgres 7.2, and have rule on a table which causes a notify if
 an insert/update/delete is performed on the table.
 The table is very very small.
 When performing a simple (very simple) update on the table this takes about
 3 secs, when I remove the rule it is virtually instantaneous.
 The rest of the database seems to perform fine, have you any ideas or come
 across this before??

Let's see the rule exactly?  NOTIFY per se is not slow in my experience.

(One thing to ask: have you done a VACUUM FULL on pg_listener in recent
memory?  Heavy use of LISTEN/NOTIFY does tend to bloat that table if you
don't keep after it with VACUUM.)

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Correlation in cost_index()

2002-10-04 Thread Manfred Koizar

On Thu, 03 Oct 2002 14:50:00 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
 indexCorrelation is calculated by dividing the correlation of the
 first index column by the number of index columns.

Yeah, I concluded later that that was bogus.  I've been thinking of
just using the correlation of the first index column and ignoring
the rest; that would not be great, but it's probably better than what's
there.  Have you got a better formula?

Unfortunately not.  I think such a formula does not exist for the
information we have.  What we'd need is a notion of correlation of the
nth (n  1) index column for constant values of the first n-1 index
columns; or a combined correlation for the first n index columns (1 
n = number of index columns).

I try to understand the problem with the help of use cases.
[ Jump to the end, if this looks to long-winded. ]

1)  Have a look at invoice items with an index on (fyear, invno,
itemno).  Invoice numbers start at 1 for each financial year, item
numbers start at 1 for each invoice.  In a typical scenario
correlations for fyear, (fyear, invno), and (fyear, invno, itemno) are
close to 1;  invno correlation is expected to be low; and itemno looks
totally chaotic to the analyzer.

When we 
SELECT * FROM item WHERE fyear = 2001 AND invno  1000
dividing the correlation of the first column by the number of columns
gives 1/3 which has nothing to do with what we want.  (And then the
current implementation of cost_index() squares this and gets 1/9 which
is even farther away from the truth.)  Just using the correlation of
the first index column seems right here.

2)  OTOH consider bookkeeping with enties identified by (fyear,
account, entryno).  Again fyear has a correlation near 1.  For account
we can expect something near 0, and entryno has a distribution
comparable to invno in the first use case, i.e. starting at 1 for each
year.

SELECT * from entry WHERE fyear = 2001 AND account = whatever
Taking the correlation of fyear would imply that the tuples we are
looking for are close to each other, which again can turn out to be
wrong.

So what do we know now?  Even less than before :-(

I have just one idea that might help a bit:  If correlation of the
first index column is near +/1, cost_index() should not use
baserel-pages, but baserel-pages * selectivity of the first column.

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

On Fri, 2002-10-04 at 09:40, Hans-Jürgen Schönig wrote:
 
 I had a brief look at the code used for sorting. It is very well 
 documented so maybe it is worth thinking about a parallel algorithm.
 
 When talking about threads: A pool of processes for sorting? Maybe this 
 could be useful but I doubt if it the best solution to avoid overhead.
 Somewhere in the TODO it says that there will be experiments with a 
 threaded backend. This make me think that threads are not a big no no.
 
 Hans

That was a fork IIRC.  Threading is not used in baseline PostgreSQL nor
is there any such plans that I'm aware of.  People from time to time ask
about threads for this or that and are always told what I'm telling
you.  The use of threads leads to portability issues not to mention
PostgreSQL is entirely built around the process model.

Tom is right to dismiss the notion of adding additional CPUs to
something that is already I/O bound, however, the concept it self should
not be dismissed.  Applying multiple CPUs to a sort operation is well
accepted and understood technology.

At this point, perhaps Tom or one of the other core developers having
insight in this area would be willing to address how readily such a
mechanism could could be put in place.

Also, don't be so fast to dismiss what the process model can do.  There
is not reason to believe that having a process pool would not be able to
perform wonderful things if implemented properly.  Basically, the notion
would be that the backend processing the query would solicit assistance
from the sort pool if one or more processes were available.  At that
point, several methods could be employed to divide the work.  Some form
of threshold would also have to be created to prevent the pool from
being used when a single backend is capable of addressing the need. 
Basically the idea is, you only have the pool assist with large tuple
counts and then, only when resources are available and resource are
available from within the pool.  By doing this, you avoid additional
overhead for small sort efforts and gain when it matters the most.


Regards,

Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

On Fri, 2002-10-04 at 10:37, Hans-Jürgen Schönig wrote:
 My concern was that a process model might be a bit too slow for that but 
 if we had processes in memory this would be wonderful thing.

Yes, that's the point of having a pool.  The idea is not only do you
avoid process creation and destruction which is notoriously expensive on
many platforms, they would sit idle until signaled to begin working on
it's assigned sort operation.  Ideally, these would be configurable
options which would include items such as, pool size (maximum number of
processes in the pool), max concurrency level (maximum number of process
from the pool which can contribute to a single backend) and tuple count
threshold (threshold which triggers solicition for assistance from the
sort pool).

 Using it for small amounts of data is pretty useless - I totally agree 
 but when it comes to huge amounts of data it can be useful.
 
 It is a mechanism for huge installations with a lot of data.
 
 Hans

Agreed.  Thus the importance of being able to specify some type of
meaningful threshold.

Any of the core developers wanna chime in here on this concept?

Greg





signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Correlation in cost_index()

2002-10-04 Thread Manfred Koizar

On Thu, 3 Oct 2002 10:45:08 -0600 (MDT), scott.marlowe
[EMAIL PROTECTED] wrote:
 effective cache size is the default (i.e. commented out)

The default is 1000, meaning ca. 8 MB, which seems to be way too low.
If your server is (almost) exclusively used by Postgres, try setting
it to represent most of your OS cache (as reported by free on Linux).
Otherwise you have to estimate the fraction of the OS cache that gets
used by Postgres.

I'm still trying to get a feeling for how these settings play
together, so I'd be grateful if you report back the effects this has
on your cost estimates.

Caveat:  effective_cache_size is supposed to be the number of cache
pages available to one query (or is it one scan?).  So if you have
several concurrent queries (or complex queries with several scans),
you should choose a lower value.  OTOH if most of your queries operate
on the same data, one query could benefit from pages cached by other
queries ...  You have to experiment a little.

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

On Fri, 2002-10-04 at 12:26, Bruce Momjian wrote:
 Added to TODO:
 
   * Allow sorting to use multiple work directories

Why wouldn't that fall under the table space effort???

Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Bingo!  Want to increase sorting performance, give it more I/O
  bandwidth, and it will take 1/100th of the time to do threading.
 
  Added to TODO:
  * Allow sorting to use multiple work directories
 
 Yeah, I like that.  Actually it should apply to all temp files not only
 sorting.
 
 A crude hack would be to allow there to be multiple pg_temp_NNN/
 subdirectories (read symlinks) in a database, and then the code would
 automatically switch among these.

TODO updated:

* Allow sorting/temp files to use multiple work directories 

Tom, what temp files do we use that aren't for sorting;  I forgot.

 Probably a cleaner idea would be to somehow integrate this with
 tablespace management --- if you could mark some tablespaces as intended
 for temp stuff, the system could round-robin among those as it creates
 temp files and/or temp tables.

Yes, tablespaces would be the place for this.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] ALTER TABLE ... ADD COLUMN

2002-10-04 Thread Alvaro Herrera

Hello hackers,

I'm thinking about ALTER TABLE ... ADD COLUMN working properly when
child tables already contain the column.

I have two proposals.  First one:

There are two cases: one when specifying ALTER TABLE ONLY, and other
when specifying recursive (not ONLY).

In the first (ONLY) case, child tables are checked for existance of the
column.  If any child does not contain the column or it has a different
atttypid or attypmod (any other field that should be checked?), the
request is aborted.  Else the column is added in the parent only, and
the child attributes have their attinhcount incremented.

In the second case, child tables are checked for existance of the
column.  If the column doesn't exist, the procedure is called
recursively.  If the column exists but has a different atttypid o
atttypmod, the request is aborted.  Else, the attribute has its
attinhcount incremented and attislocal reset.


There are two differences between these two cases:
- ONLY does not add a column in childs.  This seems natural.  Obviously
  the scenario where this can lead to trouble produces an error
  (inexistent attr in child).

- ONLY does not touch attislocal in childs.


The second may seems odd, but consider the following scenario:

CREATE TABLE p1 (f1 int);
CREATE TABLE p2 (f2 int);
CREATE TABLE c (f1 int) INHERITS (p1, p2);

In this case, c.f1.attislocal is true.  Now suppose the user wants to
create p2.f1.  If the recursive version is used, attislocal will be
reset and the scenario will be equivalent to

CREATE TABLE p1 (f1 int);
CREATE TABLE p2 (f1 int, f2 int);
CREATE TABLE c () INHERITS (p1, p2);

but the natural way would be

CREATE TABLE p1 (f1 int);
CREATE TABLE p2 (f1 int, f2 int);
CREATE TABLE c (f1 int) INHERITS (p1, p2);

which is what the ONLY case does.


Second proposal: Another way of doing this would be resetting attislocal
iff attinhcount is exactly zero.  This assumes that if attislocal is
true and attinhcount is not zero, the user wants the column as locally
defined and we should keep it that way.  On the other hand, if a child
table already has the column there's no way to push the definition up
the inheritance tree.

What do you think?  I have actually a patch ready that implements the
first proposal; the second one, while simpler conceptually and in terms
of code, occurred to me just as I was writing this, and has an issue of
completeness.

-- 
Alvaro Herrera (alvherre[a]atentus.com)
Las cosas son buenas o malas segun las hace nuestra opinion (Lisias)

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] New lock types

2002-10-04 Thread Alvaro Herrera

Hello hackers,

I'm looking at implementing the btree reorganizer described in On-line
reorganization of sparsely-..., ACM SIGMOD proceedings 1996, by Zou and
Salzberg.  It seems to me I'll have to add some amount of lock types
in the lock manager.  Does that bother you?

-- 
Alvaro Herrera (alvherre[a]atentus.com)
On the other flipper, one wrong move and we're Fatal Exceptions
(T.U.X.: Term Unit X  - http://www.thelinuxreview.com/TUX/)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] ALTER TABLE ... ADD COLUMN

2002-10-04 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 I'm thinking about ALTER TABLE ... ADD COLUMN working properly when
 child tables already contain the column.
 There are two cases: one when specifying ALTER TABLE ONLY, and other
 when specifying recursive (not ONLY).

I think ALTER TABLE ONLY ... ADD COLUMN is nonsensical and should be
rejected out of hand.  That solves that part of the problem easily ;-)

The comparison case in my mind is ALTER TABLE ONLY ... RENAME COLUMN,
which has to be rejected.  (Surely you're not going to say we should
support this by allowing the parent column to become associated with
some other child columns than before...)  ALTER ONLY ADD COLUMN cannot
add any functionality that's not perfectly well available with
ALTER ADD COLUMN.


 In the second case, child tables are checked for existance of the
 column.  If the column doesn't exist, the procedure is called
 recursively.  If the column exists but has a different atttypid o
 atttypmod, the request is aborted.  Else, the attribute has its
 attinhcount incremented and attislocal reset.

I don't like resetting attislocal here.  If you do that, then DROPping
the parent column doesn't return you to the prior state.  I think I gave
this example before, but consider

CREATE TABLE p (f1 int);
CREATE TABLE c (f2 int) INHERITS (p);
ALTER TABLE p ADD COLUMN f2 int;
-- sometime later, realize that the ADD was a mistake, so:
ALTER TABLE p DROP COLUMN f2;

If you reset attislocal then the ending state will be that c.f2 is gone.
That seems clearly wrong to me.


 The second may seems odd, but consider the following scenario:

 CREATE TABLE p1 (f1 int);
 CREATE TABLE p2 (f2 int);
 CREATE TABLE c (f1 int) INHERITS (p1, p2);

 In this case, c.f1.attislocal is true.  Now suppose the user wants to
 create p2.f1.  If the recursive version is used, attislocal will be
 reset and the scenario will be equivalent to

 CREATE TABLE p1 (f1 int);
 CREATE TABLE p2 (f1 int, f2 int);
 CREATE TABLE c () INHERITS (p1, p2);

... which is wrong also.  c had a local definition before and should
still, IMHO.  What's the argument for taking away its local definition?

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] New lock types

2002-10-04 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 I'm looking at implementing the btree reorganizer described in On-line
 reorganization of sparsely-..., ACM SIGMOD proceedings 1996, by Zou and
 Salzberg.  It seems to me I'll have to add some amount of lock types
 in the lock manager.  Does that bother you?

Such as?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

I wrote:
  ... most file systems can't process fsync's
  simultaneous with other writes, so those writes block because the file
  system grabs its own internal locks.


tom lane replies:
 Oh?  That would be a serious problem, but I've never heard that asserted
 before.  Please provide some evidence.

Well I'm basing this on past empirical testing and having read some man
pages that describe fsync under this exact scenario. I'll have to write
a test to prove this one way or another. I'll also try and look into
the linux/BSD source for the common file systems used for PostgreSQL.

 On a filesystem that does have that kind of problem, can't you avoid it
 just by using O_DSYNC on the WAL files?  Then there's no need to call
 fsync() at all, except during checkpoints (which actually issue sync()
 not fsync(), anyway).


No, they're not exactly the same thing. Consider:

Process A   File System
-   ---
Writes index buffer .idling...
Writes entry to log cache   .
Writes another index buffer .
Writes another log entry.
Writes tuple buffer .
Writes another log entry.
Index scan  .
Large table sort.
Writes tuple buffer .
Writes another log entry.
Writes  .
Writes another index buffer .
Writes another log entry.
Writes another index buffer .
Writes another log entry.
Index scan  .
Large table sort.
Commit  .
File Write Log Entry.
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
Write Commit Log Entry  .idling...
.idling...  Write to cache
Call fsync  .idling...
.idling...  Write all buffers to device.
.DONE.

In this case, Process A is waiting for all the buffers to write
at the end of the transaction.

With asynchronous I/O this becomes:

Process A   File System
-   ---
Writes index buffer .idling...
Writes entry to log cache   Queue up write - move head to cylinder
Writes another index buffer Write log entry to media
Writes another log entryImmediate write to cylinder since head is
still there.
Writes tuple buffer .
Writes another log entryQueue up write - move head to cylinder
Index scan  .busy with scan...
Large table sortWrite log entry to media
Writes tuple buffer .
Writes another log entryQueue up write - move head to cylinder
Writes  .
Writes another index buffer Write log entry to media
Writes another log entryQueue up write - move head to cylinder
Writes another index buffer .
Writes another log entryWrite log entry to media
Index scan  .
Large table sortWrite log entry to media
Commit  .
Write Commit Log Entry  Immediate write to cylinder since head is
still there.
.DONE.

Effectively the real work of writing the cache is done while the CPU
for the process is busy doing index scans, sorts, etc. With the WAL
log on another device and SCSI I/O the log writing should almost always be
done except for the final commit write.

  Whether by threads or multiple processes, there is the same
 contention on
  the file through multiple writers. The file system can decide to reorder
  writes before they start but not after. If a write comes after a
  fsync starts it will have to wait on that fsync.

 AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
 on safety grounds (we want to be sure we have a consistent WAL up to the
 end of what we've written).  Even if we can allow some reordering when a
 single transaction puts out a large volume of WAL data, I fail to see
 where any large gain is going to come from.  We're going to be issuing
 those writes sequentially and that ought to match the disk layout about
 as well as can be hoped anyway.

My comment was applying to reads and writes of other processes not the
WAL log. In my original email, recall I mentioned using the O_APPEND
open flag which will ensure that all log entries are done sequentially.

  Likewise a given process's writes can NEVER be reordered if they are
  submitted synchronously, as is done in the calls to flush the log as
  well as the dirty pages in the buffer in the current code.

 We do not fsync buffer pages; in fact a transaction commit doesn't write
 buffer pages at all.  I think the above is just a misunderstanding of
 what's really 

Re: [HACKERS] ALTER TABLE ... ADD COLUMN

2002-10-04 Thread Alvaro Herrera

On Fri, Oct 04, 2002 at 05:57:02PM -0400, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  I'm thinking about ALTER TABLE ... ADD COLUMN working properly when
  child tables already contain the column.
  There are two cases: one when specifying ALTER TABLE ONLY, and other
  when specifying recursive (not ONLY).

 I don't like resetting attislocal here.  If you do that, then DROPping
 the parent column doesn't return you to the prior state.  I think I gave
 this example before, but consider

Huh, I don't know where I got the idea you were (or someone else was?)
in the position that attislocal should be reset.  I'll clean everything
up and submit the patch I had originally made.

-- 
Alvaro Herrera (alvherre[a]atentus.com)
Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end. (2nd Commandment for C programmers)

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Neil Conway

Curtis Faith [EMAIL PROTECTED] writes:
 It looks to me like BufferAlloc will simply result in a call to
 BufferReplace  smgrblindwrt  write for md storage manager objects.
 
 This means that a process will block while the write of dirty cache
 buffers takes place.

I think Tom was suggesting that when a buffer is written out, the
write() call only pushes the data down into the filesystem's buffer --
which is free to then write the actual blocks to disk whenever it
chooses to. In other words, the write() returns, the backend process
can continue with what it was doing, and at some later time the blocks
that we flushed from the Postgres buffer will actually be written to
disk. So in some sense of the word, that I/O is asynchronous.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Greg Copeland wrote:
-- Start of PGP signed section.
 On Fri, 2002-10-04 at 12:26, Bruce Momjian wrote:
  Added to TODO:
  
  * Allow sorting to use multiple work directories
 
 Why wouldn't that fall under the table space effort???

Yes, but we make it a separate item so we are sure that is implemented
as part of tablespaces.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



[HACKERS] fsync exlusive lock evidence WAS: Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?

Here's the evidence:

Code from: /usr/src/sys/syscalls/vfs_syscalls

int
fsync(p, uap)
struct proc *p;
struct fsync_args /* {
syscallarg(int) fd;
} */ *uap;
{
register struct vnode *vp;
struct file *fp;
vm_object_t obj;
int error;

if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0)
return (error);
vp = (struct vnode *)fp-f_data;
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
if (VOP_GETVOBJECT(vp, obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);
if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 
vp-v_mount  (vp-v_mount-mnt_flag  MNT_SOFTDEP) 
bioops.io_fsync)
error = (*bioops.io_fsync)(vp);
VOP_UNLOCK(vp, 0, p);
return (error);
}

Notice the calls to:

vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
..
VOP_UNLOCK(vp, 0, p);

surrounding the call to VOP_FSYNC.

From the man pages for VOP_UNLOCK:


HEADER STUFF .


 VOP_LOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_ISLOCKED(struct vnode *vp, struct proc *p);

 int
 vn_lock(struct vnode *vp, int flags, struct proc *p);



DESCRIPTION
 These calls are used to serialize access to the filesystem, such as to
 prevent two writes to the same file from happening at the same time.

 The arguments are:

 vp the vnode being locked or unlocked

 flags  One of the lock request types:

  LK_SHARED Shared lock
  LK_EXCLUSIVE  Exclusive lock
  LK_UPGRADEShared-to-exclusive upgrade
  LK_EXCLUPGRADEFirst shared-to-exclusive upgrade
  LK_DOWNGRADE  Exclusive-to-shared downgrade
  LK_RELEASERelease any type of lock
  LK_DRAIN  Wait for all lock activity to end

The lock type may be or'ed with these lock flags:

  LK_NOWAITDo not sleep to wait for lock
  LK_SLEEPFAIL Sleep, then return failure
  LK_CANRECURSEAllow recursive exclusive lock
  LK_REENABLE  Lock is to be reenabled after drain
  LK_NOPAUSE   No spinloop

The lock type may be or'ed with these control flags:

  LK_INTERLOCKSpecify when the caller already has a simple
  lock (VOP_LOCK will unlock the simple lock
  after getting the lock)
  LK_RETRYRetry until locked
  LK_NOOBJDon't create object

 p  process context to use for the locks

 Kernel code should use vn_lock() to lock a vnode rather than calling
 VOP_LOCK() directly.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Greg Copeland

On Fri, 2002-10-04 at 18:03, Neil Conway wrote:
 Curtis Faith [EMAIL PROTECTED] writes:
  It looks to me like BufferAlloc will simply result in a call to
  BufferReplace  smgrblindwrt  write for md storage manager objects.
  
  This means that a process will block while the write of dirty cache
  buffers takes place.
 
 I think Tom was suggesting that when a buffer is written out, the
 write() call only pushes the data down into the filesystem's buffer --
 which is free to then write the actual blocks to disk whenever it
 chooses to. In other words, the write() returns, the backend process
 can continue with what it was doing, and at some later time the blocks
 that we flushed from the Postgres buffer will actually be written to
 disk. So in some sense of the word, that I/O is asynchronous.


Isn't that true only as long as there is buffer space available?  When
there isn't buffer space available, seems the window for blocking comes
into play??  So I guess you could say it is optimally asynchronous and
worse case synchronous.  I think the worse case situation is one which
he's trying to address.

At least that's how I interpret it.

Greg




signature.asc
Description: This is a digitally signed message part


Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)

2002-10-04 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Yes, I realize it is during parsing.  I was just wondering if making
 constants coming in from the parser NUMERIC is a performance hit?

Offhand I don't see a reason to think that coercing to NUMERIC (and then
something else) is slower than coercing to FLOAT (and then something else).
Yeah, you would come out a little behind when the final destination type
is FLOAT, but on the other hand you win a little when it's NUMERIC.
I see no reason to think this isn't a wash overall.

 I see
 in gram.y that FCONST comes in as a Float so I don't even see were we
 make it NUMERIC.

It's make_const in parse_node.c that has the first contact with the
grammar's output.  Up to that point the value's just a string, really.
The grammar does *not* coerce it to float8.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

I see.  I just always assumed that it would be done as part of table
space effort as it's such a defacto feature.

I am curious as to why no one has commented on the other rather obvious
performance enhancement which was brought up in this thread.  Allowing
for parallel sorting seems rather obvious and is a common enhancement
yet seems to of been completely dismissed as people seem to be fixated
on I/O.  Go figure. 

Greg



On Fri, 2002-10-04 at 14:02, Bruce Momjian wrote:
 Greg Copeland wrote:
 -- Start of PGP signed section.
  On Fri, 2002-10-04 at 12:26, Bruce Momjian wrote:
   Added to TODO:
   
 * Allow sorting to use multiple work directories
  
  Why wouldn't that fall under the table space effort???
 
 Yes, but we make it a separate item so we are sure that is implemented
 as part of tablespaces.
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
 http://archives.postgresql.org




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Greg Copeland wrote:
-- Start of PGP signed section.
 I see.  I just always assumed that it would be done as part of table
 space effort as it's such a defacto feature.
 
 I am curious as to why no one has commented on the other rather obvious
 performance enhancement which was brought up in this thread.  Allowing
 for parallel sorting seems rather obvious and is a common enhancement
 yet seems to of been completely dismissed as people seem to be fixated
 on I/O.  Go figure. 

We think we are fixated on I/O because we think that is where the delay
is.  Is there a reason we shouldn't think that?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Neil Conway [EMAIL PROTECTED] writes:
 Curtis Faith [EMAIL PROTECTED] writes:
 It looks to me like BufferAlloc will simply result in a call to
 BufferReplace  smgrblindwrt  write for md storage manager objects.
 
 This means that a process will block while the write of dirty cache
 buffers takes place.

 I think Tom was suggesting that when a buffer is written out, the
 write() call only pushes the data down into the filesystem's buffer --
 which is free to then write the actual blocks to disk whenever it
 chooses to.

Exactly --- in all Unix systems that I know of, a write() is
asynchronous unless one takes special pains (like opening the file
with O_SYNC).  Pushing the data from userspace to the kernel disk
buffers does not count as I/O in my mind.

I am quite concerned about Curtis' worries about fsync, though.
There's not any fundamental reason for fsync to block other operations,
but that doesn't mean that it's been implemented reasonably everywhere
:-(.  We need to take a look at that.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

I resent this since it didn't seem to get to the list.

After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?

Here's the evidence:

Code from: /usr/src/sys/syscalls/vfs_syscalls

int
fsync(p, uap)
struct proc *p;
struct fsync_args /* {
syscallarg(int) fd;
} */ *uap;
{
register struct vnode *vp;
struct file *fp;
vm_object_t obj;
int error;

if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0)
return (error);
vp = (struct vnode *)fp-f_data;
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
if (VOP_GETVOBJECT(vp, obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);
if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 
vp-v_mount  (vp-v_mount-mnt_flag  MNT_SOFTDEP) 
bioops.io_fsync)
error = (*bioops.io_fsync)(vp);
VOP_UNLOCK(vp, 0, p);
return (error);
}

Notice the calls to:

vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
..
VOP_UNLOCK(vp, 0, p);

surrounding the call to VOP_FSYNC.

From the man pages for VOP_UNLOCK:


HEADER STUFF .


 VOP_LOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_ISLOCKED(struct vnode *vp, struct proc *p);

 int
 vn_lock(struct vnode *vp, int flags, struct proc *p);



DESCRIPTION
 These calls are used to serialize access to the filesystem, such as to
 prevent two writes to the same file from happening at the same time.

 The arguments are:

 vp the vnode being locked or unlocked

 flags  One of the lock request types:

  LK_SHARED Shared lock
  LK_EXCLUSIVE  Exclusive lock
  LK_UPGRADEShared-to-exclusive upgrade
  LK_EXCLUPGRADEFirst shared-to-exclusive upgrade
  LK_DOWNGRADE  Exclusive-to-shared downgrade
  LK_RELEASERelease any type of lock
  LK_DRAIN  Wait for all lock activity to end

The lock type may be or'ed with these lock flags:

  LK_NOWAITDo not sleep to wait for lock
  LK_SLEEPFAIL Sleep, then return failure
  LK_CANRECURSEAllow recursive exclusive lock
  LK_REENABLE  Lock is to be reenabled after drain
  LK_NOPAUSE   No spinloop

The lock type may be or'ed with these control flags:

  LK_INTERLOCKSpecify when the caller already has a simple
  lock (VOP_LOCK will unlock the simple lock
  after getting the lock)
  LK_RETRYRetry until locked
  LK_NOOBJDon't create object

 p  process context to use for the locks

 Kernel code should use vn_lock() to lock a vnode rather than calling
 VOP_LOCK() directly.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] [GENERAL] Small patch for PL/Perl Misbehavior with

2002-10-04 Thread John Worsley

On Thu, 3 Oct 2002, Tom Lane wrote:
That seems a little weird.  Does Perl really expect people to do that
(ie, is it a documented part of some API)?  I wonder whether there is
some other action that we're supposed to take instead, but are
missing...

Not that I know of: clearing out the $@ variable manually was just my way
of getting around the problem in practice. I think the underlying issue
may be tied to the fact that it's running a function generated within a
Safe Module, but I'm not enough of a Perl Guru to say anything more
decisive than that.

If this is what we'd have to do, I think a better way would be

   perlerrmsg = pstrdup(SvPV(ERRSV, PL_na));
   sv_setpv(perl_get_sv(@,FALSE),);
   elog(ERROR, plperl: error from function: %s, perlerrmsg);
Splitting the ERROR into a NOTICE with the useful info and an ERROR
without any isn't real good, because the NOTICE could get dropped on the
floor (either because of min_message_level or a client that just plain
loses notices).

Yeah, that's a cleaner solution. I take it anything pstrdup'd by
PostgreSQL gets freed automatically by the backend? (I wasn't familiar
enough with the backend to know how to ask for memory confident in the
understanding that it would at some point be freed. ;)

Jw.
-- 
John Worsley - [EMAIL PROTECTED]
http://www.openvein.com/


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] [GENERAL] Small patch for PL/Perl Misbehavior with Runtime Error Reporting

2002-10-04 Thread Tom Lane

John Worsley [EMAIL PROTECTED] writes:
 Yeah, that's a cleaner solution. I take it anything pstrdup'd by
 PostgreSQL gets freed automatically by the backend?

Pretty much.  The only situation where it wouldn't be is if
CurrentMemoryContext is pointing at TopMemoryContext or another
long-lived context --- but we are *very* chary about how much code we
allow to run with such a setting.  User-definable functions can safely
assume that palloc'd space will live only long enough for them to return
something to their caller in it.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Curtis Faith [EMAIL PROTECTED] writes:
 After some research I still hold that fsync blocks, at least on
 FreeBSD. Am I missing something?

 Here's the evidence:
 [ much snipped ]
 vp = (struct vnode *)fp-f_data;
 vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);

Hm, I take it a vnode is what's usually called an inode, ie the unique
identification data for a specific disk file?

This is kind of ugly in general terms but I'm not sure that it really
hurts Postgres.  In our present scheme, the only files we ever fsync()
are WAL log files, not data files.  And in normal operation there is
only one WAL writer at a time, and *no* WAL readers.  So an exclusive
kernel-level lock on a WAL file while we fsync really shouldn't create
any problem for us.  (Unless this indirectly blocks other operations
that I'm missing?)

As I commented before, I think we could do with an extra process to
issue WAL writes in places where they're not in the critical path for
a foreground process.  But that seems to be orthogonal from this issue.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

Well, that's why I was soliciting developer input as to exactly what
goes on with sorts.  From what I seem to be hearing, all sorts result in
temp files being created and/or used.  If that's the case then yes, I
can understand the fixation.  Of course that opens the door for it being
a horrible implementation.  If that's not the case, then parallel sorts
still seem like a rather obvious route to look into.

Greg


On Fri, 2002-10-04 at 14:15, Bruce Momjian wrote:
 Greg Copeland wrote:
 -- Start of PGP signed section.
  I see.  I just always assumed that it would be done as part of table
  space effort as it's such a defacto feature.
  
  I am curious as to why no one has commented on the other rather obvious
  performance enhancement which was brought up in this thread.  Allowing
  for parallel sorting seems rather obvious and is a common enhancement
  yet seems to of been completely dismissed as people seem to be fixated
  on I/O.  Go figure. 
 
 We think we are fixated on I/O because we think that is where the delay
 is.  Is there a reason we shouldn't think that?
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Tom, what temp files do we use that aren't for sorting;  I forgot.

MATERIALIZE plan nodes are the only thing I can think of offhand that
uses a straight temp file.  But ISTM that if this makes sense for
our internal temp files, it makes sense for user-created temp tables
as well.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Greg Copeland wrote:
-- Start of PGP signed section.
 Well, that's why I was soliciting developer input as to exactly what
 goes on with sorts.  From what I seem to be hearing, all sorts result in
 temp files being created and/or used.  If that's the case then yes, I
 can understand the fixation.  Of course that opens the door for it being
 a horrible implementation.  If that's not the case, then parallel sorts
 still seem like a rather obvious route to look into.

We use tape sorts, ala Knuth, meaning we sort in memory as much as
possible, but when there is more data than fits in memory, rather than
swapping, we write to temp files then merge the temp files (aka tapes).

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom, what temp files do we use that aren't for sorting;  I forgot.
 
 MATERIALIZE plan nodes are the only thing I can think of offhand that
 uses a straight temp file.  But ISTM that if this makes sense for
 our internal temp files, it makes sense for user-created temp tables
 as well.

Yes, I was thinking that, but of course, those are real tables, rather
than just files.  Not sure how clean it will be to mix those in the same
directory.  We haven't in the past.  Is it a good idea?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

It appears the fsync problem is pervasive. Here's Linux 2.4.19's
version from fs/buffer.c:

lock-  down(inode-i_sem);
ret = filemap_fdatasync(inode-i_mapping);
err = file-f_op-fsync(file, dentry, 1);
if (err  !ret)
ret = err;
err = filemap_fdatawait(inode-i_mapping);
if (err  !ret)
ret = err;
unlock-up(inode-i_sem);

But this is probably not a big factor as you outline below because
the WALWriteLock is causing the same kind of contention.

tom lane wrote:
 This is kind of ugly in general terms but I'm not sure that it really
 hurts Postgres.  In our present scheme, the only files we ever fsync()
 are WAL log files, not data files.  And in normal operation there is
 only one WAL writer at a time, and *no* WAL readers.  So an exclusive
 kernel-level lock on a WAL file while we fsync really shouldn't create
 any problem for us.  (Unless this indirectly blocks other operations
 that I'm missing?)

I hope you're right but I see some very similar contention problems in
the case of many small transactions because of the WALWriteLock.

Assume Transaction A which writes a lot of buffers and XLog entries,
so the Commit forces a relatively lengthy fsynch.

Transactions B - E block not on the kernel lock from fsync but on
the WALWriteLock. 

When A finishes the fsync and subsequently releases the WALWriteLock
B unblocks and gets the WALWriteLock for its fsync for the flush.

C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT.

B Releases and now C writes its XLOG_XACT_COMMIT.

There now seems to be a lot of contention on the WALWriteLock. This
is a shame for a system that has no locking at the logical level and
therefore seems like it could be very, very fast and offer
incredible concurrency.

 As I commented before, I think we could do with an extra process to
 issue WAL writes in places where they're not in the critical path for
 a foreground process.  But that seems to be orthogonal from this issue.
 
It's only orthogonal to the fsync-specific contention issue. We now
have to worry about WALWriteLock semantics causes the same contention.
Your idea of a separate LogWriter process could very nicely solve this
problem and accomplish a few other things at the same time if we make
a few enhancements.

Back-end servers would not issue fsync calls. They would simply block
waiting until the LogWriter had written their record to the disk, i.e.
until the sync'd block # was greater than the block that contained the
XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
ends after its log write returns.

The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
would issue writes of the optimal size when enough data was present or
of smaller chunks if enough time had elapsed since the last write.

The nice part is that the WALWriteLock semantics could be changed to
allow the LogWriter to write to disk while WALWriteLocks are acquired
by back-end servers. WALWriteLocks would only be held for the brief time
needed to copy the entries into the log buffer. The LogWriter would
only need to grab a lock to determine the current end of the log
buffer. Since it would be writing blocks that occur earlier in the
cache than the XLogInsert log writers it won't need to grab a
WALWriteLock before writing the cache buffers.

Many transactions would commit on the same fsync (now really a write
with O_DSYNC) and we would get optimal write throughput for the log
system.

This would handle all the issues I had and it doesn't sound like a
huge change. In fact, it ends up being almost semantically identical 
to the aio_write suggestion I made orignally, except the
LogWriter is doing the background writing instead of the OS and we
don't have to worry about aio implementations and portability.

- Curtis



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

On Fri, 2002-10-04 at 14:31, Bruce Momjian wrote:
 We use tape sorts, ala Knuth, meaning we sort in memory as much as
 possible, but when there is more data than fits in memory, rather than
 swapping, we write to temp files then merge the temp files (aka tapes).

Right, which is what I originally assumed.  On lower end systems, that
works great.  Once you allow that people may actually have high-end
systems with multiple CPUs and lots of memory, wouldn't it be nice to
allow for huge improvements on large sorts?  Basically, you have two
ends of the spectrum.  One, where you don't have enough memory and
become I/O bound.  The other is where you have enough memory but are CPU
bound; where potentially you have extra CPUs to spare.  Seems to me they
are not mutually exclusive.

Unless I've missed something, the ideal case is to never use tapes for
sorting.  Which is saying, you're trying to optimize an already less an
ideal situation (which is of course good).  I'm trying to discuss making
it a near ideal use of available resources.  I can understand why
addressing the seemingly more common I/O bound case would receive
priority, however, I'm at a loss as to why the other would be completely
ignored.  Seems to me, implementing both would even work in a
complimentary fashion on the low-end cases and yield more breathing room
for the high-end cases.

What am I missing for the other case to be completely ignored?

Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 ...  But ISTM that if this makes sense for
 our internal temp files, it makes sense for user-created temp tables
 as well.

 Yes, I was thinking that, but of course, those are real tables, rather
 than just files.  Not sure how clean it will be to mix those in the same
 directory.  We haven't in the past.  Is it a good idea?

Sure we have --- up till recently, pg_temp files just lived in the
database directory.  I think it was you that added the pg_temp
subdirectory, and the reason you did it was to let people symlink the
temp files to someplace else.  But that's just a zeroth-order
approximation to providing a tablespace facility for these things.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith


Bruce Momjian wrote:
 I am again confused.  When we do write(), we don't have to lock
 anything, do we?  (Multiple processes can write() to the same file just
 fine.)  We do block the current process, but we have nothing else to do
 until we know it is written/fsync'ed.  Does aio more easily allow the
 kernel to order those write?  Is that the issue?  Well, certainly the
 kernel already order the writes.  Just because we write() doesn't mean
 it goes to disk.  Only fsync() or the kernel do that.

We don't have to lock anything, but most file systems can't process
fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks. The fsync call is more
contentious than typical writes because its duration is usually
longer so it holds the locks longer over more pages and structures.
That is the real issue. The contention caused by fsync'ing very frequently
which blocks other writers and readers.

For the buffer manager, the blocking of readers is probably even more
problematic when the cache is a small percentage (say  10% to 15%) of
the total database size because most leaf node accesses will result in
a read. Each of these reads will have to wait on the fsync as well. Again,
a very well written file system probably can minimize this but I've not
seen any.

Further comment on:
We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed.

Writing out a bunch of calls at the end, after having consumed a lot
of CPU cycles and then waiting is not as efficient as writing them out,
while those CPU cycles are being used. We are currently waisting the
time it takes for a given process to write.

The thinking probably has been that this is no big deal because other
processes, say B, C and D can use the CPU cycles while process A blocks.
This is true UNLESS the other processes are blocking on reads or
writes caused by process A doing the final writes and fsync.

 Yes, but Oracle is threaded, right, so, yes, they clearly could win with
 it.  I read the second URL and it said we could issue separate writes
 and have them be done in an optimal order.  However, we use the file
 system, not raw devices, so don't we already have that in the kernel
 with fsync()?

Whether by threads or multiple processes, there is the same contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.

Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.

 Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
 into the trap where we add stuff that just makes things faster on
 SMP/threaded systems when it makes our code _slower_ on single CPU
 systems, which is exaclty what Informix did in Informix 7, and we know
 how that ended (lost customers, bought by IBM).  I don't think that's
 going to happen to us, but I thought I would mention it.

Yes, I hate improvements that make things worse for most people. Any
changes I'd contemplate would be simply another configuration driven
optimization that could be turned off very easily.

- Curtis


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Large databases, performance

2002-10-04 Thread Shridhar Daithankar

On 3 Oct 2002 at 18:53, Manfred Koizar wrote:

 On Thu, 03 Oct 2002 21:47:03 +0530, Shridhar Daithankar
 [EMAIL PROTECTED] wrote:
 I believe that was vacuum analyze only.
 
 Well there is
 
   VACUUM [tablename]; 
 
 and there is
 
   ANALYZE [tablename];
 
 And
 
   VACUUM ANALYZE [tablename];
 
 is VACUUM followed by ANALYZE.

I was using vacuum analyze. 

Good that you pointed out. Now I will modify the postgresql auto vacuum daemon 
that I wrote to analyze only in case of excesive inserts. I hope that's lighter 
on performance compared to vacuum analyze..

Bye
 Shridhar

--
Mix's Law:  There is nothing more permanent than a temporary building.  There 
is 
nothing more permanent than a temporary tax.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP

2002-10-04 Thread Zeugswetter Andreas SB SD


  I'd give you the first and third of those.  As Andrew noted, the
  argument that it's more standard-compliant is not very solid.
 
 The standard doesn't say anything about transaction in this regard.

Yes, it sais statement.

Note also, that a typical SELECT only session would not advance 
CURRENT_TIMESTAMP at all in the typical autocommit off mode that
the Spec is all about. 

 What do others think?

I liked your proposal to advance CURRENT_TIMESTAMP at each statement start.
(It would not advance inside a stored procedure).

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] (Fwd) Re: Any Oracle 9 users? A test please...

2002-10-04 Thread Rob Fullerton


Howdy All,

You have to explicitly commit transactions in oracle using SQL*Plus.
However, DUAL (eg. SELECT current_timestamp FROM DUAL;) is special in this
case.  It is a table in the sys schema, used for selecting constants,
pseudo-columns, etc.

I'm not sure if this helps but see:

http://download-east.oracle.com/otndoc/oracle9i/901_doc/server.901/a90125/queries2.htm#2054162http://download-east.oracle.com/otndoc/oracle9i/901_doc/server.901/a90125/queries2.htm#2054162

rob
'Oracle 9 tester' :P

On Mon, 30 Sep 2002, Bruce Momjian wrote:


 OK, I just received this answer from an Oracle 9 tester.  It shows
 CURRENT_TIMESTAMP changing during the transaction.  Thanks, Dan.

 Dan, it wasn't clear if this was in a transaction or not.  Does Oracle
 have autocommit off by default so you are always in a transaction?

 ---

 Dan Langille wrote:
  A very quick answer:
 
  --- Forwarded message follows ---
  Date: Mon, 30 Sep 2002 13:03:51 -0400 (EDT)
  From: Agent Drek [EMAIL PROTECTED]
  To: [EMAIL PROTECTED] [EMAIL PROTECTED]
  Cc: [EMAIL PROTECTED] [EMAIL PROTECTED]
  Subject: Re: Any Oracle 9 users?  A test please...
  In-Reply-To: 3D984877.19685.801EEC30@localhost
  Message-ID: Pine.BSF.4.44.0209301303030.50384-
  [EMAIL PROTECTED]
  MIME-Version: 1.0
  Content-Type: TEXT/PLAIN; charset=US-ASCII
  Sender: [EMAIL PROTECTED]
 
  On Mon, 30 Sep 2002, Dan Langille wrote:
 
   Date: Mon, 30 Sep 2002 12:49:59 -0400
   From: Dan Langille [EMAIL PROTECTED]
   Reply-To: [EMAIL PROTECTED]
   [EMAIL PROTECTED] To: [EMAIL PROTECTED]
   [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]
   [EMAIL PROTECTED] Subject: Any Oracle 9 users?  A test
   please...
  
   Followups to [EMAIL PROTECTED] please!
  
   Any Oracle 9 users out there?
  
   I need this run:
  
   BEGIN;
   SELECT CURRENT_TIMESTAMP;
   -- wait 5 seconds
   SELECT CURRENT_TIMESTAMP;
  
   Are those two timestamps the same?
  
   Thanks
  
 
  Our DBA says:
 
  snip from irc
 
  data SQL SELECT current_timestamp FROM DUAL;
  data CURRENT_TIMESTAMP
  data
  --
 
  - data 30-SEP-02 01.06.42.660969 PM -04:00 data SQL SELECT
  current_timestamp FROM DUAL; data CURRENT_TIMESTAMP data
  --
 
  - data 30-SEP-02 01.06.48.837372 PM -04:00 data (you have to
  include 'from dual' for 'non-table' selects)
 
  --
 Derek Marshall
 
  Smash and Pow Inc  'digital plumber'
  http://www.smashpow.net
 
 
  To Unsubscribe: send mail to [EMAIL PROTECTED]
  with unsubscribe freebsd-database in the body of the message
 
  --- End of forwarded message ---
  --
  Dan Langille
  I'm looking for a computer job:
  http://www.freebsddiary.org/dan_langille.php
 
 

 --
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073




---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] [COMMITTERS] pgsql/contrib/rserv ApplySnapshot.in CleanLog. ...

2002-10-04 Thread Alexey V. Borzov

Hello Peter,

Tuesday, October 01, 2002, 1:42:46 AM, you wrote:

PE Bruce Momjian writes:

 Peter, the author is questioning why his Makefile changes were wrong.
 Would you elaborate?

PE Because we rely on the built-in library lookup functionality instead of
PE hardcoding the full file name.

  Agh! I finally read up on module loading
http://developer.postgresql.org/docs/postgres/xfunc-c.html#XFUNC-C-DYNLOAD
  and now I seem to understand. You see, the problem with the current
  Makefile is as follows: it substitutes '$libdir' into both .sql and
  perl files. While this is good enough for sql, $libdir is consumed
  by Perl and thus perl scripts do NOT work.
  Thanks for elaborating, I'll try to produce a better patch ASAP.

-- 
Ñ óâàæåíèåì, Àëåêñåé Áîðçîâ
îòäåë èíòåðíåò-ïðîåêòîâ ÎÎÎ ÐÄÂ-Ìåäèà
http://www.rdw.ru 
http://www.vashdosug.ru


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] (Fwd) Re: Any Oracle 9 users? A test please...

2002-10-04 Thread Dan Langille

The original tester says this is an anonymous procedure.

On 30 Sep 2002 at 15:07, Bruce Momjian wrote:

 
 It is not clear to me;  is this its own transaction or a function
 call?
 
 --
 -
 
 Dan Langille wrote:
  And just for another opinion, which supports the first.
  
  From now, unless you indicate otherwise, I'll only report tests
  which 
  have both values the same.
  
  From: Shawn O'Connor [EMAIL PROTECTED]
  To: Dan Langille [EMAIL PROTECTED]
  Subject: Re: Any Oracle 9 users?  A test please...
  In-Reply-To: 3D985663.24174.80554E83@localhost
  Message-ID: [EMAIL PROTECTED]
  MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII
  X-PMFLAGS: 35127424 0 1 P2A7A0.CNM
  
  Okay, here you are:
  --
  
  DECLARE
   time1 TIMESTAMP;
   time2 TIMESTAMP;
   sleeptime NUMBER;
  BEGIN
   sleeptime := 5;
   SELECT CURRENT_TIMESTAMP INTO time1 FROM DUAL;
   DBMS_LOCK.SLEEP(sleeptime);
   SELECT CURRENT_TIMESTAMP INTO time2 FROM DUAL;
   DBMS_OUTPUT.PUT_LINE(TO_CHAR(time1));
   DBMS_OUTPUT.PUT_LINE(TO_CHAR(time2));
  END;
  /
  30-SEP-02 11.54.09.583576 AM
  30-SEP-02 11.54.14.708333 AM
  
  PL/SQL procedure successfully completed.
  
  --
  
  Hope this helps!
  
   -Shawn
  
  
  On Mon, 30 Sep 2002, Dan Langille wrote:
  
   We're testing this just to see what Oracle does.  What you are
   saying is what we expect to happen.  But could you do that test
   for us from the command line?  Thanks.
  
   On 30 Sep 2002 at 10:31, Shawn O'Connor wrote:
  
I'm assuming your doing this as some sort of anonymous
PL/SQL function:
   
Don't you need to do something like:
   
SELECT CURRENT_TIMESTAMP FROM DUAL INTO somevariable?
   
and to wait five seconds probably:
   
EXECUTE DBMS_LOCK.SLEEP(5);
   
But to answer your question-- When this PL/SQL function
is run the values of current_timestamp are not the same, they
will be sepearated by five seconds or so.
   
Hope this helps!
   
 -Shawn
   
On Mon, 30 Sep 2002, Dan Langille wrote:
   
 Followups to [EMAIL PROTECTED] please!

 Any Oracle 9 users out there?

 I need this run:

 BEGIN;
 SELECT CURRENT_TIMESTAMP;
 -- wait 5 seconds
 SELECT CURRENT_TIMESTAMP;

 Are those two timestamps the same?

 Thanks
 --
 Dan Langille
 I'm looking for a computer job:
 http://www.freebsddiary.org/dan_langille.php


 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-database in the body of the message

   
   
  
  
   --
   Dan Langille
   I'm looking for a computer job:
   http://www.freebsddiary.org/dan_langille.php
  
  
  
  --- End of forwarded message ---
  -- 
  Dan Langille
  I'm looking for a computer job:
  http://www.freebsddiary.org/dan_langille.php
  
  
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001 +  If your
   life is a hard drive, |  13 Roberts Road +  Christ can be your
   backup.|  Newtown Square, Pennsylvania 19073
 


-- 
Dan Langille
I'm looking for a computer job:
http://www.freebsddiary.org/dan_langille.php


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [GENERAL] [HACKERS] Anyone want to assist with the translation

2002-10-04 Thread Tino Wildenhain

Hi Justin,

--On Donnerstag, 3. Oktober 2002 09:23 +1000 Justin Clift 
[EMAIL PROTECTED] wrote:

 Hi Michael,

 Michael Paesold wrote:
 snip
 Hi Justin,

 I am from Austria, and I would like to help. I could provide a German
 translation. The Babelfish's translation is really funny. Machine
 translation is readable, but it is no advocacy. ;-) I do not really nead
 an interface, but just tell me in what way you want the texts.

 Cool.  Could you deal with an OpenOffice Calc or M$ Excel file having
 the lines of English text in one column, and doing the German
 translation into a second column?

 That might be easiest, and will allow a cut-n-paste of the German
 version straight into the database backend.

Haha cutpaste ;-) Ever heard of csv? :-))

However, I can also have a look at it, if desired.

Regards
Tino

 Sound workable to you?

 :-)

 Regards and best wishes,

 Justin Clift

 Best Regards,
 Michael Paesold

 --
 My grandfather once told me that there are two kinds of people: those
 who work and those who take the credit. He told me to try to be in the
 first group; there was less competition there.
- Indira Gandhi

 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [GENERAL] [HACKERS] Anyone want to assist with the

2002-10-04 Thread Tino Wildenhain

Hi Justin,

 Good point.  For the moment we've whipped up that MS Excel document
 (created in OpenOffice of course) of all the English text strings in the
 site and emailed it to the volunteers.  :)

Btw. did you ever unzip the native OpenOffice (aka StarOffice)
file?


 So far community members have volunteered for German, Turkish, French,
 Spanish, Brazilian Portuguese, and Polish.

 Cool.  :)

 Want to co-ordinate with the other two German language volunteers?

Sure. So I'm here :-)

Regards
Tino

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] [GENERAL] Large databases, performance

2002-10-04 Thread Charles H. Woloszynski

Can you comment on the tools you are using to do the insertions (Perl, 
Java?) and the distribution of data (all random, all static), and the 
transaction scope (all inserts in one transaction, each insert as a 
single transaction, some group of inserts as a transaction).

I'd be curious what happens when you submit more queries than you have 
processors (you had four concurrent queries and four CPUs), if you care 
to run any additional tests.  Also, I'd report the query time in 
absolute (like you did) and also in 'Time/number of concurrent queries. 
 This will give you a sense of how the system is scaling as the workload 
increases.  Personally I am more concerned about this aspect than the 
load time, since I am going to guess that this is where all the time is 
spent.  

Was the original posting on GENERAL or HACKERS.  Is this moving the 
PERFORMANCE for follow-up?  I'd like to follow this discussion and want 
to know if I should join another group?

Thanks,

Charlie

P.S.  Anyone want to comment on their expectation for 'commercial' 
databases handling this load?  I know that we cannot speak about 
specific performance metrics on some products (licensing restrictions) 
but I'd be curious if folks have seen some of the databases out there 
handle these dataset sizes and respond resonably.


Shridhar Daithankar wrote:

Hi,

Today we concluded test for database performance. Attached are results and the 
schema, for those who have missed earlier discussion on this.

We have (almost) decided that we will partition the data across machines. The 
theme is, after every some short interval a burst of data will be entered in 
new table in database, indexed and vacuume. The table(s) will be inherited so 
that query on base table will fetch results from all the children. The 
application has to consolidate all the data per node basis. If the database is 
not postgresql, app. has to consolidate data across partitions as well.

Now we need to investigate whether selecting on base table to include children 
would use indexes created on children table.

It's estimated that when entire data is gathered, total number of children 
tables would be around 1K-1.1K across all machines. 

This is in point of average rate of data insertion i.e. 5K records/sec and 
total data size, estimated to be 9 billion rows max i.e. estimated database 
size is 900GB. Obviously it's impossible to keep insertion rate on an indexed 
table high as data grows. So partitioning/inheritance looks better approach. 

Postgresql is not the final winner as yet. Mysql is in close range. I will keep 
you guys posted about the result.

Let me know about any comments..

Bye
 Shridhar

--
Price's Advice:It's all a game -- play it to have fun.


  



Machine
Compaq Proliant Server ML 530 
 
Intel Xeon 2.4 Ghz Processor x 4,   
 
4 GB RAM, 5 x 72.8 GB SCSI HDD  
 
RAID 0 (Striping) Hardware Setup, Mandrake Linux 9.0
 
Cost - $13,500 ($1,350 for each additional 72GB HDD)
 
   
Performance Parameter  MySQL 3.23.52   MySQL 3.23.52  
 PostgreSQL 7.2.2
   WITHOUT InnoDB  WITH InnoDB 
for with built-in support   
   for transactional   transactional 
support   for transactions
   support
 
Complete Data  
   
Inserts + building a composite index  
 
40 GB data, 432,000,000 tuples   3738 secs   18720 secs 
 20628 secs  
about 100 bytes each, schema on 
'schema' sheet
composite index on 3 fields 
(esn, min, datetime)  
   
Load Speed 115570 tuples/second23076 
tuples/second 20942 tuples/second
   
Database Size on Disk  48 GB   87 GB  
 111 GB
   
Average per partition  
   

Re: [GENERAL] [HACKERS] Anyone want to assist with the

2002-10-04 Thread Tino Wildenhain

Hi Michael,

yeah, I got :-)

I'm busy reviewing :-)

Regards
Tino

--On Donnerstag, 3. Oktober 2002 21:54 +0200 Michael Paesold 
[EMAIL PROTECTED] wrote:

 Tino Wildenhain [EMAIL PROTECTED] wrote:

 Hi Justin,

  Good point.  For the moment we've whipped up that MS Excel document
  (created in OpenOffice of course) of all the English text strings in
  the site and emailed it to the volunteers.  :)

 Btw. did you ever unzip the native OpenOffice (aka StarOffice)
 file?

 
  So far community members have volunteered for German, Turkish, French,
  Spanish, Brazilian Portuguese, and Polish.
 
  Cool.  :)
 
  Want to co-ordinate with the other two German language volunteers?

 Sure. So I'm here :-)

 Regards
 Tino

 You should have already got at least two mails, haven't you?

 Michael




---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP

2002-10-04 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 Note also, that a typical SELECT only session would not advance 
 CURRENT_TIMESTAMP at all in the typical autocommit off mode that
 the Spec is all about. 

True, but the spec also says to default to serializable transaction
mode.  So in a single-transaction session like you are picturing,
the successive SELECTs would all see a frozen snapshot of the database.
Freezing CURRENT_TIMESTAMP goes right along with that, and in fact makes
a lot of sense, because it tells you exactly what time your snapshot
of the database state was taken.

This line of thought opens another can of worms: should the behavior
of CURRENT_TIMESTAMP depend on serializable vs. read-committed mode?
Maybe SetQuerySnapshot is the routine that ought to capture the
statement-start-time timestamp value.  We could define
CURRENT_TIMESTAMP as the time of the active database snapshot.
Or at least offer a fourth parameter to that parameterized now() to
return this time.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Return of INSTEAD rules

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  I am confused how yours differs from mine.  I don't see how the last
  matching tagged query would not be from an INSTEAD rule.
 
 You could have both INSTEAD and non-INSTEAD rules firing for the same
 original query.  If the alphabetically-last rule is a non-INSTEAD rule,
 then there's a difference.

How do we get multiple rules on a query?  I thought it was mostly
INSERT/UPDATE/DELETE, and those all operate on a single table.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Bruce Momjian

Curtis Faith wrote:
 Bruce Momjian wrote:
  I may be missing something here, but other backends don't block while
  one writes to WAL.
 
 I don't think they'll block until they get to the fsync or XLogWrite
 call while another transaction is fsync'ing.
 
 I'm no Unix filesystem expert but I don't see how the OS can
 handle multiple writes and fsyncs to the same file descriptors without
 blocking other processes from writing at the same time. It may be that
 there are some clever data structures they use but I've not seen huge
 praise for most of the file systems. A well written file system could
 minimize this contention but I'll bet it's there with most of the ones
 that PostgreSQL most commonly runs on.
 
 I'll have to write a test and see if there really is a problem.

Yes, I can see some contention, but what does aio solve?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Hans-Jürgen Schönig wrote:
 Did anybody think about threaded sorting so far?
 Assume an SMP machine. In the case of building an index or in the case 
 of sorting a lot of data there is just one backend working. Therefore 
 just one CPU is used.
 What about starting a thread for every temporary file being created? 
 This way CREATE INDEX could use many CPUs.
 Maybe this is worth thinking about because it will speed up huge 
 databases and enterprise level computing.

We haven't thought about it yet because there are too many buggy thread
implementations.  We are probably just now getting to a point where we
can consider it.  However, lots of databases have moved to threads for
all sorts of things and ended up with a royal mess of code.  Threads
can only improve things in a few areas of the backend so it would be
nice if we could limit the exposure to threads to those areas;  sorting
could certainly be one of them, but frankly, I think disk I/O is our
limiting factore there.  I would be interested to see some tests that
showed otherwise.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Moving to the left requires an explicit cast (or at least an assignment
 to a column).  I know this looks strange to someone who knows that our
 numeric type beats float4/float8 on both range and precision, but it's
 effectively mandated by the SQL spec.  Any combination of exact and
 inexact numeric types is supposed to yield an inexact result per
 spec, thus numeric + float8 yields float8 not numeric.  Another reason
 for doing it this way is that a numeric literal like 123.456 can be
 initially typed as numeric, and later implicitly promoted to float4 or
 float8 if context demands it.  Doing that the other way 'round would
 introduce problems with precision loss.  We had speculated about
 introducing an unknown_numeric pseudo-type to avoid that problem, but
 the above hierarchy eliminates the need for unknown_numeric.  We can
 initially type a literal as the smallest thing it will fit in, and then
 do implicit promotion as needed.  (7.3 is not all the way there on that
 plan, but 7.4 will be.)

Do we know that defaulting floating constants will not be a performance
hit?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Return of INSTEAD rules

2002-10-04 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 I am confused how yours differs from mine.  I don't see how the last
 matching tagged query would not be from an INSTEAD rule.
 
 You could have both INSTEAD and non-INSTEAD rules firing for the same
 original query.  If the alphabetically-last rule is a non-INSTEAD rule,
 then there's a difference.

 How do we get multiple rules on a query?  I thought it was mostly
 INSERT/UPDATE/DELETE, and those all operate on a single table.

You can create as many rules as you want.  One reasonably likely
scenario is that you have a view, you make an ON INSERT DO INSTEAD
rule to support insertions into the view (by inserting into some
underlying table(s) instead), and then you add some not-INSTEAD
rules to perform logging into other tables that aren't part of the
view but just keep track of activity.

You'd not want the logging activity to usurp the count result for this
setup, I think, even if it happened last.  (Indeed, that might be
*necessary*, if for some reason it needed to access the rows inserted
into the view's base table.)

This approach would give us a general principle that applies in all
cases: not-INSTEAD rules don't affect the returned command result.
Perhaps that would answer Manfred's thought that we should be able
to label which rules affect the result.  If you have any INSTEAD rules,
then it doesn't matter exactly how many you have, so you can mark them
INSTEAD or not to suit your fancy.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)

2002-10-04 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Do we know that defaulting floating constants will not be a performance
 hit?

Uh ... what's your concern exactly?  The datatype coercion (if any) will
happen once at parse time, not at runtime.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] CURRENT_TIMESTAMP

2002-10-04 Thread Michael Paesold

Bruce Momjian [EMAIL PROTECTED] wrote:

 Yes, clearly, we will need to have all three time values available to
 users.  With three people now suggesting we don't change, I will just
 add to TODO:

 Add now(transaction|statement|clock) functionality

 Is that good?

CURRENT_TIMESTAMP etc. are converted to now()::TIMESTAMP, at least in 7.2,
right?
So when there are all three options available, it would be easy to change
the behaviour of CURRENT_DATE/TIME/TIMESTAMP, right?

SET .. or GUC would be options, no?

Best Regards,
Michael Paesold


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Return of INSTEAD rules

2002-10-04 Thread Manfred Koizar

On Thu, 3 Oct 2002 22:21:27 -0400 (EDT), Bruce Momjian
[EMAIL PROTECTED] wrote:
so I propose we handle
INSTEAD rules this way:  that we return the oid and tuple count of the
last INSTEAD rule query with a tag matching the main query.  

Bruce, this won't work for this example

  CREATE RULE visible_delete AS -- DELETE rule
  ON DELETE TO visible 
  DO INSTEAD 
  COUNT UPDATE table3
  SET deleted = 1
  WHERE pk = old.pk;

because here we don't have a rule query with a matching tag.  Same
applies for

  CREATE RULE v_update AS -- UPDATE rule
  ON UPDATE TO v 
  DO INSTEAD NOTHING;

I wrote:
 One argument against automatically don't count non-INSTEAD rules and
 count the last statement in INSTEAD rules

Seems I introduced a little bit of confusion here by argueing against
something that has never been proposed before.  Funny, that this
non-existent proposal is now seriously discussed :-(

Has the idea of extending the syntax to explicitly mark queries as
COUNTed already been rejected?  If yes, I cannot help here.  If no, I
keep telling you that this approach can emulate most of the other
possible solutions still under discussion.

Bruce wrote:
If there is more than one matching tag in
the INSTEAD rule, the user has the option to place the query he wants
for the return at the end of the rule.

Are you sure this is always possible without unwanted side effects?

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith


I wrote:
  I'm no Unix filesystem expert but I don't see how the OS can
  handle multiple writes and fsyncs to the same file descriptors without
  blocking other processes from writing at the same time. It may be that
  there are some clever data structures they use but I've not seen huge
  praise for most of the file systems. A well written file system could
  minimize this contention but I'll bet it's there with most of the ones
  that PostgreSQL most commonly runs on.
  
  I'll have to write a test and see if there really is a problem.

Bruce Momjian wrote:

 Yes, I can see some contention, but what does aio solve?
 

Well, theoretically, aio lets the file system handle the writes without
requiring any locks being held by the processes issuing those reads. 
The disk i/o scheduler can therefore issue the writes using spinlocks or
something very fast since it controls the timing of each of the actual
writes. In some systems this is handled by the kernal and can be very
fast.

I suspect that with large RAID controllers or intelligent disk systems
like EMC this is even more important because they should be able to
handle a much higher level of concurrent i/o.

Now whether or not the common file systems handle this well, I can't say,

Take a look at some comments on how Oracle uses asynchronous I/O

http://www.ixora.com.au/notes/redo_write_multiplexing.htm
http://www.ixora.com.au/notes/asynchronous_io.htm
http://www.ixora.com.au/notes/raw_asynchronous_io.htm

It seems that OS support for this will likely increase and that this
issue will become more and more important as uses contemplate SMP systems
or if threading is added to certain PostgreSQL subsystems.

It might be easier for me to implement the change I propose and then
see what kind of difference it makes.

I wanted to run the idea past this group first. We can all postulate
whether or not it will work but we won't know unless we try it. My real
issue is one of what happens in the event that it does work.

I've had very good luck implementing this sort of thing for other systems
but I don't yet know the range of i/o requests that PostgreSQL makes.

Assuming we can demonstrate no detrimental effects on system reliability
and that the change is implemented in such a way that it can be turned
on or off easily, will a 50% or better increase in speed for updates
justify the sort or change I am proposing. 20%? 10%?

- Curtis

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



[HACKERS] any experience with IA-64

2002-10-04 Thread Oleg Bartunov

I, probably, will have a chance to work with postgres on Linux IA-64.
Is there any optimization for postgresql ?

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP

2002-10-04 Thread Michael Paesold

Tom Lane [EMAIL PROTECTED] wrote:

 Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
  Note also, that a typical SELECT only session would not advance
  CURRENT_TIMESTAMP at all in the typical autocommit off mode that
  the Spec is all about.

 True, but the spec also says to default to serializable transaction
 mode.  So in a single-transaction session like you are picturing,
 the successive SELECTs would all see a frozen snapshot of the database.
 Freezing CURRENT_TIMESTAMP goes right along with that, and in fact makes
 a lot of sense, because it tells you exactly what time your snapshot
 of the database state was taken.

 This line of thought opens another can of worms: should the behavior
 of CURRENT_TIMESTAMP depend on serializable vs. read-committed mode?
 Maybe SetQuerySnapshot is the routine that ought to capture the
 statement-start-time timestamp value.  We could define
 CURRENT_TIMESTAMP as the time of the active database snapshot.
 Or at least offer a fourth parameter to that parameterized now() to
 return this time.

 regards, tom lane

That is a very good point. At least with serializable transactions it seems
perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you think
about read-commited level? Can time be commited? ;-)
It would be even more surprising to new users if the implementation of
CURRENT_TIMESTAMP would depend on trx serialization level.

Regards,
Michael Paesold


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread scott.marlowe

On Fri, 4 Oct 2002, Bruce Momjian wrote:

 Hans-Jürgen Schönig wrote:
  Did anybody think about threaded sorting so far?
  Assume an SMP machine. In the case of building an index or in the case 
  of sorting a lot of data there is just one backend working. Therefore 
  just one CPU is used.
  What about starting a thread for every temporary file being created? 
  This way CREATE INDEX could use many CPUs.
  Maybe this is worth thinking about because it will speed up huge 
  databases and enterprise level computing.
 
 We haven't thought about it yet because there are too many buggy thread
 implementations.  We are probably just now getting to a point where we
 can consider it.  However, lots of databases have moved to threads for
 all sorts of things and ended up with a royal mess of code.  Threads
 can only improve things in a few areas of the backend so it would be
 nice if we could limit the exposure to threads to those areas;  sorting
 could certainly be one of them, but frankly, I think disk I/O is our
 limiting factore there.  I would be interested to see some tests that
 showed otherwise.

Wouldn't the type of disk subsystem really make a big difference here?

With a couple of U160 cards and a dozen 15krpm hard drives, I would 
imagine I/O would no longer be as much of an issue as a single drive 
system would be.

It seems like sometimes we consider these issues more from the one or two 
SCSI drives perspective insted of the big box o drives perspective.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Return of INSTEAD rules

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 You can create as many rules as you want.  One reasonably likely
 scenario is that you have a view, you make an ON INSERT DO INSTEAD
 rule to support insertions into the view (by inserting into some
 underlying table(s) instead), and then you add some not-INSTEAD
 rules to perform logging into other tables that aren't part of the
 view but just keep track of activity.
 
 You'd not want the logging activity to usurp the count result for this
 setup, I think, even if it happened last.  (Indeed, that might be
 *necessary*, if for some reason it needed to access the rows inserted
 into the view's base table.)
 
 This approach would give us a general principle that applies in all
 cases: not-INSTEAD rules don't affect the returned command result.
 Perhaps that would answer Manfred's thought that we should be able
 to label which rules affect the result.  If you have any INSTEAD rules,
 then it doesn't matter exactly how many you have, so you can mark them
 INSTEAD or not to suit your fancy.

Oh, I like that, and rules fire alphabetically, right?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Return of INSTEAD rules

2002-10-04 Thread Bruce Momjian

Manfred Koizar wrote:
 On Thu, 3 Oct 2002 22:21:27 -0400 (EDT), Bruce Momjian
 [EMAIL PROTECTED] wrote:
 so I propose we handle
 INSTEAD rules this way:  that we return the oid and tuple count of the
 last INSTEAD rule query with a tag matching the main query.  
 
 Bruce, this won't work for this example
 
 CREATE RULE visible_delete AS -- DELETE rule
 ON DELETE TO visible 
 DO INSTEAD 
 COUNT UPDATE table3
 SET deleted = 1
 WHERE pk = old.pk;
 
 because here we don't have a rule query with a matching tag.  Same
 applies for

True, but because we have said we are going to return the tag of the
original command, I don't think we have anything valid to return in this
case to match the tag.

 CREATE RULE v_update AS -- UPDATE rule
 ON UPDATE TO v 
 DO INSTEAD NOTHING;

This is OK because the default is return zeros.

 I wrote:
  One argument against automatically don't count non-INSTEAD rules and
  count the last statement in INSTEAD rules
 
 Seems I introduced a little bit of confusion here by argueing against
 something that has never been proposed before.  Funny, that this
 non-existent proposal is now seriously discussed :-(
 
 Has the idea of extending the syntax to explicitly mark queries as
 COUNTed already been rejected?  If yes, I cannot help here.  If no, I

Well, I am hoping to find something that was automatic.  If we do our
best, and we still get complains, we can add some syntax.  I am
concerned that adding syntax is just over-designing something that isn't
necessary.

 keep telling you that this approach can emulate most of the other
 possible solutions still under discussion.
 
 Bruce wrote:
 If there is more than one matching tag in
 the INSTEAD rule, the user has the option to place the query he wants
 for the return at the end of the rule.
 
 Are you sure this is always possible without unwanted side effects?

I am sure it isn't always possible, but let's do our best and see how
people react.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP

2002-10-04 Thread Bruce Momjian

Michael Paesold wrote:
 Tom Lane [EMAIL PROTECTED] wrote:
 
  Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
   Note also, that a typical SELECT only session would not advance
   CURRENT_TIMESTAMP at all in the typical autocommit off mode that
   the Spec is all about.
 
  True, but the spec also says to default to serializable transaction
  mode.  So in a single-transaction session like you are picturing,
  the successive SELECTs would all see a frozen snapshot of the database.
  Freezing CURRENT_TIMESTAMP goes right along with that, and in fact makes
  a lot of sense, because it tells you exactly what time your snapshot
  of the database state was taken.
 
  This line of thought opens another can of worms: should the behavior
  of CURRENT_TIMESTAMP depend on serializable vs. read-committed mode?
  Maybe SetQuerySnapshot is the routine that ought to capture the
  statement-start-time timestamp value.  We could define
  CURRENT_TIMESTAMP as the time of the active database snapshot.
  Or at least offer a fourth parameter to that parameterized now() to
  return this time.
 
  regards, tom lane
 
 That is a very good point. At least with serializable transactions it seems
 perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you think
 about read-commited level? Can time be commited? ;-)
 It would be even more surprising to new users if the implementation of
 CURRENT_TIMESTAMP would depend on trx serialization level.

Yes, CURRENT_TIMESTAMP changing based on transaction serializable/read
commited would be quite confusing.  Also, because our default is read
committed, we would end up with CURRENT_TIMESTAMP being statement level,
which actually does give us a logical place to allow CURRENT_TIMESTAMP
to change, but I thought people voted against that.

However, imagine a query that used CURRENT_TIMESTAMP in the WHERE clause
to find items that were not in the future.  Would a CURRENT_TIMESTAMP
test in a multi-statement transaction want to check based on transaction
start, or on the tuples visible at the time the statement started?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Bruce Momjian

Curtis Faith wrote:
  Yes, I can see some contention, but what does aio solve?
  
 
 Well, theoretically, aio lets the file system handle the writes without
 requiring any locks being held by the processes issuing those reads. 
 The disk i/o scheduler can therefore issue the writes using spinlocks or
 something very fast since it controls the timing of each of the actual
 writes. In some systems this is handled by the kernal and can be very
 fast.

I am again confused.  When we do write(), we don't have to lock
anything, do we?  (Multiple processes can write() to the same file just
fine.)  We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed.  Does aio more easily allow the
kernel to order those write?  Is that the issue?  Well, certainly the
kernel already order the writes.  Just because we write() doesn't mean
it goes to disk.  Only fsync() or the kernel do that.

 
 I suspect that with large RAID controllers or intelligent disk systems
 like EMC this is even more important because they should be able to
 handle a much higher level of concurrent i/o.
 
 Now whether or not the common file systems handle this well, I can't say,
 
 Take a look at some comments on how Oracle uses asynchronous I/O
 
 http://www.ixora.com.au/notes/redo_write_multiplexing.htm
 http://www.ixora.com.au/notes/asynchronous_io.htm
 http://www.ixora.com.au/notes/raw_asynchronous_io.htm

Yes, but Oracle is threaded, right, so, yes, they clearly could win with
it.  I read the second URL and it said we could issue separate writes
and have them be done in an optimal order.  However, we use the file
system, not raw devices, so don't we already have that in the kernel
with fsync()?

 It seems that OS support for this will likely increase and that this
 issue will become more and more important as uses contemplate SMP systems
 or if threading is added to certain PostgreSQL subsystems.

Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
into the trap where we add stuff that just makes things faster on
SMP/threaded systems when it makes our code _slower_ on single CPU
systems, which is exaclty what Informix did in Informix 7, and we know
how that ended (lost customers, bought by IBM).  I don't think that's
going to happen to us, but I thought I would mention it.

 Assuming we can demonstrate no detrimental effects on system reliability
 and that the change is implemented in such a way that it can be turned
 on or off easily, will a 50% or better increase in speed for updates
 justify the sort or change I am proposing. 20%? 10%?

Yea, let's see what boost we get, and the size of the patch, and we can
review it.  It is certainly worth researching.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Hans-Jürgen Schönig wrote:
 Threads are bad - I know ...
 I like the idea of a pool of processes instead of threads - from my 
 point of view this would be useful.
 
 I am planning to run some tests (GEQO, AIX, sorts) as soon as I have 
 time to do so (still too much work ahead before :( ...).
 If I had time I'd love to do something for the PostgreSQL community :(.
 
 As far as sorting is concerned: It would be fine if it was possible to 
 define an alternative location for temporary sort files using SET.
 If you had multiple disks this would help in the case of concurrent 
 sorts because this way people could insert and index many tables at once 
 without having to access just one storage system.
 This would be an easy way out of the IO limitation ... - at least for 
 some problems.

Bingo!  Want to increase sorting performance, give it more I/O
bandwidth, and it will take 1/100th of the time to do threading.

Ingres had a nice feature where you could specify sort directories and
it would cycle through those directories while it did the tape sort.

Added to TODO:

* Allow sorting to use multiple work directories

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

scott.marlowe wrote:
  We haven't thought about it yet because there are too many buggy thread
  implementations.  We are probably just now getting to a point where we
  can consider it.  However, lots of databases have moved to threads for
  all sorts of things and ended up with a royal mess of code.  Threads
  can only improve things in a few areas of the backend so it would be
  nice if we could limit the exposure to threads to those areas;  sorting
  could certainly be one of them, but frankly, I think disk I/O is our
  limiting factore there.  I would be interested to see some tests that
  showed otherwise.
 
 Wouldn't the type of disk subsystem really make a big difference here?
 
 With a couple of U160 cards and a dozen 15krpm hard drives, I would 
 imagine I/O would no longer be as much of an issue as a single drive 
 system would be.
 
 It seems like sometimes we consider these issues more from the one or two 
 SCSI drives perspective insted of the big box o drives perspective.

Yes, it is mostly for non-RAID drives, but also, sometimes single drives
can be faster.  When you have a drive array, it isn't as easy to hit
each drive and keep it running sequentially.  Of course, I don't have
any hard numbers on that.  ;-)

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP

2002-10-04 Thread Michael Paesold

Bruce Momjian [EMAIL PROTECTED] wrote:

  That is a very good point. At least with serializable transactions it
seems
  perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you
think
  about read-commited level? Can time be commited? ;-)
  It would be even more surprising to new users if the implementation of
  CURRENT_TIMESTAMP would depend on trx serialization level.

 Yes, CURRENT_TIMESTAMP changing based on transaction serializable/read
 commited would be quite confusing.  Also, because our default is read
 committed, we would end up with CURRENT_TIMESTAMP being statement level,
 which actually does give us a logical place to allow CURRENT_TIMESTAMP
 to change, but I thought people voted against that.

 However, imagine a query that used CURRENT_TIMESTAMP in the WHERE clause
 to find items that were not in the future.  Would a CURRENT_TIMESTAMP
 test in a multi-statement transaction want to check based on transaction
 start, or on the tuples visible at the time the statement started?

Well, in a serializable transaction there would be no difference at all, at
least if CURRENT_TIMESTAMP is consistent within the transaction. Any changes
outside the transaction after SetQuerySnapshot would not be seen by the
transaction anyway.

In read-commited, I think it's different. If CURRENT_TIMESTAMP is frozen,
than the behavior would be the same as in serializable level, if
CURRENT_TIMESTAMP advances with each statement, the result would also
change. That is an inherent problem with read-commited though and has not so
much to do with the timestamp behavior.

Regards,
Michael Paesold




---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Justin Clift

Bruce Momjian wrote:
 
 scott.marlowe wrote:
snip
  It seems like sometimes we consider these issues more from the one or two
  SCSI drives perspective insted of the big box o drives perspective.
 
 Yes, it is mostly for non-RAID drives, but also, sometimes single drives
 can be faster.  When you have a drive array, it isn't as easy to hit
 each drive and keep it running sequentially.  Of course, I don't have
 any hard numbers on that.  ;-)

Arrghh... please remember that big bunch of drives != all in one
array.

It's common to have a bunch of drives and allocate different ones for
different tasks appropriately, whether in array sets, individually,
mirrored, etc.

100% totally feasible to have a separate 15k SCSI drive or two just
purely for doing sorts if it would assist in throughput.

:-)

Regards and best wishes,

Justin Clift


 
 --
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
 
 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

-- 
My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there.
   - Indira Gandhi

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Do we know that defaulting floating constants will not be a performance
  hit?
 
 Uh ... what's your concern exactly?  The datatype coercion (if any) will
 happen once at parse time, not at runtime.

Yes, I realize it is during parsing.  I was just wondering if making
constants coming in from the parser NUMERIC is a performance hit?  I see
in gram.y that FCONST comes in as a Float so I don't even see were we
make it NUMERIC.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] any experience with IA-64

2002-10-04 Thread Bruce Momjian

Oleg Bartunov wrote:
 I, probably, will have a chance to work with postgres on Linux IA-64.
 Is there any optimization for postgresql ?

None we know of.  I think people have already gotten it working.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] [GENERAL] Anyone want to assist with the translation of the

2002-10-04 Thread Tino Wildenhain

Hi Justin,

you want probably use the language-negotiation
rather then a query variable :-)

Regards
Tino

--On Donnerstag, 3. Oktober 2002 08:53 +1000 Justin Clift 
[EMAIL PROTECTED] wrote:

 Hi everyone,

 Have just put together a prototype page to show off the multi-lingual
 capabilities that the Advocacy sites' infrastructure has:

 http://advocacy.postgresql.org/?lang=de

 The text was translated to german via Altavista's Babelfish, so it's
 probably only about 80% accurate, but it conveys the concept.

 Is anyone interested in translating the English version to other
 languages?  All Latin based languages should be fine (German, French,
 Italian, Spanish, Portuguese, Turkish, Greek, etc).

 If there's strong interest, then an interface to let volunteers
 translators do it easily can be constructed over the next fortnight or
 so.

 :-)

 Regards and best wishes,

 Justin Clift

 --
 My grandfather once told me that there are two kinds of people: those
 who work and those who take the credit. He told me to try to be in the
 first group; there was less competition there.
- Indira Gandhi

 ---(end of broadcast)---
 TIP 4: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] [GENERAL] Anyone want to assist with the translation of the Advocacy

2002-10-04 Thread Diogo Biazus



Hi everyone,

Have just put together a prototype page to show off the multi-lingual
capabilities that the Advocacy sites' infrastructure has:

http://advocacy.postgresql.org/?lang=de

The text was translated to german via Altavista's Babelfish, so it's
probably only about 80% accurate, but it conveys the concept.

Is anyone interested in translating the English version to other
languages?  All Latin based languages should be fine (German, French,
Italian, Spanish, Portuguese, Turkish, Greek, etc).

If there's strong interest, then an interface to let volunteers
translators do it easily can be constructed over the next fortnight or
so.

:-)

Regards and best wishes,

Justin Clift

  

Justin,

I would be glad to translate it to brazilian portuguese.
Here we have a lot of companies starting to use PostgreSQL, including 
the one where I work and some of our clients.
It would be very nice to have this site translated.

Cheers,

-- 
Diogo de Oliveira Biazus
[EMAIL PROTECTED]
Ikono Sistemas e Automação
http://www.ikono.com.br



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Bingo!  Want to increase sorting performance, give it more I/O
 bandwidth, and it will take 1/100th of the time to do threading.

 Added to TODO:
   * Allow sorting to use multiple work directories

Yeah, I like that.  Actually it should apply to all temp files not only
sorting.

A crude hack would be to allow there to be multiple pg_temp_NNN/
subdirectories (read symlinks) in a database, and then the code would
automatically switch among these.

Probably a cleaner idea would be to somehow integrate this with
tablespace management --- if you could mark some tablespaces as intended
for temp stuff, the system could round-robin among those as it creates
temp files and/or temp tables.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Tom Lane

Greg Copeland [EMAIL PROTECTED] writes:
 ... I can understand why
 addressing the seemingly more common I/O bound case would receive
 priority, however, I'm at a loss as to why the other would be completely
 ignored.

Bruce already explained that we avoid threads because of portability and
robustness considerations.

The notion of a sort process pool seems possibly attractive.  I'm
unconvinced that it's going to be a win though because of the cost of
shoving data across address-space boundaries.  Another issue is that
the sort comparison function can be anything, including user-defined
code that does database accesses or other interesting stuff.  This
would mean that the sort auxiliary process would have to adopt the
database user identity of the originating process, and quite possibly
absorb a whole lot of other context information before it could
correctly/safely execute the comparison function.  That pushes the
overhead up a lot more.

(The need to allow arbitrary operations in the comparison function would
put a pretty substantial crimp on a thread-based approach, too, even if
we were willing to ignore the portability issue.)

Still, if you want to try it out, feel free ... this is an open-source
project, and if you can't convince other people that an idea is worth
implementing, that doesn't mean you can't implement it yourself and
prove 'em wrong.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Greg Copeland

On Fri, 2002-10-04 at 15:07, Tom Lane wrote:
 the sort comparison function can be anything, including user-defined
 code that does database accesses or other interesting stuff.  This

This is something that I'd not considered.

 would mean that the sort auxiliary process would have to adopt the
 database user identity of the originating process, and quite possibly
 absorb a whole lot of other context information before it could
 correctly/safely execute the comparison function.  That pushes the
 overhead up a lot more.

Significantly!  Agreed.

 
 Still, if you want to try it out, feel free ... this is an open-source
 project, and if you can't convince other people that an idea is worth
 implementing, that doesn't mean you can't implement it yourself and
 prove 'em wrong.

No Tom, my issue wasn't if I could or could not convince someone but
rather that something has been put on the table requesting additional
feedback on it's feasibility but had been completely ignored.  Fact is,
I knew I didn't know enough about the implementation details to even
attempt to convince anyone of anything.  I simply wanted to explore the
idea or rather the feasibility of the idea.  In theory, it's a great
idea.  In practice, I had no idea, thus my desire to seek additional
input.  As such, it seems a practical implementation may prove
difficult.  I now understand.  Thank you for taking the take to respond
in a manner that satisfies my curiosity.  That's all I was looking for. 
:)

Best Regards,

Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Curtis Faith [EMAIL PROTECTED] writes:
 ... most file systems can't process fsync's
 simultaneous with other writes, so those writes block because the file
 system grabs its own internal locks.

Oh?  That would be a serious problem, but I've never heard that asserted
before.  Please provide some evidence.

On a filesystem that does have that kind of problem, can't you avoid it
just by using O_DSYNC on the WAL files?  Then there's no need to call
fsync() at all, except during checkpoints (which actually issue sync()
not fsync(), anyway).

 Whether by threads or multiple processes, there is the same contention on
 the file through multiple writers. The file system can decide to reorder
 writes before they start but not after. If a write comes after a
 fsync starts it will have to wait on that fsync.

AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
on safety grounds (we want to be sure we have a consistent WAL up to the
end of what we've written).  Even if we can allow some reordering when a
single transaction puts out a large volume of WAL data, I fail to see
where any large gain is going to come from.  We're going to be issuing
those writes sequentially and that ought to match the disk layout about
as well as can be hoped anyway.

 Likewise a given process's writes can NEVER be reordered if they are
 submitted synchronously, as is done in the calls to flush the log as
 well as the dirty pages in the buffer in the current code.

We do not fsync buffer pages; in fact a transaction commit doesn't write
buffer pages at all.  I think the above is just a misunderstanding of
what's really happening.  We have synchronous WAL writing, agreed, but
we want that AFAICS.  Data block writes are asynchronous (between
checkpoints, anyway).

There is one thing in the current WAL code that I don't like: if the WAL
buffers fill up then everybody who would like to make WAL entries is
forced to wait while some space is freed, which means a write, which is
synchronous if you are using O_DSYNC.  It would be nice to have a
background process whose only task is to issue write()s as soon as WAL
pages are filled, thus reducing the probability that foreground
processes have to wait for WAL writes (when they're not committing that
is).  But this could be done portably with one more postmaster child
process; I see no real need to dabble in aio_write.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Zeugswetter Andreas SB SD


  ... most file systems can't process fsync's
  simultaneous with other writes, so those writes block because the file
  system grabs its own internal locks.
 
 Oh?  That would be a serious problem, but I've never heard that asserted
 before.  Please provide some evidence.
 
 On a filesystem that does have that kind of problem, can't you avoid it
 just by using O_DSYNC on the WAL files?

To make this competitive, the WAL writes would need to be improved to 
do more than one block (up to 256k or 512k per write) with one write call 
(if that much is to be written for this tx to be able to commit).
This should actually not be too difficult since the WAL buffer is already 
contiguous memory.

If that is done, then I bet O_DSYNC will beat any other config we currently 
have.

With this, a separate disk for WAL and large transactions you shoud be able 
to see your disks hit the max IO figures they are capable of :-)

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 To make this competitive, the WAL writes would need to be improved to 
 do more than one block (up to 256k or 512k per write) with one write call 
 (if that much is to be written for this tx to be able to commit).
 This should actually not be too difficult since the WAL buffer is already 
 contiguous memory.

Hmmm ... if you were willing to dedicate a half meg or meg of shared
memory for WAL buffers, that's doable.  I was originally thinking of
having the (still hypothetical) background process wake up every time a
WAL page was completed and available to write.  But it could be set up
so that there is some slop, and it only wakes up when the number of
writable pages exceeds N, for some N that's still well less than the
number of buffers.  Then it could write up to N sequential pages in a
single write().

However, this would only be a win if you had few and large transactions.
Any COMMIT will force a write of whatever we have so far, so the idea of
writing hundreds of K per WAL write can only work if it's hundreds of K
between commit records.  Is that a common scenario?  I doubt it.

If you try to set it up that way, then it's more likely that what will
happen is the background process seldom awakens at all, and each
committer effectively becomes responsible for writing all the WAL
traffic since the last commit.  Wouldn't that lose compared to someone
else having written the previous WAL pages in background?

We could certainly build the code to support this, though, and then
experiment with different values of N.  If it turns out N==1 is best
after all, I don't think we'd have wasted much code.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Improving backend startup interlock

2002-10-04 Thread Giles Lean


Tom Lane writes:

 $ man flock
 No manual entry for flock.
 $
 
 HPUX has generally taken the position of adopting both BSD and SysV
 features, so if it doesn't exist here, it's not portable to older
 Unixen ...

If only local locking is at issue then finding any one of fcntl()
locking, flock(), or lockf() would do.  All Unixen will have one or
more of these and autoconf machinery exists to find them.

The issue Tom raised about NFS support remains: locking over NFS
introduces new failure modes.  It also only works for NFS clients
that support NFS locking, which not all do.

Mind you NFS users are currently entirely unprotected from someone
starting a postmaster on a different NFS client using the same data
directory right now, which file locking would prevent. So there is
some win for NFS users as well as local filesystem users.  (Anyone
using NFS care to put their hand up?  Maybe nobody does?)

Is the benefit of better local filesystem behaviour plus multiple
client protection for NFS users who have file locking enough to
outweigh the drawbacks?  My two cents says it is, but my two cents are
worth approximately USD$0.01, which is to say not very much ...

Regards,

Giles

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Improving backend startup interlock

2002-10-04 Thread Michael Paesold

Giles Lean [EMAIL PROTECTED] wrote:

 Tom Lane writes:
 
  $ man flock
  No manual entry for flock.
  $
  
  HPUX has generally taken the position of adopting both BSD and SysV
  features, so if it doesn't exist here, it's not portable to older
  Unixen ...
 
 If only local locking is at issue then finding any one of fcntl()
 locking, flock(), or lockf() would do.  All Unixen will have one or
 more of these and autoconf machinery exists to find them.
 
 The issue Tom raised about NFS support remains: locking over NFS
 introduces new failure modes.  It also only works for NFS clients
 that support NFS locking, which not all do.
 
 Mind you NFS users are currently entirely unprotected from someone
 starting a postmaster on a different NFS client using the same data
 directory right now, which file locking would prevent. So there is
 some win for NFS users as well as local filesystem users.  (Anyone
 using NFS care to put their hand up?  Maybe nobody does?)
 
 Is the benefit of better local filesystem behaviour plus multiple
 client protection for NFS users who have file locking enough to
 outweigh the drawbacks?  My two cents says it is, but my two cents are
 worth approximately USD$0.01, which is to say not very much ...

Well, I am going to do some tests with postgresql and our netapp
filer later in October. If that setup proves to work fast and reliable
I would also be interested in such a locking. I don't care about
the feature if I find the postgresql/NFS/netapp-filer setup to be
unreliable or bad performing.

I'll see.

Regards,
Michael Paesold


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Zeugswetter Andreas SB SD


 Hmmm ... if you were willing to dedicate a half meg or meg of shared
 memory for WAL buffers, that's doable.

Yup, configuring Informix to three 2 Mb buffers (LOGBUF 2048) here. 

 However, this would only be a win if you had few and large transactions.
 Any COMMIT will force a write of whatever we have so far, so the idea of
 writing hundreds of K per WAL write can only work if it's hundreds of K
 between commit records.  Is that a common scenario?  I doubt it.

It should help most for data loading, or mass updating, yes.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Giles Lean


Curtis Faith writes:

 I'm no Unix filesystem expert but I don't see how the OS can handle
 multiple writes and fsyncs to the same file descriptors without
 blocking other processes from writing at the same time.

Why not?  Other than the necessary synchronisation for attributes such
as file size and modification times, multiple processes can readily
write to different areas of the same file at the same time.

fsync() may not return until after the buffers it schedules are
written, but it doesn't have to block subsequent writes to different
buffers in the file either.  (Note too Tom Lane's responses about
when fsync() is used and not used.)

 I'll have to write a test and see if there really is a problem.

Please do.  I expect you'll find things aren't as bad as you fear.

In another posting, you write:

 Hmm, I keep hearing that buffer block writes are asynchronous but I don't
 read that in the code at all. There are simple write calls with files
 that are not opened with O_NOBLOCK, so they'll be done synchronously. The
 code for this is relatively straighforward (once you get past the
 storage manager abstraction) so I don't see what I might be missing.

There is a confusion of terminology here: the write() is synchronous
from the point of the application only in that the data is copied into
kernel buffers (or pages remapped, or whatever) before the system call
returns.  For files opened with O_DSYNC the write() would wait for the
data to be written to disk.  Thus O_DSYNC is synchronous I/O, but
there is no equivalently easy name for the regular flush to disk
after write() returns that the Unix kernel has done ~forever.

The asynchronous I/O that you mention (aio) is a third thing,
different from both regular write() and write() with O_DSYNC. I
understand that with aio the data is not even transferred to the
kernel before the aio_write() call returns, but I've never programmed
with aio and am not 100% sure how it works.

Regards,

Giles



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Improving backend startup interlock

2002-10-04 Thread Joe Conway

Michael Paesold wrote:
 Giles Lean [EMAIL PROTECTED] wrote:
Mind you NFS users are currently entirely unprotected from someone
starting a postmaster on a different NFS client using the same data
directory right now, which file locking would prevent. So there is
some win for NFS users as well as local filesystem users.  (Anyone
using NFS care to put their hand up?  Maybe nobody does?)

Is the benefit of better local filesystem behaviour plus multiple
client protection for NFS users who have file locking enough to
outweigh the drawbacks?  My two cents says it is, but my two cents are
worth approximately USD$0.01, which is to say not very much ...
 
 
 Well, I am going to do some tests with postgresql and our netapp
 filer later in October. If that setup proves to work fast and reliable
 I would also be interested in such a locking. I don't care about
 the feature if I find the postgresql/NFS/netapp-filer setup to be
 unreliable or bad performing.
 

We have multiple Oracle databases running over NFS from an HPUX server to a 
netapp and have been pleased with the performance overall. It does require 
some tuning to get it right, and it hasn't been entirely without issues, but I 
don't see us going back to local storage. We also just recently set up a Linux 
box running Oracle against an NFS mounted netapp. Soon I'll be adding Postgres 
on the same machine, initially using locally attached storage, but at some 
point I may need to shift to the netapp due to data volume.

If you do try Postgres on the netapp, please post your results/experience and 
I'll do the same.

Anyway, I guess I qualify as interested in an NFS safe locking method.

Joe


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Threaded Sorting

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  ...  But ISTM that if this makes sense for
  our internal temp files, it makes sense for user-created temp tables
  as well.
 
  Yes, I was thinking that, but of course, those are real tables, rather
  than just files.  Not sure how clean it will be to mix those in the same
  directory.  We haven't in the past.  Is it a good idea?
 
 Sure we have --- up till recently, pg_temp files just lived in the
 database directory.  I think it was you that added the pg_temp
 subdirectory, and the reason you did it was to let people symlink the
 temp files to someplace else.  But that's just a zeroth-order
 approximation to providing a tablespace facility for these things.

OK, TODO updated:

* Allow sorting, temp files, temp tables to use multiple work
directories  

FYI, I originally created that directory so a postmaster startup could
clear that dir.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Curtis Faith [EMAIL PROTECTED] writes:
  After some research I still hold that fsync blocks, at least on
  FreeBSD. Am I missing something?
 
  Here's the evidence:
  [ much snipped ]
  vp = (struct vnode *)fp-f_data;
  vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
 
 Hm, I take it a vnode is what's usually called an inode, ie the unique
 identification data for a specific disk file?

Yes, Virtual Inode.  I think it is virtual because it is used for NFS,
where the handle really isn't an inode.

 This is kind of ugly in general terms but I'm not sure that it really
 hurts Postgres.  In our present scheme, the only files we ever fsync()
 are WAL log files, not data files.  And in normal operation there is
 only one WAL writer at a time, and *no* WAL readers.  So an exclusive
 kernel-level lock on a WAL file while we fsync really shouldn't create
 any problem for us.  (Unless this indirectly blocks other operations
 that I'm missing?)

I think the small issue is:

proc1   proc2
write
fsync   write
fync

Proc2 has to wait for the fsync, but the write is so short compared to
the fsync, I don't see an issue.  Now, if someone would come up with
code that did only one fsync for the above case, that would be a big
win.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP

2002-10-04 Thread Bruce Momjian


OK, are we agreed to leave CURRENT_TIMESTAMP/now() alone and just add
now(string)?  If no one replies, I will assume that is a yes and I
will add it to TODO.

---

Michael Paesold wrote:
 Bruce Momjian [EMAIL PROTECTED] wrote:
 
   That is a very good point. At least with serializable transactions it
 seems
   perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you
 think
   about read-commited level? Can time be commited? ;-)
   It would be even more surprising to new users if the implementation of
   CURRENT_TIMESTAMP would depend on trx serialization level.
 
  Yes, CURRENT_TIMESTAMP changing based on transaction serializable/read
  commited would be quite confusing.  Also, because our default is read
  committed, we would end up with CURRENT_TIMESTAMP being statement level,
  which actually does give us a logical place to allow CURRENT_TIMESTAMP
  to change, but I thought people voted against that.
 
  However, imagine a query that used CURRENT_TIMESTAMP in the WHERE clause
  to find items that were not in the future.  Would a CURRENT_TIMESTAMP
  test in a multi-statement transaction want to check based on transaction
  start, or on the tuples visible at the time the statement started?
 
 Well, in a serializable transaction there would be no difference at all, at
 least if CURRENT_TIMESTAMP is consistent within the transaction. Any changes
 outside the transaction after SetQuerySnapshot would not be seen by the
 transaction anyway.
 
 In read-commited, I think it's different. If CURRENT_TIMESTAMP is frozen,
 than the behavior would be the same as in serializable level, if
 CURRENT_TIMESTAMP advances with each statement, the result would also
 change. That is an inherent problem with read-commited though and has not so
 much to do with the timestamp behavior.
 
 Regards,
 Michael Paesold
 
 
 
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly