from:"Kevin Brown"

Re: [HACKERS] Cause of moving-target FSM space-needed reports

2006-09-21 Thread Kevin Brown

Tom Lane wrote:
 In fact, maybe vacuum should just throw a WARNING when it
 finds a single rel with more than MaxFSMPages pages with useful free
 space?

I fully agree with this.  This (in particular, how many useful free
pages a rel has) is information a good DBA can make very good use of,
and is needed in the case where it exceeds MaxFSMPages.

I would also suggest having vacuum verbose print an INFO message with
the rel's number of free pages for rels that don't exceed MaxFSMPages
(that number should be printed in the WARNING when the number exceeds
MaxFSMPages).

Are there other ways of getting this information from the system?  If
not, then I'd consider this proposed change important.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] advisory locks and permissions

2006-09-21 Thread Kevin Brown

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Doesn't creating many temp tables in a transaction do the same thing?
 
 True, but it's a tad harder/less likely that you'd accidentally cause
 a problem that way.

Then why not use a GUC (that only an administrator can set) to control
the maximum number of advisory locks a given backend can take at any
one time?  Then it becomes the DBA's problem (and solution) if someone
manages to run the database out of shared memory through this
mechanism.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PATCHES] [HACKERS] Fix linking of OpenLDAP libraries

2006-09-11 Thread Kevin Brown

Tom Lane wrote:
 Albe Laurenz [EMAIL PROTECTED] writes:
  Let me expand a little on some of the peculiarities of
  shared libraries on AIX:
 
  - A normal AIX shared library is called libXX.a
It is an 'ar' archive that contains the shared object(s).
 
 Ah, so the problem really boils down to funny naming conventions.
 If they use .a for both shared and static libraries, how does anyone
 tell the difference?

It sounds to me like there is no difference.  Notice how his example
ldd output shows dependencies on specific .o entries within the
various .a files that reside on the system, rather than on the .a
files as a whole.  If those entries had been statically linked then
they wouldn't have shown up in the ldd output at all.

So the difference is no longer between static libraries and shared
libraries -- they're now just libraries.  The only difference is how
you link to them.

What IBM has done here is very sensible, and is really what the other
Unixes should have done from the beginning: a library is just a
library, and what differs is how it's used.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Foreign keys

2006-09-10 Thread Kevin Brown

Joshua D. Drake wrote:
 
 In any case the same logic that leads to it being desirable to report all 
 the
 errors to the user in a UI and not just report them one by one also 
 applies to
 the database. I'm not sure it's the most important issue in the world, but 
 it
 does seem like a it would be nice feature if it reported all the errors 
 in
 the statement, not just the first one it finds.
 
 
 Seems kind of extraneous to me. I am guessing it would cause yet further 
 overhead with our foreign key checks.

But in this case, it would be (or should be) overhead only in the case
of failure.  In the case of success, all the constraints are checked
anyway -- they just succeed.

I would expect that the number of applications for which a constraint
violation is the norm and not the exception is very small.


But Tom's concern is a valid one.  I expect a reasonable compromise
would be to record and show the errors for only the non-deferred
constraints in the currently executing statement, because after that
point the transaction is in an error state anyway and thus can't
continue without a rollback to a savepoint.  It probably wouldn't make
sense to evaluate the deferred constraints within the erroring
statement anyway -- they're deferred, which by definition means they
don't get evaluated until commit, so evaluating them at failure time
could easily show errors that are only there because subsequent
statements never got executed.

As for the deferred constraints, it might be reasonable to show errors
only up to some limit (controlled by a GUC, perhaps), with the default
limit being 1, which is what we have now.  Otherwise you run the risk
of throwing millions of errors, which is surely not desirable.  The
downside to this is until you've hit the limit, you have to evaluate
*all* the deferred constraints, which could take a while, whereas the
current setup will return immediately upon encountering the first
constraint error.



 My testing shows that the use of foreign keys on high velocity single 
 transaction loads, can cause easily a 50% reduction in performance. Why 
 add to that? What we need to be doing is finding a way to decrease the 
 impact of foreign key checks.

I definitely agree here, but this should be independent of how foreign
key failures are handled once they're detected.  In other words, what
you're experiencing is the perfomance hit that comes from evaluating
the constraints, not from reporting the errors afterwards.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Fixed length data types issue

2006-09-10 Thread Kevin Brown

Tom Lane wrote:
 (does anyone know the cost of ntohl() on modern
 Intel CPUs?)

I wrote a simple test program to determine this:

#include arpa/inet.h

int main (int argc, char *argv[]) {
unsigned long i;
uint32_t a;

a = 0;
for (i = 0 ; i  40L ; ++i) {
#ifdef CALL_NTOHL
a = ntohl(i);
#endif
}
return a;
}


I have a system with an Athlon 64 3200+ (2.0 GHz) running in 64-bit
mode, another one with the same processor running in 32-bit mode, a a
third running a Pentium 4 1.5 GHz processor, and a fourth running a
pair of 2.8 GHz Xeons in hyperthreading mode.

I compiled the test program on the 32-bit systems with the -std=c9x
option so that the constant would be treated as unsigned.  Other than
that, the compilation method I used was identical: no optimization,
since it would skip the loop entirely in the version without the
ntohl() call.  I compiled it both with and without defining
CALL_NTOHL, and measured the difference in billed CPU seconds.

Based on the above, on both Athlon 64 systems, each ntohl() invocation
and assignment takes 1.04 nanoseconds to complete (I presume the
assignment is to a register, but I'd have to examine the assembly to
know for sure).  On the 1.5 GHz P4 system, each iteration takes 8.49
nanoseconds.  And on the 2.8 GHz Xeon system, each iteration takes
5.01 nanoseconds.


That seems reasonably fast to me...




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Getting a move on for 8.2 beta

2006-09-07 Thread Kevin Brown

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  Well, it's taken us the full month to get this far through the queue, so
  I'd sure like to have more people on board reviewing next time.  Alvaro
  helped but I miss Neil Conway, and some other people have helped in the
  past.  However --- the present situation has nothing to do with lack of
  reviewers, it's lack of time to finish the patches.
 
  I did try to get us additional help in reviewing.  Neil was unavailable,
  and Alvaro could only give part of his time.
 
 It strikes me that setting feature freeze in midsummer might not be the
 best strategy for having manpower available to review --- people tend to
 be on vacation in August.  Maybe the answer is just to move the dates a
 bit one way or the other.

Hmm...but if you're going to do that, why not do that now: push the
beta date back by, say, a month (or however long you had in mind) for
this cycle.  That way, the two major patches that are likely to be
dropped for this cycle stand a chance to make it into this release,
and you accomplish your goal of moving the dates a bit all at the same
time.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Coding style for emacs

2006-08-30 Thread Kevin Brown

Andrew Dunstan wrote:
 David Fetter wrote:
 If we put them in a place that's visible before you get the source, we
 can help people use the settings globally :)
 
 The likely level of our influence on someone who hasn't used the 
 settings and isn't editing our source code is approximately equal to the 
 influence of Pluto's gravitational field on your mood when you got out 
 of bed this morning.

I always wondered what affected his mood when he got out of bed this
morning.  Now I know!

It's amazing the things you can learn by lurking here...  :-)



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] [PATCHES] LWLock statistics collector

2006-08-03 Thread Kevin Brown

Tom Lane wrote:
 If I thought that average users would have a need for LWLock statistics,
 I'd be more sympathetic to expending effort on a nice frontend for
 viewing the statistics, but this is and always will be just a concern
 for hardcore hackers ...

That may be true of the output, but that's not a very strong argument
against making it much easier to gather and display the LWLock
statistics.  I can easily imagine the patch be a useful performance
troubleshooting tool in a high load environment.  Depends on how
easy/intrusive it is to enable/use the stderr method on a production
system, though, as well as how much of a performance impact the
measurements have on overall operation...



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Corrupt data pages...

2006-04-18 Thread Kevin Brown

Tom Lane wrote:
 It's fairly hard to see how that could happen inside Postgres.  One can
 readily imagine bugs that might replace one whole page with another,
 but there aren't any operations that manipulate half-a-page.  On the
 other hand, if your kernel uses 4K blocksize, this could be explained
 as substitution of one buffer for another at the kernel level.  So my
 money is on a kernel bug.  As I mentioned, we've once or twice before
 seen reports that looked like similar substitutions of single pages by
 the kernel.

Yeah, I agree that's what's likely to have happened here.  The
kernel's page size is 4k, as is the default block size used by XFS.

Tracking this one down any further is going to be nigh unto
impossible, I think.

There have been a pile of fixes to XFS in 2.6.16, so I'll start using
that, at least...


Thanks for looking at this.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Corrupt data pages...

2006-04-17 Thread Kevin Brown

Tom Lane wrote:
 You should at least show the page you think is corrupt.

I attempted to send this additional info to the list but I think the
message got dropped on the floor by the mailing list software or by
the spam filter.

I'll put the files on a web server and post links to them here.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Corrupt data pages...

2006-04-17 Thread Kevin Brown

I wrote:
 I attempted to send this additional info to the list but I think the
 message got dropped on the floor by the mailing list software or by
 the spam filter.
 
 I'll put the files on a web server and post links to them here.

You can find them here:

https://gazebo.sysexperts.com/~kevin/postgresql

The files are bad-page-info.txt and bad-page.txt.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] Corrupt data pages...

2006-04-16 Thread Kevin Brown

 possible that this is causing problems,
but to be honest this is the first time I've ever seen this happen.

If you guys have any suggestions on what I might do to help track this
one down, please let me know.  I tarred up the entire 'stagezilla'
database after shutting down PostgreSQL so that I can do additional
examination of the datafiles if that proves useful.


Additionally, if you'd like to see the output of pg_filedump, I'll be
happy to post it here (or email it separately).  I hesitate to attach
it here because it's not clear the mailing list's spam filtering would
let it pass.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-24 Thread Kevin Brown

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  We already do something similar for page deletions.  Empty pages are not
  deleted right away, but they are marked with BTP_DEAD, and then deleted
  on a subsequent vacuum.  Or something like that, I don't remember the
  exact details.
 
 Right, and the reason for that is exactly that there might be a
 concurrent indexscan already in flight to the newly-dead page.
 We must wait to recycle the page until we are certain no such scans
 remain.
 
 It doesn't matter whether a concurrent indexscan visits the dead
 page or not, *because it's empty* and so there's nothing to miss.
 So there's no race condition.  But if you try to move valid data
 across pages then there is a race condition.

Hmm...

Well, REINDEX is apparently a very expensive operation right now.  But
how expensive would it be to go through the entire index and perform
the index page merge operation being discussed here, and nothing else?

If it's fast enough, might it be worthwhile to implement just this
alone as a separate maintenance command (e.g., VACUUM INDEX) that
acquires the appropriate lock (AccessExclusive, I'd expect) on the
index to prevent exactly the issues you're concerned about?

If it's fast enough even on large tables, it would be a nice
alternative to REINDEX, I'd think.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Reducing relation locking overhead

2005-12-05 Thread Kevin Brown

Tom Lane wrote:
 The concern about deadlock applies to the various proposals that involve
 upgrading to a write-prevention lock at some late point in the process.
 That clearly has the potential to deadlock against relatively weak lock
 requests.

After looking at the various lock types, I don't see how this is the
case at all (which may mean that I'm more confused than ever.  But
please read on).  It seems to me that only ops that promote to
AccessExclusiveLock can deadlock against at least some of the proposed
REINDEX implementations.

REINDEX would have to initially grab AccessShareLock, of course, but
AccessExclusiveLock is the only lock type that blocks against it, so
in the case of lock promotion the only operations that would cause
REINDEX to really deadlock (as opposed to simply blocking) are
operations on the entire table (ALTER TABLE, DROP TABLE, etc.).

None of the common operations block against an AccessShareLock, and
the order of acquisition against objects (table vs index) is already
enforced, so where's the deadlock potential?


REINDEX would, I expect, promote its lock to ShareLock when it's time
for it to block writers.  That would block against quite a number of
operations, of course, but that's not a problem in and of itself,
because it need only wait until the operations in question are
finished.  The lock won't be granted until those other operations are
finished, and nothing aside from table-level ops will block against
the REINDEX until that lock is granted.  A true deadlock won't happen
against common operations unless REINDEX promotes its lock again to
something stronger than ShareLock, and that's easy to avoid: just have
REINDEX promote directly from AccessShareLock to the strongest lock it
will ever need.




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Reducing relation locking overhead

2005-12-04 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  Even ignoring that, you *still* have a lock upgrade problem
  in this sketch.
 
  Hmm, well, I can see a deadlock potential for those operations that
  have to acquire multiple locks simultaneously, and I suppose that the
  table + FSM lock would qualify in the sequence here, but the rest of
  it involves just a single read lock against the table.  What am I
  missing?
 
 At some point you have to lock out writers, else you can never be
 certain you have all the tuples.  I was taking your read lock to
 actually mean lock out writers; otherwise the sketch doesn't work
 at all.

Right, but the idea is to lock out writers for as brief a time as
possible.  That not only minimizes the possibility of lock contention
but guarantees that REINDEX will get a complete view of the database.

That said, it hinges on some sort of efficient way of identifying the
new tuples created by other transactions that are/were running during
the bulk of the time RINDEX was running.  If there's no good way to do
that, then there's no way to avoid blocking writers for an extended
period of time.

 The real situation is that you must hold at least AccessShareLock on the
 table throughout the entire operation, else you have no defense against
 (say) someone dropping the index or the entire table out from under you.
 And when you add onto this lock in order to lock out writers
 temporarily, you are engaging in a lock upgrade, which can deadlock
 against any sort of exclusive lock request.  

But won't that depend on the order in which the lock requests appear?
If locks A, B, and C are taken in that same order by every
transaction, then there's no possibility of deadlock, right?

 The fact that you've been holding the AccessShareLock for quite a
 long time means that the window for deadlock problems is very wide.

But with respect to deadlocks, that's true only if deadlocks are
possible, which is true only if the order of lock acquisition differs
between transactions.

I guess the real question here is: is it possible to, in code,
guarantee the order of lock acquisition by any given transaction?  For
REINDEX, the problem is simplified somewhat because it's operating
against a single index and a single table, so the locks it has to
acquire are somewhat limited in scope compared with a generic
transaction.

An endeavor to acquire all locks in the same order throughout the code
would not only take care of this REINDEX deadlock problem but would
essentially eliminate all possible deadlocks arising from code-ordered
lock acquisition in the system, which I expect would be considered a
very good thing indeed.  But I expect it would be a lot of effort and
wouldn't be worth it just to make REINDEX behave differently than it
does now.

So what am I missing/overlooking here?


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Reducing relation locking overhead

2005-12-04 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  I guess the real question here is: is it possible to, in code,
  guarantee the order of lock acquisition by any given transaction?
 
 Yes, but not in our code :-(.  This is largely determined by what the
 application does.

Yeah, that's what I figured.

But what of the more limited problem of lock acquisition relative to
the locks that REINDEX might need to acquire?  Since those locks are
limited to a single table and a single index, I'd think the problem
wouldn't be insurmountable.  No?

Suppose the following rules were enforced in the code:

- when requesting a type of lock, one must first acquire all lesser
  lock types against the object in order of strength.  Hence, one must
  acquire AccessShareLock before acquiring AccessExclusiveLock.

- one must always acquire a given lock type against the table before
  acquiring it against the index.

The latter rule might be sufficient, if the former rule is already
implied by the lock types and how they're acquired.

Thus, acquisition of AccessExclusiveLock against the index
automatically implies acquisition of AccessShareLock(table), then
AccessShareLock(index), then AccessExclusiveLock(table), then
AccessExclusiveLock(index).

I could see that possibly causing performance problems (and would be
interested in knowing what performance problems it would likely
cause), but given the limited scope of the locks that REINDEX needs,
wouldn't the above be sufficient to keep REINDEX deadlock-free against
other transactions?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Reducing relation locking overhead

2005-12-04 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  - when requesting a type of lock, one must first acquire all lesser
lock types against the object in order of strength.  Hence, one must
acquire AccessShareLock before acquiring AccessExclusiveLock.
 
 This is exactly wrong ...

And now I see why, since it will introduce deadlocks (sigh).  But what
of the other rule (always acquiring locks against the table before the
index)?  You may have stopped reading at the above...

One thing I don't quite understand about the discussion is why there's
particular attention being paid to deadlocks with respect to REINDEX
when it clearly can happen in the general case when lock promotion is
involved.  Why is REINDEX special here?

If the problem is that REINDEX has to hold an AccessShareLock to
prevent the table or index from being dropped, but does not need to
prevent writers in general (because the presumption is that there is
some way of efficiently discovering the addtional modifications made
during the bulk of REINDEX processing), then doesn't that mean that an
AccessShareLock is the wrong kind of lock for REINDEX to be holding,
and that the appropriate type of lock should be created if it doesn't
already exist?

Additionally, I was under the impression that normal INSERTs, UPDATEs,
and DELETEs didn't generally need to acquire AccessExclusiveLock,
because of MVCC.  If that's the case, then aren't the operations that
could deadlock REINDEX relatively rare?  And if those operations *do*
need to acquire that lock type, then what exactly does MVCC buy you?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Reducing relation locking overhead

2005-12-03 Thread Kevin Brown

Greg Stark wrote:
 
 Tom Lane [EMAIL PROTECTED] writes:
 
  What's worse, once you have excluded writes you have to rescan the entire
  table to be sure you haven't missed anything. So in the scenarios where this
  whole thing is actually interesting, ie enormous tables, you're still
  talking about a fairly long interval with writes locked out. Maybe not as
  long as a complete REINDEX, but long.
 
 I was thinking you would set a flag to disable use of the FSM for
 inserts/updates while the reindex was running. So you would know where to find
 the new tuples, at the end of the table after the last tuple you
 read.

If REINDEX works by seqscanning the table then the inclusion of new
tuples would happen for free if you turn off the FSM before beginning
the REINDEX operation -- you're guaranteed to see them last.  But that
only works if REINDEX behaves this way.

Then it's a question of what to do with in-flight updates at the time
the REINDEX hits the end of the table.

Even if REINDEX hits the table in non-sequential order, turning off
the FSM should still work.  REINDEX wouldn't need to acquire any
additional locks until after it has scanned the appended area.  So the
way I (perhaps naively) envision it working is:

1.  Acquire read lock on the table
2.  Turn off FSM
3.  Note the location of the end of the table
4.  Release read lock on the table
5.  Perform REINDEX operation
6.  Read and index the bit of the table starting with the location
noted previously.
7.  Note new end of table
8.  Acquire read lock on the table
9.  Scan any entries that have been appended past new end of table.
10. Release read lock on table
11. Turn on FSM


In the above for large relations, the bulk of the REINDEX should
happen without any locks being held by the REINDEX operation.  For
small tables (where the amount of new insert activity can be a large
percentage of the total table size), it would almost certainly be more
efficient to just take a read lock for the whole operation.  So it
might be wise to set up some sort of threshold, and to take a read
lock during the whole operation if the table size is smaller than the
threshold.

The reason the sequence I enumerate above involves taking any locks at
all is to avoid the issues that Tom brought up about having to rescan
the entire table to make sure nothing gets missed, to avoid possible
race conditions between steps 2 and 3, and to allow step 9 to
definitively complete, since otherwise in-flight updates would be
missed.


In the context of the original discussion (reduction of lock
acquisition), REINDEX isn't a common operation even if it is a
critical one, so acquisition of more than the usual number of locks
here shouldn't be a big deal.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [pgsql-www] [HACKERS] Upcoming PG re-releases

2005-12-03 Thread Kevin Brown

David Fetter wrote:
 On Wed, Nov 30, 2005 at 11:56:33PM -0400, Marc G. Fournier wrote:
  So, if Sun, SRA, Pervasive, Command Prompt, etc were to submit a patch for 
  v7.2, we'd refuse it?
 
 That depends on what you mean by refuse.  Such a patch wouldn't
 resurrect the original Postgres with POSTQUEL and cause us to support
 it, and it won't cause us to start supporting PostgreSQL 7.2 again
 either.

Okay, but suppose the patch in question breaks the version in question
in some subtle but horrible way?  If the community isn't supporting
the release in question then it implies that it won't go to the effort
of testing the patch, subjecting it to a beta period, etc.  But since
the patch would be applied by the community, the implication would be
that the community *endorses* the patch in question, since the
official source would be changed to reflect it.  If the patch breaks
the release horribly, just blindly accepting it wouldn't do good
things to the community's reputation.

And that means that the only really good way to guard against such an
occurrance is to subject the patch to the same process that's used for
officially supported releases.  At that point, there's no real
distinction between officially supported and not officially
supported.  I doubt the community wants to go down that road.


The acceptance of a patch by the community probably implies a lot more
than one would think at first glance, so this is certainly an issue
that should be thought all the way through.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Reducing relation locking overhead

2005-12-03 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  In the above for large relations, the bulk of the REINDEX should
  happen without any locks being held by the REINDEX operation.
 
 As I just pointed out to Greg, the arm-waving notion that you can turn
 off the FSM requires a great deal of low-level locking that is not
 there now.  

Yeah, I thought that the check for use of the FSM was already being
done by the lower level operators, and that contention would only
occur on the modification of the FSM enabled flag.  Obviously this
doesn't work at all if the FSM is just assumed to be in use at all
times, or if the FSM values are read only at transaction start or
something...


 Even ignoring that, you *still* have a lock upgrade problem
 in this sketch.

Hmm, well, I can see a deadlock potential for those operations that
have to acquire multiple locks simultaneously, and I suppose that the
table + FSM lock would qualify in the sequence here, but the rest of
it involves just a single read lock against the table.  What am I
missing?


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] SIGSEGV taken on 8.1 during dump/reload

2005-11-13 Thread Kevin Brown

Martijn van Oosterhout wrote:
 None of this applies to PostgreSQL because we open the modules
 directly, and don't rely on the linker loader.

Ah, right.  I forgot the context was the server, not one of the
utilities...

Sorry for the waste of bandwidth...



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] SIGSEGV taken on 8.1 during dump/reload

2005-11-12 Thread Kevin Brown

Tom Lane wrote:
 On the other hand, it'd be relatively easy for clueless lusers to
 defeat; I can readily imagine someone copying foo.so.8.2 to foo.so.8.3
 when the backend complained that it couldn't find the latter.  So
 maybe it's not what we want.

Hmm...but isn't the version number also something that can be stored
in the shared library itself during link time (e.g., via the -soname
option to the linker)?  The manpage for ld under Linux implies that
this will cause the executable that's linked against the shared object
to look explicitly for a library with the soname specified by the
shared object.  I don't know if that just causes the dynamic linker to
look for a file with the specified soname or if it will actually
examine the shared object under consideration to make sure it has the
DT_SONAME field in question, however.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Question about Ctrl-C and less

2005-10-22 Thread Kevin Brown

Sean Utt wrote:
 
 If you send a recent version of vim a CONTROL-C, and you're just sitting
 there at a prompt, it gives you a hint:
 
 Type  :quitEnter  to exit Vim
 
 Any reason not to just trap the CONTROL-C in psql when paging and offer a
 hint? Especially since we don't really know that the user really wanted to
 type CONTROL-C instead of q for quit. I know that I have always meant to
 type q and was just distracted whenever I've typed CONTROL-C in the pager,
 and so passing the CONTROL-C on to less is not actually heeding my wishes,
 it is instead giving me enough rope to shoot myself in the foot.

It won't work properly that way.  SIGINT gets sent to all the members
of the process group, not just the child.  Psql isn't responsible for
sending ctrl-c through to the child.

When you're in an editor such as vi that makes use of the terminal,
the editor itself is likely the only program that is doing anything.
Its parent is doing a wait() on the editor.  The parent in that
instance can ignore SIGINT because it's not involved at all at that
point.

That's not the case here.  Psql and the pager are really two
cooperating parts of the same task.  They just happen to be running in
two different process spaces.  Because they're both cooperatively
active at the same time, they both need to handle SIGINT, because when
the user invokes SIGINT, he intends for the overall task to return
some kind of control to him.  For psql, which is gathering data and
sending it to the pager, that means that it needs to stop doing so and
wait for the pager to finish.  For the pager, it means at a minimum
that it needs to display what it has so far and give interactive
control to the user (it may or may not attempt to continue to read
what's being sent to it).  Some pagers (like more) will just exit.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Question about Ctrl-C and less

2005-10-21 Thread Kevin Brown

[EMAIL PROTECTED] wrote:
 On Thu, Oct 20, 2005 at 03:42:10PM -0700, Kevin Brown wrote:
  Martijn van Oosterhout wrote:
   You can't do a pclose in a signal handler, it's not one of the
   reeentrant safe functions and could lead to deadlocks. The signal
   manpage documents the ones you can use. Just set a flag. Setting the
   descriptor to NULL is worse because then we have check before every
   output function. fprintf(NULL, ...) will segfault on most
   architechtures I wager. 
  Yeah, I was thinking that you'd do the check for the flag and invoke a
  cleanup handler after the write() to the output file descriptor.  It's
  not clear that you'd need to do the check anyplace else.  It's been a
  while since I've messed with this stuff, but if I recall correctly,
  the write() will return immediately after receipt of a signal, and
  will indicate how much was actually written.  So receipt of a SIGINT
  should wind up being handled in a reasonably timely fashion.
 
  Additionally the normal SIGINT signal handler (the one that gets
  invoked when the pager is turned off) can be called from the cleanup
  handler in order to maintain the proper semantics.
 
 I disagree that psql should make *any* assumptions about what SIGINT
 means to the child process. Consider less again, and Control-C used
 to abort a search. You are suggesting that Control-C should not only
 abort the search, but should also cut off the input from less. Less
 won't die. Less will just see a terminated input stream. What has been
 gained from this? Is this intuitive behaviour?

It's behaviour that's consistent with every other pipeline application
set I've ever seen.

The only difference between those and this situation is that for a
standard pipeline set, SIGINT will kill all the processes in the
pipeline.  Less explicitly ignores (or traps) SIGINT, so the effect of
generating a SIGINT where less is at the end of the pipeline is to
kill everything that precedes it, and less will then show what results
it received.  And obviously, that implies that the pipeline gets
closed.

If psql does not close the pipeline right then and there, then its
behaviour will obviously be different from what the user likely
expects, based on other pipelined uses of less.  After all, if they
wanted to see the entire result set then they wouldn't have sent
SIGINT, would they?

 If the pager does die in response to SIGINT, the write() will fail with
 SIGPIPE. Completely clean, without any need for psql to pay attention
 to SIGINT.

We're not talking about the semantics of the pager, we're talking
about the semantics of psql.  You said it yourself: psql can't make
any assumptions about what SIGINT means to the child process.  So it
has to consider what to do if the child does *not* die in response to
SIGINT.  What are the proper semantics for psql + the child in that
situation?

Well, I think it's clear that having psql ignore SIGINT in that
situation is *not* correct, because it implies that whether or not
SIGINT causes processing to stop (as the user would expect) depends
entirely on the child.  More specifically, it would depend on the
child either dying or explicitly closing its stdin upon receipt of
SIGINT.

The bottom line is that, at least in my opinion, the semantics of
SIGINT as regards psql should be the same whether or not there's a
pager involved, with one crucial difference: if there's a pager
involved, psql should wait for it to terminate before showing the
prompt.  But otherwise, the semantics should be identical, because
they are what the user expects.

 I think the only reasonable behaviour is to ignore SIGINT within the
 parent, until the child exits. I don't see why other behaviours are
 even being considered. To me, it points at a misunderstanding of the
 problem.

Not at all.  When you send SIGINT to a process, you want that process
to stop doing whatever it's doing and return control to you.  That's
what it means, and that's what it's for.  If we ignore SIGINT then
obviously we will *not* be heeding the wishes of the user who sends
SIGINT, and that is not likely what the user expects.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Question about Ctrl-C and less

2005-10-21 Thread Kevin Brown

Andrew - Supernews wrote:
 On 2005-10-19, Kevin Brown [EMAIL PROTECTED] wrote:
  Making assumptions about what the pager will do upon receipt of SIGINT
  is folly as well.
 
  Setting up SIGINT to be ignored may be the right answer (I don't
  believe it is -- see below), but if so then it needs to be done
  properly.  If it gets ignored prior to the popen(), then the child
  will also end up ignoring it by default, because signal disposition is
  inherited by child processes.  If we ignore SIGINT, it should be after
  the popen(), not before.
 
 I do not believe it is possible to do the signal disposition correctly
 and still use popen() to run the pager. (You would need to reimplement
 popen using raw syscalls.)

I'm not sure I see why this is so.  popen() just creates the pipeline,
fork()s, closes the proper file descriptor (depending on whether it's
in the parent or the child and whether the pipe was open for read or
write) and then exec()s in the child.  In the parent, it returns
control after the fork() and file descriptor cleanup.  So the parent
can set up its own internal signal disposition immediately after
popen() returns.  This sequence of events is exactly what we'd end up
doing if we did everything ourselves using raw syscalls, save for the
use of stdio instead of direct syscalls for the file operations.


  So I think the right answer here is for psql to handle SIGINT
  internally by doing a pclose() first
 
 The chances that psql can do this safely approach zero. pclose() is not a
 signal-safe function, so it can only be called from a signal handler if
 you _know_ that the signal did not interrupt any non-signal-safe function.
 (Nor can the signal handler longjmp out in such a case, unless the code is
 never again going to call any unsafe function.)

I agree.  I guess I need to be a little more explicit about what I
envision here.

There would be two possible signal handlers.  The first is the one we
have now, which cleans up various things upon receipt of SIGINT.  The
second simply sets a flag that says that SIGINT has been received.

The signal handler that gets assigned to SIGINT depends on whether or
not a pager is going to be used.  If it's not, then we point it to the
first signal handler.  If it is, then we point it to the second, and
clear the flag.

When a pager is being used, we check for the flag immediately after
doing a write()/fwrite() to the pipe.  If it's set, we pclose(), clear
the flag, and then manually invoke the non-pager signal handler.
SIGINT should cause the write() to return immediately, possibly with
EINTR.


Make sense?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Question about Ctrl-C and less

2005-10-20 Thread Kevin Brown

Martijn van Oosterhout wrote:
 You can't do a pclose in a signal handler, it's not one of the
 reeentrant safe functions and could lead to deadlocks. The signal
 manpage documents the ones you can use. Just set a flag. Setting the
 descriptor to NULL is worse because then we have check before every
 output function. fprintf(NULL, ...) will segfault on most
 architechtures I wager. 

Yeah, I was thinking that you'd do the check for the flag and invoke a
cleanup handler after the write() to the output file descriptor.  It's
not clear that you'd need to do the check anyplace else.  It's been a
while since I've messed with this stuff, but if I recall correctly,
the write() will return immediately after receipt of a signal, and
will indicate how much was actually written.  So receipt of a SIGINT
should wind up being handled in a reasonably timely fashion.

Additionally the normal SIGINT signal handler (the one that gets
invoked when the pager is turned off) can be called from the cleanup
handler in order to maintain the proper semantics.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Question about Ctrl-C and less

2005-10-18 Thread Kevin Brown

Martijn van Oosterhout wrote:
 Very well, patch attached. It's quite simple actually. However, there
 seems to be some push back from people suggesting that jumping back to
 the main loop before the pager has quit is not buggy behaviour.
 Assuming that a ^C will kill the pager is just folly.

Making assumptions about what the pager will do upon receipt of SIGINT
is folly as well.

Setting up SIGINT to be ignored may be the right answer (I don't
believe it is -- see below), but if so then it needs to be done
properly.  If it gets ignored prior to the popen(), then the child
will also end up ignoring it by default, because signal disposition is
inherited by child processes.  If we ignore SIGINT, it should be after
the popen(), not before.


When the user sends SIGINT, he means to interrupt whatever processing
is currently occurring.  He expects to regain control of the
terminal.  If psql is in the process of sending data to the pager,
then a SIGINT should cause psql to stop doing so.

So I think the right answer here is for psql to handle SIGINT
internally by doing a pclose() first (and at this point, it probably
should ignore SIGINT altogether), then returning to the main loop
(and, of course, cleaning up anything that needs it along the way).
If the child hasn't exited then pclose() will block until it has.  The
end result should be the semantics you want: if psql is in the middle
of sending a bunch of rows of output to the pager, this will interrupt
that process.  If the pager remains running then it will hopefully
give the user the ability to scroll through whatever results were sent
to it.


 Tom asked if we should be blocking SIGQUIT and SIGHUP too. Standard
 procedure for spawning external interactive processes includes blocking
 SIGQUIT too (see system() manpage).

SIGQUIT has a different standard meaning in Unix than SIGINT: it
causes the process to drop core.  We should not be blocking it -- we
should be leaving it alone.  The reason is that it's quite possible
that the user wants to have psql generate a core file while it's
writing output to the pager.


 Logically speaking, when the user sends an interrupt from the
 keyboard they expect to interrupt the currently active *interaxtive*
 process.

They expect to interrupt the currently active processing.  Not quite
the same thing.


 Hence, once psql has spawned
 the pager, it should ignore such interrupts until control is returned
 (after pclose). So yes, I would suggest blocking SIGQUIT also, if only
 to prevent terminal corruption problems. Interactive programs like less
 trap SIGQUIT to restore the terminal settings on exit, but the exit
 anyway.

They should be dropping core upon receipt of SIGQUIT.  It might be
nice if they cleaned up the terminal first, but receipt of a SIGQUIT
generally means that the user wants to run the resulting core file
through a debugger, and trapping the signal could alter the stack such
that the resulting core would be less useful.  I'd rather have to
clean up the terminal manually than have an unusable core file.





-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] GRANT/roles problem: grant is shown as from login role

2005-10-12 Thread Kevin Brown

Tom Lane wrote:
 So I think we don't have much choice but to implement theory #2; which
 is essentially the same thing I said earlier, ie, ACLs have to record
 the grantor of a privilege as being the role actually holding the grant
 option, not the role-member issuing the GRANT.

There are really two different considerations here.

The first is the meaning of the role relationships involved.  With
respect to this, I'm in agreement that the recorded grantor of the
privilege should be the role actually holding the option.

But the second is auditing.  It's useful to know which user/role
actually performed the grant in question, independent of the grant
relationships themselves.

These two are at odds with each other only if the system can record
only one of the two things.  The auditing consideration really argues
for the implementation of an audit trail table/structure, if one
doesn't already exist (and if it already exists, then clearly the ACLs
should be storing the id of the role holding the grant, since the
audit structure will separately record the user/role issuing the
grant).



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC

2005-07-16 Thread Kevin Brown

Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I don't think we should care too much about indexes. We can rebuild
  them...but losing heap sectors means *data loss*.
 
 If you're so concerned about *data loss* then none of this will be
 acceptable to you at all.  We are talking about going from a system
 that can actually survive torn-page cases to one that can only tell
 you whether you've lost data to such a case.  Arguing about the
 probability with which we can detect the loss seems beside the
 point.

I realize I'm coming into this discussion a bit late, and perhaps my
thinking on this is simplistically naive.  That said, I think I have
an idea of how to solve the torn page problem.

If the hardware lies to you about the data being written to the disk,
then no amount of work on our part can guarantee data integrity.  So
the below assumes that the hardware doesn't ever lie about this.

If you want to prevent a torn page, you have to make the last
synchronized write to the disk as part of the checkpoint process a
write that *cannot* result in a torn page.  So it has to be a write of
a buffer that is no larger than the sector size of the disk.  I'd make
it 256 bytes, to be sure of accomodating pretty much any disk hardware
out there.

In any case, the modified sequence would go something like:

1.  write the WAL entry, and encode in it a unique magic number
2.  sync()
3.  append the unique magic number to the WAL again (or to a separate
file if you like, it doesn't matter as long as you know where to
look for it during recovery), using a 256 byte (at most) write
buffer.
4.  sync()


After the first sync(), the OS guarantees that the data you've written
so far is committed to the platters, with the possible exception of a
torn page during the write operation, which will only happen during a
crash during step 2.  But if a crash happens here, then the second
occurrance of the unique magic number will not appear in the WAL (or
separate file, if that's the mechanism chosen), and you'll *know* that
you can't trust that the WAL entry was completely committed to the
platter.

If a crash happens during step 4, then either the appended magic
number won't appear during recovery, in which case the recovery
process can assume that the WAL entry is incomplete, or it will
appear, in which case it's *guaranteed by the hardware* that the WAL
entry is complete, because you'll know for sure that the previous
sync() completed successfully.


The amount of time between steps 2 and 4 should be small enough that
there should be no significant performance penalty involved, relative
to the time it takes for the first sync() to complete.


Thoughts?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC

2005-07-16 Thread Kevin Brown

Bruce Momjian wrote:
 
 I don't think our problem is partial writes of WAL, which we already
 check, but heap/index page writes, which we currently do not check for
 partial writes.

Hmm...I've read through the thread again and thought about the problem
further, and now think I understand what you're dealing with.

Ultimately, the problem is that you're storing diffs in the WAL, so
you have to be able to guarantee that every data/index page has been
completely written, right?

There's no way to detect a torn page without some sort of marker in
each disk-indivisible segment of the page, unless you're willing to
checksum the entire page.  With that in mind, the method Microsoft
uses for SQL Server is probably about as simple as it gets.  In our
case, I suppose we may as well allocate one byte per 256-bytes segment
for the torn page marker.  Just increment the marker value each time
you write the page (you'll have to read it from the page prior to
incrementing it, of course).

Other than that, torn page detection is really just a special case of
page corruption detection.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] fool-toleranced optimizer

2005-03-10 Thread Kevin Brown

Greg Stark wrote:
 
 Kevin Brown [EMAIL PROTECTED] writes:
 
  Hence, it makes sense to go ahead and run the query, but issue a
  warning at the very beginning, e.g. WARNING: query JOINs tables list
  of tables without otherwise referencing or making use of those
  tables.  This may cause excessively poor performance of the query.
 
 Well the problem with a warning is what if it *is* intentional? It's
 not ok to fill my logs up with warnings for every time the query is
 executed. That just forces me to turn off warnings.

WARNING is probably the wrong level (I wasn't thinking in terms of PG
logging, though I probably should have been).  What about NOTICE?
Basically, you want something that will alert the interactive user
that what they're doing is likely to be stupid, but at the same time
won't be a burden on the system or the DBA...

 It would be ok to have an option to block cartesian joins entirely. I might
 even choose to run with that enabled normally. I can always disable it for
 queries I know need cartesion joins.

Which wouldn't work all that well for people who are trying to write
their software in a reasonably portable fashion, unfortunately.
However, the number of people who care would now be much smaller.

 For that matter, I wonder whether it's time to consider an option to
 disable implicit (ie, pre-ansi join syntax) joins entirely. It seems
 like lots of shops are likely imposing coding standards that require
 ansi join syntax anyways. In environments like that you would expect
 a CROSS JOIN b not just select * from a,b anyways.
 
 Shops like that might appreciate the ability to enforce a blanket
 coding standard on that point and get protection from accidental
 cartesian joins as a side benefit.

That could be handy, but of course it should default to off, which
with respect to cross joins would unfortunately wind up benefitting
only those people who already are potentially aware of the issue and
care about it (or, at least, those people who have DBAs that care
about it).



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] fool-toleranced optimizer

2005-03-09 Thread Kevin Brown

Neil Conway wrote:
 Simon Riggs wrote:
 Oleg is saying that the optimizer doesn't protect against foolish SQL
 requests. His query is an example of a foolishly written query.
 
 IMHO calling this a foolishly written query is completely arbitrary. I 
 can imagine plenty of applications for which a cartesian join makes 
 sense. In this case the user didn't write the query they meant to write 
 -- but it is surely hopeless to prevent that in the general case :)

Sure, but this case, at least, is (hopefully) easily detectable (as
such things go), has a high cost when it occurs, and is *usually* not
what the user intended.

Hence, it makes sense to go ahead and run the query, but issue a
warning at the very beginning, e.g. WARNING: query JOINs tables list
of tables without otherwise referencing or making use of those
tables.  This may cause excessively poor performance of the query.


That said, the real question is whether or not it's worth putting in
the effort to detect this condition and issue the warning.  I'd say
probably not, but if this is a big enough itch for someone then why
should we discourage them from coding up a fix?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] We are not following the spec for HAVING without GROUP BY

2005-03-09 Thread Kevin Brown

Tom Lane wrote:
 What that means is that neither the HAVING clause nor the targetlist
 can use any ungrouped columns except within aggregate calls; that is,
 
   select col from tab having 21
 
 is in fact illegal per SQL spec, because col isn't a grouping column
 (there are no grouping columns in this query).

[...]

 Comments?  Can anyone confirm whether DB2 or other databases allow
 ungrouped column references with HAVING?


Oracle does not allow such references.  It issues ORA-00979: not a
GROUP BY expression when you try to hand it such a reference.

MS SQL Server does not allow such references either, yielding
columnname is invalid in the HAVING clause because it is not
contained in either an aggregate function or the GROUP BY clause..

Can't comment about DB2.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Help me recovering data

2005-02-17 Thread Kevin Brown

Tom Lane wrote:
 Gaetano Mendola [EMAIL PROTECTED] writes:
  BTW, why not do an automatic vacuum instead of shutdown ? At least the
  DB do not stop working untill someone study what the problem is and
  how solve it.
 
 No, the entire point of this discussion is to whup the DBA upside the
 head with a big enough cluestick to get him to install autovacuum.
 
 Once autovacuum is default, it won't matter anymore.

I have a concern about this that I hope is just based on some
misunderstanding on my part.

My concern is: suppose that a database is modified extremely
infrequently?  So infrequently, in fact, that over a billion read
transactions occur before the next write transaction.  Once that write
transaction occurs, you're hosed, right?  Autovacuum won't catch this
because it takes action based on the write activity that occurs in the
tables.

So: will autovacuum be coded to explicitly look for transaction
wraparound, or to automatically vacuum every N number of transactions
(e.g., 500 million)?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] IBM patent

2005-01-29 Thread Kevin Brown

Marc G. Fournier wrote:

 On Wed, 26 Jan 2005, Christopher Browne wrote:
 
 Actually, the latter isn't so.
 
 If Mammoth or Pervasive or such release their own release of
 PostgreSQL, nothing has historically mandated that they make that
 release available under the BSD license.
 
 Presumably acceptance of the patent would change that.
 
 You and I might not have individual objections to this situation, but
 one or another of the companies putting together PostgreSQL releases
 very well might.
 
 But, there is nothing stop'ng them from replacing the ARC code with
 their own variant though ...

Not only that, I'd go further and say that they have a duty to either
do that or pay someone to do it.  They are, after all, the entities
that probably care about the situation the most.

This type of situation seems to me to be one that has to be examined
from a greatest good point of view.  If IBM were to allow all open
source projects to make free use of a patent (thus exposing only those
entities which sell commercial versions under a non-open-source
license to risk), then the PG group might be faced with the tradeoff
of using a superior but patented (though free for open source use)
algorithm, or using a possibly inferior but unencumbered one.  I'd
wager that the vast majority of PostgreSQL users received their copy
via the open source license.  Unless the encumbered algorithm is not
significantly superior to the unencumbered one, the greater good is
likely to be to make use of the patented algorithm and force the
non-open-source vendors to deal with removing the algorithm
themselves.


None of that really applies to the specific situation we're
discussing, however: the current ARC implementation is apparently not
showing itself to be a clearly superior approach, so some other
approach is probably warranted.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Allow GRANT/REVOKE permissions to be applied to all schema objects with one command

2005-01-28 Thread Kevin Brown

Alvaro Herrera wrote:
 On Fri, Jan 28, 2005 at 09:17:46PM +0100, Matthias Schmidt wrote:
 
  a) accept some sort of wildcard for the grant on table syntax:
 GRANT ... ON TABLE schema.*
 
 What about a list,
 
 GRANT ... ON TABLE table1, table2, ... TO user1, user2, ...;
 
 It would be good if it was a list of wildcards.  Not sure if that is
 workable.

Actually, what I'd *love* to see is for statements such as GRANT to
allow select result sets to be used in place of arguments, e.g.:

GRANT ... ON TABLE (SELECT table_schema || '.' || table_name FROM
  information_schema.tables WHERE table_schema IN ('public', 'postgres'))
  TO (SELECT usename from PG_USER WHERE usecatupd = true);


Actually, it would be very nice if all DDL statements could work that
way.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] ARC patent

2005-01-17 Thread Kevin Brown

Nicolai Tufar wrote:
 Second, a pending patent is not a granted patent,
 one is not infringing anything by distributing
 technology based in a pending patent. 

Given the patents the USPTO has been granting in recent times, if a
patent is pending, it's almost certainly going to be granted.
Especially if it comes from an entity such as IBM (the USPTO wouldn't
want to upset its biggest paying customers, would it?), and especially
if it's on something that isn't completely trivial.

For that reason, I think it's quite reasonable to treat any pending
patent from IBM as if it were a granted patent.  The only way I could
see the patent not being granted is if some large corporate entity
like Microsoft filed an objection.  That's possible, I suppose, but
not something I would want to count on.  But objections raised by
small entities such as individuals will almost certainly be dropped on
the floor, because such entities don't matter to the USPTO (or the
rest of the government, for that matter), unless they are flush with
cash.


 IBM can NEVER sue customers for using infringing
 code before first informing them of infringement and
 giving reasonable time to upgrade to uninfringing
 version.

This is the United States.  People (and especially large corporations)
can sue anybody for anything anytime they wish.  And they do.  Reason
doesn't enter into it.  Only money.  See the SCO debacle for proof,
and note that they're not suing in any other countries.



If I sound bitter and cynical, well, there's lots of good reason for
it.  You need only look around, at least if you're in the U.S.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] [pgsql-hackers-win32] [BUGS] More SSL questions..

2005-01-13 Thread Kevin Brown

Bruce Momjian wrote:
   FWIW, I've seen several apps that use .txt for config files, but I can't
   think of an example right now. Most don't though - .cfg or .conf is
   probably most common. Except for the majority of windows programs that
   don't use config files - they use the registry. But I see no reason *at
   all* for us to want to do that :-) It also more or less requires you to
   write a GUI to change the config stuff and in that case the file
   extension becomes irrelevant.
  
  Where are we on this?  I think Andrew and I both think *.txt is
  confusing.  We need to decide on Monday if we should change the current
  *.txt names.  We can either leave it unchanged, remove *.txt, or change
  it to *.config.
 
   APPDATA/postgresql/pgpass.txt
   APPDATA/postgresql/psqlrc.txt
 
 Another idea is to use *.conf.

For what it's worth, I always thought that text configuration files on
Windows platforms generally used the '.ini' extension.  I believe on
most Windows systems that extension is by default associated with
Notepad.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] psql \e broken again

2004-11-16 Thread Kevin Brown

Tom Lane wrote:
 Magnus Hagander [EMAIL PROTECTED] writes:
  Agreed, no quotes on Unix.  I'm just wondering what to do 
  on Windows.
 
  I don't think I've seen a single windows program that uses the EDITOR
  variable. There are some ported unix tools, but that's it. Native
  windows program will have a per-program setting for this. The system
  default is picked based on file extension. So I doubt it will break a
  lot of things.
 
 So are you in favor of not quoting at all --- ie, reverting to the Unix
 behavior?  I'm pretty sure that psql got changed because someone
 complained, so it seems like we'd be going in circles if we just do that
 and don't have any special behavior at all on Windows.

You probably already know this but others on the list may not, so...

The behavior of EDITOR under Unix comes about as a result of how it's
invoked -- usually via the system() library call, which invokes a
shell to parse the command.  The fact that spaces in EDITOR's value
are treated as argument delimeters instead of part of the path of the
editor itself is a direct result of how the shell interprets the
command string.

It almost certainly doesn't make sense to retain those semantics under
Windows, because to achieve real equivalence we would have to expand
other shell metacharacters ourselves.

Since Windows generally doesn't even make use of EDITOR as such, it
probably makes the most sense for \e on that platform to save the edit
buffer to a .txt file and execute it -- Windows will then invoke
whichever editor is associated with text files (Notepad by default).
It would have to be invoked in such a way that psql could wait for it
to complete, of course.

I suppose it might be nice to be able to override that, and do
something else if EDITOR is defined.  In that event it's probably
safer to avoid parsing EDITOR and instead have it just refer to the
full path to the program to be used.  Someone who is savvy enough to
define EDITOR is also savvy enough to throw together a simple batch
file which invokes his real editor with whatever options he wants.  It
seems to me that being able to properly specify the path of the
program to use with a minimum of fuss (hence no strange quoting
conventions) takes precedence.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] code question: storing INTO relation

2004-11-12 Thread Kevin Brown

Bruce Momjian wrote:
 
 Added to TODO:
 
   * Eliminate WAL logging for CREATE INDEX/REINDEX/CREATE TABLE AS
 when not doing WAL archiving

But that means that these operations can't be performed in a
transaction unless WAL archiving is on, right?  (If I misunderstand
and thus am wrong about this then you can ignore the rest of what
follows, of course)

Suppose I do the following within a transaction with WAL archiving
turned off:

CREATE TABLE blah (x integer primary key, y varchar(32));
INSERT INTO blah SELECT x, y FROM foo;

and then roll it back because something didn't work out the way I
wanted it to.

If CREATE INDEX isn't WAL logged, then either the rollback of the
above will result in an inconsistent database, or the fact that CREATE
INDEX isn't WAL logged will cause the CREATE TABLE to fail because the
index creation (and thus the table creation) can't be done within a
transaction.


Being able to do DDL within a transaction is one of the coolest and
(from the POV of a DBA) most useful features of the database.  If
we're going to eliminate WAL logging of certain operations, it should
be done at the explicit request of the DBA, preferably through a GUC,
and preferably with a relevant GUC for each operation.

Since WAL archiving requires that the operations in question be WAL
logged, the GUCs that control WAL logging of those statements would
clearly be ineffective if WAL archiving is turned on.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Unixware 714 pthreads

2004-10-28 Thread Kevin Brown

Tom Lane wrote:
 [EMAIL PROTECTED] writes:
  On Thu, 28 Oct 2004, Tom Lane wrote:
  No.  Why should the signal handler need re-arming?
 
  My impression was that once caught, signal handler for a particular signal
  is reset to SIG-DFL.
 
 No.  If your signal support is POSIX-compatible, it should not do that
 because we don't set SA_RESETHAND when calling sigaction(2).  If you
 don't have POSIX signals, you had better have BSD-style signal(2),
 which doesn't reset either.  If this is not happening as expected,
 you will have much worse problems than whether statement_timeout works :-(

SysV-style signal(2) handling does indeed require that the signal
handler be re-enabled.  The attached program demonstrates this on
Solaris, and probably on Unixware as well (I don't have access to the
latter).  Just run it and interrupt it with ctrl-c.  It should print
something the first time around, and actually be interrupted the
second time.


So if Unixware doesn't have sigaction() or it's not being picked up by
autoconf then yeah, he'll have big problems...



-- 
Kevin Brown   [EMAIL PROTECTED]
#include signal.h
#include stdio.h
#include unistd.h


void sighandler(int sig) {
printf (Received signal %d\n, sig);
}


int main (int argc, char *argv[]) {
signal(SIGINT, sighandler);

while(1) {
sleep(1);
}
}

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-27 Thread Kevin Brown

Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  So I would suggest using something like 100us as the threshold for
  determining whether a buffer fetch came from cache.
 
 I see no reason to hardwire such a number.  On any hardware, the
 distribution is going to be double-humped, and it will be pretty easy to
 determine a cutoff after minimal accumulation of data.  The real question
 is whether we can afford a pair of gettimeofday() calls per read().
 This isn't a big issue if the read actually results in I/O, but if it
 doesn't, the percentage overhead could be significant.
 
 If we assume that the effective_cache_size value isn't changing very
 fast, maybe it would be good enough to instrument only every N'th read
 (I'm imagining N on the order of 100) for this purpose.  Or maybe we
 need only instrument reads that are of blocks that are close to where
 the ARC algorithm thinks the cache edge is.

If it's decided to instrument reads, then perhaps an even better use
of it would be to tune random_page_cost.  If the storage manager knows
the difference between a sequential scan and a random scan, then it
should easily be able to measure the actual performance it gets for
each and calculate random_page_cost based on the results.

While the ARC lists can't be tuned on the fly, random_page_cost can.

 One small problem is that the time measurement gives you only a lower
 bound on the time the read() actually took.  In a heavily loaded system
 you might not get the CPU back for long enough to fool you about whether
 the block came from cache or not.

True, but that's information that you'd want to factor into the
performance measurements anyway.  The database needs to know how much
wall clock time it takes for it to fetch a page under various
circumstances from disk via the OS.  For determining whether or not
the read() hit the disk instead of just OS cache, what would matter is
the average difference between the two.  That's admittedly a problem
if the difference is less than the noise, though, but at the same time
that would imply that given the circumstances it really doesn't matter
whether or not the page was fetched from disk: the difference is small
enough that you could consider them equivalent.


You don't need 100% accuracy for this stuff, just statistically
significant accuracy.


 Another issue is what we do with the effective_cache_size value once
 we have a number we trust.  We can't readily change the size of the
 ARC lists on the fly.

Compare it with the current value, and notify the DBA if the values
are significantly different?  Perhaps write the computed value to a
file so the DBA can look at it later?

Same with other values that are computed on the fly.  In fact, it
might make sense to store them in a table that gets periodically
updated, and load their values from that table, and then the values in
postgresql.conf or the command line would be the default that's used
if there's nothing in the table (and if you really want fine-grained
control of this process, you could stick a boolean column in the table
to indicate whether or not to load the value from the table at startup
time).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Should libpq set close-on-exec flag on its socket?

2004-10-21 Thread Kevin Brown

Tom Lane wrote:
 Dennis Bjorklund [EMAIL PROTECTED] writes:
  On Thu, 21 Oct 2004, Tom Lane wrote:
  It was suggested to me off-list that libpq should do
  fcntl(fd, F_SETFD, FD_CLOEXEC) on the socket connecting to the server.
  This would prevent any child program from accidentally or maliciously
  interfering with the connection.
 
  Either way that the lib sets it, the client can alter the setting itself
  by issuing a new SETFD command.
 
 That's a fair point, and certainly passing it down to the child
 intentionally wouldn't be a common case.  I'll put the change in.

Since program authors who would care about this one way or another
probably won't be expecting this behavior, it should also be
documented reasonably well -- something which I'm rather sure you were
going to do anyway.




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] [BUGS] BUG #1290: Default value and ALTER...TYPE

2004-10-21 Thread Kevin Brown

Tom Lane wrote:
 PostgreSQL Bugs List [EMAIL PROTECTED] writes:
  troels=# create table lookat_feature(
  troels(#   feature_id char(4),
  troels(#   status varchar(2) default 'TODO'
  troels(# );
  CREATE TABLE
  troels=# alter table lookat_feature
  troels-#   alter column status type varchar(4);
  ALTER TABLE
  troels=# insert into lookat_feature (feature_id) values('B034');
  ERROR:  value too long for type character varying(2)
 
 Hmm.  What's going on here is that the stored default expression is
 actually of the form
   ('TODO'::varchar)::varchar(2)
 where you don't see the coercion to varchar(2) in \d becayuse ruleutils.c
 doesn't show implicit casts.  After the ALTER COLUMN it's of the form
   (('TODO'::varchar)::varchar(2))::varchar(4)
 which of course will give an error when used.
 
 Possibly we should make ALTER COLUMN strip any implicit coercions that
 appear at the top level of the default expression before it adds on the
 implicit coercion to the new column datatype.  I am not sure that this
 is a good idea, however; it seems like it might alter the semantics in
 unexpected ways.  (The default expression could potentially come through
 differently than an actually stored value of the column would do.)
 
 The alternative would seem to be decreeing that this is not a bug.
 
 Comments anyone?

I think the conversion from 'TODO' to varchar(2) and then to
varchar(4) is much more surprising than converting 'TODO' directly to
varchar(4) after the ALTER TABLE.  In short, as a DBA I would expect
the database to do any conversion into the column target type based on
the original specified default value, and not some intermediate form
that exists only because of the history of the column's datatype.

So, my vote is for the form to be ('TODO'::varchar)::varchar(4) after
the ALTER TABLE in the example.


FWIW...



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Tablespace issues (comment on ,moving indexes)

2004-08-10 Thread Kevin Brown

Christopher Kings-Lynne wrote:
 Errr, unlike all the other uses for alter table and friends? ie:
 
 OWNER TO
 
 Which changes the attributes of the table...
 
 And indexes.

Sure.  But not *just* indexes.

 RENAME TO
 
 Same.
 
 And indexes.

It does?  I thought the indexes pointed to relations directly, not to
tables by name, and so changing the name of the table wouldn't have any
effect on the indexes, right?

 SET TABLESPACE
 
 Which again changes the attributes of the table..
 
 And indexes.

But it does change more than just the indexes.


But the context here is changing the tablespace of indexes independently
of the tablespace for the table.  For that, how exactly does it affect
the table metadata?  Not at all, I'd wager.


If you're going to go use ALTER TABLE to make changes to the attributes
of indexes, might I suggest that you also use ALTER TABLE to create and
destroy them as well?  Otherwise you end up with an inconsistent language,
which is fine if the spec calls for it or if you somehow are attempting
to maintain compatibility with something.  But what we're talking about
here is brand new functionality for which the language hasn't been
defined yet.  It would be a bit unfortunate to introduce inconsistencies
where they're not needed, wouldn't you say?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Tablespace issues (comment on ,moving indexes)

2004-08-10 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  ...  But what we're talking about
  here is brand new functionality for which the language hasn't been
  defined yet.
 
 You're missing the point, which is that there *is* a precedent of long
 standing.  ALTER TABLE has worked on indexes (and sequences, and views)
 for those cases in which the operation sensibly applied for a long time.
 In particular, the original 7.1 implementation of ALTER TABLE OWNER
 would work on tables, indexes, sequences, and views.  Should we really
 have insisted on inventing four syntaxes for the identical operation?
 Maybe, but we didn't, and now there is a precedent to follow.

And yet we have ALTER SEQUENCE.  In 7.4, we seem to have:

ALTER AGGREGATE
ALTER CONVERSION
ALTER DATABASE
ALTER DOMAIN
ALTER FUNCTION
ALTER GROUP
ALTER LANGUAGE
ALTER OPERATOR CLASS
ALTER SCHEMA
ALTER SEQUENCE
ALTER TABLE
ALTER TRIGGER
ALTER USER


Within ALTER TABLE, you can change:

1. columns
2. the table name
3. constraints
4. table ownership
5. index clustering

and within those, only (2) and (4) apply to sequences and views, and (5)
is the only ALTER TABLE operation that applies to indexes (corrections
to this welcome).  Furthermore, the rename operation for triggers,
languages, groups, functions, databases, conversions, and aggregates are
all implemented in their own ALTER statement (indeed, the rename
operation is the only ALTER operation for some of those).

The decision to roll some of the functionality affecting sequences and
views into ALTER TABLE is at least somewhat sensible: those things look
like tables in at least one key way, namely that they can be SELECTed
from.  That's not true of indexes, and so that reasoning does not apply
to using ALTER TABLE to change an index's tablespace.


It appears to me that the precedent for creating a new ALTER statement
is actually much bigger than the precedent for rolling functionality
into ALTER TABLE, based on the above.



But that's just my bird's eye view on things.  I'm sure lots of people
disagree with me on this.  :-)


I'm certainly not arguing for a wholesale rework of the syntax in order
to achieve maximum consistency (nice as that might be), but it seems to
me that it would be a mistake to introduce more inconsistency than is
already there when it's not necessary to do so.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Tablespace issues (comment on ,moving indexes)

2004-08-10 Thread Kevin Brown

I wrote:
 I'm certainly not arguing for a wholesale rework of the syntax in order
 to achieve maximum consistency (nice as that might be), but it seems to
 me that it would be a mistake to introduce more inconsistency than is
 already there when it's not necessary to do so.

What I mean here is that I think it would be in our best interests to
define the syntax for any new operation to be as easily guessed as
possible.  I believe that ALTER INDEX would be more easily guessed by
more people as the means by which one would alter an index's tablespace
than ALTER TABLE, even if those people have a decent amount of PG
experience.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Tablespace issues (comment on ,moving indexes)

2004-08-09 Thread Kevin Brown

Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
  1. there is no COMMENT ON TABLESPACE support
 
 That's right.
 
  2. how is one supposed to move indexes(not tables) to another tablespace?
 
 Use ALTER TABLE on the index.

Hmm...not ALTER INDEX?  Now that there's an operation that actually
modifies an index instead of the table itself, should there be an ALTER
INDEX?  It would be cleaner and more consistent, IMO...


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Tablespace issues (comment on ,moving indexes)

2004-08-09 Thread Kevin Brown

Christopher Kings-Lynne wrote:
 Hmm...not ALTER INDEX?  Now that there's an operation that actually
 modifies an index instead of the table itself, should there be an ALTER
 INDEX?  It would be cleaner and more consistent, IMO...
 
 Errr, unlike all the other uses for alter table and friends? ie:
 
 OWNER TO

Which changes the attributes of the table...

 RENAME TO

Same.

 SET TABLESPACE

Which again changes the attributes of the table..


But using ALTER TABLE to change the tablespace that an index belongs to
doesn't change an attribute of a table, it changes the attribute of an
index.


 etc.
 
 Lots of things against tables work against indexes and views.  Some 
 stuff for commenting on columns say works on views, composite types and 
 indexes!

No doubt.  Of course, that something's been done a certain way in the
past doesn't imply that it's the right way to do something new, nor does
it imply that the new thing must be done that way.


I mean, it's not a terribly big deal or anything, but since we're talking
about stuff that isn't in the SQL spec it seems reasonable to define the
commands in such a way that they don't violate the principle of least
surprise.  Using ALTER TABLE to alter the characteristics of an index
violates that principle, at least in my opinion.  It's not the first
command I would have thought of when asking myself how do I change the
tablespace of an index? -- ALTER INDEX is.  And the reason is simple:
we use CREATE INDEX to create an index and DROP INDEX to drop one -- we
don't use ALTER TABLE subcommands to create or drop indexes.  Why, then,
should modification of an index's properties be treated any differently
than the rest of the index manipulation commands?

I just happen to like consistency.  :-)



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] DROP TABLESPACE causes panic during recovery

2004-08-08 Thread Kevin Brown

Tom Lane wrote:
 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Maybe we could avoid removing it until the next checkpoint?  Or is that 
  not enough.  Maybe it could stay there forever :/
 
 Part of the problem here is that this code has to serve several
 purposes.  We have different scenarios to worry about:
 
   * crash recovery from the most recent checkpoint
 
   * PITR replay over a long interval (many checkpoints)
 
   * recovery in the face of a partially corrupt filesystem
 
 It's the last one that is mostly bothering me at the moment.  I don't
 want us to throw away data simply because the filesystem forgot an
 inode.  Yeah, we might not have enough data in the WAL log to completely
 reconstruct a table, but we should push out what we do have, *not* toss
 it into the bit bucket.

I like the idea tossed out by one of the others the most: create a
recovery system tablespace, and use it to resolve issues like this.

The question is: what do you do with the tables in that tablespace once
recovery is complete?  Leave them there?  That's certainly a possibility
(in fact, it seems the best option, especially now that we're doing
PITR), but it means that the DBA would have to periodically clean up that
tablespace so that it doesn't run out of space during a later recovery.
Actually, it seems to me to be the only option that isn't the equivalent
of throwing away the data...

 In the first case (straight crash recovery) I think it is true that any
 reference to a missing file is a reference to a file that will get
 deleted before recovery finishes.  But I don't think that holds for PITR
 (we might be asked to stop short of where the table gets deleted) nor
 for the case where there's been filesystem damage.

But doesn't PITR assume that a full filesystem-level restore of the
database as it was prior to the events in the first event log being
replayed has been done?  In that event, wouldn't the PITR process Just
Work?


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] DROP TABLESPACE causes panic during recovery

2004-08-04 Thread Kevin Brown

Tom Lane wrote:
 In CVS tip, try running the regression tests against an installed
 postmaster (ie, make installcheck); then as soon as the tests are
 done, kill -9 the bgwriter process to force a database restart.
 Most of the time you'll get a PANIC during recovery:

[...]

 This is impossible to fix nicely because the information to reconstruct
 the tablespace is simply not available.  We could make an ordinary
 directory (not a symlink) under pg_tblspc and then limp along in the
 expectation that it would get removed before we finish replay.  Or we
 could just skip logged operations on files within the tablespace, but
 that feels pretty uncomfortable to me --- it amounts to deliberately
 discarding data ...
 
 Any thoughts?

How is a dropped table handled by the recovery code?  Doesn't it present
the same sort of issues (though on a smaller scale)?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Too-many-files errors on OS X

2004-06-07 Thread Kevin Brown

Larry Rosenman wrote:
 I had to hack on the code some more for FreeBSD:
 (the realloc call needed the multiplication).  I ran this same code
 on UnixWare.

I feel like a moron, having missed that.  Probably explains the bad
file number error I was getting on AIX, too...




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Too-many-files errors on OS X

2004-06-06 Thread Kevin Brown

Tom Lane wrote:
 However, it seems that the real problem here is that we are so far off
 base about how many files we can open.  I wonder whether we should stop
 relying on sysconf() and instead try to make some direct probe of the
 number of files we can open.  I'm imagining repeatedly open() until
 failure at some point during postmaster startup, and then save that
 result as the number-of-openable-files limit.

I strongly favor this method.  In particular, the probe should probably
be done after all shared libraries have been loaded and initialized.

I originally thought that each shared library that was loaded would eat
a file descriptor (since I thought it would be implemented via mmap())
but that doesn't seem to be the case, at least under Linux (for those
who are curious, you can close the underlying file after you perform
the mmap() and the mapped region still works).  If it's true under any
OS then it would certainly be prudent to measure the available file
descriptors after the shared libs have been loaded (another reason is
that the init function of a library might itself open a file and keep
it open, but this isn't likely to happen very often).

 I also notice that OS X 10.3 seems to have working SysV semaphore
 support.  I am tempted to change template/darwin to use SysV where
 available, instead of Posix semaphores.  I wonder whether inheriting
 100-or-so open file descriptors every time we launch a backend isn't
 in itself a nasty performance hit, quite aside from its effect on how
 many normal files we can open.

I imagine this could easily be tested.  I rather doubt that the
performance hit would be terribly large, but we certainly shouldn't rule
it out without testing it first.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] PostgreSQL configuration

2004-04-16 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  The goal here is simply to make it obvious to a system administrator where
  the PG data directory that a given postmaster is using resides.
 
 Why would it not be sufficient to add a read-only GUC variable that
 tells that?  Connect to the postmaster and do show datadir and you're
 done.  (Without this, it's not clear you've made any particular gain
 anyway, since a given postmaster would typically mean the one I can
 connect to at this port, no?)

That would probably be sufficient for most cases.  It wouldn't take care
of the case where there's a strict separation of powers between the
system administrator and the DBA, but only if the system were managed
badly (i.e., the SA and the DBA don't talk to each other very well).
That's probably something we shouldn't concern ourselves with.

 In any case I don't see how removing PGDATA would make this more
 obvious.  You yourself just pointed out that the command-line arguments
 of a postmaster aren't necessarily visible through ps; if they're not,
 what have you gained in transparency by forbidding PGDATA?

I think you misunderstood what I was saying (which means I didn't say it
right).

There are ways within a program to change what 'ps' shows as the
command line.  We use those methods to make it possible to see what
a given backend is doing by looking at the 'ps' output.  It would be
possible to have the postmaster use those ways in order to show which data
directory it is using even if it wasn't specified on the command line.
But in my experience, those ways don't work reliably on all systems.
On the systems that those methods don't work, what 'ps' shows is the
original command line that was used.  So clearly, the only way 'ps'
will show the data directory in that instance is if it was actually
specified on the command line.

  In any case, I'm not at all opposed to having the backend stuff know
  about PGDATA during development, but for production you should have to
  explicitly specify the data directory on the command line.
 
 If you wish to do things that way, you can; but that doesn't mean that
 everyone else should have to do it that way too.  If there were a
 security or reliability hazard involved, I might agree with taking the
 fascist approach, but I see no such hazard here ...

Fair enough.  The PGDATA issue isn't a big enough one that I'm terribly
concerned about it, especially if a read-only GUC variable is available
to give that information (something that, I think, should be there
anyway).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] PostgreSQL configuration

2004-04-15 Thread Kevin Brown

Simon Riggs wrote:
 Very much agreed. PGDATA is important, lets keep it, please.

To me it's not so much whether or not PGDATA is kept around for the
system as a whole so much as how it's used.

In the general case, scripts are used to start the postmaster.  So using
PGDATA even if the postmaster doesn't directly make use of it is a
simple matter of adding '-D $PGDATA' to the command that invokes the
postmaster.

The goal here is simply to make it obvious to a system administrator where
the PG data directory that a given postmaster is using resides.  We can't
rely on the mechanism used to change the command string that ps shows for
the process: in my experience it's something that often does not work.
And in any case, the system administrator will also want to know exactly
what options were passed to the postmaster when it was invoked.


If there's any group that can figure out how to effortlessly get PGDATA
onto the command line of the backend utilities, it's the developer
group.  :-)


In any case, I'm not at all opposed to having the backend stuff know
about PGDATA during development, but for production you should have to
explicitly specify the data directory on the command line.  That seems
easy enough to do: #ifdef is your friend.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] PostgreSQL configuration

2004-04-12 Thread Kevin Brown

Tom Lane wrote:
 Well, the patch says that the command line switch wins, which is
 consistent with what we do for other command line switches (they all
 override the equivalent postgresql.conf entries).  This does seem a
 bit at variance with the stated goal of making the configuration more
 clearly documented, though :-(.  

Hmm...well, think of it as a tool.  It makes it *possible* to make the
configuration more clearly documented, and in fact makes it easy to do
so, but doesn't guarantee safety in all cases.

 If you actually use the capability then
 your config file will be lying to you about where things are.

Of course.  Just like your config file is lying about any configuration
option that is overridden on the command line.

I don't see why this is a problem, unless we intend to change the way
the entire GUC system works.

 It's worth pointing out in this connection that for the most part
 I think people are moving *away* from using command line switches;
 it's better to set the value in postgresql.conf, both for documentation
 reasons and because that way you have some chance of changing the value
 via config file update and SIGHUP.  The only way to change a value on
 the command line is to restart the postmaster.  Plus, if you're using a
 distribution-supplied init script to start the postmaster, it's hard to
 get any switches in without hacking the script anyway.

Now this raises a very interesting problem.  Namely, what happens if
you use the -C option to the postmaster as is being advocated, then
change the datadir entry in the config file, and send SIGHUP to the
postmaster?  Ooops.  Score one for Tom.  :-)

 Most of these objections also apply to values obtained from environment
 variables (the exception is that postgresql.conf can override
 environment variables).

To be honest, I think the use of the PG_DATA environment variable is the
biggest impediment to self documentation - the postmaster should not
use it.

The reason is that if PG_DATA is used to specify the location of the
data directory, you won't be able to find out where a running
postmaster's data directory is located without doing some heavy-duty
investigation.  Not all operating systems make it possible to determine
the values of a particular process' environment variables.

By requiring that the data directory be specified on the postmaster
command line, it becomes possible to always determine where a
postmaster's data directory resides just by looking at the ps output.


Now, I know you guys who do heavy duty development make use of PG_DATA.
I see no problem with having the code in postmaster that looks at
PG_DATA be surrounded by a #ifdef that is active whenever you're doing
development work.  But it should *not* be active on a production system.



Oh, as to the safety issue of a config file not properly corresponding
to a given data directory, that seems easy enough to solve: if a file
(call it magic for the purposes of discussion, though perhaps a better
name would be do_not_remove :-)  ) exists in the data directory, then
the value of a configuration variable (call it magic, too) must match
the contents of that file.  If the values don't match then the postmaster
will issue an error and refuse to start.  If the file doesn't exist then
no magic configuration option need exist in the config file, and the
postmaster will start as usual.  So any administrator who wants to make
sure that a configuration file has to explicitly be targetted at the data
directory can do so.  End result: if you use the -D option on the command
line with an inappropriate -C option, the postmaster will refuse to run.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] PostgreSQL configuration

2004-04-09 Thread Kevin Brown

Tom Lane wrote:
 Honza Pazdziora [EMAIL PROTECTED] writes:
  On Thu, Apr 08, 2004 at 10:31:44AM -0400, Tom Lane wrote:
  It seems that MySQL *will* read /etc/my.cnf if it
  exists, whether it's appropriate or not, and so it's impossible to have
  a truly independent test installation, even though you can configure it
  to build/install into nonstandard directories.  Let's not emulate that
  bit of brain damage.
 
  A counterexample of Apache shows that you can easily use -f or another
  command line option to point the server to alternate master config
  file (which I believe is the same with MySQL).
 
 According to
 http://www.mysql.com/documentation/mysql/bychapter/manual_Using_MySQL_Programs.html#Option_files
 /etc/my.cnf will be read if it exists, no matter what you say on the
 command line.  So AFAICS the only way to make a private installation is
 to make sure that you have overridden each and every setting in
 /etc/my.cnf in a private config file that you do control.  This is
 tedious and breakage-prone, of course.

Yes.  But we don't have to do that.

If we're truly concerned about the possibility of multiple installations
attempting to use the same config, then the answer is simple: require
that the location of the config file be specified on the command line
and don't compile a default location into the binary.  Similarly, don't
take the value from an environment variable.

Packaged installations won't have trouble with this: they supply a startup
script which would pass the appropriate argument to the postmaster.


If we want to be a bit paranoid (justifiable if you've got really
important data on the line), we could also require that a version
string exist in the config file.  If the version string doesn't match
the version of the postmaster being started, the postmaster exits with
an error (and a hint of what to set the version string to and what the
name of the version string parameter is).  That way, even if you screw
up on the command line, you won't hose a database by starting the wrong
version of the postmaster against it.  Not sure if this would break
anything, though.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Function to kill backend

2004-04-08 Thread Kevin Brown

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Not having a way to kill backends is like having no way to kill a
  process except rebooting the server.
 
 Some people think that making a database hard to kill is a good thing.

Sure.  But we're not talking about taking down the whole database, we're
talking about taking down a connection.

Killing the database is the equivalent of killing the OS.  It should be
hard to do either.  But it should be easy to kill a process on an OS if
you have the right permissions, and similiarly it should be easy to kill
a connection to the database if you have the right permissions.


With respect to nested transactions and other things that might make
properly shutting down difficult, it seems to me that the SIGINT case is
actually a harder case to deal with.  Why?  Because for the SIGTERM case,
you basically have to do whatever is done whenever the connection itself
drops.  If we can't handle the connection itself dropping out arbitrarily
then we have more serious problems than just how to handle SIGTERM.  :-)
But for SIGINT you have to decide whether to just abort the innermost
transaction or the outermost one, and if it's the outermost one you have
to abort then you have to provide the mechanism for it -- something that
you might not have to deal with otherwise.

So it seems that handling SIGTERM might actually be easy: you have the
signal handler close the backend's side of the connection and let the
connection-dropping logic kick in automatically, no?


Thoughts?  Am I completely off my rocker here?  :-)



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] [HACKERS] fsync method checking

2004-03-19 Thread Kevin Brown

I wrote:
 Note, too, that the preferred method isn't likely to depend just on the
 operating system, it's likely to depend also on the filesystem type
 being used.
 
 Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs,
 and that's just off the top of my head.  I imagine the performance of
 the various syncing methods will vary significantly between them.

For what it's worth, my database throughput for transactions involving
a lot of inserts, updates, and deletes is about 12% faster using
fdatasync() than O_SYNC under Linux using JFS.

I'll run the test program and report my results with it as well, so
we'll be able to see if there's any consistency between it and the live
database.




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Kevin Brown

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Well, I wrote the program to allow testing.  I don't see a complex test
  as being that much better than simple one.  We don't need accurate
  numbers.  We just need to know if fsync or O_SYNC is faster.
 
 Faster than what?  The thing everyone is trying to point out here is
 that it depends on context, and we have little faith that this test
 program creates a context similar to a live Postgres database.

Note, too, that the preferred method isn't likely to depend just on the
operating system, it's likely to depend also on the filesystem type
being used.

Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs,
and that's just off the top of my head.  I imagine the performance of
the various syncing methods will vary significantly between them.


It seems reasonable to me that decisions such as which sync method to
use should initially be made at installation time: have the test program
run on the target filesystem as part of the installation process, and
build the initial postgresql.conf based on the results.  You might even
be able to do some additional testing such as measuring the difference
between random block access and sequential access, and again feed the
results into the postgresql.conf file.  This is no substitute for
experience with the platform, but I expect it's likely to get you closer
to something optimal than doing nothing.  The only question, of course,
is whether or not it's worth going to the effort when it may or may not
gain you a whole lot.  Answering that is going to require some
experimentation with such an automatic configuration system.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Too-many-files errors on OS X

2004-02-23 Thread Kevin Brown

I wrote:
 Larry Rosenman wrote:
  I had to hack on the code some more for FreeBSD:
  (the realloc call needed the multiplication).  I ran this same code
  on UnixWare.
 
 I feel like a moron, having missed that.  Probably explains the bad
 file number error I was getting on AIX, too...

And sure enough, that was it.  Got the same results on AIX 5 as on other
systems:

[EMAIL PROTECTED]:~$ ./eatfds /usr/lib/librpm.so.0 /usr/lib/librpmbuild.so.0
dup() failed: Too many open files
Was able to use 1997 file descriptors
dup() failed: Too many open files
Was able to use 1997 file descriptors after opening 2 shared libs
[EMAIL PROTECTED]:~$ uname -a
AIX m048 1 5 0001063A4C00



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Too-many-files errors on OS X

2004-02-22 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  I originally thought that each shared library that was loaded would eat
  a file descriptor (since I thought it would be implemented via mmap())
  but that doesn't seem to be the case, at least under Linux
 
 Hmm.  This may be OS-specific.  The shlibs certainly show up in the
 output of lsof in every variant I've checked, but do they count against
 your open-file limit?

It seems not, for both shared libraries that are linked in at startup
time by the dynamic linker and shared libraries that are explicitly
opened via dlopen().  This seems to be true for Linux and Solaris (I
wasn't able to test on HP-UX, and AIX yields a strange bad file number
error that I've yet to track down).

Attached is the test program I used.  It takes as its arguments a list
of files to hand to dlopen(), and will show how many files it was able
to open before and after running a batch of dlopen() commands.


-- 
Kevin Brown   [EMAIL PROTECTED]
#include stdio.h
#include errno.h
#include stdlib.h
#include dlfcn.h

int *fd;
int size = 1024;

int eatallfds(void) {
	int i = 0;
	int j, myfd;

	while (1) {
		myfd = dup(0);
		if (myfd  0) {
			fprintf (stderr, dup() failed: %s\n, strerror(errno));
			break;
		}
		fd[i++] = myfd;
		if (i = size) {
			size *= 2;
			fd = realloc(fd, size);
			if (fd == NULL) {
fprintf (stderr, Can't allocate: %s\n,
		strerror(errno));
fprintf (stderr, Had used %d descriptors\n,
		i);
exit(1);
			}
		}
	}
	for (j = 0 ; j  i ; ++j) {
		close(fd[j]);
	}
	return i;
}


int main (int argc, char *argv[]) {
	int n, na;
	int i;
	void *addr;

	size = 1024;
	fd = malloc(size * sizeof(*fd));
	if (fd == NULL) {
		fprintf (stderr, Can't allocate: %s\n, strerror(errno));
		return 1;
	}
	n = eatallfds();
	printf (Was able to use %d file descriptors\n, n);

	na = 0;
	for (i = 1 ; i  argc ; ++i) {
		addr = dlopen(argv[i], RTLD_LAZY);
		if (addr != NULL) na++;
	}
	n = eatallfds();
	printf (Was able to use %d file descriptors after opening %d shared libs\n, n, na);
	return 0;
}


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint

2004-02-07 Thread Kevin Brown

I wrote:
 But that someplace else
 could easily be a process forked by the backend in question whose sole
 purpose is to go through the list of files generated by its parent backend
 and fsync() them.  The backend can then go about its business and upon
 receipt of the SIGCHLD notify anyone that needs to be notified that the
 fsync()s have completed.

Duh, what am I thinking?  Of course, the right answer is to have the
child notify anyone that needs notification that fsync()s are done.  No
need for involvement of the parent (i.e., the backend in question)
unless the architecture of PG requires it somehow.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint

2004-02-06 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  Well, running out of space in the list isn't that much of a problem.  If
  the backends run out of list space (and the max size of the list could
  be a configurable thing, either as a percentage of shared memory or as
  an absolute size), then all that happens is that the background writer
  might end up fsync()ing some files that have already been fsync()ed.
  But that's not that big of a deal -- the fact they've already been
  fsync()ed means that there shouldn't be any data in the kernel buffers
  left to write to disk, so subsequent fsync()s should return quickly.
 
 Yes, it's a big deal.  You're arguing as though the bgwriter is the
 thing that needs to be fast, when actually what we care about is the
 backends being fast.  If the bgwriter isn't doing the vast bulk of the
 writing (and especially the fsync waits) then we are wasting our time
 having one at all.  So we need a scheme that makes it as unlikely as
 possible that backends will have to do their own fsyncs.  Small
 per-backend fsync lists aren't the way to do that.

Ah, okay.  Pardon me, I was writing on low sleep at the time.

If we want to make the backends as fast as possible then they should
defer synchronous writes to someplace else.  But that someplace else
could easily be a process forked by the backend in question whose sole
purpose is to go through the list of files generated by its parent backend
and fsync() them.  The backend can then go about its business and upon
receipt of the SIGCHLD notify anyone that needs to be notified that the
fsync()s have completed.  This approach on any reasonable OS will have
minimal overhead because of copy-on-write page handling in the kernel
and the fact that the child process isn't going to exec() or write to
a bunch of memory.  The advantage is that each backend can maintain its
own list in per-process memory instead of using shared memory.  The
disadvantage is that a given file could have multiple simultaneous (or
close to simultaneous) fsync()s issued against it.  As noted previously,
that might not be such a big deal.

You could still build a list in shared memory of the files that backends
are accessing but it would then be a cache of sorts because it would
be fixed in size.  As soon as you run out of space in the shared list,
you'll have to expire some entries.  An expired entry simply means
that multiple fsync()s might be issued for the file being referred to.
But I suspect that such a list would have far too much contention,
and that it would be more efficient to simply risk issuing multiple
fsync()s against the same file by multiple backend children.

Another advantage to the child-of-backend-fsync()s approach is that it
would cause simultaneous fsync()s to happen, and on more advanced OSes
the OS itself should be able to coalesce the work to be done into a more
efficient pattern of writes to the disk.  That won't be possible if
fsync()s are serialized by PG.  It's not as good as a syscall that would
allow you to fsync() a bunch of file descriptors simultaneously, but it
might be close.

I have no idea whether or not this approach would work in Windows.

  Perhaps a better way to do it would be to store the list of all the
  relfilenodes of everything in pg_class, with a flag for each indicating
  whether or not an fsync() of the file needs to take place.
 
 You're forgetting that we have a fixed-size workspace to do this in ...
 and no way to know at postmaster start how many relations there are in
 any of our databases, let alone predict how many there might be later on.

Unfortunately, this is going to apply to most any approach.  The number
of blocks being dealt with is not fixed, because even though the cache
itself is fixed in size, the number of block writes it represents (and
thus the number of files involved) is not.  The list of files itself is
not fixed in size, either.

However, this *does* suggest another possible approach: you set up a
fixed size list and fsync() the batch when it fills up.


It sounds like we need to define the particular behavior we want first.
We're optimizing for some combination of throughput and responsiveness,
and those aren't necessarily the same thing.  I suppose this means that
the solution chosen has to have enough knobs to allow the DBA to pick
where on the throughput/responsiveness curve he wants to be.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint

2004-02-04 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  Instead, have each backend maintain its own separate list in shared
  memory.  The only readers of a given list would be the backend it belongs
  to and the bgwriter, and the only time bgwriter attempts to read the
  list is at checkpoint time.
 
  The sum total size of all the lists shouldn't be that much larger than
  it would be if you maintained it as a global list.
 
 I fear that is just wishful thinking.  Consider the system catalogs as a
 counterexample of files that are likely to be touched/modified by many
 different backends.

Oh, I'm not arguing that there won't be a set of files touched by a lot
of backends, just that the number of such files is likely to be relatively
small -- a few tens of files, perhaps.  But that admittedly can add up
fast.  But see below.


 The bigger problem though with this is that it makes the problem of
 list overflow much worse.  The hard part about shared memory management
 is not so much that the available space is small, as that the available
 space is fixed --- we can't easily change it after postmaster start.
 The more finely you slice your workspace, the more likely it becomes
 that one particular part will run out of space.  So the inefficient case
 where a backend isn't able to insert something into the appropriate list
 will become considerably more of a factor.

Well, running out of space in the list isn't that much of a problem.  If
the backends run out of list space (and the max size of the list could
be a configurable thing, either as a percentage of shared memory or as
an absolute size), then all that happens is that the background writer
might end up fsync()ing some files that have already been fsync()ed.
But that's not that big of a deal -- the fact they've already been
fsync()ed means that there shouldn't be any data in the kernel buffers
left to write to disk, so subsequent fsync()s should return quickly.
How quickly depends on the individual kernel's implementation of the
dirty buffer list as it relates to file descriptors.

Perhaps a better way to do it would be to store the list of all the
relfilenodes of everything in pg_class, with a flag for each indicating
whether or not an fsync() of the file needs to take place.  When anything
writes to a file without O_SYNC or a trailing fsync(), it sets the flag
for the relfilenode of what it's writing.  Then at checkpoint time, the
bgwriter can scan the list and fsync() everything that has been flagged.

The relfilenode list should be relatively small in size: at most 16
bytes per item (and that on a 64-bit machine).  A database that has 4096
file objects would have a 64K list at most.  Not bad.

Because each database backend can only see the class objects associated
with the database it's connected to or the global objects (if there's a
way to see all objects I'd like to know about it, but pg_class only
shows objects in the current database or objects which are visible to
all databases), the relfilenode list might have to be broken up into one
list per database, with perhaps a separate list for global objects.

The interesting question in that situation is how to handle object
creation and removal, which should be a relatively rare occurrance
(fortunately), so it supposedly doesn't have to be all that efficient.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint

2004-02-03 Thread Kevin Brown

Bruce Momjian wrote:
 Here is my new idea.  (I will keep throwing out ideas until I hit on a
 good one.)  The bgwriter it going to have to check before every write to
 determine if the file is already recorded as needing fsync during
 checkpoint.  My idea is to have that checking happen during the bgwriter
 buffer scan, rather than at write time.  if we add a shared memory
 boolean for each buffer, backends needing to write buffers can writer
 buffers already recorded as safe to write by the bgwriter scanner.  I
 don't think the bgwriter is going to be able to keep up with I/O bound
 backends, but I do think it can scan and set those booleans fast enough
 for the backends to then perform the writes.  (We might need a separate
 bgwriter thread to do this or a separate process.)

That seems a bit excessive.

It seems to me that contention is only a problem if you keep a
centralized list of files that have been written by all the backends.
So don't do that.

Instead, have each backend maintain its own separate list in shared
memory.  The only readers of a given list would be the backend it belongs
to and the bgwriter, and the only time bgwriter attempts to read the
list is at checkpoint time.

At checkpoint time, for each backend list, the bgwriter grabs a write
lock on the list, copies it into its own memory space, truncates the
list, and then releases the read lock.  It then deletes the entries
out of its own list that have entries in the backend list it just read.
It then fsync()s the files that are left, under the assumption that the
backends will fsync() any file they write to directly.

The sum total size of all the lists shouldn't be that much larger than
it would be if you maintained it as a global list.  I'd conjecture that
backends that touch many of the same files are not likely to be touching a
large number of files per checkpoint, and those systems that touch a large
number of files probably do so through a lot of independent backends.


One other thing: I don't know exactly how checkpoints are orchestrated
between individual backends, but it seems clear to me that you want to do
a sync() *first*, then the fsync()s.  The reason is that sync() allows
the OS to order the writes across all the files in the most efficient
manner possible, whereas fsync() only takes care of the blocks belonging
to the file in question.  This won't be an option under Windows, but
on Unix systems it should make a difference.  On Linux it should make
quite a difference, since its sync() won't return until the buffers
have been flushed -- and then the following fsync()s will return almost
instantaneously since their data has already been written (so there
won't be any dirty blocks in those files).  I suppose it's possible that
on some OSes fsync()s could interfere with a running sync(), but for
those OSes we can just drop back do doing only fsync()s.


As usual, I could be completely full of it.  Take this for what it's
worth.  :-)


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] index scan with functional indexes -- solved

2004-01-30 Thread Kevin Brown

Dave Cramer wrote:
 Interesting it works now, and the good news is it is *WAY* faster, this
 might be able to speed up marc's doc search by orders of magnitude
 
 this is searching 100536 rows
 
  select * from url where fn_strrev(url) like fn_strrev('%beta12.html');
 1.57ms
 
 
 explain select * from url where url like '%beta12.html';
  3310.38 ms


The nice thing about this is that you can create your query thusly:

SELECT * from table WHERE column like 'string' AND fn_strrev(column)
LIKE fn_strrev('string')

and, if you have both a standard index on column and a functional index
on fn_strrev(column), the query will be fast (well, as fast as the
pattern in question allows) as long as 'string' is anchored on either end.


I've implemented the 'locate' utility in Perl using a PG backend instead
of the standard locate database.  I internally convert globs given as
arguments into LIKE strings, and with a functional index like that the
searches are now blazingly fast -- faster than the original 'locate'
utility.  It has the added advantage that you can specify a file type
to further narrow the search (thus 'locate --type file core' will find
all regular files named 'core' in the database).

I'll be happy to share my code with anyone who's interested.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] psql \d option list overloaded

2004-01-10 Thread Kevin Brown

in the know about how to show the available databases, or how to
describe a table.

  sold me on MySQL when I first learned it.  For me, it's like:
  
 'dir' in DOS,
 'ls' in Unix
 'SHOW' in MySQL
 ??? in PostgreSQL ?
 
 We've been over this. It's \d*. 

For listing databases it's \l.  Not exactly consistent with the rest of
the related psql commands.

  Sure, with time as my database needs grew and I matured as a developer,
  I eventually gained more respect for PostgreSQL and have made the switch
  even without this feature, but to this day, I really think MySQL *did it
  right* with those extensions.  You can't become a PostgreSQL guru without
  being a newbie first.  I vote we make it easier for newbies.
 
 What really frightens me here is that I know of several applications (shudder,
 LAMP applications) which use the output of show tables or other of your
 extensions. The problem with this is precisely that it /isn't/ sql, and it
 can't be supported as a static command. 

Of course not.  But applications which rely on information such as that
provided by show tables will typically not be possible to write while
adhering to the feature intersection of all major databases anyway.

 It is intended to be there for people
 to use interactively. 

Nonsense.  It's there to be used.  Whether it's used interactively or
not is irrelevant.  The command provides useful information.  But see
below.

 Making pseudo sql will encourage more developers to
 (and I'd apologize for this if it weren't true) code in Postgres the same
 lazy way they code in MySQL.

This is a strawman argument, although I understand your concern here.
To be honest, for application development I'd much rather see people
use information_schema, but that's only because information_schema is
in the SQL standard and as such should be the preferred way to retrieve
the information that the SHOW commands in MySQL return.  That said,
the inclusion of information_schema is a very recent development on the
PostgreSQL side of things, and doesn't even exist on some other major
databases such as MSSQL.

Of course, a PG equivalent to MySQL's show would be an even more
recent development...  :-)


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] pg_restore and create FK without verification check

2003-11-26 Thread Kevin Brown

Tom Lane wrote:
 Andreas Pflug [EMAIL PROTECTED] writes:
  In principle you could do this today, but we don't have enough
  support code in place to make it work smoothly, eg WAL segment files
  aren't labeled with enough identifying information to let you manage
  an archive full of 'em.  Still it doesn't seem that far away.
 
  So I issue CHECKPOINT, and tar the cluster or database. Still, I got two 
  questions:
  - how to restore a single database
 
 You don't.  As I said, any physical backup is going to be
 all-or-nothing.  These techniques are not a replacement for pg_dump.

But this is just an artifact of the fact that the WAL is a single
instance-wide entity, rather than a per-database entity.  But since
databases are completely separate entities that cannot be simultaneously
accessed by any query (corrections welcome), there isn't any reason in
principle that the WAL files cannot also be created on a per-database
basis.

I'm sure, of course, that doing so would bring with it a new set of
problems and tradeoffs, so it might not be worth it...


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] pg_restore and create FK without verification check

2003-11-26 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  You don't.  As I said, any physical backup is going to be
  all-or-nothing.  These techniques are not a replacement for pg_dump.
 
  But this is just an artifact of the fact that the WAL is a single
  instance-wide entity, rather than a per-database entity.  But since
  databases are completely separate entities that cannot be simultaneously
  accessed by any query (corrections welcome), there isn't any reason in
  principle that the WAL files cannot also be created on a per-database
  basis.
 
 WAL is not the bottleneck ... as I already mentioned today, pg_clog (and
 more specifically the meaning of transaction IDs) is what really makes a
 cluster an indivisible whole at the physical level.
 
 If you want to do separate physical dumps/restores, the answer is to set
 up separate clusters (separate postmasters).  Not so hard, is it?

Well, aside from the fact that separate clusters have completely separate
user databases, listen on different ports, will compete with other
clusters on the same system for resources that would be better managed
by a single cluster, and generally have to be maintained as completely
separate entities from start to finish, no it's not that hard.  ;-)


The ability to restore a single large database quickly is, I think,
a reasonable request, it's just that right now it's difficult (perhaps
impossible) to satisfy that request.


It's probably something that we'll have to deal with if we want PG to be
useful to people managing really large databases on really, really big
iron, though.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Release cycle length

2003-11-17 Thread Kevin Brown

Neil Conway wrote:
 Peter Eisentraut [EMAIL PROTECTED] writes:
  First, if you develop something today, the first time users would
  realistically get a hand at it would be January 2005.  Do you want
  that?  Don't you want people to use your code?
 
 Sure :-) But I don't mind a long release cycle if it is better for
 users.

Given that users can run whatever they like, it's not clear that a long
release cycle is better for users.

   (c) As PostgreSQL gets more mature, putting out stable,
   production-worthy releases becomes even more important. In
   theory, longer release cycles contribute to higher quality
   releases: we have more time to implement new features properly,
   polish rough edges and document things, test and find bugs, and
   ensure that features we've implemented earlier in the release
   cycle are properly thought out, and so forth.

On the other hand, the longer you wait to release a new feature, the
longer it will be before you get your REAL testing done.  You don't want
to release something that hasn't at least been looked over and checked
out by the development community first, of course, but waiting beyond that
point to release a new version of PG doesn't help you that much, because
most people aren't going to run the latest CVS version -- they'll run
the latest released version, whatever that may be.  So the time between
the testing phase for the feature you implement and the version release
is essentially dead time for testing of that feature, because most
developers have moved on to working on and/or testing something else.

That's why the release methodology used by the Linux kernel development
team is a reasonable one.  Because the development releases are still
releases, people who wish to be more on the bleeding edge can do so
without having to grab the source from CVS and compile it themselves.
And package maintainers are more likely to package up the development
version if it's given to them in a nice, prepackaged format, even if
it's just a source tarball.

   Note that whether or not we are using those 355 days effectively
   is another story -- it may well be true that there are we could
   make parts of the development process much more efficient.
 
 Furthermore, longer release cycles reduce, to some degree, the pain of
 upgrades. Unless we make substantial improvements to the upgrade story
 any time soon, I wouldn't be surprised if many DBAs are relieved at
 only needing to upgrade once a year.

But DBAs only need to upgrade as often as they feel like.  Any
reasonable distribution will give them an option of using either the
stable version or the development version anyway, if we're talking about
prepackaged versions.

  The longer you develop, the more parallel efforts are underway, and
  it becomes impossible to synchronize them to a release date.
 
 I think this is inherent to the way PostgreSQL is developed: Tom has
 previously compared PostgreSQL release scheduling to herding cats :-)
 As long as much of the work on the project is done by volunteers in
 their spare time, ISTM that coordinating everyone toward a single
 release date is going to be difficult, if not impossible. The length
 of the release cycle doesn't really effect this, IMHO.

Linux, too, is done largely by volunteers in their spare time.  Yet
Linux kernel releases are much more frequent than PostgreSQL releases.
One difference is that the Linux community makes a distinction between
development releases and stable releases.  The amount of time between
stable releases is probably about the same as it is for PostgreSQL.  The
difference is that the *only* releases PostgreSQL makes are stable
releases (or release candidates, when a stable release is close).
That's something we might want to re-think.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Release cycle length

2003-11-17 Thread Kevin Brown

Matthew T. O'Connor wrote:
 I agree with Peter's other comment, that the longer the development 
 cycle, the longer the beta / bug shakeout period, perhaps a shorter dev 
 cycle would yield a shorter beta period, but perhaps it would also 
 result in a less solid release.

Perhaps.  Perhaps not.  The fewer the changes, the less complexity you
have to manage.

But it would certainly result in a smaller set of feature changes per
release.  Some people might regard that as a good thing.

The advantage to doing more frequent releases is that new features end
up with more real-world testing within a given block of time, on
average, because a lot more people pick up the releases than the CVS
snapshots or even release candidates..


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Still a few flaws in configure's default CFLAGS selection

2003-10-27 Thread Kevin Brown

Bruce Momjian wrote:

 Well, we don't want to use debug for non-gcc (no optimization) so do we
 do -g for gcc, and then --enable-debug does nothing.  Seems people can
 decide for themselves.

But doesn't --enable-debug turn off optimization?

It's really a question of what the default behavior should be for each
option.  Clearly for non-gcc compilers, the default should be -O only
since they probably can't simultaneously handle -g.  But gcc builds are
an exception, one which I think is worth considering.  Hence my opinion
that for gcc builds, the default should be -g and -O.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Still a few flaws in configure's default CFLAGS selection

2003-10-26 Thread Kevin Brown

Bruce Momjian wrote:
 Peter Eisentraut wrote:
  Tom Lane writes:
  
   What Peter was advocating in that thread was that we enable -g by
   default *when building with gcc*.  I have no problem with that, since
   there is (allegedly) no performance penalty for -g with gcc.  However,
   the actual present behavior of our configure script is to default to -g
   for every compiler, and I think that that is a big mistake.  On most
   non-gcc compilers, -g disables optimizations, which is way too high a
   price to pay for production use.
  
  You do realize that as of now, -g is the default for gcc?

It is?

[EMAIL PROTECTED]:~/tmp$ gcc -c foo.c
[EMAIL PROTECTED]:~/tmp$ ls -l foo.o
-rw-r--r--1 kevinkevin 876 Oct 26 18:52 foo.o
[EMAIL PROTECTED]:~/tmp$ gcc -g -c foo.c
[EMAIL PROTECTED]:~/tmp$ ls -l foo.o
-rw-r--r--1 kevinkevin   12984 Oct 26 18:52 foo.o
Reading specs from /usr/lib/gcc-lib/i386-linux/3.3/specs


Doesn't look like it to me...

If -g is the default, it must be very recent, in which case it's
obviously not something for our configuration scripts to rely on.

  Was that the intent?
 
 I was going to ask that myself.  It seems strange to include -g by default ---
 we have --enable-debug, and that should control -g on all platforms.

I thought --enable-debug had other implications, e.g. enabling assert()s
and other such things you might want enabled for debugging but not for
production.  It certainly makes sense for it to have such semantics even
if it doesn't right now.

When combined with gcc, -g is, IMO, too useful to eliminate: it makes it
possible to get good stacktraces in the face of crashes, and makes it
possible to examine variables and such when looking at core files.

 Also, -g bloats the executable, encouraging people/installers to run
 strip, which removes all symbols.  Without -g and without strip, at
 least we get function names in the backtrace.

This should be up to the individual.  I'd argue that disk space is so
plentiful and so cheap these days that executable bloat is hardly worth
considering.

But even if it were, a database tends to be so critical to so many
things that you probably want to know why and how it crashes more than
you would most other things.  So even if you might be inclined to strip
most of your binaries, you might think twice about doing the same for
the PG binaries.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Call for port reports

2003-10-25 Thread Kevin Brown

Bruce Momjian wrote:
 How does everyone like this patch?  It removes -g from non-debug
 compiles, and changes -O2 to -O for FreeBSD/Alpha.

I'd be hesitant to remove -g from non-debug compiles.  If something
crashes, it's useful to be able to get a good stacktrace from the
resulting core file.  The -g option makes that possible for optimized
code when compiling with gcc.

Is there any way we can have configure put -g in when it detects gcc?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Dreaming About Redesigning SQL

2003-10-24 Thread Kevin Brown

Dawn M. Wolthuis wrote:

 So, nope, I'm not trolling.  I've been doing some research the past
 couple of years and I'm convinced that it is time to do something new
 (and yet old) with data persistence.

Perhaps.

But before you go down that road, you have to answer the following
simple, yet possibly difficult-to-answer, question:

What problem are you trying to solve?

Data persistence is far too vague a term to be meaningful in its own
right -- the term needs some context to have any real meaning here.

We store data for a reason.  Similarly, we retrieve it for a reason.
The data we're interested in looking for and the way we are interested
in looking for it will have a huge impact on any data retrieval solution
one could craft.

The relational model of data storage and retrieval is just a model,
highly suitable to some things and not suitable at all to others.  The
amount of development work that has gone into it and the amount of use
it has gotten shows that the relational model is actually reasonably
good at meeting many of the data storage and retrieval needs that people
have.  As with any method, it has tradeoffs and drawbacks,

There is no magic bullet and there never will be (or so experience
says).  I have no reason to believe that the problem of data persistence
and retrieval is any exception to that.

If you have a particular set of data retrieval problems in mind that you
wish to solve, by all means feel free to develop the mathematical
foundation to solve them.  Feel free to tell us that the relational
model is not suitable for that set of problems -- we might even agree
with you on that.

But don't make the claim that the relational model is lacking as a
result of not being a storage and retrieval method that is suitable to
all problems, and that there is a Better Way that will Solve Everything.
Many have made such claims about many different technologies.  They were
wrong, too.


I may be misreading you and responding to arguments you aren't making or
implying, but if so I doubt I'm alone, based on some of the other
responses I've seen here.


By the way, language is only a means of expression, and the only sort
of question (relevant to this discussion, anyway) that a language is the
answer to is what's the best way for me to express X?.  It is neither
an answer to the question of how to retrieve data nor is it a solution to
the problem of storing data in a persistent manner.  The answer to the
question of how best to query data is certainly going to be a language,
but the specific language one comes up with in answer to the question
will depend on what the person asking wants.  English is likely to
be the best answer only under certain circumstances.  SQL is likely to
be the best answer (or, at least, a very good answer) only under other
circumstances.  It just depends.  But as with any solution to any problem,
there is no one-size-fits-all solution.  As a mathematician, you should
know this: the English language is horribly inadequate for expressing
mathematical concepts.  That's why mathematicians don't use it as their
primary language for doing math.  Why, then, should we expect English,
or Java, or any other language to be any better for performing certain
kinds of queries against data than some other, more directed language?
Say what you want about SQL, but at least it was designed with querying
table-based data in mind and is at least somewhat good at its job.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] 2-phase commit

2003-09-28 Thread Kevin Brown

Bruce Momjian wrote:
 Kevin Brown wrote:
  Actually, all that's really necessary is the ability to call a stored
  procedure when some event occurs.  The stored procedure can take it from
  there, and since it can be written in C it can do anything the postgres
  user can do (for good or for ill, of course).
 
 But the postmaster doesn't connect to any database, and in a serious
 failure, might not be able to start one.

Ah, true.  But I figured that in the context of 2PC and replication that
most of the associated failures were likely to occur in an active
backend or something equivalent, where a stored procedure was likely to
be accessible.

But yes, you certainly want to account for failures where the database
itself is unavailable.  So I guess my original comment isn't strictly
true.  :-)


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] invalid tid errors in latest 7.3.4 stable.

2003-09-27 Thread Kevin Brown

Stephan Szabo wrote:
 The case at hand (with *'s on the ri queries) assuming pk already
 has an id=1 row would be.
 T1: begin;
 T1: set transaction isolation level serializable;
 T1 ... (something that does a select, not necessarily on either pk or fk)
 T2: begin;
 T2: insert into fk values (1);
 T2*:select * from pk where id=1 for update;
 T2: commit;
 T1: delete from pk where id=1;
 T1*:select * from fk where id=1 for update;
 T1: commit;
 
 If you want to treat the serial execution as T1 followed by T2.  Then
 T2* would have to show no rows for pk and T2 rolls back.
 
 If you want to treat the order as T2,T1, then T1* would have to see the
 row that T2 inserted and T1 rolls back.
 
 Right now, you won't get that, you'll get T2* showing 1 row and T1*
 showing 0 rows.

Does it also behave this way *without* any actual foreign key
constraints in place?  In other words, if you perform the RI queries
explicitly?

If so, then the problem is with the serialization code.  Sounds like
that's pretty much what you're saying.


The problem in the scenario you described should be solved if we mark any
rows that are selected with the for update option (either implicitly,
as with RI triggers, or explicitly) as having been modified by the
selecting transaction, the equivalent of (in the case of T2*) update pk
set id=id where id=1 but without firing any of the ON MODIFY triggers.
A rollback would, of course, not have any effect on the data in those
rows since there weren't any real changes.  This fix won't work,
of course, if the serialization code is so broken that it doesn't work
properly even in the face of updates (something I'd find hard to believe).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] invalid tid errors in latest 7.3.4 stable.

2003-09-27 Thread Kevin Brown

Stephan Szabo wrote:
  The problem in the scenario you described should be solved if we mark any
  rows that are selected with the for update option (either implicitly,
  as with RI triggers, or explicitly) as having been modified by the
  selecting transaction, the equivalent of (in the case of T2*) update pk
  set id=id where id=1 but without firing any of the ON MODIFY triggers.
  A rollback would, of course, not have any effect on the data in those
  rows since there weren't any real changes.  This fix won't work,
  of course, if the serialization code is so broken that it doesn't work
  properly even in the face of updates (something I'd find hard to believe).
 
 That fixes the case above which will fix the ri constraints for right now
 (although they really have to stop using for update eventually), but
 doesn't really solve the serialization problem since it still exists
 AFAICS without for update. Without the for update, you still have T2*
 getting 1 row and T1* getting 0 which can't happen for either ordering of
 the transactions.  It gets worse if that select as a holder at the
 beginning of T1 was say select * from fk where id=1 because SQL tells us
 that the later select can't see a different set of rows from the earlier
 one, so T2 shouldn't be allowed to commit before T1.

That's what I was afraid of, and what I figured serialization really
meant: what you see is a snapshot of the database as it was at
transaction start time.

I can't think of any good way to implement proper serialization without
destroying a serialized transaction's read performance, because it
seems to me that the only way to properly implement serialization is to
somehow record on-disk all the rows a serializable transaction visits,
which means that a serializable transaction is going to be *much* slower
than a read-committed transaction.  You have to mark such rows because
other transactions (even read-committed transactions) have to abort
if they attempt to modify such a row, and the list of such rows can
grow far too large to record it in shared memory.  Worse, you have to
maintain a dynamic list of serializable transactions that have seen the
row and remove a transaction from the list once it commits or rolls back,
because the only time a transaction needs to care about this when changing
a row is when there's a currently-running transaction that's seen it.

We could use the MVCC mechanism to implement it: duplicate the row being
examined and assign the reader's transaction ID to the duplicate just
as if it had modified the row.  But you also have to somehow flag the
duplicate as being there as a result of a serializable read, so that
other serializable transactions that try to modify the row after the one
in question has committed won't themselves throw a serialization error
(because without the flag they'd think they were attempting to read a
row that had been modified by someone else during their lifetime).


The other situation you have to deal with is when you have two
transactions, 1 and 2, that start and commit in that order but which
have overlapping times of execution.  If transaction 1 modifies a row
after transaction 2 starts, then commits before transaction 2 reads it,
transaction 2 has to be able to detect that and throw a serialization
error.

The way around that problem is to assign a commit ID to each transaction
at commit time.  The commit ID is just the transaction ID that will be
assigned to the next transaction that runs.  It might make sense for
assignment of commit IDs to increment the transaction ID counter the
way assignment of a transaction ID does.  Anyway, if a serializable
transaction reads a row that has a commit ID greater than the reader's
transaction ID, it throws a serialization error.  It's probably sufficient
to store the commit ID along with the transaction ID of the committer
in the transaction log as well as in shared memory, so that the commit
ID can be quickly looked up from the transaction ID.



Maybe there's a better way around all this, but I certainly can't think
of one at the moment.  :-(



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] 2-phase commit

2003-09-27 Thread Kevin Brown

Bruce Momjian wrote:
 Marc G. Fournier wrote:
  
  
  On Sat, 27 Sep 2003, Bruce Momjian wrote:
  
   I have been thinking it might be time to start allowing external
   programs to be called when certain events occur that require
   administrative attention --- this would be a good case for that.
   Administrators could configure shell scripts to be run when the network
   connection fails or servers drop off the network, alerting them to the
   problem.  Throwing things into the server logs isn't _active_ enough.
  
  Actually, apparently you can do this now ... there is apparently a mail
  module for PostgreSQL that you can use to have the database send email's
  out ...
 
 The only part that needs to be added is the ability to call an external
 program when some even occurs, like a database write failure.

Actually, all that's really necessary is the ability to call a stored
procedure when some event occurs.  The stored procedure can take it from
there, and since it can be written in C it can do anything the postgres
user can do (for good or for ill, of course).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] invalid tid errors in latest 7.3.4 stable.

2003-09-26 Thread Kevin Brown

Tom Lane wrote:
 Stephan Szabo [EMAIL PROTECTED] writes:
  I think theoretically in serializable the cases where the difference
  between the snapshot from this statement and the standard snapshot for the
  transaction are noticable we probably have a serialization failure
 
 Hmm, that is a good point.  It would be cleaner to throw a can't
 serialize failure than have the RI triggers run under a different
 snapshot.  I am not sure if we can implement that behavior easily,
 though.  Can you think of a way to detect whether there's an RI conflict
 against a later-started transaction?

Just some thoughts on this that, of course, could be wrong.  So please
don't be too hard on me if I'm full of it.  :-)

By a later-started transaction I assume you mean a later-started
transaction that commits before yours does?

I don't see how RI is any different than dealing with straight SQL
in this regard.  The effect of RI is to read/write/delete rows from a
related table that you otherwise wouldn't read or modify, and that means
that the RI mechanism needs to be treated in exactly the same way that
the equivalent SELECT/UPDATE/DELETE would be.

So the question I have is: what would PG do in the case that you SELECT
the same row(s) that the RI triggers are reading implicitly?  For
instance, suppose we have two tables:

CREATE TABLE corps (id integer PRIMARY KEY, name varchar(32));
CREATE TABLE widgets (id integer PRIMARY KEY, name varchar(32),
corpid integer REFERENCES corps(id) ON DELETE CASCADE);

When, within a transaction, I do:

INSERT INTO widgets (id, name, corpid) VALUES (1, 'cpu', 3);

the RI mechanism will automatically check to make sure that the value
3 is in the id column of the corps table.  Put another way, it will do
an implicit SELECT id FROM corps WHERE id = 3, right?  So suppose
that for the purposes of testing the serialization code I remove the RI
triggers and then actually do the following:

SELECT id FROM corps WHERE id = 3;
INSERT INTO widgets (id, name, corpid) VALUES (1, 'cpu', 3);

If my transaction is serializable then clearly, when another transaction
does

UPDATE corps SET id = 4 WHERE id = 3;

and commits before my transaction commits, either the updating
transaction is in violation of serializability rules or the inserting
transaction is.  Serialization is maintained if either of those
transactions aborts with a serialization error.

But note that whether or not RI is involved should be entirely
irrelevant.  What matters is what rows each transacion sees and
modifies.  How the row gets looked at doesn't matter; the only thing
that matters is that the row *does* get looked at.

The important thing here is that the effect of the RI mechanism MUST be
the same as if the equivalent manual SQL statements were exected within
the same transaction.  If it's not, then the RI mechanism is broken and
needs to be fixed at that level.

But if PG exhibits exactly the same bug this thread refers to regardless
of whether a row is examined/modified via directly issued SQL or via
the RI mechanism then the problem lies not within the RI code at all,
but within the serialization code.


I just hope I'm not merely stating the obvious here...


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Killing the backend to cancel a long waiting query

2003-09-20 Thread Kevin Brown

Tom Lane wrote:
 Paulo Scardine [EMAIL PROTECTED] writes:
  I trust when you say kill, you really mean send SIGINT ...
 
  I'm sending a SIGTERM. Would SIGINT be more appropriate?
 
 Yes --- that would actually cancel the query, not cause the backend to
 shut down.

Ahh...this is very useful information.

Wouldn't it be useful, though, to implement a KILL or CANCEL SQL
command that takes a backend ID as its argument (and, of course, does
the appropriate checks of whether you're a superuser or the owner of
the backend) and sends the appropriate signal to the target backend?

That would make it possible for users to kill their own runaway queries
without having to run as whatever user PG is running as.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] New thoughts about indexing cross-type comparisons

2003-09-17 Thread Kevin Brown

Tom Lane wrote:
 Dave Smith [EMAIL PROTECTED] writes:
  My point was that it was inconstant behavour. What exactly are you 
  comparing with int2? To me the case without the cast should should throw 
  the same error as the statement with the cast.
 
  select * from test where f=1981928928921;
 
 I contend not.  The above is perfectly well defined.  It will always
 return false if f is int2, but that does not mean it should throw an
 error instead.  In any standard programming language, you'd resolve
 the operator by up-converting f to the type of the constant, not by
 trying to down-convert the wider value.  PG happens to have
 implementation reasons to wish to use the variable's datatype instead
 of the constant's, but that doesn't excuse us from obeying the ordinary
 laws of arithmetic.

Hmm...but what if the cast were to return NULL in the event that the cast
fails or cannot be done?  Would that even be reasonable?  I don't know
what the standard says about this so my suggestion may be unreasonable
(and it may break a lot of things as well).

In a way, this would be consistent with the meaning of NULL: no value,
and would also yield the desired effect in the example select (no
matches).


Of course, I could always be off my rocker here.  :-)

-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Stats Collector Error 7.4beta1 and 7.4beta2

2003-09-09 Thread Kevin Brown

Tom Lane wrote:
 [EMAIL PROTECTED] writes:
  This analysis makes sense - I think using memcmp is clearly wrong here.
 
 Yeah, now that I think about it, we're betting on the kernel to
 faithfully zero all unused bits in addrinfo structures.  In an ideal
 world, all kernels would do that, but in the real world it seems like
 a losing bet.

Yeah, I've always been under the impression that it's a bad idea in
general to memcmp() structs, if only because in doing so you make a
lot of implicit assumptions about the structs in question that aren't
necessarily true, especially when dealing with multiple architectures.

Makes me wonder if there are other parts of the code where we're
vulnerable to the same sort of issue...

 I could go for Jan's idea of putting a random key into the messages,
 if anyone feels that we should not trust to the kernel to enforce the
 packet source address restriction.  But the memcmp() test seems a clear
 loser given today's discussions.

The test in the 7.3.x code looked reasonable to me, especially if it's
possible to make it work with IPV6 (if it doesn't already).  It's doing
basically the right thing, at any rate: directly comparing the actual
fields that are relevant.  Does this test represent a significant
performance hit?




-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] set constraints docs page

2003-09-04 Thread Kevin Brown

Bruce Momjian wrote:
 
 Added to TODO:
 
   * Print table names with constraint names in error messages, or make
 constraint names unique within a schema


Should the TODO also include adding ALTER TABLE x ALTER CONSTRAINT
y RENAME TO z functionality if we don't make constraint names unique
within a schema?


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Date input changed in 7.4 ?

2003-08-28 Thread Kevin Brown

Tom Lane wrote:
 Mendola Gaetano [EMAIL PROTECTED] writes:
  I noticed that some date are not anymore accepted:
  test=# select '18/03/71'::date;
  ERROR:  invalid input syntax for date: 18/03/71
  is this the indendeed behaviour ?
 
 If it does not match your DateStyle setting, then yes.

Umm...I hope this is controllable with a GUC variable then.

There are some cases where it's extremely useful for PostgreSQL to
accept dates of any format it knows about (ambiguities should be
resolved either by looking at the current DateStyle or, failing that, by
applying the recognition in a well-defined order).  In my case I can
probably code around it but it does require some extra effort.  But I
can easily imagine situations in which that wouldn't be an option.


Whatever happened to be liberal in what you accept and conservative in
what you send?  :-)





-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] set constraints docs page

2003-08-27 Thread Kevin Brown

Bruce Momjian wrote:
 Kevin Brown wrote:
  The two approaches aren't necessarily mutually exclusive (though SQL99
  compliance on constraint names would obviously make it unnecessary to
  specify a tablename along with a constraint name), so I see little
  problem here.  But the current arrangement is obviously untenable,
  because it allows you to create a situation (multiple constraints by
  the same name) that you can't reasonably extricate yourself from.
 
 Well, it seems if we want to continue to allow the same constraint name
 to be used by different tables in the same schema, we have to print the
 tablename in the error message.  Would someone actually be looking for a
 standards-compliant error string?  We have already extended the standard
 --- either we revert that, or we have to go the entire way and print the
 table name.

If PG were configurable in terms of how it manages constraint names,
then it would depend on how the DBA had the database configured.  With it
configured to disallow name collisions, it would obviously be unnecessary
to report the table name, though I still think it would be useful (if
only because it gives a little extra context to work with).  But if it's
configured to allow name collisions, then it doesn't make sense not to
print the table name in an error message, because that's the only way to
guarantee that the DBA can identify which constraint is being referred to.


The problem as things stand now is that even if we printed the table name
involved, the DBA is placed in a difficult position if the constraint in
question isn't uniquely named -- which is the only case where printing
the table name would really matter.  That's because he can't actually
refer to the constraint in any unique way short of playing with the
system tables; he'd have to rename the constraint first before being
able to really do something with it (is this even possible for him to
do without manipulating system tables?  Is there an ALTER CONSTRAINT?).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] set constraints docs page

2003-08-19 Thread Kevin Brown

Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
  I object to creating gratuitous incompatibilities with the SQL standard,
  which will obstruct legitimate features down the road.  The SQL standard
  says it is schema.constraint.
 
  Is there a case for enforcing uniqueness on constraint names, then?
 
 Other than SQL92 says so?  Very little.  This seems to me to be a
 design error in the spec.  Per-table constraint names are easier to
 work with --- if they're global across a schema then you have a serious
 problem avoiding collisions.

I assume that SQL99 and later don't specify anything different than what
SQL92 calls for in this regard?

Without any meaningful guidance from the spec, the best we can do is
support per-table constraint names and provide optional (via a GUC
variable) support for SQL92-compliant constraint names.  Let the DBA
decide which (if not both) is best for his situation.  Inasmuch as
one of our selling points is our compliance with the SQL spec, I see
little reason to entirely avoid compliance with the spec on this issue --
just make it possible to do something else when/if necessary.

The two approaches aren't necessarily mutually exclusive (though SQL99
compliance on constraint names would obviously make it unnecessary to
specify a tablename along with a constraint name), so I see little
problem here.  But the current arrangement is obviously untenable,
because it allows you to create a situation (multiple constraints by
the same name) that you can't reasonably extricate yourself from.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] contrib compilation probs

2003-08-14 Thread Kevin Brown

Peter Eisentraut wrote:
 Tom Lane writes:
 
  I was afraid it was something like that.  Can we leave the directory
  structure as-is and just make the .o (and .d) files get built in the
  upper directory, that is
  gcc ... -o english_stem.o snowball/english_stem.c
 
 That will fail for a more basic reason: not all compilers support the -o
 option.

...jaw drops to ground...

I'm astounded.  There are such compilers still in use??  Which ones?



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] reprise on Linux overcommit handling

2003-07-23 Thread Kevin Brown

Bruce Momjian wrote:
 
 Thanks.  Interesting.  Hard to imagine what they were thinking when they
 put this code in.

Way back in the day, when dinosaurs ruled the earth, or at least the
server room, many applications were written with rather bad memory
allocation semantics: they'd grab a bunch of memory and not necessarily use
it for anything.  Typically you could specify a maximum memory
allocation amount for the program but the problem was that it would grab
exactly that amount, and it's obviously better for it to be a bit more
dynamic.

That in itself isn't a terribly bad thing ... if you have enough actual
memory to deal with it.

Problem is, back then most systems didn't have enough memory to deal
with multiple programs behaving that way.

Overcommit was designed to account for that behavior.  It's not ideal at
all but it's better to have that option than not.


Overcommit isn't really necessary today because of the huge amount of
memory that you can put into a system for cheap (HP servers excluded,
they want some serious cash for memory).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-26 Thread Kevin Brown

The Hermit Hacker wrote:
 On Wed, 25 Jun 2003, Kevin Brown wrote:
 
  So...would it make sense to create a gborg project to which people who
  have written their own test suites can contribute whatever code and data
  they feel comfortable releasing?  As a gborg project, it would be
  separate from the main PG distribution and would thus have no impact on
  the build process or anything like that.  But at the same time, if there
  are any ideas on testing that people have had, they could be shared with
  others through that mechanism.
 
  And any tests which prove to be particularly useful could make their way
  into the PG distribution if people here wish.
 
  Of course, like anything else this could be a bad (or perhaps redundant)
  idea.  :-)
 
 It doesn't sound like a bad idea ... but, it pretty much comes down to the
 original thread: are you willing to step up and maintain such a project?

Yes, I am (how hard can it be?, he asks himself, knowing all the
while that it's a really bad idea to be asking that question.  :-).
But I haven't the faintest idea of how or where to even start, so
pointers would be appreciated.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-25 Thread Kevin Brown

Tom Lane wrote:
 Josh Berkus [EMAIL PROTECTED] writes:
  3) Dann is proposing not just a feature but sweeping changes to the way our 
  commmunity works, despite having been a member of this community for about 3 
  weeks total.
 
 In Dann's defense, I didn't think I heard him proposing that we get rid
 of our existing testing methods, but rather that we see if we can't
 supplement them with something more formal.  This strikes me as a
 perfectly reasonable proposal.  However, he hasn't succeeded in
 convincing anyone else to put their time into it (for reasons that
 are also perfectly reasonable, namely that we find that our existing
 methods do pretty well, and we don't have the manpower to create a large
 formal testing structure ... even if we thought it would repay the effort,
 which many of us doubt).  So it's his itch to scratch, or not.
 
 Unless there's something more profitable to be said on the subject,
 could we drop the thread?

One thing that came out of the thread is the fact that many people who
use PostgreSQL do testing in many different ways, and that much of the
stability of PostgreSQL can be attributed to that.

It occurs to me that there may be (perhaps) a lot of duplication of
effort there that could be reduced a bit.

So...would it make sense to create a gborg project to which people who
have written their own test suites can contribute whatever code and
data they feel comfortable releasing?  As a gborg project, it would be
separate from the main PG distribution and would thus have no impact on
the build process or anything like that.  But at the same time, if there
are any ideas on testing that people have had, they could be shared with
others through that mechanism.

And any tests which prove to be particularly useful could make their
way into the PG distribution if people here wish.


Of course, like anything else this could be a bad (or perhaps redundant)
idea.  :-)


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Updating psql for features of new FE/BE protocol

2003-06-25 Thread Kevin Brown

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  I think 'is_superuser' is more appropriate.
 
 Okay, fine.
 
 I forgot one other thing that is available from the recent libpq
 additions and needs to be exposed by psql: error message verbosity
 setting.
 
 What's there now is described in
 http://candle.pha.pa.us/main/writings/pgsql/sgml/libpq-control.html
 to wit, terse, default, and verbose options.
 
 We have the choice of exposing this as a backslash command or as a
 special variable in psql --- any preferences?

My preference for such things is to use variables.  It seems to me that
backslash commands should be reserved for actual actions, e.g. show
me the list of tables or import data from stdin, etc.  It seems to
me that variables are a natural way of representing the state of psql,
and that changing that state should be accomplished through the standard
mechanisms, i.e. \set.

 Also, I would like to provide the same set of options w.r.t. messages
 logged in the server log.  Here there is an additional frammish that
 could be imagined, ie, more detail for more-serious errors.  Any
 opinions about what it should look like?

Not sure exactly what you're asking for here.  If you're asking what
additional detail should be included for more serious errors, I'd say
it should be things like the actual text of the query being executed
and perhaps the file and line number of the code that threw the error.
A stack trace could be useful in the most extreme cases (and, obviously,
only when verbosity is maximized), too, but that may be too much to
ask for.  :-)


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Updating psql for features of new FE/BE protocol

2003-06-25 Thread Kevin Brown

[EMAIL PROTECTED] wrote:
  [EMAIL PROTECTED] writes:
   Is it too late to suggest that there be a way to have output displayed
   on screen AND output to a file?  
  
  tee perhaps?
 
 Tee ALMOST does it.  Try doing a \d while tee'ing the output, for example.

Try using script (start it from the shell before invoking psql).
It sounds like it'll do much of what you're after.

Screen also has a logging option which may work just as well, if not
better, than script, and has the additional advantage that the session
will continue (and can be reattached to) even if your terminal window
dies for whatever reason.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Two weeks to feature freeze

2003-06-21 Thread Kevin Brown

Tom Lane wrote:
 I have been through crash-me in some detail, and it left a very bad
 taste in my mouth.  Don't bother holding it up as an example of good
 practice.

You seem to miss Dan's point.  The specific implementation of crashme
is undoubtedly flawed in a number of ways, but the idea is very useful
as part of an acceptance testing suite.  In short, it would probably
be beneficial to us to fix crashme so that it tests the proper,
standards-compliant things and reports the actual results, and then
include it in the test suite.

Indeed, we could even go so far as to use it for our own marketing
purposes!  Have it cite, for each test, which part of the SQL spec it's
testing and what the result should be.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Two weeks to feature freeze

2003-06-21 Thread Kevin Brown

I wrote:
 Tom Lane wrote:
  I have been through crash-me in some detail, and it left a very bad
  taste in my mouth.  Don't bother holding it up as an example of good
  practice.
 
 You seem to miss Dan's point.  The specific implementation of crashme
 is undoubtedly flawed in a number of ways, but the idea is very useful
 as part of an acceptance testing suite.  In short, it would probably
 be beneficial to us to fix crashme so that it tests the proper,
 standards-compliant things and reports the actual results, and then
 include it in the test suite.

Actually, now that I think about it, it would probably be more beneficial
to merge any correct tests that we aren't already performing into our
existing regression test framework, provided that the end result doesn't
take too long to run (as you pointed out elsewhere, regression tests
that take a really long time to run simply won't be run by most people,
except perhaps in a tinderbox type of environment).

Overall, it might be of some benefit to mark individual regression tests
with a priority, and then make it possible to run only those tests of
a specified priority or higher.  That way, the indvidual developer may
decide for himself which group of regression tests to run based on the
amount of time he's willing to let it take and how much hardware he has
to throw at it.  And at the same time, it would make it easier for new
tests to be included in the suite without worrying about the impact it
would have on people running the tests.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Two weeks to feature freeze

2003-06-20 Thread Kevin Brown

Dann Corbit wrote:
  -Original Message-
  From: Jason Earl [mailto:[EMAIL PROTECTED] 
  Sent: Friday, June 20, 2003 10:45 AM
  To: The Hermit Hacker
  Cc: Robert Treat; Tom Lane; Christopher Kings-Lynne; Bruce 
  Momjian; PostgreSQL-development
  Subject: Re: [HACKERS] Two weeks to feature freeze

[...]

  Why couldn't you just release the win32 version of 7.4 when 
  it was finished.  If it takes an extra month then that just 
  gives you guys the chance to circulate *two* press releases.  
  The Native Win32 port is likely to make a big enough splash 
  all by itself.

 A formal release needs a big testing effort.  Two separate releases will
 double the work of validation.

That's true in the general case.

But in this case we're talking about releasing for a completely new
platform.  That's much different than doing another release for the same
platform set.

The testing that needs to be done for the Win32 release has to be done
separately *anyway*, so there's nothing lost by releasing the Win32 port
separately.

-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] contrib and licensing

2003-04-06 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  But if both of these paragraphs are simultaneously true, then why put
  *anything* in contrib?
 
 Don't say that too loudly, or Marc may take it upon himself to make it
 happen ;-).

Well, I hope he's not so eager to do so that he does it before this
licensing thing is hammered out.  :-)

I think having contrib is a very good idea, but that's only *because*
it currently allows non-BSD licenses.  Take that away, and I think
you've taken away the entire purpose for contrib (versus simply
including something in the main source tree with a configure switch to
enable/disable it).
 
 There are a number of reasons to keep things in contrib.  One is that
 the code may be too tightly tied to backend innards to be appropriate to
 maintain separately (the GIST extension modules are a good example, and
 most of the modules that include server-side code are easier to maintain
 with the server than not).  

Shouldn't this stuff (if we decide to make contrib BSD-only) become
part of the main source tree, then, with a configure option to
enable/disable it at compile time?

 Another is that small modules may not have
 enough critical mass to get maintained at all, if they're kicked out to
 live or die on their own.

That's certainly a problem, but only if the modules are also tightly
tied to the backend.  But then, does that mean that anything which is
strongly tied to the backend must be included in contrib, no matter
how unpopular?  Aside from that and licensing, what other criteria
would you use to decide on such inclusion?

  Otherwise, perhaps you're more concerned about the licensing issues in
  contrib than you need to be?
 
 The way I see it, the only BSD stuff in contrib rule is designed
 precisely to save us from having to think too hard about licensing
 issues.  I'm not interested in getting into lawyeristic arguments
 about how it's okay to distribute something with a different license
 if only we don't do XYZ with it.

Yeah, but what I'm saying is that *we* don't have to think about this
lawyeristic stuff regardless.  All we have to care about is whether or
not the contrib item in question has a license that allows source
distribution.  The rest of it is a problem for whoever wants to build
binary distributions, and such people *already* know what the issues
are and know that they have to think about them.

The only case where that might not be true is the case of the
individual who is building a binary distribution for use within his
own group.  But such a person is easily covered by simply disabling
compilation of contrib by default.  If he has to go to the trouble of
enabling that explicitly, then he can be warned that the stuff in
contrib might be covered by a different license.  But such a person is
likely to be in a role where he's had to deal with licensing issues
elsewhere, so it's more likely than not that he'll be aware of such
things.  The only thing he needs to be made aware of is that contrib
contains items that fall under different licenses.  That alone isn't,
IMO, justification for removing non-BSD items from contrib.

So in the end, keeping contrib BSD-only doesn't help *us*, it only
helps the people who build binary distributions.  But because they're
already used to dealing with the problem, they don't need our help on
this.

And that means that kicking non-BSD stuff out of contrib doesn't
really help anyone very much, if any...but it does hurt us in that
some potentially very valuable things will no longer be considered for
inclusion in the distribution.  So from here it looks like there's
more (perhaps much more) to be lost by making contrib BSD-only than
there is to be gained.


It would be one thing if we had a lot of people clamoring for removal
of non-BSD stuff from contrib because they'd actually been burned by
licensing issues.  But I haven't seen anything to that effect on this
list, at least, and we've had at least one GPL item in there
(pgcrypto) since late 2000.





-- 
Kevin Brown   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Deadlock while doing VACUUM??

2003-04-04 Thread Kevin Brown

Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  When a heavy INSERT or UPDATE load on a table is occurring (lots of
  quick INSERTs or UPDATEs within a single transaction), a VACUUM
  ANALYZE (or just straight VACUUM) has a really good chance (10% or so)
  of causing either the INSERT/UPDATE or the VACUUM to fail with a
  deadlock detected error.
 
 I was unable to replicate this in CVS tip, using pgbench -c 10 -t 1000
 as the load generator.

I finally got 7.3.2 installed, and confirmed that the problem does not
exist on that version.  So this is something that's limited to the
7.2.x tree.  Which, I guess, means that it's not going to get fixed
for that tree (I assume that 7.2.x is effectively end-of-lifed)...

On 7.3.2, a concurrent VACUUM appears to slow inserts down A LOT, but
it won't deadlock them.


-- 
Kevin Brown   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Detecting corrupted pages earlier

2003-04-04 Thread Kevin Brown

Andrew Sullivan wrote:
 On Thu, Apr 03, 2003 at 02:39:17PM -0500, Tom Lane wrote:
  just not listing zero_damaged_pages in postgresql.conf.sample?  We
  already have several variables deliberately not listed there ...
 
 Hey, that might be a good solution.  Of course, it doesn't solve the
 doomsday device problem, but nobody who uses it can complain that
 they didn't know what the thing would do.

Shouldn't each variable listed in postgresql.conf.sample have comments
right above it explaining what it does anyway?  A self-documenting
configuration file is a really handy thing to have.

If it's documented that way in postgresql.conf.sample (and adjacent to
the variable itself, even) then nobody who changed it would have
grounds to complain about not knowing what the variable did.

I'm much more in favor of being lucid and upfront about everything
than hiding things just because they might be dangerous.


That said, all sorts of warnings and such should be in that bit of
documentation in postgresql.conf.sample, so that it's made abundantly
clear that this particular option is not one to be messing with except
when you know exactly what you're doing...


-- 
Kevin Brown   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

1 2 3 >

1 - 100 of 202 matches

Mail list logo