date:20030622

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Mike Mascari

- Original Message -
From: Bruce Momjian [EMAIL PROTECTED]

 Tom said that our low-hanging fruit is gone and only hard
items are
 left.  This is certainly true.  What is hard to accept is that
those big
 items take _weeks_ of focused development, and we just don't
have enough
 full-time developers who can spend that amount of time to do
them.  The
 sad truth is that there is alway something _else_ to do,
rather than
 block out weeks to code a complex feature.  And these are
usually
 features that can't be done incrementally, but require a huge
input of
 time before there is any payback.

 I tried with Win32, and spent a few weeks getting us closer,
but my
 other work of housecleaning (email/patches/cleanup), and
marketing
 (speaking and tutorial preparation) just make it impossible to
spend the
 time needed to complete a big item.  And people were rightly
upset that
 the patches weren't getting applied or cleanup done in a
timely manner.

 It is depressing.

I was disappointed that Satoshi Nagayasu's two-phase commit
patches seemed to be implicitly rejected by lack of an
enthusiastic response by any of the core members. Distributed
query (not replication) would have been a very nice feature.
It's what separates, in part, Oracle Enterprise Edition from the
Standard Edition, and it appeared someone (Satoshi Nagayasu) was
more than willing to get the ball rolling. But the flight path
bothered some I guess so we got nothin'

Mike Mascari
[EMAIL PROTECTED]



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Andreas Pflug

Tom Lane wrote:

I spent weeks doing hash aggregates, weeks doing IN-subselect
optimization, and am in the middle of many weeks on FE/BE protocol
improvement.  I am sorry that you don't see these as killer features
... but they are all things that we desperately needed to do.
 

For me, the 7.4 enhancements are essential, the join optimizations make 
the difference between app works and app doesn't work, because 
queries (that can't be changed) that previously ran for ages for 
non-obvious reasons now speed up to 1 second. The planner reached a 
new level of maturity.

I'd recommend continuing enhancement work on pgsql, and if a majority 
feels that features are so killing the version is bumped up one major.
I wouldn't see a yet another platform as a reason for this, rather 
something that vastly extends the field of operation (2PC was mentioned, 
maybe PITR).

Regards,
Andreas
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Bruce Momjian

Mike Mascari wrote:
 I was disappointed that Satoshi Nagayasu's two-phase commit
 patches seemed to be implicitly rejected by lack of an
 enthusiastic response by any of the core members. Distributed
 query (not replication) would have been a very nice feature.
 It's what separates, in part, Oracle Enterprise Edition from the
 Standard Edition, and it appeared someone (Satoshi Nagayasu) was
 more than willing to get the ball rolling. But the flight path
 bothered some I guess so we got nothin'

I sure want two-phase commit.  I don't remember it as being rejected,
and we certainly need it, independent of replication.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Jan Wieck

Tom Lane wrote:
BTW, I would not approve of a response along the lines of can't you
#ifdef to the point that there are no code changes in the Unix builds?
No you can't, unless you want to end up with an unmaintainable mess 
of #ifdef spaghetti.  The thing that makes this hard is the tradeoff
between making the code readable and maintainable (which requires
sharing as much code as possible across platforms) vs isolating
platform-specific considerations.  Programming at this level is not
a science but an art form, and it's very hard to get it right the first
time --- especially when none of us have access to all the platforms
that the code must ultimately work on.
Exactly my point and the reason I am doing the entire fork+exec stuff 
over again. Bruce nagged me endlessly to commit the broken parts I had 
and fix them later. I never agreed with that philosophy because in my 
experience the worst workarounds live forever.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Jan Wieck

Dann Corbit wrote:
PostgreSQL's regression tests (IMHO) are much better than 
MySQL's crash-me scripts.
They are less thorough in coverage, but not too bad.
Right, we are still missing this test that proves 10,000 CREATE+DROP 
TABLE will ensure that PostgreSQL is slower than MySQL, if you don't 
VACUUM the catalog ...

Here is what I suggest:

PostgreSQL has an excellent programming team.  Why not try to recruit a
similar testing team?  I think it would strongly differentiate the tool
set from similar free stuff.
Perhaps all that is needed is some sort of automated, formal reporting
procedure.  For example, a large test set might be created that runs a
thorough regression feature list.  When the test completes, a data file
is emailed to some central repository, parsed, and stored in a database.
Here is what I suggest:

Get started :-)

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Jan Wieck

Dann Corbit wrote:
-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 

Are you volunteering to create it?  Step right up.
No.  And as an outsider, I rather doubt if any procedures I developed
would be taken very seriously.  If such procedures are to be developed,
I suspect that they will have to be developed from within if they are to
be successful.
What do you think is the way to become an insider?

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Jan Wieck

Alvaro Herrera wrote:
Also they don't test things they don't support.  Is there a test for
subselects?  What about concurrency?  Transactional issues?  What about
performance when they have their transaction support enabled?
Sure don't they. Like their NUMERIC data type, that can *store* any 
precision, but when you actually calculate with it, it converts to 
floating point internally ... that's not only a spec violation, in many 
countries that's a violation of law if used by a bookkeeping system.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Rod Taylor

On Sun, 2003-06-22 at 07:44, Bruce Momjian wrote:
 Mike Mascari wrote:
  I was disappointed that Satoshi Nagayasu's two-phase commit
  patches seemed to be implicitly rejected by lack of an
  enthusiastic response by any of the core members. Distributed

They weren't ready to be committed at the time, nor are they now.

The hardest parts are still to come (resume, forget, etc.).

I believe he is still working on the third phase:

http://snaga.org/pgsql/


-- 
Rod Taylor [EMAIL PROTECTED]

PGP Key: http://www.rbt.ca/rbtpub.asc


signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Bruce Momjian

Jan Wieck wrote:
 Tom Lane wrote:
  BTW, I would not approve of a response along the lines of can't you
  #ifdef to the point that there are no code changes in the Unix builds?
  No you can't, unless you want to end up with an unmaintainable mess 
  of #ifdef spaghetti.  The thing that makes this hard is the tradeoff
  between making the code readable and maintainable (which requires
  sharing as much code as possible across platforms) vs isolating
  platform-specific considerations.  Programming at this level is not
  a science but an art form, and it's very hard to get it right the first
  time --- especially when none of us have access to all the platforms
  that the code must ultimately work on.
 
 Exactly my point and the reason I am doing the entire fork+exec stuff 
 over again. Bruce nagged me endlessly to commit the broken parts I had 
 and fix them later. I never agreed with that philosophy because in my 
 experience the worst workarounds live forever.

I wouldn't say nagging ... I would say NAGGING.  :-)

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Sun, 22 Jun 2003, Bruce Momjian wrote:

 Mike Mascari wrote:
  I was disappointed that Satoshi Nagayasu's two-phase commit
  patches seemed to be implicitly rejected by lack of an
  enthusiastic response by any of the core members. Distributed
  query (not replication) would have been a very nice feature.
  It's what separates, in part, Oracle Enterprise Edition from the
  Standard Edition, and it appeared someone (Satoshi Nagayasu) was
  more than willing to get the ball rolling. But the flight path
  bothered some I guess so we got nothin'

 I sure want two-phase commit.  I don't remember it as being rejected,
 and we certainly need it, independent of replication.

I don't recall the patch itself :(

Mike, do you recall the date(s) for this?  Reasons for rejections?


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Sat, 21 Jun 2003, Bruce Momjian wrote:

 That's a tough call.  I do worry about readability.  We have made Win32
 changes, and they aren't ifdefs, and we still have a running system, and
 I think we can do that for PITR too. I think the big issue, which may be
 your point, is to get incremental work into CVS as soon as possible so
 we continue to take small steps.

Exactly ... its gotten to the point that the changes needed are large, so
ppl are waiting until its all finished before submitting it ...


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

The reason I mention it is that Postgres already supports
O_DIRECT I think on some other platforms (for whatever
reason).
   
   [ sounds of grepping... ]  No.  The only occurrence of O_DIRECT in the
   source tree is in TODO:
   
   * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
   
   I personally disagree with this TODO item for the same reason
   Sean cited: Postgres is designed and tuned to rely on OS-level
   disk caching, and bypassing that cache is far more likely to
   hurt our performance than help it.
  
  DB2 and Oracle, from memory, allow users to pass hints to the
  planner to use/not use file system caching. This could be useful
  if you had an application retrieving a large amount of data on an
  adhoc basis. The large retrieval would empty out the disk cache
  there by negatively impacting upon other applications operating on
  data which should be cached.
  
 I've recently been bitten by this. On DB2, I could change what
 bufferpool the large tables were using and set it fairly small, but
 obviously not an option with PGSQL. But, if pgsql could stop caching
 from occuring on user-specified queries, large table or index scans,
 etc., it would be very helpful.

Actually, now that I think about this, if the planner is going to read
more than X number of bytes as specified in a GUC, it would be useful
to have the fd marked as O_DIRECT to avoid polluting the disk
cache... I have a few tables with about 300M rows (~9GB on disk) that
I have to perform nightly seq scans over for reports and it does wipe
out some of the other fast movers that come through and depend on the
disk cache to be there for their speed.  Because they're performed in
the middle of the night, I don't care that much, but my avg query
times during that period of time are slower... whether it's load or
the disk buffer being emptied and having to be refilled, I'm not sure,
but thinking about it, use of a GUC threshold to have an FD marked as
O_DIRECT does make sense (0 == disabled and the default, but tunable
in Kbytes as an admin sees fit) and could be nice for big queries that
have lots of smaller queries running around at the same time.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] compile failure on cvs tip --with-krb5

2003-06-22 Thread Sean Chittenden

  This change (I'm sure this will wrap poorly -- sorry):
  http://developer.postgresql.org/cvsweb.cgi/pgsql-server/src/include/libpq/pqcomm.h.diff?r1=1.85r2=1.86
  
  modified SockAddr, but no corresponding change was made here 
  (fe-auth.c:612):
  
case AUTH_REQ_KRB5:
  #ifdef KRB5
if (pg_krb5_sendauth(PQerrormsg, conn-sock, conn-laddr.in,
 conn-raddr.in,
 hostname) != STATUS_OK)
  
  It's not obvious to me what the change ought to be though.
 
 Please try the attached patch.
 
 I'll try to change kerberos 4 later if I can find some
 documentation about it.  Especially the krb_sendauth() function.
 
 Does Kerberos 4 support other protocols than ipv4?

Not that I'm aware of.  -sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Fri, 20 Jun 2003, Dann Corbit wrote:

 Designing tests is busywork.  Desiging tests is boring.  Nobody wants to
 design tests, let alone interpret the results and define correct
 baselines.  But testing is very, very important.

But we do do testing ... we even design testing (in the form of the
regression tests) ... we just don't do testing that you personally approve
of ... and, from what I've seen so far, you aren't willing to actually put
*your* time where your mouth is ... design some tests and submit them to
us ... if they are valid, they will get used ...

If you feel that crash-me is a valid starting point, start there and see
where it takes you ...

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Sat, 21 Jun 2003, Dann Corbit wrote:

 It seems pretty clear that there are warts on the Crashme test.
 Perhaps 70% or so is truly useful.  Maybe the useful subset could be
 approximated or modified to be useful as a general tool set.

And we all wait with baited breath for you to develop and submit such
tests ...


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Mike Mascari

The Hermit Hacker wrote:
 On Sun, 22 Jun 2003, Bruce Momjian wrote:
 
Mike Mascari wrote:

I was disappointed that Satoshi Nagayasu's two-phase commit
patches seemed to be implicitly rejected by lack of an
enthusiastic response by any of the core members. Distributed
query (not replication) would have been a very nice feature.
It's what separates, in part, Oracle Enterprise Edition from the
Standard Edition, and it appeared someone (Satoshi Nagayasu) was
more than willing to get the ball rolling. But the flight path
bothered some I guess so we got nothin'

I sure want two-phase commit.  I don't remember it as being rejected,
and we certainly need it, independent of replication.
 
 I don't recall the patch itself :(
 
 Mike, do you recall the date(s) for this?  Reasons for rejections?

I choose my words poorly. A discussion arose regarding the 7.4
protocol changes. I suggested looking forward to allow for a 2PC
implementation. Satoshi Nagayasu remarked about the work done on 2PC
and posted a link to patches:

http://snaga.org/pgsql/pgsql-20021025.tgz

The thread was here:

http://archives.postgresql.org/pgsql-hackers/2002-11/msg00143.php

Various people critiqued the work that had been done - protocol change
instead of a purely statement-driven implementation, the use of 2PC
for sync. replication, etc. And that was the last (and first, IIRC)
post from Satoshi Nagayasu. I was worried that PostgreSQL lost the
opportunity to have a 2PC implementation, because no one followed up,
allowing it to die on the vine.

I have learned from Rod Taylor that lack of posts on hackers doesn't
mean lack of work:

They weren't ready to be committed at the time, nor are they now.
The hardest parts are still to come (resume, forget, etc.).
I believe he is still working on the third phase:

http://snaga.org/pgsql/

-- Rod Taylor [EMAIL PROTECTED]

So I stand corrected.

Mike Mascari
[EMAIL PROTECTED]






---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Jan Wieck

The Hermit Hacker wrote:
On Fri, 20 Jun 2003, Dann Corbit wrote:

Designing tests is busywork.  Desiging tests is boring.  Nobody wants to
design tests, let alone interpret the results and define correct
baselines.  But testing is very, very important.
But we do do testing ... we even design testing (in the form of the
regression tests) ... we just don't do testing that you personally approve
of ... and, from what I've seen so far, you aren't willing to actually put
*your* time where your mouth is ... design some tests and submit them to
us ... if they are valid, they will get used ...
If you feel that crash-me is a valid starting point, start there and see
where it takes you ...
Not that fast! I didn't take the time to check but it wouldn't surprise 
me if MySQL's crash-me is GPL'd and copyright MySQL AB. That's not an 
optimal point to start PostgreSQL test code from, is it?

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Sun, 22 Jun 2003, Jan Wieck wrote:

 The Hermit Hacker wrote:
  On Fri, 20 Jun 2003, Dann Corbit wrote:
 
  Designing tests is busywork.  Desiging tests is boring.  Nobody wants to
  design tests, let alone interpret the results and define correct
  baselines.  But testing is very, very important.
 
  But we do do testing ... we even design testing (in the form of the
  regression tests) ... we just don't do testing that you personally approve
  of ... and, from what I've seen so far, you aren't willing to actually put
  *your* time where your mouth is ... design some tests and submit them to
  us ... if they are valid, they will get used ...
 
  If you feel that crash-me is a valid starting point, start there and see
  where it takes you ...

 Not that fast! I didn't take the time to check but it wouldn't surprise
 me if MySQL's crash-me is GPL'd and copyright MySQL AB. That's not an
 optimal point to start PostgreSQL test code from, is it?

I didn't say to copy it, but if the format is what Dann feels is required
to be taken seriously, it does give a starting point to work from ...

the thing is, as I think it was Tom that pointed out, the crash-me is more
a feature tester then anything ... but Dann appears to be stuck on it as
the 'be all, end all of testing suites' ...

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Peter Eisentraut

Bruce Momjian writes:

 I have added a cleaned up version of this to CVS as src/tools/pgtest.

This seems to be a platform-specific reimplementation of 'make clean; make
check'.  Why bother?

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Access to transaction status

2003-06-22 Thread Christian Plattner


- Original Message - 
From: Jeroen T. Vermeulen [EMAIL PROTECTED]

 I ran into the same need (Bruce, we discussed this at FOSDEM in Brussels
 this February) for libpqxx.  My code tries to compensate for the
 possibility that the backend connection is lost while waiting for a reply
 to a COMMIT.  The way I worked around it was to create a special record
 at the beginning of the transaction, in a dedicated table that's
 effectively a custom transaction log.  If the record is still there when
 I reconnect, the transaction committed.  If not, it didn't.

 Obviously this leaves some garbage collection issues, so I'd be really
 happy with a way to go back to the server after my connection is lost
 and find out whether my transaction committed or not.

I see a race condition in this approach: if you reconnect too fast, and the
backend which actually should commit is still in progress (assume it takes a
while to commit for whatever reasons) you get the impression that it did not
commit - and a short time later the backend will commit... (before noticing
that the client connection was lost).

A safe method would be to shut down all backends (or is there another method
to safely abort all transactions?), then start the backends again, and then
read the table with the special records. In this way you would be sure that
your transaction is not in progress while you're inspecting the table.
Ofcourse, this approach is not very fast and may abort alot of
transactions - but if consistency is more important for you than anything
else...

- Christian



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: tsearch V2 (Was: Re: [HACKERS] Two weeks to feature freeze)

2003-06-22 Thread Bruce Momjian

The Hermit Hacker wrote:
 
 And, actually, for some reason I hadn't thought of the tsearch as being
 another 'INDEX' type ... I crawl back over and be quiet now :)
 
 Oleg, as far as commits are concerned, I have no problems with extending
 the privileges to one of your guys for this, just email me seperately who,
 and I'll get it setup ...

That would help me too.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] interval's and printing...

2003-06-22 Thread Bruce Momjian

Tom Lane wrote:
 Larry Rosenman [EMAIL PROTECTED] writes:
  Why does the interval type not print seconds when they are zero?
 
 Seems like a bug to me too.  Anyone think it's not?
 
 Note this only occurs with DateStyle = ISO, the other datestyles
 use a different format for intervals.

Clearly seems like a bug to me.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian


What you really want is Solaris's free-behind, where it detects if a
scan is exceeding a certain percentage of the OS cache and moves the
pages to the _front_ of the to-be-reused list.  I am not sure what other
OS's support this, but we need this on our own buffer manager code as
well.

Our TODO already has:

* Add free-behind capability for large sequential scans (Bruce)

Basically, I think we need free-behind rather than O_DIRECT.

---

Sean Chittenden wrote:
 The reason I mention it is that Postgres already supports
 O_DIRECT I think on some other platforms (for whatever
 reason).

[ sounds of grepping... ]  No.  The only occurrence of O_DIRECT in the
source tree is in TODO:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason
Sean cited: Postgres is designed and tuned to rely on OS-level
disk caching, and bypassing that cache is far more likely to
hurt our performance than help it.
   
   DB2 and Oracle, from memory, allow users to pass hints to the
   planner to use/not use file system caching. This could be useful
   if you had an application retrieving a large amount of data on an
   adhoc basis. The large retrieval would empty out the disk cache
   there by negatively impacting upon other applications operating on
   data which should be cached.
   
  I've recently been bitten by this. On DB2, I could change what
  bufferpool the large tables were using and set it fairly small, but
  obviously not an option with PGSQL. But, if pgsql could stop caching
  from occuring on user-specified queries, large table or index scans,
  etc., it would be very helpful.
 
 Actually, now that I think about this, if the planner is going to read
 more than X number of bytes as specified in a GUC, it would be useful
 to have the fd marked as O_DIRECT to avoid polluting the disk
 cache... I have a few tables with about 300M rows (~9GB on disk) that
 I have to perform nightly seq scans over for reports and it does wipe
 out some of the other fast movers that come through and depend on the
 disk cache to be there for their speed.  Because they're performed in
 the middle of the night, I don't care that much, but my avg query
 times during that period of time are slower... whether it's load or
 the disk buffer being emptied and having to be refilled, I'm not sure,
 but thinking about it, use of a GUC threshold to have an FD marked as
 O_DIRECT does make sense (0 == disabled and the default, but tunable
 in Kbytes as an admin sees fit) and could be nice for big queries that
 have lots of smaller queries running around at the same time.
 
 -sc
 
 -- 
 Sean Chittenden
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faqs/FAQ.html
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Bruce Momjian

Peter Eisentraut wrote:
 Bruce Momjian writes:
 
  I have added a cleaned up version of this to CVS as src/tools/pgtest.
 
 This seems to be a platform-specific reimplementation of 'make clean; make
 check'.  Why bother?

Marc requested it.  Is there anything platform-specific except the greps?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

 What you really want is Solaris's free-behind, where it detects if a
 scan is exceeding a certain percentage of the OS cache and moves the
 pages to the _front_ of the to-be-reused list.  I am not sure what
 other OS's support this, but we need this on our own buffer manager
 code as well.
 
 Our TODO already has:
 
   * Add free-behind capability for large sequential scans (Bruce)
 
 Basically, I think we need free-behind rather than O_DIRECT.

I suppose, but you've already polluted the cache by the time the above
mentioned mechanism kicks in and takes effect.  Given that the planner
has an idea of how much data it's going to read in in order to
complete the query, seems easier/better to mark the fd O_DIRECT.
*shrug*

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

Sean Chittenden wrote:
  What you really want is Solaris's free-behind, where it detects if a
  scan is exceeding a certain percentage of the OS cache and moves the
  pages to the _front_ of the to-be-reused list.  I am not sure what
  other OS's support this, but we need this on our own buffer manager
  code as well.
  
  Our TODO already has:
  
  * Add free-behind capability for large sequential scans (Bruce)
  
  Basically, I think we need free-behind rather than O_DIRECT.
 
 I suppose, but you've already polluted the cache by the time the above
 mentioned mechanism kicks in and takes effect.  Given that the planner
 has an idea of how much data it's going to read in in order to
 complete the query, seems easier/better to mark the fd O_DIRECT.
 *shrug*

_That_ is an excellent point.  However, do we know at the time we open
the file descriptor if we will be doing this?  What about cache
coherency problems with other backends not opening with O_DIRECT?  And
finally, how do we deal with the fact that writes to O_DIRECT files will
wait until the data hits the disk because there is no kernel buffer cache?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

   What you really want is Solaris's free-behind, where it detects
   if a scan is exceeding a certain percentage of the OS cache and
   moves the pages to the _front_ of the to-be-reused list.  I am
   not sure what other OS's support this, but we need this on our
   own buffer manager code as well.
   
   Our TODO already has:
   
 * Add free-behind capability for large sequential scans (Bruce)
   
   Basically, I think we need free-behind rather than O_DIRECT.
  
  I suppose, but you've already polluted the cache by the time the
  above mentioned mechanism kicks in and takes effect.  Given that
  the planner has an idea of how much data it's going to read in in
  order to complete the query, seems easier/better to mark the fd
  O_DIRECT.  *shrug*
 
 _That_ is an excellent point.  However, do we know at the time we
 open the file descriptor if we will be doing this?

Doesn't matter, it's an option to fcntl().

 What about cache coherency problems with other backends not opening
 with O_DIRECT?

That's a problem for the kernel VM, if you mean cache coherency in the
VM.  If you mean inside of the backend, that could be a stickier
issue, I think.  I don't know enough of the internals yet to know if
this is a problem or not, but you're right, it's certainly something
to consider.  Is the cache a write behind cache or is it a read
through cache?  If it's a read through cache, which I think it is,
then the backend would have to dirty all cache entries pertaining to
the relations being opened with O_DIRECT.  The use case for that
being:

1) a transaction begins
2) a few rows out of the huge table are read
3) a huge query is performed that triggers the use of O_DIRECT
4) the rows selected in step 2 are updated (this step should poison or
   update the cache, actually, and act as a write through cache if the
   data is in the cache)
5) the same few rows are read in again
6) transaction is committed

Provided the cache is poisoned or updated in step 4, I can't see how
or where this would be a problem.  Please enlighten if there's a
different case that would need to be taken into account.  I can't
imagine ever wanting to write out data using O_DIRECT and think that
it's a read only optimization in an attempt to minimize the turn over
in the OS's cache.  From fcntl(2):

 O_DIRECT Minimize or eliminate the cache effects of reading and writ-
  ing.  The system will attempt to avoid caching the data you
  read or write.  If it cannot avoid caching the data, it will
  minimize the impact the data has on the cache.  Use of this
  flag can drastically reduce performance if not used with
  care.


 And finally, how do we deal with the fact that writes to O_DIRECT
 files will wait until the data hits the disk because there is no
 kernel buffer cache?

Well, two things.

1) O_DIRECT should never be used on writes... I can't think of a case
   where you'd want it off, even when COPY'ing data and restoring a
   DB, it just doesn't make sense to use it.  The write buffer is
   emptied as soon as the pages hit the disk unless something is
   reading those bits, but I'd imagine the write buffer would be used
   to make sure that as much writing is done to the platter in a
   single write by the OS as possible, circumventing that would be
   insane (though useful possibly for embedded devices with low RAM,
   solid state drives, or some super nice EMC fiber channel storage
   device that basically has its own huge disk cache).

2) Last I checked PostgreSQL wasn't a threaded app and doesn't use
   non-blocking IO.  The backend would block until the call returns,
   where's the problem?  :)

If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
code, if there are any.  -sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Sailesh Krishnamurthy


Let me add my two cents ..

I think something like PostgreSQL needs two test suites - an
acceptance test (to ensure that checkins don't break functionality)
and a regression test suite.

What we call the regression test suite is really an acceptance
test. Tom Lane is absolutely right in asserting that a test suite that
takes a week to run will mean that people won't test at
all. Personally, I can (and have for many years) tolerate acceptance
test suites that take upto an hour. 

The existence of such an acceptance test should not however obviate
the presence of a wider regression test suite. It should be fine to
have an entire suite of regression test buckets take a week to run,
because you only start running them once you have a release candidate
or equivalent. 

Of course, setting up a regression test suite takes effort. There is
no need, however, to spend umpteen amounts of time writing the
buckets. What can be done is to incrementally build it up. So whenever
we have a significant new feature, somebody (preferably not the key
developers) could take the time to set up a set of test cases that try
to test it thoroughly. It's okay if such a test bucket takes 10-15
minutes to run. Then this can get rolled up into the regression suite
while a very small representative test case makes it to the
acceptance test suite. 

Of course, in the open source world, these things take resources and
are not easy to do. 

I certainly think that the base regression test suite is great. We
have clearing the pgsql regression test a checkin requirement for
TelegraphCQ developers as our goal is to not break pgsql
functionality.

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Sun, 22 Jun 2003, Bruce Momjian wrote:

 Peter Eisentraut wrote:
  Bruce Momjian writes:
 
   I have added a cleaned up version of this to CVS as src/tools/pgtest.
 
  This seems to be a platform-specific reimplementation of 'make clean; make
  check'.  Why bother?

 Marc requested it.  Is there anything platform-specific except the greps?

Ya, the script looked like it did a bit more then just a 'make clean; make
check' ... doesn't it?


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian


Basically, we don't know when we read a buffer whether this is a
read-only or read/write.  In fact, we could read it in, and another
backend could write it for us.

The big issue is that when we do a write, we don't wait for it to get to
disk.

It seems to use O_DIRECT, we would have to read the buffer in a special
way to mark it as read-only, which seems kind of strange.  I see no
reason we can't use free-behind in the PostgreSQL buffer cache to handle
most of the benefits of O_DIRECT, without the read-only buffer restriction.

---

Sean Chittenden wrote:
  _That_ is an excellent point.  However, do we know at the time we
  open the file descriptor if we will be doing this?
 
 Doesn't matter, it's an option to fcntl().
 
  What about cache coherency problems with other backends not opening
  with O_DIRECT?
 
 That's a problem for the kernel VM, if you mean cache coherency in the
 VM.  If you mean inside of the backend, that could be a stickier
 issue, I think.  I don't know enough of the internals yet to know if
 this is a problem or not, but you're right, it's certainly something
 to consider.  Is the cache a write behind cache or is it a read
 through cache?  If it's a read through cache, which I think it is,
 then the backend would have to dirty all cache entries pertaining to
 the relations being opened with O_DIRECT.  The use case for that
 being:
 
 1) a transaction begins
 2) a few rows out of the huge table are read
 3) a huge query is performed that triggers the use of O_DIRECT
 4) the rows selected in step 2 are updated (this step should poison or
update the cache, actually, and act as a write through cache if the
data is in the cache)
 5) the same few rows are read in again
 6) transaction is committed
 
 Provided the cache is poisoned or updated in step 4, I can't see how
 or where this would be a problem.  Please enlighten if there's a
 different case that would need to be taken into account.  I can't
 imagine ever wanting to write out data using O_DIRECT and think that
 it's a read only optimization in an attempt to minimize the turn over
 in the OS's cache.  From fcntl(2):
 
  O_DIRECT Minimize or eliminate the cache effects of reading and writ-
   ing.  The system will attempt to avoid caching the data you
   read or write.  If it cannot avoid caching the data, it will
   minimize the impact the data has on the cache.  Use of this
   flag can drastically reduce performance if not used with
   care.
 
 
  And finally, how do we deal with the fact that writes to O_DIRECT
  files will wait until the data hits the disk because there is no
  kernel buffer cache?
 
 Well, two things.
 
 1) O_DIRECT should never be used on writes... I can't think of a case
where you'd want it off, even when COPY'ing data and restoring a
DB, it just doesn't make sense to use it.  The write buffer is
emptied as soon as the pages hit the disk unless something is
reading those bits, but I'd imagine the write buffer would be used
to make sure that as much writing is done to the platter in a
single write by the OS as possible, circumventing that would be
insane (though useful possibly for embedded devices with low RAM,
solid state drives, or some super nice EMC fiber channel storage
device that basically has its own huge disk cache).
 
 2) Last I checked PostgreSQL wasn't a threaded app and doesn't use
non-blocking IO.  The backend would block until the call returns,
where's the problem?  :)
 
 If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
 code, if there are any.  -sc
 
 -- 
 Sean Chittenden
 
 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] PlPython

2003-06-22 Thread Tom Lane

Kevin Jacobs [EMAIL PROTECTED] writes:
 Attached is a patch that removes all of the RExec code from plpython from
 the current PostgreSQL CVS.  In addition, plpython needs to be changed to an
 untrusted language in createlang.

I am inclined to rename plpython to plpythonu, by analogy to pltclu.
The advantage of doing so is that (a) the name change makes it somewhat
more obvious that there's a fundamental behavioral change, and (b)
assuming that the Python folk someday figure out a secure version of
RExec, we'd want to reinstitute the trusted version of plpython, but
perhaps not take away the untrusted one.

On the other hand, this would create headaches for people who are trying
to load dump files that declare plpython or contain plpython-language
functions.  I can't think of any non-kluge solution to this (kluge
solutions would include putting special-case code into CREATE FUNCTION
to change 'plpython' to 'plpythonu' ...)

Comments?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

 Basically, we don't know when we read a buffer whether this is a
 read-only or read/write.  In fact, we could read it in, and another
 backend could write it for us.

Um, wait.  The cache is shared between backends?  I don't think so,
but it shouldn't matter because there has to be a semaphore locking
the cache to prevent the coherency issue you describe.  If PostgreSQL
didn't, it'd be having problems with this now.  I'd also think that
MVCC would handle the case of updated data in the cache as that has to
be a common case.  At what point is the cached result invalidated and
fetched from the OS?

 The big issue is that when we do a write, we don't wait for it to
 get to disk.

Only in the case when fsync() is turned off, but again, that's up to
the OS to manage that can of worms, which I think BSD takes care of
that.  From conf/NOTES:

# Attempt to bypass the buffer cache and put data directly into the
# userland buffer for read operation when O_DIRECT flag is set on the
# file.  Both offset and length of the read operation must be
# multiples of the physical media sector size.
#
#optionsDIRECTIO

The offsets and length bit kinda bothers me though, but I thin that's
stuff that the ernel has to take into account, not the userland calls,
I wonder if that's actually accurate any more or affects userland
calls...  seems like that'd be a bit too much work to have the user
do, esp given the lack of documentation on the flag... should be just
drop in additional flag, afaict.

 It seems to use O_DIRECT, we would have to read the buffer in a
 special way to mark it as read-only, which seems kind of strange.  I
 see no reason we can't use free-behind in the PostgreSQL buffer
 cache to handle most of the benefits of O_DIRECT, without the
 read-only buffer restriction.

I don't see how this'd be an issue as buffers populated via a read(),
that are updated, and then written out, would occupy a new chunk of
disk to satisfy MVCC.  Why would we need to mark a buffer as read only
and carry around/check its state?

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

Sean Chittenden wrote:
  Basically, we don't know when we read a buffer whether this is a
  read-only or read/write.  In fact, we could read it in, and another
  backend could write it for us.
 
 Um, wait.  The cache is shared between backends?  I don't think so,
 but it shouldn't matter because there has to be a semaphore locking
 the cache to prevent the coherency issue you describe.  If PostgreSQL
 didn't, it'd be having problems with this now.  I'd also think that
 MVCC would handle the case of updated data in the cache as that has to
 be a common case.  At what point is the cached result invalidated and
 fetched from the OS?

Uh, it's called the _shared_ buffer cache in postgresql.conf, and we
lock pages only while we are reading/writing them, not for the duration
they are in the cache.

  The big issue is that when we do a write, we don't wait for it to
  get to disk.
 
 Only in the case when fsync() is turned off, but again, that's up to
 the OS to manage that can of worms, which I think BSD takes care of
 that.  From conf/NOTES:

Nope.  When you don't have a kernel buffer cache, and you do a write,
where do you expect it to go?  I assume it goes to the drive, and you
have to wait for that.

 
 # Attempt to bypass the buffer cache and put data directly into the
 # userland buffer for read operation when O_DIRECT flag is set on the
 # file.  Both offset and length of the read operation must be
 # multiples of the physical media sector size.
 #
 #optionsDIRECTIO
 
 The offsets and length bit kinda bothers me though, but I thin that's
 stuff that the ernel has to take into account, not the userland calls,
 I wonder if that's actually accurate any more or affects userland
 calls...  seems like that'd be a bit too much work to have the user
 do, esp given the lack of documentation on the flag... should be just
 drop in additional flag, afaict.
 
  It seems to use O_DIRECT, we would have to read the buffer in a
  special way to mark it as read-only, which seems kind of strange.  I
  see no reason we can't use free-behind in the PostgreSQL buffer
  cache to handle most of the benefits of O_DIRECT, without the
  read-only buffer restriction.
 
 I don't see how this'd be an issue as buffers populated via a read(),
 that are updated, and then written out, would occupy a new chunk of
 disk to satisfy MVCC.  Why would we need to mark a buffer as read only
 and carry around/check its state?

We update the expired flags on the tuple during update/delete.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 I sure want two-phase commit.  I don't remember it as being rejected,
 and we certainly need it, independent of replication.

Is 2PC a real-world solution to any real-world problem?  I have never
seen a satisfactory explanation of what you do when you've reported that
you're ready to commit and no confirmation ever comes back.  Sooner or
later you must violate the protocol in one direction or the other (ie,
commit without confirmation or roll back in violation of your promise
of being able to commit).

I think it's a cool-sounding phrase that does not actually work in
practice.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Basically, I think we need free-behind rather than O_DIRECT.

There are two separate issues here --- one is what's happening in our
own cache, and one is what's happening in the kernel disk cache.
Implementing our own free-behind code would help in our own cache but
does nothing for the kernel cache.

My thought on this is that for large seqscans we could think about
doing reads through a file descriptor that's opened with O_DIRECT.
But writes should never go through O_DIRECT.  In some scenarios this
would mean having two FDs open for the same relation file.  This'd
require moderately extensive changes to the smgr-related APIs, but
it doesn't seem totally out of the question.  I'd kinda like to see
some experimental evidence that it's worth doing though.  Anyone
care to make a quick-hack prototype and do some measurements?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 _That_ is an excellent point.  However, do we know at the time we open
 the file descriptor if we will be doing this?

We'd have to say on a per-read basis whether we want O_DIRECT or not,
and fd.c would need to provide a suitable file descriptor.

 What about cache
 coherency problems with other backends not opening with O_DIRECT?

If O_DIRECT introduces cache coherency problems against other processes
not using O_DIRECT then the whole idea is a nonstarter, but I can't
imagine any kernel hackers would have been stupid enough to allow that
to happen ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] a problem with index and user define type

2003-06-22 Thread Weiping He

Tom Lane wrote:

Wang Mike [EMAIL PROTECTED] writes:
 

but  this query: select * from test_uuid where id = 
'df2b10aa-a31d-11d7-9867-0050babb6029'::uuid   dosn't use index
   

 

 QUERY PLAN
---
Seq Scan on test_uuid  (cost=0.00..22.50 rows=500 width=140)
  Filter: (id = 'df2b10aa-a31d-11d7-9867-0050babb6029'::uuid)
   

 

why ??
   

The rows estimate looks pretty fishy --- I think you are getting the
0.5 default selectivity estimate for an operator that has no restriction
estimator.  Most likely you should have created the operator using eqsel
and eqjoinsel as the restriction/join estimators.
			regards, tom lane

 

Hi, Tom,

   I'm trying to test it, but don't know if I understood you correctly,
   you mean we should  try to create the operator using 
eqsel/eqjoinsel  estimators, right?
   But after we added those estimators like this:

CREATE OPERATOR = (
   LEFTARG = uuid,
   RIGHTARG = uuid,
   COMMUTATOR = =,
   NEGATOR = ,
   PROCEDURE = uuid_eq,
   RESTRICT = eqsel,
   JOIN = eqjoinsel
);
   the situation trun worse: now the explain shows the query using the 
index,
   the we can't select out the match row! Any hint about what's wrong 
with us?

Thanks and Reagards

Laser



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Basically, I think we need free-behind rather than O_DIRECT.
 
 There are two separate issues here --- one is what's happening in our
 own cache, and one is what's happening in the kernel disk cache.
 Implementing our own free-behind code would help in our own cache but
 does nothing for the kernel cache.

Right.

 My thought on this is that for large seqscans we could think about
 doing reads through a file descriptor that's opened with O_DIRECT.
 But writes should never go through O_DIRECT.  In some scenarios this
 would mean having two FDs open for the same relation file.  This'd
 require moderately extensive changes to the smgr-related APIs, but
 it doesn't seem totally out of the question.  I'd kinda like to see
 some experimental evidence that it's worth doing though.  Anyone
 care to make a quick-hack prototype and do some measurements?

True, it is a cost/benefit issue.  My assumption was that once we have
free-behind in the PostgreSQL shared buffer cache, the kernel cache
issues would be minimal, but I am willing to be found wrong.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  _That_ is an excellent point.  However, do we know at the time we open
  the file descriptor if we will be doing this?
 
 We'd have to say on a per-read basis whether we want O_DIRECT or not,
 and fd.c would need to provide a suitable file descriptor.

OK

  What about cache
  coherency problems with other backends not opening with O_DIRECT?
 
 If O_DIRECT introduces cache coherency problems against other processes
 not using O_DIRECT then the whole idea is a nonstarter, but I can't
 imagine any kernel hackers would have been stupid enough to allow that
 to happen ...

Seeing how the buffer exists in user space, I would be interested how
they prevent coherency problems with good performance --- maybe they map
the same page into multiple processes --- that would be interesting,
though it would require some locking.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

  Basically, I think we need free-behind rather than O_DIRECT.
 
 There are two separate issues here --- one is what's happening in
 our own cache, and one is what's happening in the kernel disk cache.
 Implementing our own free-behind code would help in our own cache
 but does nothing for the kernel cache.
 
 My thought on this is that for large seqscans we could think about
 doing reads through a file descriptor that's opened with O_DIRECT.
 But writes should never go through O_DIRECT.  In some scenarios this
 would mean having two FDs open for the same relation file.  This'd
 require moderately extensive changes to the smgr-related APIs, but
 it doesn't seem totally out of the question.  I'd kinda like to see
 some experimental evidence that it's worth doing though.  Anyone
 care to make a quick-hack prototype and do some measurements?

What would you like to measure?  Overall system performance when a
query is using O_DIRECT or are you looking for negative/postitve
impact of read() not using the FS cache?  The latter is much easier to
do than the former...  recreating a valid load environment that'd let
any O_DIRECT benchmark be useful isn't trivial.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  I sure want two-phase commit.  I don't remember it as being rejected,
  and we certainly need it, independent of replication.
 
 Is 2PC a real-world solution to any real-world problem?  I have never
 seen a satisfactory explanation of what you do when you've reported that
 you're ready to commit and no confirmation ever comes back.  Sooner or
 later you must violate the protocol in one direction or the other (ie,
 commit without confirmation or roll back in violation of your promise
 of being able to commit).
 
 I think it's a cool-sounding phrase that does not actually work in
 practice.

I think 2PC can be used to build more complex features, like using dblink
for cross-db modification queries.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

   What about cache coherency problems with other backends not
   opening with O_DIRECT?
  
  If O_DIRECT introduces cache coherency problems against other
  processes not using O_DIRECT then the whole idea is a nonstarter,
  but I can't imagine any kernel hackers would have been stupid
  enough to allow that to happen ...

Cache coherency isn't an issue, read() called on a fd that has
O_DIRECT set on it will pull data from the FS cache if a write has
occurred to the file pointed to by the reading/writing fd (though the
write of that page gets done immediately and isn't postponed,
rationale for that is still pending).

 Seeing how the buffer exists in user space, I would be interested
 how they prevent coherency problems with good performance --- maybe
 they map the same page into multiple processes --- that would be
 interesting, though it would require some locking.

It's in src/kern/vfs_bio.c, src/ufs/ffs_vnops.c, src/ufs/ffs_rawread.c

http://lists.freebsd.org/pipermail/cvs-all/2003-March/000226.html

phk's likely to revamp some of the FS buffer cache in FreeBSD 6, but
the userland API will remain stable (it'll just get more efficient in
6, not that it's not fast as is).

http://lists.freebsd.org/pipermail/cvs-all/2003-March/000261.html

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 True, it is a cost/benefit issue.  My assumption was that once we have
 free-behind in the PostgreSQL shared buffer cache, the kernel cache
 issues would be minimal, but I am willing to be found wrong.

If you are running on the small-shared-buffers-and-large-kernel-cache
theory, then getting the kernel cache to behave right is much more
important than making the PG cache behave right.  If you favor the
other theory then free-behind in the PG cache is the important thing.
However, I've not yet seen any convincing evidence that large PG
cache with small kernel cache is the way to go.

The scenario that I'm envisioning would require passing a this-is-a-
seqscan-read flag down through bufmgr to smgr to fd.c.  So the info
would be available to bufmgr to implement a free-behind policy.  But
I suspect that's not really where any big win would come from.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Sun, 22 Jun 2003, Bruce Momjian wrote:

 Tom Lane wrote:
  Bruce Momjian [EMAIL PROTECTED] writes:
   I sure want two-phase commit.  I don't remember it as being rejected,
   and we certainly need it, independent of replication.
 
  Is 2PC a real-world solution to any real-world problem?  I have never
  seen a satisfactory explanation of what you do when you've reported that
  you're ready to commit and no confirmation ever comes back.  Sooner or
  later you must violate the protocol in one direction or the other (ie,
  commit without confirmation or roll back in violation of your promise
  of being able to commit).
 
  I think it's a cool-sounding phrase that does not actually work in
  practice.

 I think 2PC can be used to build more complex features, like using
 dblink for cross-db modification queries.

That was my understanding too ... as soon as you try and go distributed,
you need some method of ensuring that the data is constant across them all
...

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane

Sean Chittenden [EMAIL PROTECTED] writes:
 it doesn't seem totally out of the question.  I'd kinda like to see
 some experimental evidence that it's worth doing though.  Anyone
 care to make a quick-hack prototype and do some measurements?

 What would you like to measure?  Overall system performance when a
 query is using O_DIRECT or are you looking for negative/postitve
 impact of read() not using the FS cache?  The latter is much easier to
 do than the former...  recreating a valid load environment that'd let
 any O_DIRECT benchmark be useful isn't trivial.

If this stuff were easy, we'd have done it already ;-).  The first
problem is to figure out what makes sense to measure.

Given that the request is for a quick-and-dirty test, I'd be willing to
cut you some slack on the measurement process.  That is, it's okay to
pick something easier to measure over something harder to measure, as
long as you can make a fair argument that what you're measuring is of
any interest at all...

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I think it's a cool-sounding phrase that does not actually work in
 practice.

 I think 2PC can be used to build more complex features,

Only if it works to begin with.  If it's unreliable, what are you gonna
build on it?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  I think it's a cool-sounding phrase that does not actually work in
  practice.
 
  I think 2PC can be used to build more complex features,
 
 Only if it works to begin with.  If it's unreliable, what are you gonna
 build on it?

The question was whether 2PC is useful.  The question wasn't if an
unreliable 2PC was useful.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] a problem with index and user define type

2003-06-22 Thread Tom Lane

Weiping He [EMAIL PROTECTED] writes:
 the situation trun worse: now the explain shows the query using the 
 index,
 the we can't select out the match row! Any hint about what's wrong 
 with us?

My bet: either your operators are broken or your operator class
definition is wrong.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

  True, it is a cost/benefit issue.  My assumption was that once we have
  free-behind in the PostgreSQL shared buffer cache, the kernel cache
  issues would be minimal, but I am willing to be found wrong.
 
 If you are running on the
 small-shared-buffers-and-large-kernel-cache theory, then getting the
 kernel cache to behave right is much more important than making the
 PG cache behave right.  If you favor the other theory then
 free-behind in the PG cache is the important thing.  However, I've
 not yet seen any convincing evidence that large PG cache with small
 kernel cache is the way to go.

Nor could it ever be a win unless the cache was populated via
O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once in
the kernel and once in PG.  Doing caching at the kernel level, however
means only one copy of data (for the most part).  Only problem with
this being that it's not always that easy or an option to reconfig a
kernel to have a bigger FS cache.  That said, tripple copying a chunk
of mem is generally faster than even a single disk read.  If
PostgreSQL ever wanted to have a platform agnostic way of doing
efficient caching, it'd likely have to be in the userland and would
require the use of O_DIRECT.

-sc


PS Tripple copy == disk buffer into kernel (data is normally DMA'ed,
   not technically a copy), fs cache into user land, userland into PG
   cache, pg cache into application.  O_DIRECT eliminates one of these
   copies: nevermind the doubling up of data in RAM.

-- 
Sean Chittenden

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

Sean Chittenden wrote:
 Nor could it ever be a win unless the cache was populated via
 O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once in
 the kernel and once in PG.  Doing caching at the kernel level, however
 means only one copy of data (for the most part).  Only problem with
 this being that it's not always that easy or an option to reconfig a
 kernel to have a bigger FS cache.  That said, tripple copying a chunk
 of mem is generally faster than even a single disk read.  If
 PostgreSQL ever wanted to have a platform agnostic way of doing
 efficient caching, it'd likely have to be in the userland and would
 require the use of O_DIRECT.

Actually, I think of O_DIRECT as platform-dependent.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 The question was whether 2PC is useful.  The question wasn't if an
 unreliable 2PC was useful.

My question is whether there is such a thing as reliable 2PC.  I sure
don't see how you build that.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  The question was whether 2PC is useful.  The question wasn't if an
  unreliable 2PC was useful.
 
 My question is whether there is such a thing as reliable 2PC.  I sure
 don't see how you build that.

Other databases use 2PC --- are you saying none of them are reliable?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Jan Wieck

Tom Lane wrote:
Bruce Momjian [EMAIL PROTECTED] writes:
I sure want two-phase commit.  I don't remember it as being rejected,
and we certainly need it, independent of replication.
Is 2PC a real-world solution to any real-world problem?  I have never
seen a satisfactory explanation of what you do when you've reported that
you're ready to commit and no confirmation ever comes back.  Sooner or
later you must violate the protocol in one direction or the other (ie,
commit without confirmation or roll back in violation of your promise
of being able to commit).
I think it's a cool-sounding phrase that does not actually work in
practice.
The other problem I was missing being addressed is what happens if one 
promised I can commit and crashes? Not exactly at the time he crashes, 
but more at the time he restarts? Doesn't he have to restart into 
exactly that state of I can commit, with all locks in place and yet 
being able to rollback and then again ask and what now? I would be 
surprised if said patch does that ... very *positively* surprised!

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Other databases use 2PC --- are you saying none of them are reliable?

Perhaps they're just smarter than I am.  My question stands: what do
you do when the controller doesn't respond after you promise to commit?
Without a believable answer to that, I have no confidence in the idea.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] ECPG still having thread problems - follow-up

2003-06-22 Thread Philip Yarra

On Wed, 18 Jun 2003 02:58 pm, Philip Yarra wrote:
 Hi all, it looks like Lee's ECPG (and libpq) thread-safety patches
 have been applied, and configure --with-threads is also added. I
 have been doing some testing, and I still encounter a
 threading problem.

For those interested, I tested this on FreeBSD 4.8, and it works fine (as in, 
no threads hang). I'm not sure what to make of that. Sounds like something 
more might be required for pthreads on Linux.

Regards, Philip.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Sailesh Krishnamurthy

 Bruce == Bruce Momjian [EMAIL PROTECTED] writes:

Bruce Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:  The question
 was whether 2PC is useful.  The question wasn't if an 
 unreliable 2PC was useful.
 
 My question is whether there is such a thing as reliable 2PC.
 I sure don't see how you build that.

Bruce Other databases use 2PC --- are you saying none of them are
Bruce reliable?

And they use them for both federated read/write (what you refer to as
distributed access through dblink) and for clustered configurations. 

I'm not sure if I understand Tom's beef - I think he is concerned
about what happens if a subordinate does not respond to a prepare
message. I would assume that the co-ordinator would not let the commit
go through until it has received confirmations from every
subordinate. The application's commit will remain blocked against the
co-ordinator when this takes place.

That said, I agree that 2PC (and variants) is rather heavy weight when
in widely distributed configurations. 

(Although I guess in practice, many people use Presumed Abort and not
vanilla 2PC as PA results in fewer log flushes for read-only
transactions.)

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] $PostgreSQL$ for revision info

2003-06-22 Thread Bruce Momjian


Added to TODO:

* Change CVS $Id$ to $PostgreSQL$


---

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Is this a dead issue?  I know several people said they wanted it.
 
 I'm willing to do it if people want it.  Some of my Red Hat coworkers
 say it'd make their lives easier.
 
 I'd actually suggest that we standardize on $PostgreSQL$ working like
 $Id$ not $Header$, ie omitting the full-path info.  While I'd rather
 have the path in it myself, this would have the advantage that checkouts
 from the master and anoncvs servers would look the same.  Several people
 have griped about differences between distribution tarballs and anoncvs
 checkouts due to the repository path not being the same.
 
 Plan B would be to get the repository path to be the same on both
 servers, but that would cause a flag day for either committers or
 noncommitters ...
 
   regards, tom lane
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

  it doesn't seem totally out of the question.  I'd kinda like to
  see some experimental evidence that it's worth doing though.
  Anyone care to make a quick-hack prototype and do some
  measurements?
 
  What would you like to measure?  Overall system performance when a
  query is using O_DIRECT or are you looking for negative/postitve
  impact of read() not using the FS cache?  The latter is much
  easier to do than the former...  recreating a valid load
  environment that'd let any O_DIRECT benchmark be useful isn't
  trivial.
 
 If this stuff were easy, we'd have done it already ;-).

What do you mean?  Bits don't just hit the tree randomly because of a
possible speed improvement hinted at by a man page reference?  :-]

 The first problem is to figure out what makes sense to measure.

Egh, yeah, and this isn't trivial either benchmarking around vfs
caching makes it hard to get good results (been down that prim rose
path before with sendfile() happiness).

 Given that the request is for a quick-and-dirty test, I'd be willing
 to cut you some slack on the measurement process.  That is, it's
 okay to pick something easier to measure over something harder to
 measure, as long as you can make a fair argument that what you're
 measuring is of any interest at all...

hrm, well, given the easy part is thumping out the code, how's the
following sound as a test procedure:

1) Write out several files at varying sizes using O_DIRECT (512KB,
   1MB, 5MB, 10MB, 50MB, 100MB, 512MB, 1GB) to avoid having the FS
   cache polluted by the writes.

2) Open two new procs that read the above created files with and
   without O_DIRECT (each test iteration must rewrite the files
   above).

3) Before each read() call (does PostgreSQL use fread(3) or read(2)?),
   use gettimeofday(2) to get high resolution timing of time required
   to perform each system call.

4) Perform each of the tests above 4 times, averaging the last three
   and throwing out the 1st case (though reporting its value may be of
   interest).


I'm not that wild about writing anything threaded unless there's
strong enough interest in a write() to an O_DIRECT'ed fd to see what
happens.  I'm not convinced we'll see anything worth while unless I
setup an example that's doing a ton of write disk io.

As things stand, because O_DIRECT is an execution fast path through
the vfs subsystem, I expect the speed difference to be greater on
faster HDDs with high RPMs than on slower IDE machines at only
5400RPM... thus trivializing any benchmark I'll do on my laptop.  And
actually, if the app can't keep up with the disk, I bet the fs cache
case will be faster.  If the read()'s are able to keep up at the rate
of the HDD, however, this could be a big win in the speed dept, but if
things lag for an instant, the platter will have to make another
rotation before the call comes back to the userland.

Now that I think about it, the optimal case would be to anonymously
mmap() a private buffer that does the read() writes into that way the
HDD could just DMA the data into the mmap()'ed buffer making it a
zero-copy read operation though stirring any interest with my
mmap() benchmarks from a while back seems to me have been lost in the
fray.  :)

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Tom Lane

Jan Wieck [EMAIL PROTECTED] writes:
 The other problem I was missing being addressed is what happens if one 
 promised I can commit and crashes? Not exactly at the time he crashes, 
 but more at the time he restarts? Doesn't he have to restart into 
 exactly that state of I can commit, with all locks in place

Yes, I think he does --- which adds a whole 'nother layer of complexity
and performance penalty to the thing, because all those held locks etc
have to be recorded on disk before you promise to commit.

That part is soluble in theory though, ie, I believe that it can be
done (not efficiently, but it can be done).  I don't see what to do
about the no-commit-ack problem.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Tom Lane

Sailesh Krishnamurthy [EMAIL PROTECTED] writes:
 I'm not sure if I understand Tom's beef - I think he is concerned
 about what happens if a subordinate does not respond to a prepare
 message. I would assume that the co-ordinator would not let the commit
 go through until it has received confirmations from every
 subordinate.

No.  I want to know what the subordinate does when it's promised to
commit and the co-ordinator never responds.  AFAICS the subordinate
is screwed --- it can't commit, and it can't abort, and it can't expect
to make progress indefinitely on other work while it's holding locks
for the not-quite-committed transaction.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden

  Nor could it ever be a win unless the cache was populated via
  O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once
  in the kernel and once in PG.  Doing caching at the kernel level,
  however means only one copy of data (for the most part).  Only
  problem with this being that it's not always that easy or an
  option to reconfig a kernel to have a bigger FS cache.  That said,
  tripple copying a chunk of mem is generally faster than even a
  single disk read.  If PostgreSQL ever wanted to have a platform
  agnostic way of doing efficient caching, it'd likely have to be in
  the userland and would require the use of O_DIRECT.
 
 Actually, I think of O_DIRECT as platform-dependent.

FreeBSD, IRIX, and AIX, implement it, and ... *smiles with pride*
looks like Linux does too given the number of security vulnerabilities
associated with the call.  :-]

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Christopher Kings-Lynne

 No.  I want to know what the subordinate does when it's promised to
 commit and the co-ordinator never responds.  AFAICS the subordinate
 is screwed --- it can't commit, and it can't abort, and it can't expect
 to make progress indefinitely on other work while it's holding locks
 for the not-quite-committed transaction.

It takes itself offline and you need to resync it later??

Chris


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

Sean Chittenden wrote:
   Nor could it ever be a win unless the cache was populated via
   O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once
   in the kernel and once in PG.  Doing caching at the kernel level,
   however means only one copy of data (for the most part).  Only
   problem with this being that it's not always that easy or an
   option to reconfig a kernel to have a bigger FS cache.  That said,
   tripple copying a chunk of mem is generally faster than even a
   single disk read.  If PostgreSQL ever wanted to have a platform
   agnostic way of doing efficient caching, it'd likely have to be in
   the userland and would require the use of O_DIRECT.
  
  Actually, I think of O_DIRECT as platform-dependent.
 
 FreeBSD, IRIX, and AIX, implement it, and ... *smiles with pride*
 looks like Linux does too given the number of security vulnerabilities
 associated with the call.  :-]

OK, that's 4 of 15 platforms.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Sun, 22 Jun 2003, Bruce Momjian wrote:

 Tom Lane wrote:
  Bruce Momjian [EMAIL PROTECTED] writes:
   Tom Lane wrote:
   I think it's a cool-sounding phrase that does not actually work in
   practice.
 
   I think 2PC can be used to build more complex features,
 
  Only if it works to begin with.  If it's unreliable, what are you gonna
  build on it?

 The question was whether 2PC is useful.  The question wasn't if an
 unreliable 2PC was useful.

I have to back Bruce up on this one ... is there a reason why 2PC couldn't
be made reliable?  I'm guessin that Oracle supports 2PC ... ?  If so, is
it unreliable?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Mon, 23 Jun 2003, Sailesh Krishnamurthy wrote:

 I'm not sure if I understand Tom's beef - I think he is concerned about
 what happens if a subordinate does not respond to a prepare message. I
 would assume that the co-ordinator would not let the commit go through
 until it has received confirmations from every subordinate. The
 application's commit will remain blocked against the co-ordinator when
 this takes place.

Wouldn't 2PC have some sort of 'heartbeat' between the co-ordinator and
subordinate?  Like, if you had multiple subordinates and one crashed, the
co-ordinator would have to know that and be able to continue on, no?


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread The Hermit Hacker

On Mon, 23 Jun 2003, Christopher Kings-Lynne wrote:

  No.  I want to know what the subordinate does when it's promised to
  commit and the co-ordinator never responds.  AFAICS the subordinate
  is screwed --- it can't commit, and it can't abort, and it can't expect
  to make progress indefinitely on other work while it's holding locks
  for the not-quite-committed transaction.

 It takes itself offline and you need to resync it later??

Hrmmm, I see Tom's point (I think!) ... but what if, for instance, the
co-ordinator crashes?  From the subordinates point of view, it has the
complete transaction to commit, but the co-ordinator has gone down without
telling it to do so ...

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Tom Lane

The Hermit Hacker [EMAIL PROTECTED] writes:
 Hrmmm, I see Tom's point (I think!) ... but what if, for instance, the
 co-ordinator crashes?

Or you just lose the network connection for awhile.  The worst case
scenario I think is where the co-ordinator got everyone's promise to
commit, and told some of the subordinates to commit, but your own
response gets lost due to network failure.  Now what?  If you time
out and decide to abort, you're inconsistent with the other
subordinates.  On the other hand, you can't commit after a timeout
either, because that loses in the other scenario (where the coordinator
didn't decide to commit).  Basically, the subordinate must be willing
to hold its breath *forever*.  I don't see how this can work.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[HACKERS] 2Q implementaion for PostgreSQL buffer replacement.

2003-06-22 Thread Yutaka tanida

Hi.

I implement 2Q algorithm to PostgreSQL for buffer management , instead
of LRU.
It's known as low overhead and high performance than LRU. If you have
some interests , see following URL.

http://www.vldb.org/conf/1994/P439.PDF

In my test (pgbench -S) , it improves 4% cache hit rate and 2% up
performance comparing from LRU.

Do you have any interest about this patch?

-- 
Yutaka tanida [EMAIL PROTECTED]
http://www.nonsensecorner.com/


qq_1.patch
Description: Binary data


buf_init.c.diff
Description: Binary data


freelist.c.diff
Description: Binary data

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Sailesh Krishnamurthy

 Tom == Tom Lane [EMAIL PROTECTED] writes:

Tom Sailesh Krishnamurthy [EMAIL PROTECTED] writes:
 I'm not sure if I understand Tom's beef - I think he is
 concerned about what happens if a subordinate does not respond
 to a prepare message. I would assume that the co-ordinator
 would not let the commit go through until it has received
 confirmations from every subordinate.

Tom No.  I want to know what the subordinate does when it's
Tom promised to commit and the co-ordinator never responds.
Tom AFAICS the subordinate is screwed --- it can't commit, and it
Tom can't abort, and it can't expect to make progress
Tom indefinitely on other work while it's holding locks for the
Tom not-quite-committed transaction.

Okay I understand what you mean now.

AFAIK the general way things happen is that each site has a recovery
procedure that kicks in after a crash. If the co-ordinator crashes
(which could be before or after it sends out COMMIT messages to some
of the subordinates), its recovery manager will bring the system up,
read the log and ready information about all uncommitted transactions
in virtual storage.

If a Xact is in the PREPARE stage it will periodically send a message
to the co-ordinator asking about what happened to the transaction in
question. Once the co-ordinator has come back online it can respond to
the query.

Of course in the case of a co-ordinator going out of action totally
and remaining unconnected this is not a viable solution. 

If you're making the case that 2PC is not viable on very wide area
networks with intermitted connectivity, I agree whole-heartedly. 

That said, 2PC (and its children, PA and PC) have their place, and are
indeed used in many systems. 

For instance, say you are rigging up a message queueing infrastructure
(like MQ-series) to your database (say with NOTIFY), you'd at least
like to have the db act as a co-ordinator with the MQ.

Or the parallel cluster example I gave earlier. Clustered linux boxes
are definitely here although no open-source DBMS offers a parallel
solution.

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Two weeks to feature freeze

2003-06-22 Thread Mike Mascari

Tom Lane wrote:

 The Hermit Hacker [EMAIL PROTECTED] writes:
 
Hrmmm, I see Tom's point (I think!) ... but what if, for instance, the
co-ordinator crashes?
 
 Or you just lose the network connection for awhile.  The worst case
 scenario I think is where the co-ordinator got everyone's promise to
 commit, and told some of the subordinates to commit, but your own
 response gets lost due to network failure.  Now what?  If you time
 out and decide to abort, you're inconsistent with the other
 subordinates.  On the other hand, you can't commit after a timeout
 either, because that loses in the other scenario (where the coordinator
 didn't decide to commit).  Basically, the subordinate must be willing
 to hold its breath *forever*.  

Yep. And if the cohort crashes while waiting for the coordinator to
come back on-line, if I understand the world correctly, it must be
capable of committing the database changes associated with the
COMMIT-VOTE response it supplied to the coordinator's PREPARE. It
seems this would require REDO? And yet there are thousands of
installed distributed databases running enterprises every day.

A paper on a A New Presumed Commit Optimization for Two Phase Commit
describes the cohort as:

If a prepared cohort does not receive a transaction outcome message
promptly, or crashes without remembering the outcome, the cohort asks
the coordinator for the outcome. It keeps on asking until it gets an
answer. (This is the blocking aspect of 2PC.)

I'd just like to point out that:

1) The XA interface defines a 2PC protocol library which allows
transaction managers, such as BEAS Tuxedo (and Oracle, for that
matter) to use the database in a distributed transaction. Lack of an
XA interface for PostgreSQL prohibits its use in major enterprise
applications. BEAS Tuxedo can talk to PostgreSQL, but won't allow it
to participate in a distributed tx.

2) The users of distributed databases will/should/can know that a
cohort will block waiting for the coordinator. We're not talking
asynchronous multi-master replication of 4 databases distributed over
low-speed communication lines across the country. We're talking about
the sales dept. database having a few linked tables to the accounting
dept. database, where inserts into the one result in inserts into the
other.

Mike Mascari
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [GENERAL] [HACKERS] large objects

2003-06-22 Thread Bruce Momjian


Add to TODO:

* Allow SSL-enabled clients to turn off SSL transfers


---

Tom Lane wrote:
 Nigel J. Andrews [EMAIL PROTECTED] writes:
  Ok, I tried to try this but I can not get SSL to _not_ be used when
  connecting via any tcp connection, unless the client hasn't been built
  with ssl support of course.
 
 Yeah, I believe that's the way it works.  It seems overly fascist of the
 SSL code to not have the option to turn it off ... but that's how libpq
 is set up at the moment.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

71 matches

Mail list logo