Re: [HACKERS] SQL/MED estimated time of arrival?

2010-11-06 Thread Hitoshi Harada
2010/11/5 Shigeru HANADA han...@metrosystems.co.jp:
 On Fri, 5 Nov 2010 16:27:49 +0900
 Itagaki Takahiro itagaki.takah...@gmail.com wrote:
 PL/Proxy has a similar functionality with RUN ON ALL to start queries
 in parallel. So, I think it's a infrastructure commonly required.
 I noticed the lack of consideration about cache invalidation from
 reading PL/Proxy source, thanks for your mention about PL/Proxy. :-)

And if we really make this async query come true, I suggest designing
resource (i.e. remote connection) management very carefully. When the
executor fails in the middle of its execution, it possibly fails to
release its own resource; close() in ExecutorEnd() will never be
called. As far as I know files and memory are released automatically
in the current mechanism, but MED APIs will use their own resources
other than them.

Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make versus effective stack limit in regression tests

2010-11-06 Thread Heikki Linnakangas

On 06.11.2010 00:39, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

On 11/05/2010 05:45 PM, Tom Lane wrote:

Anyway, what this points up is that we are making a very conservative
assumption about what to do when getrlimit() returns RLIM_INFINITY.
It does not seem real reasonable to interpret that as 100kB on any
modern platform.  I'm inclined to interpret it as 4MB, which is the
same default stack limit that we use on Windows.



+1.


After looking a bit closer, I think the real problem is that
get_stack_depth_rlimit's API fails to distinguish between unknown and
unlimited.  In the first case we ought to have a conservative default,
whereas in the second case not.  It's already the case that (a)
max_stack_depth is a SUSET parameter, and (b) for either unknown or
unlimited RLIMIT_STACK, we will let a superuser set whatever value he
wants, and it's on his head whether that value is safe or not.  That
part of the behavior seems OK.  What's not OK is using the same
built-in default value in both cases.  We need to fix it so that
InitializeGUCOptions can tell the difference.  If it can, I think the
current default of 2MB is OK --- most people will be fine with that,
and those who aren't can select some other value.


Yeah, I bumped into this two years ago but it didn't lead to a patch:
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00918.php

+1 on choosing 2MB for RLIM_INFINITY.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-06 Thread Gurjeet Singh
A customer of ours is quite bothered about finding zero pages in an index
after
a system crash. The task now is to improve the diagnosability of such an
issue
and be able to definitively point to the source of zero pages.

The proposed solution below has been vetted in-house at EnterpriseDB and am
posting here to see any possible problems we missed, and also if the
community
would be interested in incorporating this capability.

Background:
---
SUSE Linux, ATCA board, 4 dual core CPUs = 8 cores, 24 GB RAM, 140 GB disk,
PG 8.3.11. RAID-1 SAS with SCSIinfo reporting that write-caching is
disabled.

The corrupted index's file contents, based on hexdump:

It has a total of 525 pages (cluster block size is 8K: per
pg_controldata)
Blocks 0 to 278 look sane.
Blocks 279 to 518 are full of zeroes.
Block 519 to 522 look sane.
Block 523 is filled with zeroes.
Block 524 looks sane.

The tail end of blocks 278 and 522 have some non-zero data, meaning that
those
index pages have some valid 'Special space' contents. Also, head of blocks
519
and 524 look sane. These two findings imply that the zeroing action happened
at
8K page boundary. This is a standard ext3 FS with 4K block size, so this
raises
question as to how we can ascertain that this was indeed a hardware/FS
malfunction. And if it was a hardware/FS problem, then why didn't we see
zeroes
at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default
ext3 FS block size) which does not align with an 8 K boundary.

The backup from before the crash does not have these zero-pages.

Disk Page Validity Check Using Magic Number
===

Requirement:

We have encountered quite a few zero pages in an index after a machine
crash,
causing this index to be unusable. Although REINDEX is an option but we have
no way of telling if these zero pages were caused by hardware or filesystem
or
by Postgres. Postgres code analysis shows that Postgres being the culprit is
a
very low probablity, and similarly, since our hardware is also considered of
good quality with hardware level RAID-1 over 2 disks, it is difficult to
consider
the hardware to be a problem. The ext3 filesystem being used is also quite a
time-tested piece of software, hence it becomes very difficult to point
fingers
at any of these 3 components for this corruption.

Postgres is being deployed as a component of a carrier-grade platform, and
it is
required to run unattended as much as possible. There is a High Availability
monitoring component that is tasked with performing switchover to a standby
node
in the event of any problem with the primary node. This HA component needs
to
perform regular checks on health of all the other components, including
Postgres,
and take corrective actions.

With the zero pages comes the difficulty of ascertaining whether these are
legitimate zero pages, (since Postgres considers zero pages as valid (maybe
leftover from previous extend-file followed by a crash)), or are these zero
pages
a result of FS/hardware failure.

We are required to definitively differentiate between zero pages from
Postgres
vs. zero pages caused by hardware failure. Obviously this is not possible by
the
very nature of the problem, so we explored a few ideas, including per-block
checksums in-block or in checksum-fork, S.M.A.R.T monitoring of disk drives,
PageInit() before smgrextend() in ReadBuffer_common(), and additional member
in
PageHeader for a magic number.

Following is an approach which we think is least invasive, and does not
threaten
code-breakage, yet provides a definitive detection of corruption/data-loss
outside Postgres with least performance penalty.

Implementation:
---

.) The basic idea is to have a magic number in every PageHeader before it is
written to disk, and check for this magic number when performing page
validity
checks.

.) To avoid adding a new field to PageHeader, and any code breakage, we
reuse
   an existing member of the structure.

.) We exploit the following facts and assumptions:
  -) Relations/files are extended 8 KB (BLCKSZ) at a time.
  -) Every I/O unit contains PageHeader structure (table/index/fork files),
 which in turn contains pd_lsn as the first member.
  -) Every newly written block is considered to be zero filled.
  -) PageIsNew() assumes that if pd_upper is 0 then the page is zero.
  -) PageHeaderIsValid() allows zero filled pages to be considered valid.
  -) Anyone wishing to use a new page has to do PageInit() on the page.
  -) PageInit() does a MemSet(0) on the whole page.
  -) XLogRecPtr={x,0} is considered invalid
  -) XLogRecPtr={x, ~((uint32)0)} is not valid either (i.e. last byte of an
xlog
  file (not segment)); we'll use this as the magic number.

  ... Above is my assumption, since it is not mentioned anywhere in the
code.
  The XLogFileSize calculation seems to support this assumptiopn.

  ... If this assumption doesn't hold good, 

Re: [HACKERS] temporary functions (and other object types)

2010-11-06 Thread Martijn van Oosterhout
On Fri, Nov 05, 2010 at 09:01:50PM -0400, Robert Haas wrote:
 On Fri, Nov 5, 2010 at 4:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  The latter is an intentional security feature and will not get changed.
 
 I see that there could be a problem here with SECURITY DEFINER
 functions, but I'm not clear whether it goes beyond that?

IIRC correctly it's because even unpriveledged users can make things in
the pg_temp schema and it's implicitly at the front of the search_path.
There was a CVE about this a while back, no?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] improved parallel make support

2010-11-06 Thread Peter Eisentraut
On ons, 2010-11-03 at 16:34 +0200, Peter Eisentraut wrote:
 On tis, 2010-11-02 at 10:21 -0400, Tom Lane wrote:
  Do we have a handle on how many buildfarm members this will break?
 
 I suppose we don't.  One way to find out would be to commit just this
 bit
 
 +# We need the $(eval) function, which is available in GNU make 3.80.
 +# That also happens to be the version where the .VARIABLES variable
 +# was introduced, so this is a simple check.
 +ifndef .VARIABLES
 +$(error GNU make 3.80 or newer is required)
 +endif
 
 with a $(warning) instead, and let it run for a bit.

So far, two machines have reported an older make version:

dawn_bat
narwhal

both of the mingw type.  Andrew, Dave, could you see about upgrading the
GNU make installation there?

There are a few machines that haven't build in five days or more, but
based on their operating system version, it is fairly safe to assume
that they have an up-to-date version.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How can we tell how far behind the standby is?

2010-11-06 Thread Brendan Jurd
On 6 November 2010 05:46, Josh Berkus j...@agliodbs.com wrote:
 I'm continuing in my efforts now to document how to deploy and manage
 replication on our wiki.  One of the things a DBA needs to do is to use
 pg_current_xlog_location() (and related functions) to check how far
 behind the master the standby is.

 However, there's some serious problems with that:

 (1) comparing these numbers is quite mathematically complex -- and, for
 that matter, undocumented.


Our solution to this was to strip the slash out of the numbers and
then feed them to `bc` for comparison.  The shell script for our
zabbix item looks something like this:


#!/bin/bash
errval=-1
primary=$(psql -At -h $1 -p $2 -c SELECT
replace(pg_current_xlog_location(), '/', ''); postgres)
standby=$(psql -At -h $3 -p $4 -c SELECT
replace(pg_last_xlog_receive_location(), '/', ''); postgres)

if [ -n $primary -a -n $standby ]
then
echo $(echo ibase=16; obase=10; $primary-$standby | bc)
else
echo $errval
fi


I'm posting this snippet a) in the hopes that it might help others,
and b) by way of agreement with Josh's point.  Requiring every user
who wants to monitor replication to set something like this up for
themselves is ... not awesome.

Cheers,
BJ

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix for seg picksplit function

2010-11-06 Thread Alexander Korotkov
Do you think now patch is ready for committer or it require further review
by you or somebody else?


With best regards,
Alexander Korotkov.


[HACKERS] SQL functions that can be inlined

2010-11-06 Thread Jim Nasby
Is there any way to have the database tell you if a particular SQL function can 
be inlined?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix for seg picksplit function

2010-11-06 Thread Yeb Havinga

Alexander Korotkov wrote:
Do you think now patch is ready for committer or it require further 
review by you or somebody else?
It's probably ready for committer, however the code now doesn't mention 
any reference or bit of information that it is faster than the original 
one. I was wondering how you discovered this, or is there any reverence 
to e.g. a gist paper/other work where this is researched? If the info is 
available, some comments in the code might help future gist developers 
for picking a right algorithm for other datatypes.


I don't think further review is required, but very much welcome further 
exploration of which picksplit algorithms match which datatype in which 
distribution best.


regards,
Yeb Havinga


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SQL/MED estimated time of arrival?

2010-11-06 Thread Tom Lane
Hitoshi Harada umi.tan...@gmail.com writes:
 And if we really make this async query come true, I suggest designing
 resource (i.e. remote connection) management very carefully. When the
 executor fails in the middle of its execution, it possibly fails to
 release its own resource; close() in ExecutorEnd() will never be
 called. As far as I know files and memory are released automatically
 in the current mechanism, but MED APIs will use their own resources
 other than them.

The way to fix that is for the FDW to hook into the ResourceOwner
mechanism (via RegisterResourceReleaseCallback).  Then it can track
and clean up things it knows about just as automatically as anything
else is.

Of course, if you lose your network connection to the remote DB,
you have to assume it will clean up of its own accord.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PL/pgSQL and shared_preload_libraries

2010-11-06 Thread Bruce Momjian
Since we now install PL/pgSQL by default, should we configure
shared_preload_libraries to preload PL/pgSQL?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporary functions (and other object types)

2010-11-06 Thread Tom Lane
Martijn van Oosterhout klep...@svana.org writes:
 On Fri, Nov 05, 2010 at 09:01:50PM -0400, Robert Haas wrote:
 I see that there could be a problem here with SECURITY DEFINER
 functions, but I'm not clear whether it goes beyond that?

 IIRC correctly it's because even unpriveledged users can make things in
 the pg_temp schema and it's implicitly at the front of the search_path.
 There was a CVE about this a while back, no?

Yeah, we changed that behavior as part of the fix for CVE-2007-2138.
You'd need either SECURITY DEFINER functions or very careless use of
SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-06 Thread Tom Lane
Gurjeet Singh singh.gurj...@gmail.com writes:
 .) The basic idea is to have a magic number in every PageHeader before it is
 written to disk, and check for this magic number when performing page
 validity
 checks.

Um ... and exactly how does that differ from the existing behavior?

 .) To avoid adding a new field to PageHeader, and any code breakage, we
 reuse
an existing member of the structure.

The amount of fragility introduced by the assumptions you have to make
for this seems to me to be vastly riskier than the risk you are trying
to respond to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SQL functions that can be inlined

2010-11-06 Thread Tom Lane
Jim Nasby j...@nasby.net writes:
 Is there any way to have the database tell you if a particular SQL function 
 can be inlined?

Easiest way is to EXPLAIN a query using it and see if it did get inlined.
For example,

regression=# create function foo(int) returns int as
regression-# 'select $1 + 1' language sql;
CREATE FUNCTION
regression=# explain verbose select foo(f1) from int4_tbl;
  QUERY PLAN   
---
 Seq Scan on public.int4_tbl  (cost=0.00..1.06 rows=5 width=4)
   Output: (f1 + 1)
(2 rows)

regression=# create function foo2(int) returns int as
'select $1 + 1 limit 1' language sql;
CREATE FUNCTION
regression=# explain verbose select foo2(f1) from int4_tbl;
  QUERY PLAN   
---
 Seq Scan on public.int4_tbl  (cost=0.00..2.30 rows=5 width=4)
   Output: foo2(f1)
(2 rows)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL and shared_preload_libraries

2010-11-06 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 Since we now install PL/pgSQL by default, should we configure
 shared_preload_libraries to preload PL/pgSQL?

I don't think that follows.  The fact that it's there doesn't mean
everyone is using it.  In any case, I've seen no evidence that says
you'd get a performance win this way.  The preload feature is meant
for modules that do a substantial amount of work at load time, which
plpgsql does not.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Should we use make -k on the buildfarm?

2010-11-06 Thread Andrew Dunstan


Peter Eisentraut has suggested that we should run make -k instead of 
plain make for most or all of the buildfarm steps. This flag 
essentially instructs make to keep going rather than fail at the first 
error. We haven't done that for the last five or six years that the 
buildfarm has been running, and nobody up to now has complained (that I 
recall). I don't have any great objection, but before I make this change 
I thought it might be as well to canvas a wider range of opinion.


So, does anyone else have thoughts about it?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psycopg and two phase commit

2010-11-06 Thread Daniele Varrazzo
On Fri, Nov 5, 2010 at 5:16 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2010/11/4 Daniele Varrazzo daniele.varra...@gmail.com:

 Just wanted to warn you that I have implemented the 2pc protocol in psycopg.

 I read a notice, but I didn't find a link for download, where is it, please?

We have just released a beta package including this and other
features. All the details at
http://initd.org/psycopg/articles/2010/11/06/psycopg-230-beta1-released/.

Cheers

-- Daniele

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simplifying replication

2010-11-06 Thread Dimitri Fontaine
Hannu Krosing ha...@2ndquadrant.com writes:
 To make pg_basebackup.py self-sufficient it should also open 2nd
 connection to the same master and make sure that all WAL files are
 copied for the duration of base copy.

Excellent idea, will make that happen soon'ish.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we use make -k on the buildfarm?

2010-11-06 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Peter Eisentraut has suggested that we should run make -k instead of 
 plain make for most or all of the buildfarm steps. This flag 
 essentially instructs make to keep going rather than fail at the first 
 error. We haven't done that for the last five or six years that the 
 buildfarm has been running, and nobody up to now has complained (that I 
 recall). I don't have any great objection, but before I make this change 
 I thought it might be as well to canvas a wider range of opinion.

 So, does anyone else have thoughts about it?

I don't really care about make -k as such.  What I *have* occasionally
wished for is that the buildfarm script would act more like make -k with
respect to the various test stages.  That is, not abandon the whole test
after one stage fails, but allow stages that don't logically depend on
the failed one to proceed.  But I'm not sure how hard that would be ---
quite aside from coding complexity, it would mean that you could have
more than one failing stage, and I don't know how you'd show that in the
dashboard.

Anyway, no objection to Peter's request; but there's another TODO item
for your buildfarm list, if you want to accept it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporary functions (and other object types)

2010-11-06 Thread Robert Haas
On Sat, Nov 6, 2010 at 11:36 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Martijn van Oosterhout klep...@svana.org writes:
 On Fri, Nov 05, 2010 at 09:01:50PM -0400, Robert Haas wrote:
 I see that there could be a problem here with SECURITY DEFINER
 functions, but I'm not clear whether it goes beyond that?

 IIRC correctly it's because even unpriveledged users can make things in
 the pg_temp schema and it's implicitly at the front of the search_path.
 There was a CVE about this a while back, no?

 Yeah, we changed that behavior as part of the fix for CVE-2007-2138.
 You'd need either SECURITY DEFINER functions or very careless use of
 SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable.

Would it be practical to let foo() potentially mean pg_temp.foo()
outside of any SECURITY DEFINER context?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query Plan Columns

2010-11-06 Thread David Fetter
On Fri, Nov 05, 2010 at 01:39:07PM -0700, Josh Berkus wrote:
 
  Of course, there are containers too, which are not in your list at all.
  How do you intend to represent the tree-ish structure in a flat table?
 
 Andrew: we'll use a proximity tree.

Adjacency list?

If so, in my experience, it's best to separate the node descriptions
from the adjacency list that links them together.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] IA64 versus effective stack limit

2010-11-06 Thread Tom Lane
Sergey was kind enough to lend me use of buildfarm member dugong
(IA64, Debian Etch) so I could poke into why its behavior in the
recursion-related regression tests was so odd.  I had previously
tried and failed to reproduce the behavior on a Red Hat IA64 test
machine (running RHEL of course) so I was feeling a bit baffled.
Here's what I found out:

1. Debian Etch has the make-resets-the-stack-rlimit bug that I reported
about yesterday, whereas the RHEL version I was testing had the fix
for that.  So that's why I couldn't reproduce max_stack_depth getting
set to 100kB.

2. IA64 is a very weird architecture: it has two separate hardware
stacks.  One is reserved for saving registers, which IA64 has got
a lot of, and the other normal stack holds everything else.
The method we use in check_stack_depth (ie, measure the difference
in addresses of local variables) effectively measures the depth of
the normal stack.  I don't know of any simple way to find out the
depth of the register stack.  You can get gdb to tell you about
both stacks, though.  I found out that with PG HEAD, the recursion
distance for the infinite_recurse() regression test is 160 bytes
of normal stack and 928 bytes of register stack per fmgr_sql call
level.  This is with gcc (I got identical numbers on dugong and the
RHEL machine).  But, if you build PG with icc as the buildfarm
critter is doing, that bloats to 3232 bytes of normal stack and
2832 bytes of register stack.  For comparison, my x86_64 Fedora 13 box
uses 704 bytes of stack per recursion level.

I don't know why icc is so much worse than gcc on this measure of
stack depth consumption, but clearly the combination of that and
the 100kB max_stack_depth explains why dugong is failing to do
very many levels of recursion before erroring out.  Fixing
get_stack_depth_rlimit as I proposed yesterday should give it
a reasonable stack depth.

However, we're not out of the woods yet.  Because check_stack_depth is
only checking the normal stack depth, and the two stacks don't grow at
the same rate, it's possible for a crash to occur due to running out of
register stack space.  We haven't seen that happen on dugong because,
as shown above, with icc the register stack grows more slowly than the
normal stack (at least for the specific functions we care about here).
But with gcc, the same code eats register stack a lot faster than normal
stack --- and in fact I observed a crash in the infinite_recurse() test
when building with gcc and testing in a manually-started postmaster.
The manually-started postmaster was under ulimit -s 8MB, which
apparently Debian interprets as 8MB for normal stack and another 8MB
for register stack.  Even though check_stack_depth was trying to
constrain the normal stack to just 2MB, the register stack grew 5.8
times faster and so blew through 8MB before check_stack_depth thought
there was a problem.  Raising ulimit -s allowed it to work.

(Curiously, I did *not* see the same type of crash on the RHEL machine.
I surmise that Red Hat has tweaked the kernel to allow the register
stack to grow more than the normal stack, but I haven't tried to verify
that.)

So this means we have a problem.  To some extent it's new in HEAD:
before the changes I made last week to not keep a local
FunctionCallInfoData in ExecMakeFunctionResult, there would have been at
least another 900 bytes of normal stack per recursion level, so even
with gcc the register stack would grow slower than normal stack in this
test, and you wouldn't have seen any crash in the regression tests.
But I'm sure there are lots of other potentially recursive routines in
PG where register stack could grow faster than normal stack, so we
shouldn't suppose that this fmgr_sql recursion is the only trouble spot.

As I said above, I don't know of any good way to measure register stack
depth directly.  It's probably possible to find out by asking the kernel
or something like that, but we surely do not want to introduce a kernel
call into check_stack_depth().  So a good solution for this is hard to
see.  The best idea I have at the moment is to reduce the reported stack
limit by some arbitrary factor, ie do something like

#ifdef __IA64__
val /= 8;
#endif

in get_stack_depth_rlimit().  Anyone have a better idea?

BTW, this also suggests to me that it'd be a real good idea to have
a buildfarm critter for IA64+gcc --- the differences between gcc and
icc are clearly pretty significant on this hardware.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporary functions (and other object types)

2010-11-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Nov 6, 2010 at 11:36 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, we changed that behavior as part of the fix for CVE-2007-2138.
 You'd need either SECURITY DEFINER functions or very careless use of
 SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable.

 Would it be practical to let foo() potentially mean pg_temp.foo()
 outside of any SECURITY DEFINER context?

Doesn't seem like a particularly good idea for the search semantics
to be randomly different inside a SECURITY DEFINER function.  In fact,
I'll bet you could construct an attack in the reverse direction:
S.D. function thinks it is calling a temp function (using syntax that
works fine when not S.D.), but control gets sent to a non-temp function
belonging to $badguy instead.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simplifying replication

2010-11-06 Thread Hannu Krosing
On Sat, 2010-11-06 at 18:02 +0100, Dimitri Fontaine wrote:
 Hannu Krosing ha...@2ndquadrant.com writes:
  To make pg_basebackup.py self-sufficient it should also open 2nd
  connection to the same master and make sure that all WAL files are
  copied for the duration of base copy.
 
 Excellent idea, will make that happen soon'ish.

Unitil I learned better, I thought that this is how SR is supposed to works ;)

btw, as next step you could backport this to 8.x and have most of the
benefits of SR. It should not be very hard to keep track of wal position
inside a pl/pythonu function and send one or more records back in form of

(walfile_name text, start_pos int, data bytea)

and then call this function from client every second to keep possible data 
loss down to 1 sec.

this lets you set up warm standby with nothing more than a postgresql 
superuser access to master (assuming right defaults in postgresql conf).

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Preformance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query Plan Columns

2010-11-06 Thread David E. Wheeler
On Nov 5, 2010, at 1:42 PM, David E. Wheeler wrote:

 http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/commands/explain.c;h=f494ec98e510c23120e072bd5ee8821ea12738a4;hb=HEAD#l617
 
 Ah, great, thanks.

So based on this, I've come up with:

Node Type TEXT,
Strategy  TEXT,
Operation TEXT,
Startup Cost  FLOAT,
Total CostFLOAT,
Plan Rows FLOAT,
Plan WidthINTEGER,
Actual Startup Time   FLOAT,
Actual Total Time FLOAT,
Actual Rows   FLOAT,
Actual Loops  FLOAT,
Parent Relationship   TEXT,
Sort Key  TEXT[],
Sort Method   TEXT[],
Sort Space Used   BIGINT,
Sort Space Type   TEXT,
Join Type TEXT,
Join Filter   TEXT,
Hash Cond TEXT,
Relation Name NAME,
Alias NAME,
Scan DirectionTEXT,
Index NameTEXT,
Index CondTEXT,
Recheck Cond  TEXT,
TID Cond  TEXT,
Merge CondTEXT,
Subplan Name  TEXT,
Function Name TEXT,
Function Call TEXT,
FilterTEXT,
One-Time Filter   TEXT,
Command   TEXT,
Shared Hit Blocks BIGINT,
Shared Read BlocksBIGINT,
Shared Written Blocks BIGINT,
Local Hit Blocks  BIGINT,
Local Read Blocks BIGINT,
Local Written Blocks  BIGINT,
Temp Read Blocks  BIGINT,
Temp Written Blocks   BIGINT,
OutputTEXT[],
Hash Buckets  BIGINT,
Hash Batches  BIGINT,
Original Hash Batches BIGINT,
Peak Memory Usage BIGINT,
SchemaTEXT,
CTE Name  TEXT

Does that seem reasonable? Am I missing anything obvious?

Thanks,

David
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we use make -k on the buildfarm?

2010-11-06 Thread Andrew Dunstan



On 11/06/2010 01:07 PM, Tom Lane wrote:

What I *have* occasionally
wished for is that the buildfarm script would act more like make -k with
respect to the various test stages.  That is, not abandon the whole test
after one stage fails, but allow stages that don't logically depend on
the failed one to proceed.  But I'm not sure how hard that would be ---
quite aside from coding complexity, it would mean that you could have
more than one failing stage, and I don't know how you'd show that in the
dashboard.


That would be a significant architectural change. I'm not sure how many 
steps we could run this way.


Here's the list of tests from a recent run, leaving out stopping and 
starting the installed postmaster, and locale specifiers:


   SCM-checkout
   configure
   make
   check
   make-contrib
   make-install
   install-contrib
   initdb
   install-check
   pl-install-check
   contrib-install-check
   ecpg-check

Currently, the implied dependency list is in this order. We could have 
make-contrib depend only on make rather than check, 
pl-install-check  and contrib-install-check depend on initdb, and 
ecpg-check depend on make rather than anything that comes after. I 
think that's about the limit of what we could sensibly relax


I'm not sure that would be a great advance. Certainly, right now I'm 
going to be putting effort into the FTS stuff which I think should be 
much higher up your list of wants.


cheers

andrew


Re: [HACKERS] Should we use make -k on the buildfarm?

2010-11-06 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 11/06/2010 01:07 PM, Tom Lane wrote:
 What I *have* occasionally
 wished for is that the buildfarm script would act more like make -k with
 respect to the various test stages.

 I'm not sure that would be a great advance. Certainly, right now I'm 
 going to be putting effort into the FTS stuff which I think should be 
 much higher up your list of wants.

Agreed, that would be far more useful.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] knngist questions

2010-11-06 Thread Robert Haas
I'm gradually slogging my way through the KNNGIST patches which were
posted here:

http://archives.postgresql.org/pgsql-hackers/2010-07/msg01183.php

I have a couple of conceptual questions.

1. Is KNNGIST intended to work if there's more than one pathkey?  If
so, how?  Example:

SELECT * FROM tab ORDER BY this_point - '(0,0)', this_point - '(1,1)'

As far as I can see, there's nothing in match_pathkey_to_index() which
would prevent lists of pathkeys and sortclauses of length 2 from being
constructed, but can GIST actually handle that?  It seems hard.  If it
can't, which I suspect is the case, then we can make this logic a lot
simpler by removing one of the loops from match_pathkey_to_index() and
bailing out quickly whenever list_length(root-query_pathkeys) != 1.

2. I'm concerned by the fact that the new consistent() methods only
return 0 or -1.0 for non-leaf pages.  If we have a query like:

SELECT * FROM tab WHERE this_point - '(0,0)'

...it seems that every non-leaf page will be consistent and we'll end
up loading every key in the entire index into an RBTree, which doesn't
seem practical from a memory-usage perspective.  Maybe I'm
misunderstanding something, but it seems like in a case like this
you'd need to have the consistent function for each page return a
minimum and maximum distance to '(0,0)'.  If you have one page that
has a minimum distance of 0 and a maximum distance of 100 and another
page which has a minimum distance of 101 and a maximum distance of
200, you don't even need to look at the second page until all the keys
from the first one have been processed.  If there's a third page with
a minimum distance of 50 and a maximum distance of 150, you can return
the tuples from the first page with distances less than 50, in order;
then you can do a merge between the remaining keys on that page and
the keys on the third page with values = 100; then you can do a merge
between the remaining keys on that page and those on the second page
with values = 150; and finally you can return the remaining keys on
the second page in order.

That still doesn't seem to provide any particularly tight bound on
memory usage, but it's certainly way better than traversing the entire
tree up front.  If you are already doing something like this somewhere
in the code, please point me in the right direction...

3. I've been scratching my head over the following bit of code and it
doesn't make any sense to me.  As far as I can tell, this is
effectively comparing the number of columns in the ORDER BY clause to
the number of restriction clauses applicable to the relation being
scanned.  Those two quantities don't seem to have much to do with each
other, so either I'm confused or the code is.  It doesn't seem like it
should matter anyway, since I don't think we're planning on any AMs
being both amoptionalkey and amcanorderbyop.

+   if (list_length(restrictclauses) 
indexcol  !index-amoptionalkey)
+   break;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporary functions (and other object types)

2010-11-06 Thread Robert Haas
On Sat, Nov 6, 2010 at 1:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Nov 6, 2010 at 11:36 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, we changed that behavior as part of the fix for CVE-2007-2138.
 You'd need either SECURITY DEFINER functions or very careless use of
 SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable.

 Would it be practical to let foo() potentially mean pg_temp.foo()
 outside of any SECURITY DEFINER context?

 Doesn't seem like a particularly good idea for the search semantics
 to be randomly different inside a SECURITY DEFINER function.  In fact,
 I'll bet you could construct an attack in the reverse direction:
 S.D. function thinks it is calling a temp function (using syntax that
 works fine when not S.D.), but control gets sent to a non-temp function
 belonging to $badguy instead.

I guess.  If you search pg_temp always then it's pretty much
impossible to avoid having a security hole, if you use any non-trivial
SQL.  But if you search pg_temp for non-SD only then you'll only have
a security hole if you assume (presumably without testing) that the
behavior is the same in that case.  If an SD function is calling
temporary functions they'd best be ones it created, otherwise your
security is pretty much nonexistent anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporary functions (and other object types)

2010-11-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I guess.  If you search pg_temp always then it's pretty much
 impossible to avoid having a security hole, if you use any non-trivial
 SQL.  But if you search pg_temp for non-SD only then you'll only have
 a security hole if you assume (presumably without testing) that the
 behavior is the same in that case.  If an SD function is calling
 temporary functions they'd best be ones it created, otherwise your
 security is pretty much nonexistent anyway.

In general I don't see a lot of use for calling temp functions that
you don't know are temp functions.  So I see nothing much wrong with
having to use the pg_temp. prefix --- and the possibility of security
issues definitely pushes me over the line to being happy with requiring
that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IA64 versus effective stack limit

2010-11-06 Thread Greg Stark
On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 As I said above, I don't know of any good way to measure register stack
 depth directly.  It's probably possible to find out by asking the kernel
 or something like that, but we surely do not want to introduce a kernel
 call into check_stack_depth().

It seems more likely it would be some kind of asm than a trap. This
might be wishful thinking but is it too much to hope that glibc
already exposes it through some function?

It looks like the relevant registers are ar.bsp and ar.bspstore. Just
taking the difference apparently gives you the amount of memory used
in the current backing store.

However some of the comments I'm reading seem to imply that the OS can
allocate discontiguous backing store partitions, presumably if the
backing store pointer reaches an unmapped address there has to be some
way to trap to the OS to allocate more and maybe then it has a chance
to tweak the bsp address?

This was quite interesting (especially the The Register Stack Engine
section of the second one):

http://msdn.microsoft.com/en-us/magazine/cc301708.aspx
http://msdn.microsoft.com/en-us/magazine/cc301711.aspx

Also I found the following:

(lists some registers)
http://www.cs.clemson.edu/~mark/subroutines/itanium.html

(helper functions in glibc asm includes that calculate bspstore-bsp to
count the number of registers used)
http://www.koders.com/c/fidE15CABBBA63E7C24928D7F7C9A95653D101451D2.aspx?s=queue

Also I found http://www.nongnu.org/libunwind/man/libunwind(3).html
which I found cool though not really relevant. The ia64 implementation
fiddles with the RSE registers as well of course.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IA64 versus effective stack limit

2010-11-06 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 As I said above, I don't know of any good way to measure register stack
 depth directly.  It's probably possible to find out by asking the kernel
 or something like that, but we surely do not want to introduce a kernel
 call into check_stack_depth().

 It seems more likely it would be some kind of asm than a trap. This
 might be wishful thinking but is it too much to hope that glibc
 already exposes it through some function?

Yeah, I suppose some asm might be a possible solution, but I was a bit
discouraged after reading some Intel documentation that said that the
register-stack top wasn't exposed in the architectural model.  You
apparently can only find out what's been spilled to memory.  (But
perhaps that's close enough, for the purposes here?)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IA64 versus effective stack limit

2010-11-06 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 It seems more likely it would be some kind of asm than a trap.

I seem to be getting plausible results from this bit of crockery:


#include asm/ia64regs.h

static __inline__ void *
get_bsp(void)
{
  void *ret;
#ifndef __INTEL_COMPILER
  __asm__ __volatile__(
   ;;\nmov %0=ar.bsp\n
   :=r(ret));
#else
  ret = (void *) __getReg(_IA64_REG_AR_BSP);
#endif
  return ret;
}


I'll clean this up and commit, assuming it actually fixes the problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query Plan Columns

2010-11-06 Thread David E. Wheeler
On Nov 6, 2010, at 11:44 AM, David E. Wheeler wrote:

 On Nov 5, 2010, at 1:42 PM, David E. Wheeler wrote:
 
 http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/commands/explain.c;h=f494ec98e510c23120e072bd5ee8821ea12738a4;hb=HEAD#l617
 
 Ah, great, thanks.
 
 So based on this, I've come up with:

Updated:

CREATE TEMPORARY TABLE plans (
Node Type TEXT NOT NULL,
Strategy  TEXT,
Operation TEXT,
Startup Cost  FLOAT,
Total CostFLOAT,
Plan Rows FLOAT,
Plan WidthINTEGER,
Actual Startup Time   FLOAT,
Actual Total Time FLOAT,
Actual Rows   FLOAT,
Actual Loops  FLOAT,
Parent Relationship   TEXT,
Sort Key  TEXT[],
Sort Method   TEXT[],
Sort Space Used   BIGINT,
Sort Space Type   TEXT,
Join Type TEXT,
Join Filter   TEXT,
Hash Cond TEXT,
Relation Name TEXT,
Alias TEXT,
Scan DirectionTEXT,
Index NameTEXT,
Index CondTEXT,
Recheck Cond  TEXT,
TID Cond  TEXT,
Merge CondTEXT,
Subplan Name  TEXT,
Function Name TEXT,
Function Call TEXT,
FilterTEXT,
One-Time Filter   TEXT,
Command   TEXT,
Shared Hit Blocks BIGINT,
Shared Read BlocksBIGINT,
Shared Written Blocks BIGINT,
Local Hit Blocks  BIGINT,
Local Read Blocks BIGINT,
Local Written Blocks  BIGINT,
Temp Read Blocks  BIGINT,
Temp Written Blocks   BIGINT,
OutputTEXT[],
Hash Buckets  BIGINT,
Hash Batches  BIGINT,
Original Hash Batches BIGINT,
Peak Memory Usage BIGINT,
SchemaTEXT,
CTE Name  TEXT
);

Would I be right that Node Type is the only column can be NOT NULL?

Also, I'm thinking of making the Actual Startup Time and Actual Total Time 
columns into INTERVALs. The times are expressed in milliseconds, yes? I'm 
wondering if INTERVAL would be more convenient for querying…

Thanks,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simplifying replication

2010-11-06 Thread Dimitri Fontaine
Hannu Krosing ha...@2ndquadrant.com writes:
  To make pg_basebackup.py self-sufficient it should also open 2nd
  connection to the same master and make sure that all WAL files are
  copied for the duration of base copy.

Done now, please have a look and try it if possible:

  https://github.com/dimitri/pg_basebackup

 btw, as next step you could backport this to 8.x and have most of the
 benefits of SR. It should not be very hard to keep track of wal position
 inside a pl/pythonu function and send one or more records back in form of

You now have the -x and -D options to set that up, but I didn't check
the backport part: it still depends on the pg_bb_list_files() function
to get the recursive listing of the pg_xlog directory, and it does that
using a WITH RECURSIVE query.

The way I did it is to only copy the (whole) WAL again if its ctime
changed since last loop. Also pg_basebackup won't start a backup if you
run it on its own, I don't think that's what you want here.

Oh, as I needed to fork() a process to care for the pg_xlog in a loop
while the base backup is sill ongoing, I added a -j --jobs option so
that you can hammer the master some more by having more than one process
doing the copying.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sorted writes for checkpoints

2010-11-06 Thread Jeff Janes
On Fri, Oct 29, 2010 at 6:17 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro
 itagaki.takah...@gmail.com wrote:
 On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Simon's argument in the thread that the todo item points to
 (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is
 basically that we don't know what the best algorithm is yet and benchmarking
 is a lot of work, so let's just let people do whatever they feel like until
 we settle on the best approach. I think we need to bite the bullet and do
 some benchmarking, and commit one carefully vetted patch to the backend.

 When I submitted the patch, I tested it on disk-based RAID-5 machine:
 http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
 But there were no additional benchmarking reports at that time. We still
 need benchmarking before we re-examine the feature. For example, SSD and
 SSD-RAID was not popular at that time, but now they might be considerable.

 There are really two separate things here:

 (1) trying to do all the writes to file A before you start doing
 writes to file B, and
 (2) trying to write out blocks to each file in ascending logical block
 number order

 I'm much more convinced of the value of #1 than I am of the value of
 #2.  If we do #1, we can then spread out the checkpoint fsyncs in a
 meaningful way without fearing that we'll need to fsync the same file
 a second time for the same checkpoint.

If the OS/FS is behaving such that it is important to spread out
fsyncs, then wouldn't that same behavior also make it less important
to avoid fsync a second time?

If the OS is squirreling away a preposterous amount of dirty buffers
in its cache, and then panicking to dump them all when it gets the
fsync, then I think you would need to spread out the fsyncs within a
file, and not just between files.

 We've gotten some pretty
 specific reports of problems in this area recently, so it seems likely
 that there is some value to be had there.  On the other hand, #2 is
 only a win if sorting the blocks in numerical order causes the OS to
 write them in a better order than it would otherwise have done.

Assuming the ordering is useful, the only way the OS can do as good a
job as the checkpoint code can, is if the OS stores the entire
checkpoint worth of data as dirty blocks and doesn't start writing
until an fsync comes in.  This strikes me as a pathologically
configured OS/FS.  (And would explain problems with fsyncs)

 We've
 had recent reports that our block-at-a-time relation extension policy
 is leading to severe fragmentation on certain filesystems, so I'm a
 bit skeptical about the value of this (though, of course, that can be
 overturned if we can collect meaningful evidence).

Some FS are better about that than others at that.  It would probably
depend on the exact workload, and pgbench would probably favor large
contiguous extents to an unrealistic degree.  So I don't know the best
way to gather that evidence.



Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IA64 versus effective stack limit

2010-11-06 Thread Tom Lane
I wrote:
 I don't know why icc is so much worse than gcc on this measure of
 stack depth consumption, but clearly the combination of that and
 the 100kB max_stack_depth explains why dugong is failing to do
 very many levels of recursion before erroring out.

I figured out why icc looked so much worse here: I had accidentally
built with optimization disabled.  Selecting -O2 causes its numbers
to come a lot closer to gcc's.  In particular, it flips around from
using more normal stack than register stack to using more register
stack than normal.  (This might be the case for gcc as well; I did
not test an unoptimized gcc build.)

This means that, at least for icc, *an optimized build is unsafe*
without code to check for register stack growth.  It turns out that
buildfarm member dugong has been building without optimization all
along, which is why we'd not noticed the issue.

I think it'd be a good idea for dugong to turn on optimization
so it's testing something closer to a production build.  However,
at this moment only HEAD is likely to pass regression tests with
that turned on.  We'd have to back-patch the just-committed code
for checking register stack growth before the back branches would
survive that.

I'm normally hesitant to back-patch code that might create portability
issues, but in this case perhaps it's a good idea.  Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improved parallel make support

2010-11-06 Thread Andrew Dunstan



On 11/06/2010 07:35 AM, Peter Eisentraut wrote:

So far, two machines have reported an older make version:

dawn_bat
narwhal

both of the mingw type.  Andrew, Dave, could you see about upgrading the
GNU make installation there?



dawn_bat is done.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sorted writes for checkpoints

2010-11-06 Thread Robert Haas
On Sat, Nov 6, 2010 at 7:25 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 There are really two separate things here:

 (1) trying to do all the writes to file A before you start doing
 writes to file B, and
 (2) trying to write out blocks to each file in ascending logical block
 number order

 I'm much more convinced of the value of #1 than I am of the value of
 #2.  If we do #1, we can then spread out the checkpoint fsyncs in a
 meaningful way without fearing that we'll need to fsync the same file
 a second time for the same checkpoint.

 If the OS/FS is behaving such that it is important to spread out
 fsyncs, then wouldn't that same behavior also make it less important
 to avoid fsync a second time?

 If the OS is squirreling away a preposterous amount of dirty buffers
 in its cache, and then panicking to dump them all when it gets the
 fsync, then I think you would need to spread out the fsyncs within a
 file, and not just between files.

Well, presumably, there's some amount of dirty data that the OS can
write out in one shot without causing a perceptible stall (otherwise
we're hosed no matter what).  The question is where that threshold is.
 With one fsync per file, we'll try to write at most 1 GB at once, and
often less.  As you say, there's a possibility that that's still too
much, but right now we could be trying to dump 30+ GB to disk if the
OS has a lot of dirty pages in the buffer cache, so getting down to 1
GB or less at a time should be a big improvement even if it doesn't
solve the problem completely.

 We've gotten some pretty
 specific reports of problems in this area recently, so it seems likely
 that there is some value to be had there.  On the other hand, #2 is
 only a win if sorting the blocks in numerical order causes the OS to
 write them in a better order than it would otherwise have done.

 Assuming the ordering is useful, the only way the OS can do as good a
 job as the checkpoint code can, is if the OS stores the entire
 checkpoint worth of data as dirty blocks and doesn't start writing
 until an fsync comes in.  This strikes me as a pathologically
 configured OS/FS.  (And would explain problems with fsyncs)

The OS would only need to store and reorder one file's worth of
blocks, if we wrote the data for one file and called fsync, wrote the
data for another file and called fsync, etc.

 We've
 had recent reports that our block-at-a-time relation extension policy
 is leading to severe fragmentation on certain filesystems, so I'm a
 bit skeptical about the value of this (though, of course, that can be
 overturned if we can collect meaningful evidence).

 Some FS are better about that than others at that.  It would probably
 depend on the exact workload, and pgbench would probably favor large
 contiguous extents to an unrealistic degree.  So I don't know the best
 way to gather that evidence.

Well, the basic idea would be to try some different workloads on
different filesystems and try go get some feeling for how often
sorting by block number wins and how often it loses.  But as I say I
think the biggest problem is that we're often trying to write too much
dirty data to disk at once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IA64 versus effective stack limit

2010-11-06 Thread Robert Haas
On Sat, Nov 6, 2010 at 8:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 I don't know why icc is so much worse than gcc on this measure of
 stack depth consumption, but clearly the combination of that and
 the 100kB max_stack_depth explains why dugong is failing to do
 very many levels of recursion before erroring out.

 I figured out why icc looked so much worse here: I had accidentally
 built with optimization disabled.  Selecting -O2 causes its numbers
 to come a lot closer to gcc's.  In particular, it flips around from
 using more normal stack than register stack to using more register
 stack than normal.  (This might be the case for gcc as well; I did
 not test an unoptimized gcc build.)

 This means that, at least for icc, *an optimized build is unsafe*
 without code to check for register stack growth.  It turns out that
 buildfarm member dugong has been building without optimization all
 along, which is why we'd not noticed the issue.

 I think it'd be a good idea for dugong to turn on optimization
 so it's testing something closer to a production build.  However,
 at this moment only HEAD is likely to pass regression tests with
 that turned on.  We'd have to back-patch the just-committed code
 for checking register stack growth before the back branches would
 survive that.

 I'm normally hesitant to back-patch code that might create portability
 issues, but in this case perhaps it's a good idea.  Comments?

Yeah, I think it might be a good idea.  Crashing is bad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLE ... IF EXISTS feature?

2010-11-06 Thread Robert Haas
On Fri, Nov 5, 2010 at 7:49 PM, Daniel Farina drfar...@acm.org wrote:
 On Fri, Nov 5, 2010 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote:
 Can you give us a self-contained example of the problem you're talking about?

 Sure. Consider the following:

 CREATE TABLE t1 (
    id integer PRIMARY KEY
 );

 CREATE TABLE t2 (
    id integer PRIMARY KEY,
    fk integer
 );

 ALTER TABLE ONLY t2
    ADD CONSTRAINT t2_constr FOREIGN KEY (fk) REFERENCES t1(id);

 Try something like this:

 createdb foo
 psql -1f this_ddl.sql foo
 pg_dump --clean foo  cleaning_backup.sql
 # db wipe
 dropdb foo
 createdb foo
 psql -1f cleaning_backup.sql foo

 The last command will return non-zero and abort the xact early on,
 because of the following stanza in pg_dump --clean's output:

 ALTER TABLE ONLY public.t2 DROP CONSTRAINT t2_constr;
 ALTER TABLE ONLY public.t2 DROP CONSTRAINT t2_pkey;
 ALTER TABLE ONLY public.t1 DROP CONSTRAINT t1_pkey;
 DROP TABLE public.t2;
 DROP TABLE public.t1;

 Since there's no public.t1/t2, it's not possible to ALTER them.

 I'm not entirely sure why the DROPs CONSTRAINT on pkeys are being
 done, as they only introduce an internal (or is it auto?) style
 self-dependency. It is more obvious why foreign keys are dropped,
 which is to break up the dependencies so that tables can be dropped
 without CASCADE.

If we're going to try to fix this, we probably ought to try to make
sure that we are fixing it fairly completely.  How confident are you
that this is the only problem?

With respect to the syntax itself, I have mixed feelings.  On the one
hand, I'm a big fan of CREATE IF NOT EXISTS and DROP IF EXISTS
precisely because I believe they handle many common cases that people
want in real life without much hullabaloo.  But, there's clearly some
limit to what can reasonably be done this way.  At some point, what
you really want is some kind of meta-language where you can write
things like:

IF EXISTS TABLE t1 THEN
   ALTER TABLE t1 DROP CONSTRAINT IF EXISTS t1_constr;
END IF;

...or possibly something much more complicated, like checking whether
a table foo has a column called bar and if so doing nothing but if not
but a column called baz exists then renaming it to bar and otherwise
adding a column called bar.

Since we now have PL/pgsql by default, we could possibly fix pg_dump
--clean by emitting a DO block, although the syntax for checking
existence of a table is none too pretty, and it would make pg_dump
--clean rely for correctness on plpgsql being installed, which might
be none too desirable.  It would actually be sort of spiffy to be able
to have some of the PL/pgsql control constructs available in straight
SQL, but I'm not expecting that to happen any time in the forseeable
future.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLE ... IF EXISTS feature?

2010-11-06 Thread Daniel Farina
On Sat, Nov 6, 2010 at 6:09 PM, Robert Haas robertmh...@gmail.com wrote:
 If we're going to try to fix this, we probably ought to try to make
 sure that we are fixing it fairly completely.  How confident are you
 that this is the only problem?

I haven't tried to isolate problems on really complicated schemas yet,
but I can tell you what I did do: I went through pg_dump and tried to
find as many parts of the code that added dropStmt to DumpableObject
nodes as possible (in the dumpFoo family of functions). ALTER seemed
like the only interesting bit so far.

But we can give the mechanic a try with, say, a PLPGSQL hack for a
little while to see if it basically gets the job done in sort-of real
life for a while. I think that's not a bad ideaI'll see what I can
do about that. (Of course, continue to share your suggestions and
revelations, I'd appreciate it)

fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-06 Thread Gurjeet Singh
On Sat, Nov 6, 2010 at 11:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gurjeet Singh singh.gurj...@gmail.com writes:
  .) The basic idea is to have a magic number in every PageHeader before it
 is
  written to disk, and check for this magic number when performing page
  validity
  checks.

 Um ... and exactly how does that differ from the existing behavior?


Right now a zero filled page considered valid, and is treated as a new page;
PageHeaderIsValid()-/* Check all-zeroes case */, and PageIsNew(). This
means that looking at a  zero-filled page on disk (say after a crash) does
not give us any clue if it was indeed left zeroed by Postgres, or did
FS/storage failed to do their job.

With the proposed change, if it is a valid page (a page actually written by
Postgres) it will either have a sensible LSN or the magic-LSN; the LSN will
never be zero. OTOH, if we encounter a zero filled page ( = LSN={0,0)} ) it
clearly would implicate elements outside Postgres in making that page zero.


 The amount of fragility introduced by the assumptions you have to make
 for this seems to me to be vastly riskier than the risk you are trying
 to respond to.


I understand that it is a pretty low-level change, but IMHO the change is
minimal and is being applied in well understood places. All the assumptions
listed have been effective for quite a while, and I don't see these
assumptions being affected in the near future. Most crucial assumptions we
have to work with are, that XLogPtr{n, 0x} will never be used, and
that mdextend() is the only place that extends a relation (until we
implement an md.c sibling, say flash.c or tape.c; the last change to md.c
regarding mdextend() was in January 2007).

Only mdextend() and PageHeaderIsValid() need to know this change in
behaviour, and all the other APIs work and behave the same as they do now.

This change would increase the diagnosability of zero-page issues, and help
the users point fingers at right places.

Regards,
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-06 Thread Greg Stark
On Sun, Nov 7, 2010 at 4:23 AM, Gurjeet Singh singh.gurj...@gmail.com wrote:
 I understand that it is a pretty low-level change, but IMHO the change is
 minimal and is being applied in well understood places. All the assumptions
 listed have been effective for quite a while, and I don't see these
 assumptions being affected in the near future. Most crucial assumptions we
 have to work with are, that XLogPtr{n, 0x} will never be used, and
 that mdextend() is the only place that extends a relation (until we
 implement an md.c sibling, say flash.c or tape.c; the last change to md.c
 regarding mdextend() was in January 2007).

I think the assumption that isn't tested here is what happens if the
server crashes. The logic may work fine as long as nothing goes wrong
but if something does it has to be fool-proof.

I think having zero-filled blocks at the end of the file if it has
been extended but hasn't been fsynced is an expected failure mode of a
number of filesystems. The log replay can't assume seeing such a block
is a problem since that may be precisely the result of the crash that
caused the replay. And if you disable checking for this during WAL
replay then you've lost your main chance to actually detect the
problem.

Another issue -- though I think a manageable one -- is that I expect
we'll want to be be using posix_fallocate() sometime soon. That will
allow efficient guaranteed pre-allocated space with better contiguous
layout than currently. But ext4 can only pretend to give zero-filled
blocks, not any random bitpattern we request. I can see this being an
optional feature that is just not compatible with using
posix_fallocate() though.

It does seem like this is kind of part and parcel of adding checksums
to blocks. It's arguably kind of silly to add checksums to blocks but
have an commonly produced bitpattern in corruption cases go
undetected.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers