Re: [HACKERS] SQL/MED estimated time of arrival?
2010/11/5 Shigeru HANADA han...@metrosystems.co.jp: On Fri, 5 Nov 2010 16:27:49 +0900 Itagaki Takahiro itagaki.takah...@gmail.com wrote: PL/Proxy has a similar functionality with RUN ON ALL to start queries in parallel. So, I think it's a infrastructure commonly required. I noticed the lack of consideration about cache invalidation from reading PL/Proxy source, thanks for your mention about PL/Proxy. :-) And if we really make this async query come true, I suggest designing resource (i.e. remote connection) management very carefully. When the executor fails in the middle of its execution, it possibly fails to release its own resource; close() in ExecutorEnd() will never be called. As far as I know files and memory are released automatically in the current mechanism, but MED APIs will use their own resources other than them. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make versus effective stack limit in regression tests
On 06.11.2010 00:39, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 11/05/2010 05:45 PM, Tom Lane wrote: Anyway, what this points up is that we are making a very conservative assumption about what to do when getrlimit() returns RLIM_INFINITY. It does not seem real reasonable to interpret that as 100kB on any modern platform. I'm inclined to interpret it as 4MB, which is the same default stack limit that we use on Windows. +1. After looking a bit closer, I think the real problem is that get_stack_depth_rlimit's API fails to distinguish between unknown and unlimited. In the first case we ought to have a conservative default, whereas in the second case not. It's already the case that (a) max_stack_depth is a SUSET parameter, and (b) for either unknown or unlimited RLIMIT_STACK, we will let a superuser set whatever value he wants, and it's on his head whether that value is safe or not. That part of the behavior seems OK. What's not OK is using the same built-in default value in both cases. We need to fix it so that InitializeGUCOptions can tell the difference. If it can, I think the current default of 2MB is OK --- most people will be fine with that, and those who aren't can select some other value. Yeah, I bumped into this two years ago but it didn't lead to a patch: http://archives.postgresql.org/pgsql-hackers/2008-07/msg00918.php +1 on choosing 2MB for RLIM_INFINITY. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Protecting against unexpected zero-pages: proposal
A customer of ours is quite bothered about finding zero pages in an index after a system crash. The task now is to improve the diagnosability of such an issue and be able to definitively point to the source of zero pages. The proposed solution below has been vetted in-house at EnterpriseDB and am posting here to see any possible problems we missed, and also if the community would be interested in incorporating this capability. Background: --- SUSE Linux, ATCA board, 4 dual core CPUs = 8 cores, 24 GB RAM, 140 GB disk, PG 8.3.11. RAID-1 SAS with SCSIinfo reporting that write-caching is disabled. The corrupted index's file contents, based on hexdump: It has a total of 525 pages (cluster block size is 8K: per pg_controldata) Blocks 0 to 278 look sane. Blocks 279 to 518 are full of zeroes. Block 519 to 522 look sane. Block 523 is filled with zeroes. Block 524 looks sane. The tail end of blocks 278 and 522 have some non-zero data, meaning that those index pages have some valid 'Special space' contents. Also, head of blocks 519 and 524 look sane. These two findings imply that the zeroing action happened at 8K page boundary. This is a standard ext3 FS with 4K block size, so this raises question as to how we can ascertain that this was indeed a hardware/FS malfunction. And if it was a hardware/FS problem, then why didn't we see zeroes at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default ext3 FS block size) which does not align with an 8 K boundary. The backup from before the crash does not have these zero-pages. Disk Page Validity Check Using Magic Number === Requirement: We have encountered quite a few zero pages in an index after a machine crash, causing this index to be unusable. Although REINDEX is an option but we have no way of telling if these zero pages were caused by hardware or filesystem or by Postgres. Postgres code analysis shows that Postgres being the culprit is a very low probablity, and similarly, since our hardware is also considered of good quality with hardware level RAID-1 over 2 disks, it is difficult to consider the hardware to be a problem. The ext3 filesystem being used is also quite a time-tested piece of software, hence it becomes very difficult to point fingers at any of these 3 components for this corruption. Postgres is being deployed as a component of a carrier-grade platform, and it is required to run unattended as much as possible. There is a High Availability monitoring component that is tasked with performing switchover to a standby node in the event of any problem with the primary node. This HA component needs to perform regular checks on health of all the other components, including Postgres, and take corrective actions. With the zero pages comes the difficulty of ascertaining whether these are legitimate zero pages, (since Postgres considers zero pages as valid (maybe leftover from previous extend-file followed by a crash)), or are these zero pages a result of FS/hardware failure. We are required to definitively differentiate between zero pages from Postgres vs. zero pages caused by hardware failure. Obviously this is not possible by the very nature of the problem, so we explored a few ideas, including per-block checksums in-block or in checksum-fork, S.M.A.R.T monitoring of disk drives, PageInit() before smgrextend() in ReadBuffer_common(), and additional member in PageHeader for a magic number. Following is an approach which we think is least invasive, and does not threaten code-breakage, yet provides a definitive detection of corruption/data-loss outside Postgres with least performance penalty. Implementation: --- .) The basic idea is to have a magic number in every PageHeader before it is written to disk, and check for this magic number when performing page validity checks. .) To avoid adding a new field to PageHeader, and any code breakage, we reuse an existing member of the structure. .) We exploit the following facts and assumptions: -) Relations/files are extended 8 KB (BLCKSZ) at a time. -) Every I/O unit contains PageHeader structure (table/index/fork files), which in turn contains pd_lsn as the first member. -) Every newly written block is considered to be zero filled. -) PageIsNew() assumes that if pd_upper is 0 then the page is zero. -) PageHeaderIsValid() allows zero filled pages to be considered valid. -) Anyone wishing to use a new page has to do PageInit() on the page. -) PageInit() does a MemSet(0) on the whole page. -) XLogRecPtr={x,0} is considered invalid -) XLogRecPtr={x, ~((uint32)0)} is not valid either (i.e. last byte of an xlog file (not segment)); we'll use this as the magic number. ... Above is my assumption, since it is not mentioned anywhere in the code. The XLogFileSize calculation seems to support this assumptiopn. ... If this assumption doesn't hold good,
Re: [HACKERS] temporary functions (and other object types)
On Fri, Nov 05, 2010 at 09:01:50PM -0400, Robert Haas wrote: On Fri, Nov 5, 2010 at 4:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: The latter is an intentional security feature and will not get changed. I see that there could be a problem here with SECURITY DEFINER functions, but I'm not clear whether it goes beyond that? IIRC correctly it's because even unpriveledged users can make things in the pg_temp schema and it's implicitly at the front of the search_path. There was a CVE about this a while back, no? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] improved parallel make support
On ons, 2010-11-03 at 16:34 +0200, Peter Eisentraut wrote: On tis, 2010-11-02 at 10:21 -0400, Tom Lane wrote: Do we have a handle on how many buildfarm members this will break? I suppose we don't. One way to find out would be to commit just this bit +# We need the $(eval) function, which is available in GNU make 3.80. +# That also happens to be the version where the .VARIABLES variable +# was introduced, so this is a simple check. +ifndef .VARIABLES +$(error GNU make 3.80 or newer is required) +endif with a $(warning) instead, and let it run for a bit. So far, two machines have reported an older make version: dawn_bat narwhal both of the mingw type. Andrew, Dave, could you see about upgrading the GNU make installation there? There are a few machines that haven't build in five days or more, but based on their operating system version, it is fairly safe to assume that they have an up-to-date version. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How can we tell how far behind the standby is?
On 6 November 2010 05:46, Josh Berkus j...@agliodbs.com wrote: I'm continuing in my efforts now to document how to deploy and manage replication on our wiki. One of the things a DBA needs to do is to use pg_current_xlog_location() (and related functions) to check how far behind the master the standby is. However, there's some serious problems with that: (1) comparing these numbers is quite mathematically complex -- and, for that matter, undocumented. Our solution to this was to strip the slash out of the numbers and then feed them to `bc` for comparison. The shell script for our zabbix item looks something like this: #!/bin/bash errval=-1 primary=$(psql -At -h $1 -p $2 -c SELECT replace(pg_current_xlog_location(), '/', ''); postgres) standby=$(psql -At -h $3 -p $4 -c SELECT replace(pg_last_xlog_receive_location(), '/', ''); postgres) if [ -n $primary -a -n $standby ] then echo $(echo ibase=16; obase=10; $primary-$standby | bc) else echo $errval fi I'm posting this snippet a) in the hopes that it might help others, and b) by way of agreement with Josh's point. Requiring every user who wants to monitor replication to set something like this up for themselves is ... not awesome. Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix for seg picksplit function
Do you think now patch is ready for committer or it require further review by you or somebody else? With best regards, Alexander Korotkov.
[HACKERS] SQL functions that can be inlined
Is there any way to have the database tell you if a particular SQL function can be inlined? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix for seg picksplit function
Alexander Korotkov wrote: Do you think now patch is ready for committer or it require further review by you or somebody else? It's probably ready for committer, however the code now doesn't mention any reference or bit of information that it is faster than the original one. I was wondering how you discovered this, or is there any reverence to e.g. a gist paper/other work where this is researched? If the info is available, some comments in the code might help future gist developers for picking a right algorithm for other datatypes. I don't think further review is required, but very much welcome further exploration of which picksplit algorithms match which datatype in which distribution best. regards, Yeb Havinga -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SQL/MED estimated time of arrival?
Hitoshi Harada umi.tan...@gmail.com writes: And if we really make this async query come true, I suggest designing resource (i.e. remote connection) management very carefully. When the executor fails in the middle of its execution, it possibly fails to release its own resource; close() in ExecutorEnd() will never be called. As far as I know files and memory are released automatically in the current mechanism, but MED APIs will use their own resources other than them. The way to fix that is for the FDW to hook into the ResourceOwner mechanism (via RegisterResourceReleaseCallback). Then it can track and clean up things it knows about just as automatically as anything else is. Of course, if you lose your network connection to the remote DB, you have to assume it will clean up of its own accord. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PL/pgSQL and shared_preload_libraries
Since we now install PL/pgSQL by default, should we configure shared_preload_libraries to preload PL/pgSQL? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporary functions (and other object types)
Martijn van Oosterhout klep...@svana.org writes: On Fri, Nov 05, 2010 at 09:01:50PM -0400, Robert Haas wrote: I see that there could be a problem here with SECURITY DEFINER functions, but I'm not clear whether it goes beyond that? IIRC correctly it's because even unpriveledged users can make things in the pg_temp schema and it's implicitly at the front of the search_path. There was a CVE about this a while back, no? Yeah, we changed that behavior as part of the fix for CVE-2007-2138. You'd need either SECURITY DEFINER functions or very careless use of SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protecting against unexpected zero-pages: proposal
Gurjeet Singh singh.gurj...@gmail.com writes: .) The basic idea is to have a magic number in every PageHeader before it is written to disk, and check for this magic number when performing page validity checks. Um ... and exactly how does that differ from the existing behavior? .) To avoid adding a new field to PageHeader, and any code breakage, we reuse an existing member of the structure. The amount of fragility introduced by the assumptions you have to make for this seems to me to be vastly riskier than the risk you are trying to respond to. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SQL functions that can be inlined
Jim Nasby j...@nasby.net writes: Is there any way to have the database tell you if a particular SQL function can be inlined? Easiest way is to EXPLAIN a query using it and see if it did get inlined. For example, regression=# create function foo(int) returns int as regression-# 'select $1 + 1' language sql; CREATE FUNCTION regression=# explain verbose select foo(f1) from int4_tbl; QUERY PLAN --- Seq Scan on public.int4_tbl (cost=0.00..1.06 rows=5 width=4) Output: (f1 + 1) (2 rows) regression=# create function foo2(int) returns int as 'select $1 + 1 limit 1' language sql; CREATE FUNCTION regression=# explain verbose select foo2(f1) from int4_tbl; QUERY PLAN --- Seq Scan on public.int4_tbl (cost=0.00..2.30 rows=5 width=4) Output: foo2(f1) (2 rows) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL and shared_preload_libraries
Bruce Momjian br...@momjian.us writes: Since we now install PL/pgSQL by default, should we configure shared_preload_libraries to preload PL/pgSQL? I don't think that follows. The fact that it's there doesn't mean everyone is using it. In any case, I've seen no evidence that says you'd get a performance win this way. The preload feature is meant for modules that do a substantial amount of work at load time, which plpgsql does not. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Should we use make -k on the buildfarm?
Peter Eisentraut has suggested that we should run make -k instead of plain make for most or all of the buildfarm steps. This flag essentially instructs make to keep going rather than fail at the first error. We haven't done that for the last five or six years that the buildfarm has been running, and nobody up to now has complained (that I recall). I don't have any great objection, but before I make this change I thought it might be as well to canvas a wider range of opinion. So, does anyone else have thoughts about it? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psycopg and two phase commit
On Fri, Nov 5, 2010 at 5:16 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2010/11/4 Daniele Varrazzo daniele.varra...@gmail.com: Just wanted to warn you that I have implemented the 2pc protocol in psycopg. I read a notice, but I didn't find a link for download, where is it, please? We have just released a beta package including this and other features. All the details at http://initd.org/psycopg/articles/2010/11/06/psycopg-230-beta1-released/. Cheers -- Daniele -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simplifying replication
Hannu Krosing ha...@2ndquadrant.com writes: To make pg_basebackup.py self-sufficient it should also open 2nd connection to the same master and make sure that all WAL files are copied for the duration of base copy. Excellent idea, will make that happen soon'ish. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we use make -k on the buildfarm?
Andrew Dunstan and...@dunslane.net writes: Peter Eisentraut has suggested that we should run make -k instead of plain make for most or all of the buildfarm steps. This flag essentially instructs make to keep going rather than fail at the first error. We haven't done that for the last five or six years that the buildfarm has been running, and nobody up to now has complained (that I recall). I don't have any great objection, but before I make this change I thought it might be as well to canvas a wider range of opinion. So, does anyone else have thoughts about it? I don't really care about make -k as such. What I *have* occasionally wished for is that the buildfarm script would act more like make -k with respect to the various test stages. That is, not abandon the whole test after one stage fails, but allow stages that don't logically depend on the failed one to proceed. But I'm not sure how hard that would be --- quite aside from coding complexity, it would mean that you could have more than one failing stage, and I don't know how you'd show that in the dashboard. Anyway, no objection to Peter's request; but there's another TODO item for your buildfarm list, if you want to accept it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporary functions (and other object types)
On Sat, Nov 6, 2010 at 11:36 AM, Tom Lane t...@sss.pgh.pa.us wrote: Martijn van Oosterhout klep...@svana.org writes: On Fri, Nov 05, 2010 at 09:01:50PM -0400, Robert Haas wrote: I see that there could be a problem here with SECURITY DEFINER functions, but I'm not clear whether it goes beyond that? IIRC correctly it's because even unpriveledged users can make things in the pg_temp schema and it's implicitly at the front of the search_path. There was a CVE about this a while back, no? Yeah, we changed that behavior as part of the fix for CVE-2007-2138. You'd need either SECURITY DEFINER functions or very careless use of SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable. Would it be practical to let foo() potentially mean pg_temp.foo() outside of any SECURITY DEFINER context? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query Plan Columns
On Fri, Nov 05, 2010 at 01:39:07PM -0700, Josh Berkus wrote: Of course, there are containers too, which are not in your list at all. How do you intend to represent the tree-ish structure in a flat table? Andrew: we'll use a proximity tree. Adjacency list? If so, in my experience, it's best to separate the node descriptions from the adjacency list that links them together. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] IA64 versus effective stack limit
Sergey was kind enough to lend me use of buildfarm member dugong (IA64, Debian Etch) so I could poke into why its behavior in the recursion-related regression tests was so odd. I had previously tried and failed to reproduce the behavior on a Red Hat IA64 test machine (running RHEL of course) so I was feeling a bit baffled. Here's what I found out: 1. Debian Etch has the make-resets-the-stack-rlimit bug that I reported about yesterday, whereas the RHEL version I was testing had the fix for that. So that's why I couldn't reproduce max_stack_depth getting set to 100kB. 2. IA64 is a very weird architecture: it has two separate hardware stacks. One is reserved for saving registers, which IA64 has got a lot of, and the other normal stack holds everything else. The method we use in check_stack_depth (ie, measure the difference in addresses of local variables) effectively measures the depth of the normal stack. I don't know of any simple way to find out the depth of the register stack. You can get gdb to tell you about both stacks, though. I found out that with PG HEAD, the recursion distance for the infinite_recurse() regression test is 160 bytes of normal stack and 928 bytes of register stack per fmgr_sql call level. This is with gcc (I got identical numbers on dugong and the RHEL machine). But, if you build PG with icc as the buildfarm critter is doing, that bloats to 3232 bytes of normal stack and 2832 bytes of register stack. For comparison, my x86_64 Fedora 13 box uses 704 bytes of stack per recursion level. I don't know why icc is so much worse than gcc on this measure of stack depth consumption, but clearly the combination of that and the 100kB max_stack_depth explains why dugong is failing to do very many levels of recursion before erroring out. Fixing get_stack_depth_rlimit as I proposed yesterday should give it a reasonable stack depth. However, we're not out of the woods yet. Because check_stack_depth is only checking the normal stack depth, and the two stacks don't grow at the same rate, it's possible for a crash to occur due to running out of register stack space. We haven't seen that happen on dugong because, as shown above, with icc the register stack grows more slowly than the normal stack (at least for the specific functions we care about here). But with gcc, the same code eats register stack a lot faster than normal stack --- and in fact I observed a crash in the infinite_recurse() test when building with gcc and testing in a manually-started postmaster. The manually-started postmaster was under ulimit -s 8MB, which apparently Debian interprets as 8MB for normal stack and another 8MB for register stack. Even though check_stack_depth was trying to constrain the normal stack to just 2MB, the register stack grew 5.8 times faster and so blew through 8MB before check_stack_depth thought there was a problem. Raising ulimit -s allowed it to work. (Curiously, I did *not* see the same type of crash on the RHEL machine. I surmise that Red Hat has tweaked the kernel to allow the register stack to grow more than the normal stack, but I haven't tried to verify that.) So this means we have a problem. To some extent it's new in HEAD: before the changes I made last week to not keep a local FunctionCallInfoData in ExecMakeFunctionResult, there would have been at least another 900 bytes of normal stack per recursion level, so even with gcc the register stack would grow slower than normal stack in this test, and you wouldn't have seen any crash in the regression tests. But I'm sure there are lots of other potentially recursive routines in PG where register stack could grow faster than normal stack, so we shouldn't suppose that this fmgr_sql recursion is the only trouble spot. As I said above, I don't know of any good way to measure register stack depth directly. It's probably possible to find out by asking the kernel or something like that, but we surely do not want to introduce a kernel call into check_stack_depth(). So a good solution for this is hard to see. The best idea I have at the moment is to reduce the reported stack limit by some arbitrary factor, ie do something like #ifdef __IA64__ val /= 8; #endif in get_stack_depth_rlimit(). Anyone have a better idea? BTW, this also suggests to me that it'd be a real good idea to have a buildfarm critter for IA64+gcc --- the differences between gcc and icc are clearly pretty significant on this hardware. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporary functions (and other object types)
Robert Haas robertmh...@gmail.com writes: On Sat, Nov 6, 2010 at 11:36 AM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah, we changed that behavior as part of the fix for CVE-2007-2138. You'd need either SECURITY DEFINER functions or very careless use of SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable. Would it be practical to let foo() potentially mean pg_temp.foo() outside of any SECURITY DEFINER context? Doesn't seem like a particularly good idea for the search semantics to be randomly different inside a SECURITY DEFINER function. In fact, I'll bet you could construct an attack in the reverse direction: S.D. function thinks it is calling a temp function (using syntax that works fine when not S.D.), but control gets sent to a non-temp function belonging to $badguy instead. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simplifying replication
On Sat, 2010-11-06 at 18:02 +0100, Dimitri Fontaine wrote: Hannu Krosing ha...@2ndquadrant.com writes: To make pg_basebackup.py self-sufficient it should also open 2nd connection to the same master and make sure that all WAL files are copied for the duration of base copy. Excellent idea, will make that happen soon'ish. Unitil I learned better, I thought that this is how SR is supposed to works ;) btw, as next step you could backport this to 8.x and have most of the benefits of SR. It should not be very hard to keep track of wal position inside a pl/pythonu function and send one or more records back in form of (walfile_name text, start_pos int, data bytea) and then call this function from client every second to keep possible data loss down to 1 sec. this lets you set up warm standby with nothing more than a postgresql superuser access to master (assuming right defaults in postgresql conf). -- --- Hannu Krosing PostgreSQL Infinite Scalability and Preformance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query Plan Columns
On Nov 5, 2010, at 1:42 PM, David E. Wheeler wrote: http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/commands/explain.c;h=f494ec98e510c23120e072bd5ee8821ea12738a4;hb=HEAD#l617 Ah, great, thanks. So based on this, I've come up with: Node Type TEXT, Strategy TEXT, Operation TEXT, Startup Cost FLOAT, Total CostFLOAT, Plan Rows FLOAT, Plan WidthINTEGER, Actual Startup Time FLOAT, Actual Total Time FLOAT, Actual Rows FLOAT, Actual Loops FLOAT, Parent Relationship TEXT, Sort Key TEXT[], Sort Method TEXT[], Sort Space Used BIGINT, Sort Space Type TEXT, Join Type TEXT, Join Filter TEXT, Hash Cond TEXT, Relation Name NAME, Alias NAME, Scan DirectionTEXT, Index NameTEXT, Index CondTEXT, Recheck Cond TEXT, TID Cond TEXT, Merge CondTEXT, Subplan Name TEXT, Function Name TEXT, Function Call TEXT, FilterTEXT, One-Time Filter TEXT, Command TEXT, Shared Hit Blocks BIGINT, Shared Read BlocksBIGINT, Shared Written Blocks BIGINT, Local Hit Blocks BIGINT, Local Read Blocks BIGINT, Local Written Blocks BIGINT, Temp Read Blocks BIGINT, Temp Written Blocks BIGINT, OutputTEXT[], Hash Buckets BIGINT, Hash Batches BIGINT, Original Hash Batches BIGINT, Peak Memory Usage BIGINT, SchemaTEXT, CTE Name TEXT Does that seem reasonable? Am I missing anything obvious? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we use make -k on the buildfarm?
On 11/06/2010 01:07 PM, Tom Lane wrote: What I *have* occasionally wished for is that the buildfarm script would act more like make -k with respect to the various test stages. That is, not abandon the whole test after one stage fails, but allow stages that don't logically depend on the failed one to proceed. But I'm not sure how hard that would be --- quite aside from coding complexity, it would mean that you could have more than one failing stage, and I don't know how you'd show that in the dashboard. That would be a significant architectural change. I'm not sure how many steps we could run this way. Here's the list of tests from a recent run, leaving out stopping and starting the installed postmaster, and locale specifiers: SCM-checkout configure make check make-contrib make-install install-contrib initdb install-check pl-install-check contrib-install-check ecpg-check Currently, the implied dependency list is in this order. We could have make-contrib depend only on make rather than check, pl-install-check and contrib-install-check depend on initdb, and ecpg-check depend on make rather than anything that comes after. I think that's about the limit of what we could sensibly relax I'm not sure that would be a great advance. Certainly, right now I'm going to be putting effort into the FTS stuff which I think should be much higher up your list of wants. cheers andrew
Re: [HACKERS] Should we use make -k on the buildfarm?
Andrew Dunstan and...@dunslane.net writes: On 11/06/2010 01:07 PM, Tom Lane wrote: What I *have* occasionally wished for is that the buildfarm script would act more like make -k with respect to the various test stages. I'm not sure that would be a great advance. Certainly, right now I'm going to be putting effort into the FTS stuff which I think should be much higher up your list of wants. Agreed, that would be far more useful. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] knngist questions
I'm gradually slogging my way through the KNNGIST patches which were posted here: http://archives.postgresql.org/pgsql-hackers/2010-07/msg01183.php I have a couple of conceptual questions. 1. Is KNNGIST intended to work if there's more than one pathkey? If so, how? Example: SELECT * FROM tab ORDER BY this_point - '(0,0)', this_point - '(1,1)' As far as I can see, there's nothing in match_pathkey_to_index() which would prevent lists of pathkeys and sortclauses of length 2 from being constructed, but can GIST actually handle that? It seems hard. If it can't, which I suspect is the case, then we can make this logic a lot simpler by removing one of the loops from match_pathkey_to_index() and bailing out quickly whenever list_length(root-query_pathkeys) != 1. 2. I'm concerned by the fact that the new consistent() methods only return 0 or -1.0 for non-leaf pages. If we have a query like: SELECT * FROM tab WHERE this_point - '(0,0)' ...it seems that every non-leaf page will be consistent and we'll end up loading every key in the entire index into an RBTree, which doesn't seem practical from a memory-usage perspective. Maybe I'm misunderstanding something, but it seems like in a case like this you'd need to have the consistent function for each page return a minimum and maximum distance to '(0,0)'. If you have one page that has a minimum distance of 0 and a maximum distance of 100 and another page which has a minimum distance of 101 and a maximum distance of 200, you don't even need to look at the second page until all the keys from the first one have been processed. If there's a third page with a minimum distance of 50 and a maximum distance of 150, you can return the tuples from the first page with distances less than 50, in order; then you can do a merge between the remaining keys on that page and the keys on the third page with values = 100; then you can do a merge between the remaining keys on that page and those on the second page with values = 150; and finally you can return the remaining keys on the second page in order. That still doesn't seem to provide any particularly tight bound on memory usage, but it's certainly way better than traversing the entire tree up front. If you are already doing something like this somewhere in the code, please point me in the right direction... 3. I've been scratching my head over the following bit of code and it doesn't make any sense to me. As far as I can tell, this is effectively comparing the number of columns in the ORDER BY clause to the number of restriction clauses applicable to the relation being scanned. Those two quantities don't seem to have much to do with each other, so either I'm confused or the code is. It doesn't seem like it should matter anyway, since I don't think we're planning on any AMs being both amoptionalkey and amcanorderbyop. + if (list_length(restrictclauses) indexcol !index-amoptionalkey) + break; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporary functions (and other object types)
On Sat, Nov 6, 2010 at 1:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sat, Nov 6, 2010 at 11:36 AM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah, we changed that behavior as part of the fix for CVE-2007-2138. You'd need either SECURITY DEFINER functions or very careless use of SET ROLE/SET SESSION AUTHORIZATION for the issue to be exploitable. Would it be practical to let foo() potentially mean pg_temp.foo() outside of any SECURITY DEFINER context? Doesn't seem like a particularly good idea for the search semantics to be randomly different inside a SECURITY DEFINER function. In fact, I'll bet you could construct an attack in the reverse direction: S.D. function thinks it is calling a temp function (using syntax that works fine when not S.D.), but control gets sent to a non-temp function belonging to $badguy instead. I guess. If you search pg_temp always then it's pretty much impossible to avoid having a security hole, if you use any non-trivial SQL. But if you search pg_temp for non-SD only then you'll only have a security hole if you assume (presumably without testing) that the behavior is the same in that case. If an SD function is calling temporary functions they'd best be ones it created, otherwise your security is pretty much nonexistent anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporary functions (and other object types)
Robert Haas robertmh...@gmail.com writes: I guess. If you search pg_temp always then it's pretty much impossible to avoid having a security hole, if you use any non-trivial SQL. But if you search pg_temp for non-SD only then you'll only have a security hole if you assume (presumably without testing) that the behavior is the same in that case. If an SD function is calling temporary functions they'd best be ones it created, otherwise your security is pretty much nonexistent anyway. In general I don't see a lot of use for calling temp functions that you don't know are temp functions. So I see nothing much wrong with having to use the pg_temp. prefix --- and the possibility of security issues definitely pushes me over the line to being happy with requiring that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IA64 versus effective stack limit
On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: As I said above, I don't know of any good way to measure register stack depth directly. It's probably possible to find out by asking the kernel or something like that, but we surely do not want to introduce a kernel call into check_stack_depth(). It seems more likely it would be some kind of asm than a trap. This might be wishful thinking but is it too much to hope that glibc already exposes it through some function? It looks like the relevant registers are ar.bsp and ar.bspstore. Just taking the difference apparently gives you the amount of memory used in the current backing store. However some of the comments I'm reading seem to imply that the OS can allocate discontiguous backing store partitions, presumably if the backing store pointer reaches an unmapped address there has to be some way to trap to the OS to allocate more and maybe then it has a chance to tweak the bsp address? This was quite interesting (especially the The Register Stack Engine section of the second one): http://msdn.microsoft.com/en-us/magazine/cc301708.aspx http://msdn.microsoft.com/en-us/magazine/cc301711.aspx Also I found the following: (lists some registers) http://www.cs.clemson.edu/~mark/subroutines/itanium.html (helper functions in glibc asm includes that calculate bspstore-bsp to count the number of registers used) http://www.koders.com/c/fidE15CABBBA63E7C24928D7F7C9A95653D101451D2.aspx?s=queue Also I found http://www.nongnu.org/libunwind/man/libunwind(3).html which I found cool though not really relevant. The ia64 implementation fiddles with the RSE registers as well of course. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IA64 versus effective stack limit
Greg Stark gsst...@mit.edu writes: On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: As I said above, I don't know of any good way to measure register stack depth directly. It's probably possible to find out by asking the kernel or something like that, but we surely do not want to introduce a kernel call into check_stack_depth(). It seems more likely it would be some kind of asm than a trap. This might be wishful thinking but is it too much to hope that glibc already exposes it through some function? Yeah, I suppose some asm might be a possible solution, but I was a bit discouraged after reading some Intel documentation that said that the register-stack top wasn't exposed in the architectural model. You apparently can only find out what's been spilled to memory. (But perhaps that's close enough, for the purposes here?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IA64 versus effective stack limit
Greg Stark gsst...@mit.edu writes: It seems more likely it would be some kind of asm than a trap. I seem to be getting plausible results from this bit of crockery: #include asm/ia64regs.h static __inline__ void * get_bsp(void) { void *ret; #ifndef __INTEL_COMPILER __asm__ __volatile__( ;;\nmov %0=ar.bsp\n :=r(ret)); #else ret = (void *) __getReg(_IA64_REG_AR_BSP); #endif return ret; } I'll clean this up and commit, assuming it actually fixes the problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query Plan Columns
On Nov 6, 2010, at 11:44 AM, David E. Wheeler wrote: On Nov 5, 2010, at 1:42 PM, David E. Wheeler wrote: http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/commands/explain.c;h=f494ec98e510c23120e072bd5ee8821ea12738a4;hb=HEAD#l617 Ah, great, thanks. So based on this, I've come up with: Updated: CREATE TEMPORARY TABLE plans ( Node Type TEXT NOT NULL, Strategy TEXT, Operation TEXT, Startup Cost FLOAT, Total CostFLOAT, Plan Rows FLOAT, Plan WidthINTEGER, Actual Startup Time FLOAT, Actual Total Time FLOAT, Actual Rows FLOAT, Actual Loops FLOAT, Parent Relationship TEXT, Sort Key TEXT[], Sort Method TEXT[], Sort Space Used BIGINT, Sort Space Type TEXT, Join Type TEXT, Join Filter TEXT, Hash Cond TEXT, Relation Name TEXT, Alias TEXT, Scan DirectionTEXT, Index NameTEXT, Index CondTEXT, Recheck Cond TEXT, TID Cond TEXT, Merge CondTEXT, Subplan Name TEXT, Function Name TEXT, Function Call TEXT, FilterTEXT, One-Time Filter TEXT, Command TEXT, Shared Hit Blocks BIGINT, Shared Read BlocksBIGINT, Shared Written Blocks BIGINT, Local Hit Blocks BIGINT, Local Read Blocks BIGINT, Local Written Blocks BIGINT, Temp Read Blocks BIGINT, Temp Written Blocks BIGINT, OutputTEXT[], Hash Buckets BIGINT, Hash Batches BIGINT, Original Hash Batches BIGINT, Peak Memory Usage BIGINT, SchemaTEXT, CTE Name TEXT ); Would I be right that Node Type is the only column can be NOT NULL? Also, I'm thinking of making the Actual Startup Time and Actual Total Time columns into INTERVALs. The times are expressed in milliseconds, yes? I'm wondering if INTERVAL would be more convenient for querying… Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simplifying replication
Hannu Krosing ha...@2ndquadrant.com writes: To make pg_basebackup.py self-sufficient it should also open 2nd connection to the same master and make sure that all WAL files are copied for the duration of base copy. Done now, please have a look and try it if possible: https://github.com/dimitri/pg_basebackup btw, as next step you could backport this to 8.x and have most of the benefits of SR. It should not be very hard to keep track of wal position inside a pl/pythonu function and send one or more records back in form of You now have the -x and -D options to set that up, but I didn't check the backport part: it still depends on the pg_bb_list_files() function to get the recursive listing of the pg_xlog directory, and it does that using a WITH RECURSIVE query. The way I did it is to only copy the (whole) WAL again if its ctime changed since last loop. Also pg_basebackup won't start a backup if you run it on its own, I don't think that's what you want here. Oh, as I needed to fork() a process to care for the pg_xlog in a loop while the base backup is sill ongoing, I added a -j --jobs option so that you can hammer the master some more by having more than one process doing the copying. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sorted writes for checkpoints
On Fri, Oct 29, 2010 at 6:17 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro itagaki.takah...@gmail.com wrote: On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Simon's argument in the thread that the todo item points to (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is basically that we don't know what the best algorithm is yet and benchmarking is a lot of work, so let's just let people do whatever they feel like until we settle on the best approach. I think we need to bite the bullet and do some benchmarking, and commit one carefully vetted patch to the backend. When I submitted the patch, I tested it on disk-based RAID-5 machine: http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php But there were no additional benchmarking reports at that time. We still need benchmarking before we re-examine the feature. For example, SSD and SSD-RAID was not popular at that time, but now they might be considerable. There are really two separate things here: (1) trying to do all the writes to file A before you start doing writes to file B, and (2) trying to write out blocks to each file in ascending logical block number order I'm much more convinced of the value of #1 than I am of the value of #2. If we do #1, we can then spread out the checkpoint fsyncs in a meaningful way without fearing that we'll need to fsync the same file a second time for the same checkpoint. If the OS/FS is behaving such that it is important to spread out fsyncs, then wouldn't that same behavior also make it less important to avoid fsync a second time? If the OS is squirreling away a preposterous amount of dirty buffers in its cache, and then panicking to dump them all when it gets the fsync, then I think you would need to spread out the fsyncs within a file, and not just between files. We've gotten some pretty specific reports of problems in this area recently, so it seems likely that there is some value to be had there. On the other hand, #2 is only a win if sorting the blocks in numerical order causes the OS to write them in a better order than it would otherwise have done. Assuming the ordering is useful, the only way the OS can do as good a job as the checkpoint code can, is if the OS stores the entire checkpoint worth of data as dirty blocks and doesn't start writing until an fsync comes in. This strikes me as a pathologically configured OS/FS. (And would explain problems with fsyncs) We've had recent reports that our block-at-a-time relation extension policy is leading to severe fragmentation on certain filesystems, so I'm a bit skeptical about the value of this (though, of course, that can be overturned if we can collect meaningful evidence). Some FS are better about that than others at that. It would probably depend on the exact workload, and pgbench would probably favor large contiguous extents to an unrealistic degree. So I don't know the best way to gather that evidence. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IA64 versus effective stack limit
I wrote: I don't know why icc is so much worse than gcc on this measure of stack depth consumption, but clearly the combination of that and the 100kB max_stack_depth explains why dugong is failing to do very many levels of recursion before erroring out. I figured out why icc looked so much worse here: I had accidentally built with optimization disabled. Selecting -O2 causes its numbers to come a lot closer to gcc's. In particular, it flips around from using more normal stack than register stack to using more register stack than normal. (This might be the case for gcc as well; I did not test an unoptimized gcc build.) This means that, at least for icc, *an optimized build is unsafe* without code to check for register stack growth. It turns out that buildfarm member dugong has been building without optimization all along, which is why we'd not noticed the issue. I think it'd be a good idea for dugong to turn on optimization so it's testing something closer to a production build. However, at this moment only HEAD is likely to pass regression tests with that turned on. We'd have to back-patch the just-committed code for checking register stack growth before the back branches would survive that. I'm normally hesitant to back-patch code that might create portability issues, but in this case perhaps it's a good idea. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improved parallel make support
On 11/06/2010 07:35 AM, Peter Eisentraut wrote: So far, two machines have reported an older make version: dawn_bat narwhal both of the mingw type. Andrew, Dave, could you see about upgrading the GNU make installation there? dawn_bat is done. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sorted writes for checkpoints
On Sat, Nov 6, 2010 at 7:25 PM, Jeff Janes jeff.ja...@gmail.com wrote: There are really two separate things here: (1) trying to do all the writes to file A before you start doing writes to file B, and (2) trying to write out blocks to each file in ascending logical block number order I'm much more convinced of the value of #1 than I am of the value of #2. If we do #1, we can then spread out the checkpoint fsyncs in a meaningful way without fearing that we'll need to fsync the same file a second time for the same checkpoint. If the OS/FS is behaving such that it is important to spread out fsyncs, then wouldn't that same behavior also make it less important to avoid fsync a second time? If the OS is squirreling away a preposterous amount of dirty buffers in its cache, and then panicking to dump them all when it gets the fsync, then I think you would need to spread out the fsyncs within a file, and not just between files. Well, presumably, there's some amount of dirty data that the OS can write out in one shot without causing a perceptible stall (otherwise we're hosed no matter what). The question is where that threshold is. With one fsync per file, we'll try to write at most 1 GB at once, and often less. As you say, there's a possibility that that's still too much, but right now we could be trying to dump 30+ GB to disk if the OS has a lot of dirty pages in the buffer cache, so getting down to 1 GB or less at a time should be a big improvement even if it doesn't solve the problem completely. We've gotten some pretty specific reports of problems in this area recently, so it seems likely that there is some value to be had there. On the other hand, #2 is only a win if sorting the blocks in numerical order causes the OS to write them in a better order than it would otherwise have done. Assuming the ordering is useful, the only way the OS can do as good a job as the checkpoint code can, is if the OS stores the entire checkpoint worth of data as dirty blocks and doesn't start writing until an fsync comes in. This strikes me as a pathologically configured OS/FS. (And would explain problems with fsyncs) The OS would only need to store and reorder one file's worth of blocks, if we wrote the data for one file and called fsync, wrote the data for another file and called fsync, etc. We've had recent reports that our block-at-a-time relation extension policy is leading to severe fragmentation on certain filesystems, so I'm a bit skeptical about the value of this (though, of course, that can be overturned if we can collect meaningful evidence). Some FS are better about that than others at that. It would probably depend on the exact workload, and pgbench would probably favor large contiguous extents to an unrealistic degree. So I don't know the best way to gather that evidence. Well, the basic idea would be to try some different workloads on different filesystems and try go get some feeling for how often sorting by block number wins and how often it loses. But as I say I think the biggest problem is that we're often trying to write too much dirty data to disk at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IA64 versus effective stack limit
On Sat, Nov 6, 2010 at 8:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: I don't know why icc is so much worse than gcc on this measure of stack depth consumption, but clearly the combination of that and the 100kB max_stack_depth explains why dugong is failing to do very many levels of recursion before erroring out. I figured out why icc looked so much worse here: I had accidentally built with optimization disabled. Selecting -O2 causes its numbers to come a lot closer to gcc's. In particular, it flips around from using more normal stack than register stack to using more register stack than normal. (This might be the case for gcc as well; I did not test an unoptimized gcc build.) This means that, at least for icc, *an optimized build is unsafe* without code to check for register stack growth. It turns out that buildfarm member dugong has been building without optimization all along, which is why we'd not noticed the issue. I think it'd be a good idea for dugong to turn on optimization so it's testing something closer to a production build. However, at this moment only HEAD is likely to pass regression tests with that turned on. We'd have to back-patch the just-committed code for checking register stack growth before the back branches would survive that. I'm normally hesitant to back-patch code that might create portability issues, but in this case perhaps it's a good idea. Comments? Yeah, I think it might be a good idea. Crashing is bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE ... IF EXISTS feature?
On Fri, Nov 5, 2010 at 7:49 PM, Daniel Farina drfar...@acm.org wrote: On Fri, Nov 5, 2010 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote: Can you give us a self-contained example of the problem you're talking about? Sure. Consider the following: CREATE TABLE t1 ( id integer PRIMARY KEY ); CREATE TABLE t2 ( id integer PRIMARY KEY, fk integer ); ALTER TABLE ONLY t2 ADD CONSTRAINT t2_constr FOREIGN KEY (fk) REFERENCES t1(id); Try something like this: createdb foo psql -1f this_ddl.sql foo pg_dump --clean foo cleaning_backup.sql # db wipe dropdb foo createdb foo psql -1f cleaning_backup.sql foo The last command will return non-zero and abort the xact early on, because of the following stanza in pg_dump --clean's output: ALTER TABLE ONLY public.t2 DROP CONSTRAINT t2_constr; ALTER TABLE ONLY public.t2 DROP CONSTRAINT t2_pkey; ALTER TABLE ONLY public.t1 DROP CONSTRAINT t1_pkey; DROP TABLE public.t2; DROP TABLE public.t1; Since there's no public.t1/t2, it's not possible to ALTER them. I'm not entirely sure why the DROPs CONSTRAINT on pkeys are being done, as they only introduce an internal (or is it auto?) style self-dependency. It is more obvious why foreign keys are dropped, which is to break up the dependencies so that tables can be dropped without CASCADE. If we're going to try to fix this, we probably ought to try to make sure that we are fixing it fairly completely. How confident are you that this is the only problem? With respect to the syntax itself, I have mixed feelings. On the one hand, I'm a big fan of CREATE IF NOT EXISTS and DROP IF EXISTS precisely because I believe they handle many common cases that people want in real life without much hullabaloo. But, there's clearly some limit to what can reasonably be done this way. At some point, what you really want is some kind of meta-language where you can write things like: IF EXISTS TABLE t1 THEN ALTER TABLE t1 DROP CONSTRAINT IF EXISTS t1_constr; END IF; ...or possibly something much more complicated, like checking whether a table foo has a column called bar and if so doing nothing but if not but a column called baz exists then renaming it to bar and otherwise adding a column called bar. Since we now have PL/pgsql by default, we could possibly fix pg_dump --clean by emitting a DO block, although the syntax for checking existence of a table is none too pretty, and it would make pg_dump --clean rely for correctness on plpgsql being installed, which might be none too desirable. It would actually be sort of spiffy to be able to have some of the PL/pgsql control constructs available in straight SQL, but I'm not expecting that to happen any time in the forseeable future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE ... IF EXISTS feature?
On Sat, Nov 6, 2010 at 6:09 PM, Robert Haas robertmh...@gmail.com wrote: If we're going to try to fix this, we probably ought to try to make sure that we are fixing it fairly completely. How confident are you that this is the only problem? I haven't tried to isolate problems on really complicated schemas yet, but I can tell you what I did do: I went through pg_dump and tried to find as many parts of the code that added dropStmt to DumpableObject nodes as possible (in the dumpFoo family of functions). ALTER seemed like the only interesting bit so far. But we can give the mechanic a try with, say, a PLPGSQL hack for a little while to see if it basically gets the job done in sort-of real life for a while. I think that's not a bad ideaI'll see what I can do about that. (Of course, continue to share your suggestions and revelations, I'd appreciate it) fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protecting against unexpected zero-pages: proposal
On Sat, Nov 6, 2010 at 11:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: Gurjeet Singh singh.gurj...@gmail.com writes: .) The basic idea is to have a magic number in every PageHeader before it is written to disk, and check for this magic number when performing page validity checks. Um ... and exactly how does that differ from the existing behavior? Right now a zero filled page considered valid, and is treated as a new page; PageHeaderIsValid()-/* Check all-zeroes case */, and PageIsNew(). This means that looking at a zero-filled page on disk (say after a crash) does not give us any clue if it was indeed left zeroed by Postgres, or did FS/storage failed to do their job. With the proposed change, if it is a valid page (a page actually written by Postgres) it will either have a sensible LSN or the magic-LSN; the LSN will never be zero. OTOH, if we encounter a zero filled page ( = LSN={0,0)} ) it clearly would implicate elements outside Postgres in making that page zero. The amount of fragility introduced by the assumptions you have to make for this seems to me to be vastly riskier than the risk you are trying to respond to. I understand that it is a pretty low-level change, but IMHO the change is minimal and is being applied in well understood places. All the assumptions listed have been effective for quite a while, and I don't see these assumptions being affected in the near future. Most crucial assumptions we have to work with are, that XLogPtr{n, 0x} will never be used, and that mdextend() is the only place that extends a relation (until we implement an md.c sibling, say flash.c or tape.c; the last change to md.c regarding mdextend() was in January 2007). Only mdextend() and PageHeaderIsValid() need to know this change in behaviour, and all the other APIs work and behave the same as they do now. This change would increase the diagnosability of zero-page issues, and help the users point fingers at right places. Regards, -- gurjeet.singh @ EnterpriseDB - The Enterprise Postgres Company http://www.EnterpriseDB.com singh.gurj...@{ gmail | yahoo }.com Twitter/Skype: singh_gurjeet Mail sent from my BlackLaptop device
Re: [HACKERS] Protecting against unexpected zero-pages: proposal
On Sun, Nov 7, 2010 at 4:23 AM, Gurjeet Singh singh.gurj...@gmail.com wrote: I understand that it is a pretty low-level change, but IMHO the change is minimal and is being applied in well understood places. All the assumptions listed have been effective for quite a while, and I don't see these assumptions being affected in the near future. Most crucial assumptions we have to work with are, that XLogPtr{n, 0x} will never be used, and that mdextend() is the only place that extends a relation (until we implement an md.c sibling, say flash.c or tape.c; the last change to md.c regarding mdextend() was in January 2007). I think the assumption that isn't tested here is what happens if the server crashes. The logic may work fine as long as nothing goes wrong but if something does it has to be fool-proof. I think having zero-filled blocks at the end of the file if it has been extended but hasn't been fsynced is an expected failure mode of a number of filesystems. The log replay can't assume seeing such a block is a problem since that may be precisely the result of the crash that caused the replay. And if you disable checking for this during WAL replay then you've lost your main chance to actually detect the problem. Another issue -- though I think a manageable one -- is that I expect we'll want to be be using posix_fallocate() sometime soon. That will allow efficient guaranteed pre-allocated space with better contiguous layout than currently. But ext4 can only pretend to give zero-filled blocks, not any random bitpattern we request. I can see this being an optional feature that is just not compatible with using posix_fallocate() though. It does seem like this is kind of part and parcel of adding checksums to blocks. It's arguably kind of silly to add checksums to blocks but have an commonly produced bitpattern in corruption cases go undetected. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers