Re: [HACKERS] [PATCH] binary heap implementation
On 2012-11-14 18:41:12 +0530, Abhijit Menon-Sen wrote: There are two or three places in the Postgres source that implement heap sort (e.g. tuplesort.c, nodeMergeAppend.c), and it's also needed by the BDR code. It seemed reasonable to factor out the functionality. This replaces the simpleheap.[ch] I had in the last submitted series. I've attached a patch (binaryheap.diff) that contains a straightforward implementation of a binary heap (originally written by Andres, with a few bugfixes and cleanups by me). I want to emphasise that the only good thing about my submitted version was that it actually compiled. So quite a bit of the work here is from Abhijit... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] S_ISLNK
Hi - I'm cross-compiling the master branch (cygwin/mingw) and have found a reference to S_ISLNK that isn't guarded by #ifndef WIN32 like the ones in basebackup.c are. Could you merge the attached fix? Thanks - Nick exec-symlink-ifdef.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] S_ISLNK
Nicholas White n.j.wh...@gmail.com writes: Hi - I'm cross-compiling the master branch (cygwin/mingw) and have found a reference to S_ISLNK that isn't guarded by #ifndef WIN32 like the ones in basebackup.c are. That whole function is guarded by HAVE_READLINK, so I'm not seeing the problem (and neither are the Windows members of the buildfarm). What environment are you in that has readlink() and not S_ISLNK? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why does delete from table where not exists (select 1 from ... LIMIT 1) perform badly?
Palle Girgensohn gir...@pingpong.net writes: How come the planner treats the delete from table where not extists(select 1 from table2 where ... LIMIT 1) so differently, and usually badly, when the LIMIT 1 is there. Because it can't optimize it into an antijoin. In older version of postgresql, I remember that the effect was the opposite, a limit 1 would actually perform substantially better. Hence we have old code (and old habits), where the LIMIT 1 is still used. Well, you're basically forcing it into the same type of plan you would have gotten before antijoins were implemented (circa 8.4), so I don't see that this is a regression. But I'd get rid of the LIMIT 1 if I were you. It's been a *very* long time since that was a net benefit in an EXISTS subquery, if indeed it ever was --- AFAIR, even the earliest PG versions that understood about optimizing for fast-start plans would do so in an EXISTS subquery, with or without any LIMIT. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote: On Wed, Nov 14, 2012 at 2:03 AM, Bruce Momjian br...@momjian.us wrote: At 64k I see pg_upgrade taking 12% of the duration time, if I subtract out the dump/restore times. My percentage numbers only included CPU time and I used SSD storage. For the most part there was no IO wait to speak of, but it's completely expected that thousands of link calls are not free. Agreed. I was looking at wall clock time so I could see the total impact of everything pg_upgrade does. Postgres time itself breaks down with 10% for shutdown checkpoint and 90% for regular running, consisting of 16% parsing, 13% analyze, 20% plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network. That SVG graph was quite impressive. I used perf and Gprof2Dot for this. I will probably do a blog post on how to generate these graphs. It's much more useful for me than a plain flat profile as I don't know by heart which functions are called by which. Yes, please share that information. It looks to me that most benefit could be had from introducing more parallelism. Are there any large roadblocks to pipelining the dump and restore to have them happen in parallel? I talked to Andrew Dustan about parallelization in pg_restore. First, we currently use pg_dumpall, which isn't in the custom format required for parallel restore, but if we changed to custom format, create table isn't done in parallel, only create index/check constraints, and trigger creation, etc. Not sure if it worth perusing this just for pg_upgrade. I agree that parallel restore for schemas is a hard problem. But I didn't mean parallelism within the restore, I meant that we could start both postmasters and pipe the output from dump directly to restore. This way the times for dumping and restoring can overlap. Wow, that is a very creative idea. The current code doesn't do that, but this has the potential of doubling pg_upgrade's speed, without adding a lot of complexity. Here are the challenges of this approach: * I would need to log the output of pg_dumpall as it is passed to psql so users can debug problems * pg_upgrade never runs the old and new clusters at the same time for fear that it will run out of resources, e.g. shared memory, or if they are using the same port number. We can make this optional and force different port numbers. Let me work up a prototype in the next few days and see how it performs. Thanks for the great idea. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On Wed, Nov 14, 2012 at 8:30 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I wrote: From: Tom Lane [mailto:t...@sss.pgh.pa.us] I wrote: Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes: I have a question. I think it would be also better to extend the syntax for the SQL COPY command in the same way, ie, COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format 'csv' Yeah, sure --- that case is already superuser-only, so why not give it the option of being a popen instead of just fopen. BTW, one thought that comes to mind is that such an operation is extremely likely to fail under environments such as SELinux. That's not necessarily a reason not to do it, but we should be wary of promising that it will work everywhere. Probably a documentation note about this would be enough. OK I'll revise the patch. I've revised the patch. In this version a user can specify hooks for pre- and post-processor executables for COPY and \copy in the follwoing way: $ echo '/bin/gunzip -c $1' decompress.sh $ chmod +x decompress.sh In the case of the COPY command, postgres=# COPY foo FROM '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz' WITH (format 'csv'); Also, in the case of the \copy instruction, postgres=# \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz' with (format 'csv') As shown in the example above, I've assumed that the syntax for this option for e.g., the COPY command is: COPY table_name FROM 'progname filename' WITH ... COPY table_name TO 'progname filename' WITH ... Here, progname for COPY IN is the user-supplied program that takes filename as its argument and that writes on standard output. What about further extending the COPY IN syntax to the following? COPY table_name FROM 'progname [ option, ... ]' WITH ... I'd just like to execute COPY vmstat_table FROM 'vmstat' WITH ... Also, prgoname for COPY OUT is the user-supplied program that reads standard input and writes to filename taken as its argument. This makes simple the identification and verification of progname and filename. Todo: * Documentation including documentation note about the limitation for environments such as SELinux mentioned by Tom. * More test Any comments and suggestions are welcomed. Isn't it dangerous to allow a user to execute external program in server side via SQL? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL
On Tuesday, November 13, 2012 9:29 AM Amit kapila wrote: On Monday, November 12, 2012 12:07 PM Greg Smith wrote: On 11/9/12 11:59 PM, Amit kapila wrote: 1) Change to add scanning a .conf directory in the default configuration using include-dir. This is a quick fix. I predict most of the headaches around it will end up being for packagers rather than the core code to deal with. You could submit this as a small thing to be evaluated on its own. How it's done is going to be controversial. Might as well get that fighting focused against a sample implementation as soon as possible. As per my understanding, a. during initdb, new conf directory can be created and also create .auto file in it. b. use include_dir at end of postgresql.conf to include directory created in above step. c. server start/sighup will take care of above include_dir WIP patch to address above point is attached. It needs cleanup w.r.t moving function for absolute path to common place where initdb as well as server code can use it. 2) Get familiar with navigating the GUC data and figuring out what, exactly, needs to be written out. This should include something that navigates like things appear after a RESET, ignoring per-user or per-session changes when figuring out what goes there. It seems inevitable that some amount of validating against the source information--what pg_settings labels source, sourcefile, and sourceline will be needed. An example is the suggestion Magnus made for confirming that the include-dir is still active before writing something there. Okay, what I will do for this is that I shall explain in next mail the way to do by navigating GUC's. One basic idea to do execution of SQL Command with GUC's is described as below: 1. Take Global level lock so that no other session can run the ALTER SYSTEM/built-in Command to change config parameter 2. Navigate through GUC variable's and remember all GUC's (of .auto file ) reset_val. 3. Compare the config parameter to be changed with the parameters stored in step-2 and if it exists, replace its value else add new variable-value to it. 4. Flush the file with all parameters computed in Step-3. 5. Signal all other backends to update this value in their GUC's reset_val, so that all backends always have recent copy. 6. When all backends have updated, change corresponding reset_val in current session as well. 7. Release the Global level lock. Some problems with the above approach: a. When the signal is sent in step-5, if other backend is also waiting on global lock, it can cause signal handling little tricky, may be special handling needs to be done to handle this situation b. If after step-5 or 6, rollback happened it might be difficult to rollback. In general if this command executes in transaction-block, the same issue can arise. c. Updation of reset_val for parameters which cannot be reset without restart is wrong. For such kind of parameters, I think we can give warning to users. I think this is the initial idea to check if I am thinking on lines you have in mind. Comments/Suggestions? With Regards, Amit Kapila. change_pgconf_for_ config_dir.patch Description: change_pgconf_for_ config_dir.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Wed, Nov 14, 2012 at 5:53 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Nov 13, 2012 at 11:46 AM, Fujii Masao masao.fu...@gmail.com wrote: Without this utility, it's difficult to calculate the maximum LSN of data page, so basically we needed to take a backup when starting the standby. In the future, thanks to this utility, we can calculate the maximum LSN, and can skip a backup if that LSN is less than the master (i.e., last applied LSN, IOW, timeline switch LSN). Doesn't the minimum recovery point give us that? Yes, but only in the standby. The master doesn't record the minimum recovery point at all. So, when we start the pre-master as new standby after failover, we need this utility to know that LSN. Or we need to change the master so that it records the minimum recovery point like the standby. BTW, it might be useful to introduce new replication option that makes the data page fush wait for its corresponding WAL to be replicated. By using this option, we can ensure that any data page in the master always precede the standby. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On 11/14/2012 10:08 AM, Bruce Momjian wrote: On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote: I agree that parallel restore for schemas is a hard problem. But I didn't mean parallelism within the restore, I meant that we could start both postmasters and pipe the output from dump directly to restore. This way the times for dumping and restoring can overlap. Wow, that is a very creative idea. The current code doesn't do that, but this has the potential of doubling pg_upgrade's speed, without adding a lot of complexity. Here are the challenges of this approach: * I would need to log the output of pg_dumpall as it is passed to psql so users can debug problems Instead of piping it directly, have pg_upgrade work as a tee, pumping bytes both to psql and a file. This doesn't seem terribly hard. * pg_upgrade never runs the old and new clusters at the same time for fear that it will run out of resources, e.g. shared memory, or if they are using the same port number. We can make this optional and force different port numbers. Right. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote: Here, progname for COPY IN is the user-supplied program that takes filename as its argument and that writes on standard output. What about further extending the COPY IN syntax to the following? COPY table_name FROM 'progname [ option, ... ]' WITH ... I'd just like to execute COPY vmstat_table FROM 'vmstat' WITH ... I think we should be using FDWs/SRFs here, not inventing new syntax/architectures for executing external code, so -1 from me. We can already do INSERT table SELECT * FROM fdw; with any logic for generating data lives inside an FDW or SRF. If we want it in COPY we can have syntax like this... COPY table FROM (SELECT * FROM fdw) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Wed, Nov 14, 2012 at 10:25:24AM -0500, Andrew Dunstan wrote: On 11/14/2012 10:08 AM, Bruce Momjian wrote: On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote: I agree that parallel restore for schemas is a hard problem. But I didn't mean parallelism within the restore, I meant that we could start both postmasters and pipe the output from dump directly to restore. This way the times for dumping and restoring can overlap. Wow, that is a very creative idea. The current code doesn't do that, but this has the potential of doubling pg_upgrade's speed, without adding a lot of complexity. Here are the challenges of this approach: * I would need to log the output of pg_dumpall as it is passed to psql so users can debug problems Instead of piping it directly, have pg_upgrade work as a tee, pumping bytes both to psql and a file. This doesn't seem terribly hard. Right. It isn't hard. * pg_upgrade never runs the old and new clusters at the same time for fear that it will run out of resources, e.g. shared memory, or if they are using the same port number. We can make this optional and force different port numbers. Right. OK. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Wed, Nov 14, 2012 at 5:35 PM, Amit Kapila amit.kap...@huawei.com wrote: On Tuesday, November 13, 2012 10:17 PM Fujii Masao wrote: On Tue, Nov 13, 2012 at 1:23 PM, Amit kapila amit.kap...@huawei.com wrote: On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote: Robert Haas escribió: On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote: I think I can see all of those things being potentially useful. There are a couple of pending patches that will revise the WAL format I wonder if we shouldn't make this a separate utility, rather than something that is part of pg_resetxlog. Anyone have a thought on that topic? That thought did cross my mind too. We might be able to use this utility to decide whether we need to take a fresh backup from the master onto the standby, to start old master as new standby after failover. When starting new standby after failover, any data page in the standby must not precede the master. Otherwise, the standby cannot catch up with the master consistently. But, the master might write the data page corresponding to the WAL which has not been replicated to the standby yet. So, if failover happens before that WAL has been replicated, the data page in old master would precede new master (i.e., old standby), and in this case the backup is required. OTOH, if maximum LSN in data page in the standby is less than the master, the backup is not required. When new standby will start the replication (RequestXLogStreaming()), it will send the startpoint, so won't in above scenario that startpoint will be ahead of new master (or new master won't have that LSN) and replication will not be eastablished? The startpoint is the heading LSN of the WAL file including the latest checkpoint record. Yes, there can be the case where the startpoint is ahead of new master. In this case, replication would fail to be established because of lack of requested WAL file. OTOH, there can be the case where new master has already been ahead of the startpoint. So now user may not be able to decide whether he needs to do incremental or full backup from new master, is this the case you are trying to point? Sorry, I could not parse this comment. Could you elaborate your concern again? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On Thu, Nov 15, 2012 at 12:31 AM, Simon Riggs si...@2ndquadrant.com wrote: On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote: Here, progname for COPY IN is the user-supplied program that takes filename as its argument and that writes on standard output. What about further extending the COPY IN syntax to the following? COPY table_name FROM 'progname [ option, ... ]' WITH ... I'd just like to execute COPY vmstat_table FROM 'vmstat' WITH ... I think we should be using FDWs/SRFs here, not inventing new syntax/architectures for executing external code, so -1 from me. We can already do INSERT table SELECT * FROM fdw; with any logic for generating data lives inside an FDW or SRF. If we want it in COPY we can have syntax like this... COPY table FROM (SELECT * FROM fdw) New syntax looks attractive to me because it's easy to use that. It's not easy to implement the FDW for the external program which a user wants to execute. Of course if someone implements something like any_external_program_fdw, I would change my mind.. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
Simon Riggs escribió: On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote: Here, progname for COPY IN is the user-supplied program that takes filename as its argument and that writes on standard output. What about further extending the COPY IN syntax to the following? COPY table_name FROM 'progname [ option, ... ]' WITH ... I'd just like to execute COPY vmstat_table FROM 'vmstat' WITH ... I think we should be using FDWs/SRFs here, not inventing new syntax/architectures for executing external code, so -1 from me. Hmm, but then you are forced to write C code, whereas the external program proposal could have you writing a only shell script instead. So there is some merit to this idea ... though we could have a pipe_fdw that could let you specify an arbitrary program to run. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Wednesday, November 14, 2012 9:12 PM Fujii Masao wrote: On Wed, Nov 14, 2012 at 5:35 PM, Amit Kapila amit.kap...@huawei.com wrote: On Tuesday, November 13, 2012 10:17 PM Fujii Masao wrote: On Tue, Nov 13, 2012 at 1:23 PM, Amit kapila amit.kap...@huawei.com wrote: On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote: Robert Haas escribió: On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote: I think I can see all of those things being potentially useful. There are a couple of pending patches that will revise the WAL format I wonder if we shouldn't make this a separate utility, rather than something that is part of pg_resetxlog. Anyone have a thought on that topic? That thought did cross my mind too. We might be able to use this utility to decide whether we need to take a fresh backup from the master onto the standby, to start old master as new standby after failover. When starting new standby after failover, any data page in the standby must not precede the master. Otherwise, the standby cannot catch up with the master consistently. But, the master might write the data page corresponding to the WAL which has not been replicated to the standby yet. So, if failover happens before that WAL has been replicated, the data page in old master would precede new master (i.e., old standby), and in this case the backup is required. OTOH, if maximum LSN in data page in the standby is less than the master, the backup is not required. When new standby will start the replication (RequestXLogStreaming()), it will send the startpoint, so won't in above scenario that startpoint will be ahead of new master (or new master won't have that LSN) and replication will not be eastablished? The startpoint is the heading LSN of the WAL file including the latest checkpoint record. Yes, there can be the case where the startpoint is ahead of new master. In this case, replication would fail to be established because of lack of requested WAL file. Now user can use this utility to decide if new-standby has max LSN greater than max LSN of new-master he needs to use fullback-up on new-standby. Is my understanding right? OTOH, there can be the case where new master has already been ahead of the startpoint. But in this case, there is no need for this utility. Right? So now user may not be able to decide whether he needs to do incremental or full backup from new master, is this the case you are trying to point? Sorry, I could not parse this comment. Could you elaborate your concern again? I wanted to understand the usecase mentioned by you for this utility. As far as I can understand is that it will be used to decide that on new-standby (old-master) whether a full backup is needed from New-master(old-standby). And that situation can occur when new-standby has startpoint LSN greater than new-master? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
Alvaro Herrera alvhe...@2ndquadrant.com writes: Simon Riggs escribió: On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote: Here, progname for COPY IN is the user-supplied program that takes filename as its argument and that writes on standard output. I think we should be using FDWs/SRFs here, not inventing new syntax/architectures for executing external code, so -1 from me. Hmm, but then you are forced to write C code, whereas the external program proposal could have you writing a only shell script instead. I disagree with Simon's objection also, because neither reading from nor writing to an external program is likely to fit the model of reading/updating a table very well. For instance, there's no good reason to suppose that reading twice will give the same results. So force-fitting this usage into the FDW model is not going to work well. Nor do I really see the argument why a pipe_fdw module is cleaner than a COPY TO/FROM pipe feature. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks
Alvaro Herrera wrote: * In heap_lock_tuple's XMAX_IS_MULTI case [snip] why is it membermode mode and not membermode = mode? Uh, that's a bug. Fixed. As noticed in the comment above that snippet, there was a deadlock possible here. Maybe I should add a test to ensure this doesn't happen. Done: https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 (I don't think this is worth a v24 submission). -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] S_ISLNK
Ah - OK. It turns out I'd run ./configure in the postgresql directory before running it in my build directory, so I had two (different) pg_config.hs! The below builds for me from a clean source tree now. Thanks - Nick ./path/to/build/configure CC=x86_64-w64-mingw32-gcc --target=x86_64-w64-mingw32 --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --without-readline --without-zlib --disable-thread-safety On 14 November 2012 06:25, Tom Lane t...@sss.pgh.pa.us wrote: Nicholas White n.j.wh...@gmail.com writes: Hi - I'm cross-compiling the master branch (cygwin/mingw) and have found a reference to S_ISLNK that isn't guarded by #ifndef WIN32 like the ones in basebackup.c are. That whole function is guarded by HAVE_READLINK, so I'm not seeing the problem (and neither are the Windows members of the buildfarm). What environment are you in that has readlink() and not S_ISLNK? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/14/2012 11:20 AM, Tom Lane wrote: I disagree with Simon's objection also, because neither reading from nor writing to an external program is likely to fit the model of reading/updating a table very well. For instance, there's no good reason to suppose that reading twice will give the same results. So force-fitting this usage into the FDW model is not going to work well. Nor do I really see the argument why a pipe_fdw module is cleaner than a COPY TO/FROM pipe feature. Yeah, I agree, although the syntax looks a bit unclean. Maybe something like COPY foo FROM wherever WITH (FILTER '/path/to/program') might work better. You'd hook up the source to the filter as its stdin and read its stdout. Not sure what we'd do for \copy though. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 14 November 2012 16:20, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Simon Riggs escribió: On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote: Here, progname for COPY IN is the user-supplied program that takes filename as its argument and that writes on standard output. I think we should be using FDWs/SRFs here, not inventing new syntax/architectures for executing external code, so -1 from me. Hmm, but then you are forced to write C code, whereas the external program proposal could have you writing a only shell script instead. I disagree with Simon's objection also, because neither reading from nor writing to an external program is likely to fit the model of reading/updating a table very well. For instance, there's no good reason to suppose that reading twice will give the same results. So force-fitting this usage into the FDW model is not going to work well. Nor do I really see the argument why a pipe_fdw module is cleaner than a COPY TO/FROM pipe feature. Perhaps not cleaner, but we do need COPY table FROM (SELECT * FROM foo) So we will then have both ways. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote: What about further extending the COPY IN syntax to the following? COPY table_name FROM 'progname [ option, ... ]' WITH ... I'd just like to execute COPY vmstat_table FROM 'vmstat' WITH ... If we go ahead with this, I think it needs additional keyword to indicate that we will execute the file rather than read from it. I don't think we should rely on whether the file is executable or not to determine how we should treat it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
Andrew Dunstan and...@dunslane.net writes: Yeah, I agree, although the syntax looks a bit unclean. Oh, I had not looked at the syntax closely. I agree, that basically sucks: it's overcomplicated and under-featured, because you can't control the actual program command line very conveniently. Nor do I see a reason to force this into the model of program filtering a specific file. What happened to the previous proposal of treating the COPY target as a pipe specification, ie COPY table FROM 'some command line |'; COPY table TO '| some command line'; Not sure what we'd do for \copy though. Adding a pipe symbol to the target works exactly the same for \copy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, Nov 13, 2012 at 4:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: What happens when you get an I/O failure on the checksum fork? Assuming you're using 8K pages there, that would mean you can no longer verify the integrity of between one and four thousand pages of data. True... but you'll have succeeded in your central aim of determining whether your hardware has crapped out. Answer: yes. The existing code doesn't have any problem reporting back the user those hardware failures which are reported to it by the OS. The only reason for the feature is for the database to be able to detect hardware failures in situations where the OS claims that everything is working just fine. Not to mention the race condition problems associated with trying to be sure the checksum updates hit the disk at the same time as the data-page updates. I think you really have to store the checksums *with* the data they're supposedly protecting. If torn pages didn't exist, I'd agree with you, but they do. Any checksum feature is going to need to cope with the fact that, prior to reaching consistency, there will be blocks on disk with checksums that don't match, because 8kB writes are not atomic. We fix that by unconditionally overwriting the possibly-torn pages with full-page images, and we could simply update the checksum fork at the same time. We don't have to do anything special to make sure that the next checkpoint cycle successfully flushes both pages to disk before declaring the checkpoint a success and moving the redo pointer; that logic already exists. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
Simon Riggs si...@2ndquadrant.com writes: On 14 November 2012 15:09, Fujii Masao masao.fu...@gmail.com wrote: I'd just like to execute COPY vmstat_table FROM 'vmstat' WITH ... If we go ahead with this, I think it needs additional keyword to indicate that we will execute the file rather than read from it. I don't think we should rely on whether the file is executable or not to determine how we should treat it. Agreed, and there's also the question of passing switches etc to the program, so the string can't be a bare file name anyway. I proposed pipe symbols (|) in the string previously, but if you find that too Unix-centric I suppose we could do COPY TABLE FROM PROGRAM 'command line'; COPY TABLE TO PROGRAM 'command line'; regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/14/2012 11:39 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Yeah, I agree, although the syntax looks a bit unclean. Oh, I had not looked at the syntax closely. I agree, that basically sucks: it's overcomplicated and under-featured, because you can't control the actual program command line very conveniently. Nor do I see a reason to force this into the model of program filtering a specific file. What happened to the previous proposal of treating the COPY target as a pipe specification, ie COPY table FROM 'some command line |'; COPY table TO '| some command line'; I'd like to be able to filter STDIN if possible - with this syntax how is COPY going to know to hook up STDIN to the program? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
Andrew Dunstan and...@dunslane.net writes: On 11/14/2012 11:39 AM, Tom Lane wrote: What happened to the previous proposal of treating the COPY target as a pipe specification, ie I'd like to be able to filter STDIN if possible - with this syntax how is COPY going to know to hook up STDIN to the program? Huh? That's fairly nonsensical for the backend-side case; there's no way that stdin (or stdout) of a backend is going to connect anywhere useful for this purpose. As for doing it on the psql side (\copy), I think it would be more or less automatic. If you do say foo | psql -c \copy tab from 'bar |' dbname then bar is going to inherit psql's stdin, which is coming from foo. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/14/2012 11:56 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 11/14/2012 11:39 AM, Tom Lane wrote: What happened to the previous proposal of treating the COPY target as a pipe specification, ie I'd like to be able to filter STDIN if possible - with this syntax how is COPY going to know to hook up STDIN to the program? Huh? That's fairly nonsensical for the backend-side case; there's no way that stdin (or stdout) of a backend is going to connect anywhere useful for this purpose. As for doing it on the psql side (\copy), I think it would be more or less automatic. If you do say foo | psql -c \copy tab from 'bar |' dbname then bar is going to inherit psql's stdin, which is coming from foo. Why does it make less sense on the backend than COPY foo FROM STDIN ? Why shouldn't I want to be able to filter or transform the input? I have a client with a pretty complex backend-driven ETL tool. One of the annoying things about it is that we have to transfer the file to the backend before we can process it. I can imagine this leading to a similar annoyance. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Nov 12, 2012 at 04:42:57PM -0800, Josh Berkus wrote: Jeff, OK, so here's my proposal for a first patch (changes from Simon's patch): * Add a flag to the postgres executable indicating that it should use checksums on everything. This would only be valid if bootstrap mode is also specified. * Add a multi-state checksums flag in pg_control, that would have three states: OFF, ENABLING, and ON. It would only be set to ON during bootstrap, and in this first patch, it would not be possible to set ENABLING. * Remove GUC and use this checksums flag everywhere. * Use the TLI field rather than the version field of the page header. * Incorporate page number into checksum calculation (already done). Does this satisfy the requirements for a first step? Does it interfere with potential future work? So the idea of this implementation is that checksums is something you set at initdb time, and if you want checksums on an existing database, it's a migration process (e.g. dump and reload)? I think that's valid as a first cut at this. pg_upgrade will need to check for the checksum flag and throw an error if it is present in the new cluster but not the old one. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Janes jeff.ja...@gmail.com writes: Are sure the server you are dumping out of is head? I experimented a bit with dumping/restoring 16000 tables matching Bruce's test case (ie, one serial column apiece). The pg_dump profile seems fairly flat, without any easy optimization targets. But restoring the dump script shows a rather interesting backend profile: samples %image name symbol name 3086139.6289 postgres AtEOXact_RelationCache 9911 12.7268 postgres hash_seq_search ... The hash_seq_search time is probably mostly associated with AtEOXact_RelationCache, which is run during transaction commit and scans the relcache hashtable looking for tables created in the current transaction. So that's about 50% of the runtime going into that one activity. There are at least three ways we could whack that mole: * Run the psql script in --single-transaction mode, as I was mumbling about the other day. If we were doing AtEOXact_RelationCache only once, rather than once per CREATE TABLE statement, it wouldn't be a problem. Easy but has only a narrow scope of applicability. * Keep a separate list (or data structure of your choice) so that relcache entries created in the current xact could be found directly rather than having to scan the whole relcache. That'd add complexity though, and could perhaps be a net loss for cases where the relcache isn't so bloated. Maybe a static list that can overflow, like the ResourceOwner/Lock table one recently added. The overhead of that should be very low. Are the three places where need_eoxact_work = true; the only places where things need to be added to the new structure? It seems like there is no need to remove things from the list, because the things done in AtEOXact_RelationCache are idempotent. * Limit the size of the relcache (eg by aging out not-recently-referenced entries) so that we aren't incurring O(N^2) costs for scripts touching N tables. Again, this adds complexity and could be counterproductive in some scenarios. I made the crude hack of just dumping the relcache whenever it was 1000 at eox. The time to load 100,000 tables went from 62 minutes without the patch to 12 minutes with it. (loading with -1 -f took 23 minutes). The next quadratic behavior is in init_sequence. Cheers, Jeff diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 8c9ebe0..3941c98 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -2260,6 +2260,8 @@ AtEOXact_RelationCache(bool isCommit) ) return; +if (hash_get_num_entries(RelationIdCache)1000) {RelationCacheInvalidate();} -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Thu, Nov 15, 2012 at 12:55 AM, Amit Kapila amit.kap...@huawei.com wrote: Now user can use this utility to decide if new-standby has max LSN greater than max LSN of new-master he needs to use fullback-up on new-standby. Is my understanding right? No. The maximum LSN of data pages in new-standby should be compared with the last replayed LSN (IOW, the last valid LSN of previous timeline) of new-master. OTOH, there can be the case where new master has already been ahead of the startpoint. But in this case, there is no need for this utility. Right? No. The above comparison is required in this case. So now user may not be able to decide whether he needs to do incremental or full backup from new master, is this the case you are trying to point? Sorry, I could not parse this comment. Could you elaborate your concern again? I wanted to understand the usecase mentioned by you for this utility. As far as I can understand is that it will be used to decide that on new-standby (old-master) whether a full backup is needed from New-master(old-standby). Yes. And that situation can occur when new-standby has startpoint LSN greater than new-master? Whether the backup is required has nothing to do with the startpoint. The backup is required when the data page in old-master precedes the last applied LSN in old-standby (i.e., new-master) at the moment of the failover. Without the backup, there is no way to revert the data which is ahead of new-master. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Mon, Nov 12, 2012 at 12:39:17AM -0500, Greg Smith wrote: On 11/9/12 6:14 PM, Jeff Davis wrote: On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote: Yeah. I definitely think that we could shed an enormous amount of complexity by deciding that this is, for now, an option that can only be selected at initdb time. That would remove approximately 85% of everything I've ever disliked about this patch - without, I think, precluding the possibility of improving things later. That's certainly true, but it introduces one large problem: upgrading would not work, which (in the past few releases) we've treated as a major showstopper for many features. If you have pages that were written with one datetime storage format, and you create a cluster using the other one, that will fail. If checksums are done as an initdb time option, I see checksummed as being a block change on that level, and the precedent for not supporting it defensible. pg_upgrade will need to check for a mismatch--new cluster has checksums turned on, old one doesn't--and abort out if that happens. That can be lumped in with the other pg_controldata tests easily enough. Yes, pg_upgrade can do that easily. What I think this argues for, though, is being precise about naming what goes into pg_controldata. Let's say the initial commit target is an initdb time switch, but later finer grained ones are expected to be available. I think the output and source code labels on the initdb controldata part should refer to this as something like checksums available then. The word enabled could be misleading when there's finer grained enabling coming later. We don't want people to run pg_controldata, see checksums: enabled/on, and later discover they're not fully operational in that cluster yet. Saying checksums: available suggests there's somewhere else that should be checked, to tell if they're available on a given database/table or not. The sort of text I'm thinking of for the manual and/or warning is: You can't use pg_upgrade to migrate from a cluster where checksums are not available to one where they are. This limitation may be removed by a future version. available sounds like it is compiled in, but in this case, it means it is active. I think we are just going to need to rename it as we make it finer grained. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/11/12 6:59 PM, Andrew Dunstan wrote: I haven't followed this too closely, but I did wonder several days ago why this wasn't being made an initdb-time decision. One problem I see with this is that it would make regression testing much more cumbersome. Basically, to do a proper job, you'd have to run all the tests twice, once against each initdb setting. Either we automate this, which would mean everyone's tests are now running almost twice as long, or we don't, which would mean that some critical piece of low-level code would likely not get wide testing. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Peter Eisentraut escribió: On 11/11/12 6:59 PM, Andrew Dunstan wrote: I haven't followed this too closely, but I did wonder several days ago why this wasn't being made an initdb-time decision. One problem I see with this is that it would make regression testing much more cumbersome. Basically, to do a proper job, you'd have to run all the tests twice, once against each initdb setting. Either we automate this, which would mean everyone's tests are now running almost twice as long, or we don't, which would mean that some critical piece of low-level code would likely not get wide testing. We already have that problem with the isolation tests regarding transaction isolation levels: the tests are only run with whatever is the default_transaction_isolation setting, which is read committed in all buildfarm installs; so repeatable read and serializable are only tested when someone gets around to tweaking an installation manually. A proposal has been floated to fix that, but it needs someone to actually implement it. I wonder if something similar could be used to handle this case as well. I also wonder, though, if the existing test frameworks are really the best mechanisms to verify block layer functionality. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/14/12 11:50 AM, Andrew Dunstan wrote: COPY table FROM 'some command line |'; COPY table TO '| some command line'; I'd like to be able to filter STDIN if possible - with this syntax how is COPY going to know to hook up STDIN to the program? Why don't you filter the data before it gets to stdin? Some program is feeding the data to stdin on the client side. Why doesn't that do the filtering? I don't see a large advantage in having the data be sent unfiltered to the server and having the server do the filtering. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/14/2012 02:01 PM, Alvaro Herrera wrote: Peter Eisentraut escribió: On 11/11/12 6:59 PM, Andrew Dunstan wrote: I haven't followed this too closely, but I did wonder several days ago why this wasn't being made an initdb-time decision. One problem I see with this is that it would make regression testing much more cumbersome. Basically, to do a proper job, you'd have to run all the tests twice, once against each initdb setting. Either we automate this, which would mean everyone's tests are now running almost twice as long, or we don't, which would mean that some critical piece of low-level code would likely not get wide testing. We already have that problem with the isolation tests regarding transaction isolation levels: the tests are only run with whatever is the default_transaction_isolation setting, which is read committed in all buildfarm installs; so repeatable read and serializable are only tested when someone gets around to tweaking an installation manually. A proposal has been floated to fix that, but it needs someone to actually implement it. I wonder if something similar could be used to handle this case as well. I also wonder, though, if the existing test frameworks are really the best mechanisms to verify block layer functionality. There is nothing to prevent a buildfarm owner from using different settings - there is a stanza in the config file that provides for them to do so in fact. Maybe a saner thing to do though would be to run the isolation tests two or three times with different PGOPTIONS settings. Maybe we need two ro three targets in the isolation test Makefile for that. Regarding checksums, I can add an option for the initdb that the buildfarm script runs. We already run different tests for different encodings. Of course, constant expanding like this won't scale, so we need to pick the options we want to exrecise carefully. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for hint bit i/o mitigation
On Wed, Nov 7, 2012 at 9:47 PM, Merlin Moncure mmonc...@gmail.com wrote: On Wed, Nov 7, 2012 at 6:01 AM, Amit Kapila amit.kap...@huawei.com wrote: Following the sig is a first cut at a patch (written by Atri) that attempts to mitigate hint bit i/o penalty when many pages worth of tuples are sequentially written out with the same transaction id. There have been other attempts to deal with this problem that fit niche cases (especially those that create the table in the same transaction as the one inserting) that work but don't really solve the problem generally. As we are now dealing with only the last xid(please refer to the details of the patch attached), the invalidation issues are not significant any more. I think you are right, but please check below the scenario I have in mind due to which I got confused: Session -1 1. let's say for tup-1 on page, cachedVisibilityXid is not set, and it go inside SetHinbits and set it to xid of tuple which is let's say = 708 2. now for all consecutive tuples which have same xmin (708), it can directly refer cached id and cached status, even if hint bit is not set. Other Sessions 3. now from other sessions, large operations happened on all other tables except this table. 4. The situation can reach where xid can wrap around. Session -1 5. It again tries to query the same table, at this point it will compare the xid in tuple with cachedVisibilityXid. Now if all tuple's xid's for that particular table are frozen in all cases then it seems to be okay, otherwise it might be problem. I am not fully aware about this wrap around and frozed xid concept, thats why I had doubted that it might create problem, on further thought, it appears that I was wrong. Well there's that. But more to the point for the cache to fail you'd have to have a scenario where a table didn't scan any records for 1 billion+ transactions. See note [3] above for reasoning why this is implausible. Also we're already relying on this effect in transam.c. merlin PFA below the sig the updated patch for the same.It maintains a cache cachedVisibilityXid which holds the results of clog visibility check. cachedVisibilityXidStatus holds the commit status of the XID in cachedVisibilityXid. In each visibility function (except HeapTupleSatisfiesVacuum() ), an addition check has been added to check if the commit status of Xmin or Xmax of a tuple can be retrieved from the cache. So, in place of if (!(tuple-t_infomask HEAP_XMIN_COMMITTED)) the condition is now if (!(tuple-t_infomask HEAP_XMIN_COMMITTED) !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple) cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED)) This checks if the commit status can be known from the cache or not before proceeding. I will be posting the patch to commit fest. Thoughts/Feedback? Atri -- Regards, Atri *l'apprenant *patch: *** tqual--unchanged.c2012-09-20 03:17:58.0 +0530 --- tqual.c2012-11-14 23:27:30.470499857 +0530 *** *** 75,80 --- 75,88 /* local functions */ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); + /* + * Single-item cache for results of clog visibility check. It's worth having + * such a cache to help reduce the amount of hint bit traffic when + * many sequentially touched tuples have the same XID. + */ + static TransactionId cachedVisibilityXid; + /* Visibility status cache stores the commit status of the XID in cachedVisibilityXid */ + static uint16 cachedVisibilityXidStatus; /* * SetHintBits() *** *** 117,122 --- 125,133 if (XLogNeedsFlush(commitLSN) BufferIsPermanent(buffer)) return;/* not flushed yet, so don't set hint */ + + cachedVisibilityXid = xid; + cachedVisibilityXidStatus = infomask; } tuple-t_infomask |= infomask; *** *** 164,170 bool HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer) { ! if (!(tuple-t_infomask HEAP_XMIN_COMMITTED)) { if (tuple-t_infomask HEAP_XMIN_INVALID) return false; --- 175,183 bool HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer) { ! if (!(tuple-t_infomask HEAP_XMIN_COMMITTED) ! !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple) ! cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED)) { if (tuple-t_infomask HEAP_XMIN_INVALID) return false; *** *** 247,253 if (tuple-t_infomask HEAP_XMAX_INVALID)/* xid invalid or aborted */ return true; ! if (tuple-t_infomask HEAP_XMAX_COMMITTED) { if (tuple-t_infomask HEAP_IS_LOCKED) return true; --- 260,268 if (tuple-t_infomask HEAP_XMAX_INVALID)/* xid invalid or aborted
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/14/2012 02:05 PM, Peter Eisentraut wrote: On 11/14/12 11:50 AM, Andrew Dunstan wrote: COPY table FROM 'some command line |'; COPY table TO '| some command line'; I'd like to be able to filter STDIN if possible - with this syntax how is COPY going to know to hook up STDIN to the program? Why don't you filter the data before it gets to stdin? Some program is feeding the data to stdin on the client side. Why doesn't that do the filtering? I don't see a large advantage in having the data be sent unfiltered to the server and having the server do the filtering. Centralization of processing would be one obvious reason. I don't really see why the same reasoning doesn't apply on the backend. You could just preprocess the input before calling COPY (via a plperlu function for example). If we're going to have filtering functionality then it should be as general as possible, ISTM. But I seem to be alone in this, so I won't push it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
Andrew Dunstan and...@dunslane.net writes: On 11/14/2012 02:05 PM, Peter Eisentraut wrote: Why don't you filter the data before it gets to stdin? Some program is feeding the data to stdin on the client side. Why doesn't that do the filtering? I don't see a large advantage in having the data be sent unfiltered to the server and having the server do the filtering. Centralization of processing would be one obvious reason. If I understand correctly, what you're imagining is that the client sources data to a COPY FROM STDIN type of command, then the backend pipes that out to stdin of some filtering program, which it then reads the stdout of to get the data it processes and stores. We could in principle make that work, but there are some pretty serious implementation problems: popen doesn't do this so we'd have to cons up our own fork and pipe setup code, and we would have to write a bunch of asynchronous processing logic to account for the possibility that the filter program doesn't return data in similar-size chunk to what it reads. (IOW, it will never be clear when to try to read data from the filter and when to try to write data to it.) I think it's way too complicated for the amount of functionality you'd get. As Peter says, there's no strong reason not to do such processing on the client side. In fact there are pretty strong reasons to prefer to do it there, like not needing database superuser privilege to invoke the filter program. What I'm imagining is a very very simple addition to COPY that just allows it to execute popen() instead of fopen() to read or write the data source/sink. What you suggest would require hundreds of lines and create many opportunities for new bugs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
Jeff Janes jeff.ja...@gmail.com writes: On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: There are at least three ways we could whack that mole: ... * Keep a separate list (or data structure of your choice) so that relcache entries created in the current xact could be found directly rather than having to scan the whole relcache. That'd add complexity though, and could perhaps be a net loss for cases where the relcache isn't so bloated. Maybe a static list that can overflow, like the ResourceOwner/Lock table one recently added. The overhead of that should be very low. Are the three places where need_eoxact_work = true; the only places where things need to be added to the new structure? Yeah. The problem is not so much the number of places that do that, as that places that flush entries from the relcache would need to know to remove them from the separate list, else you'd have dangling pointers. It's certainly not impossible, I was just unsure how much of a pain in the rear it might be. The next quadratic behavior is in init_sequence. Yeah, that's another place that is using a linear list that perhaps should be a hashtable. OTOH, probably most sessions don't touch enough different sequences for that to be a win. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
Tom Lane escribió: Jeff Janes jeff.ja...@gmail.com writes: The next quadratic behavior is in init_sequence. Yeah, that's another place that is using a linear list that perhaps should be a hashtable. OTOH, probably most sessions don't touch enough different sequences for that to be a win. Could we use some adaptive mechanism here? Say we use a list for the first ten entries, and if an eleventh one comes in, we create a hash table for that one and all subsequent ones. All future calls would have to examine both the list for the first few and then the hash table. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Andrew Dunstan and...@dunslane.net writes: Regarding checksums, I can add an option for the initdb that the buildfarm script runs. We already run different tests for different encodings. Of course, constant expanding like this won't scale, so we need to pick the options we want to exrecise carefully. I thought the whole point of the buildfarm was to provide a scalable way of exercising different combinations of options that individual developers couldn't practically test. We might need a little more coordination among buildfarm owners to ensure we get full coverage, of course. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/14/2012 02:37 PM, Tom Lane wrote: What I'm imagining is a very very simple addition to COPY that just allows it to execute popen() instead of fopen() to read or write the data source/sink. What you suggest would require hundreds of lines and create many opportunities for new bugs. That's certainly a better answer than any I've had. I accept the reasoning. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/14/2012 03:06 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Regarding checksums, I can add an option for the initdb that the buildfarm script runs. We already run different tests for different encodings. Of course, constant expanding like this won't scale, so we need to pick the options we want to exrecise carefully. I thought the whole point of the buildfarm was to provide a scalable way of exercising different combinations of options that individual developers couldn't practically test. We might need a little more coordination among buildfarm owners to ensure we get full coverage, of course. Yes, true. So lets' wait and see how the checksums thing works out and then we can tackle the buildfarm end. At any rate, I don't think the buildfarm is a reason not to have this as an initdb setting. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for hint bit i/o mitigation
On Thu, Nov 15, 2012 at 12:42 AM, Atri Sharma atri.j...@gmail.com wrote: On Wed, Nov 7, 2012 at 9:47 PM, Merlin Moncure mmonc...@gmail.com wrote: On Wed, Nov 7, 2012 at 6:01 AM, Amit Kapila amit.kap...@huawei.com wrote: Following the sig is a first cut at a patch (written by Atri) that attempts to mitigate hint bit i/o penalty when many pages worth of tuples are sequentially written out with the same transaction id. There have been other attempts to deal with this problem that fit niche cases (especially those that create the table in the same transaction as the one inserting) that work but don't really solve the problem generally. As we are now dealing with only the last xid(please refer to the details of the patch attached), the invalidation issues are not significant any more. I think you are right, but please check below the scenario I have in mind due to which I got confused: Session -1 1. let's say for tup-1 on page, cachedVisibilityXid is not set, and it go inside SetHinbits and set it to xid of tuple which is let's say = 708 2. now for all consecutive tuples which have same xmin (708), it can directly refer cached id and cached status, even if hint bit is not set. Other Sessions 3. now from other sessions, large operations happened on all other tables except this table. 4. The situation can reach where xid can wrap around. Session -1 5. It again tries to query the same table, at this point it will compare the xid in tuple with cachedVisibilityXid. Now if all tuple's xid's for that particular table are frozen in all cases then it seems to be okay, otherwise it might be problem. I am not fully aware about this wrap around and frozed xid concept, thats why I had doubted that it might create problem, on further thought, it appears that I was wrong. Well there's that. But more to the point for the cache to fail you'd have to have a scenario where a table didn't scan any records for 1 billion+ transactions. See note [3] above for reasoning why this is implausible. Also we're already relying on this effect in transam.c. merlin PFA below the sig the updated patch for the same.It maintains a cache cachedVisibilityXid which holds the results of clog visibility check. cachedVisibilityXidStatus holds the commit status of the XID in cachedVisibilityXid. In each visibility function (except HeapTupleSatisfiesVacuum() ), an addition check has been added to check if the commit status of Xmin or Xmax of a tuple can be retrieved from the cache. So, in place of if (!(tuple-t_infomask HEAP_XMIN_COMMITTED)) the condition is now if (!(tuple-t_infomask HEAP_XMIN_COMMITTED) !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple) cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED)) This checks if the commit status can be known from the cache or not before proceeding. I will be posting the patch to commit fest. Thoughts/Feedback? Atri -- Regards, Atri *l'apprenant *patch: *** tqual--unchanged.c2012-09-20 03:17:58.0 +0530 --- tqual.c2012-11-14 23:27:30.470499857 +0530 *** *** 75,80 --- 75,88 /* local functions */ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); + /* + * Single-item cache for results of clog visibility check. It's worth having + * such a cache to help reduce the amount of hint bit traffic when + * many sequentially touched tuples have the same XID. + */ + static TransactionId cachedVisibilityXid; + /* Visibility status cache stores the commit status of the XID in cachedVisibilityXid */ + static uint16 cachedVisibilityXidStatus; /* * SetHintBits() *** *** 117,122 --- 125,133 if (XLogNeedsFlush(commitLSN) BufferIsPermanent(buffer)) return;/* not flushed yet, so don't set hint */ + + cachedVisibilityXid = xid; + cachedVisibilityXidStatus = infomask; } tuple-t_infomask |= infomask; *** *** 164,170 bool HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer) { ! if (!(tuple-t_infomask HEAP_XMIN_COMMITTED)) { if (tuple-t_infomask HEAP_XMIN_INVALID) return false; --- 175,183 bool HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer) { ! if (!(tuple-t_infomask HEAP_XMIN_COMMITTED) ! !(cachedVisibilityXid == HeapTupleHeaderGetXmin(tuple) ! cachedVisibilityXidStatus == HEAP_XMIN_COMMITTED)) { if (tuple-t_infomask HEAP_XMIN_INVALID) return false; *** *** 247,253 if (tuple-t_infomask HEAP_XMAX_INVALID)/* xid invalid or aborted */ return true; ! if (tuple-t_infomask HEAP_XMAX_COMMITTED) {
Re: [HACKERS] Doc patch, further describe and-mask nature of the permission system
On 11/13/2012 08:50:55 PM, Peter Eisentraut wrote: On Sat, 2012-09-29 at 01:16 -0500, Karl O. Pinc wrote: This patch makes some sweeping statements. Unfortunately, they are wrong. I will see if anything can be salvaged. Regards, Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().
Return 0 for INT_MIN % -1 (64-bit) instead of throwing an exception. This patch complements commit f9ac414c that fixed int4mod(). --- src/backend/utils/adt/int8.c |4 src/include/c.h |7 +++ 2 files changed, 11 insertions(+) diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c index 0e59956..9da651b 100644 --- a/src/backend/utils/adt/int8.c +++ b/src/backend/utils/adt/int8.c @@ -649,6 +649,10 @@ int8mod(PG_FUNCTION_ARGS) PG_RETURN_NULL(); } + /* SELECT ((-9223372036854775808)::int8) % (-1); causes a floating point exception */ + if (arg1 == INT64_MIN arg2 == -1) + PG_RETURN_INT64(0); + /* No overflow is possible */ PG_RETURN_INT64(arg1 % arg2); diff --git a/src/include/c.h b/src/include/c.h index a6c0e6e..d20ba8c 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -294,6 +294,13 @@ typedef unsigned long long int uint64; #define UINT64CONST(x) ((uint64) x) #endif +#ifndef INT64_MAX +#define INT64_MAX INT64CONST(9223372036854775807) +#endif + +#ifndef INT64_MIN +#define INT64_MIN (-INT64_MAX-1) +#endif /* Select timestamp representation (float8 or int64) */ #ifdef USE_INTEGER_DATETIMES -- 1.7.10.4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Doc patch making firm recommendation for setting the value of commit_delay
Some of you will be aware that I've tried to formulate a good general recommendation for setting the value of commit_delay, in light of the improvements made in 9.3. I previously asked for help finding a methodology for optimising commit_delay [1]. The documentation related to commit_delay still says that we don't know what might work well, though I don't think that's true any more. I found the time to do some benchmarks of my own - Greg Smith made available a server that he frequently uses for benchmarks. It was previously my observation that half of raw single-page sync time as reported by pg_test_fsync for you wal_sync_method worked best for commit_delay. I went so far as to modify pg_test_fsync to output average time per op in microseconds for each operation with commit_delay in mind, which is a patch that has already been committed [2]. This general approach worked really well on my laptop, which has a slowish fsync of about 8 milliseconds (8,000 microseconds), for which a commit_delay setting of 4,000 (microseconds) seemed to clearly work best, on both insert and tpc-b benchmarks [3]. This server has an Intel 320 SSD series. Greg has already written quite a bit about this drive [4] for unrelated reasons. For the SSD, results of an insert-based pgbench-tools run are shown, with commit_delay at 0, 20 and 59 (determined by following my general advise above): http://commit-delay-results-ssd-insert.staticloud.com I also looked at a 3-disk RAID0 of 7200RPM drives connected through a 512MB battery-backed write cache (Areca controller), again using insert.sql: http://commit-delay-stripe-insert.staticloud.com In addition to a tpc-b benchmark with the same data directory on the same 3-disk stripe: http://commit-delay-results-stripe-tpcb.staticloud.com I used the same postgresql.conf in all cases, which you can see for each report, and did the usual median-of-three thing in all cases (though each run lasted 120 seconds in all cases, not the default 60 seconds). Settings used for one particular pgbench run can be viewed here (though they're all exactly the same anyway): http://commit-delay-results-ssd-insert.staticloud.com/19/pg_settings.txt Attached is a doc-patch that makes recommendations that are consistent with my observations about what works best here. I'd like to see us making *some* recommendation - for sympathetic cases, setting commit_delay appropriately can make a very large difference to transaction throughput. Such sympathetic cases - many small write transactions - are something that tends to be seen relatively frequently with web applications, that disproportionately use cloud hosting. It isn't at all uncommon for these cases to be highly bound by their commit rate, and so it is compelling to try to amortize the cost of a flush as effectively as possible there. It would be unfortunate if no one was even aware that commit_delay is now useful for these cases, since the setting allows cloud hosting providers to help these cases quite a bit, without having to do something like compromise durability, which in general isn't acceptable. The point of all these benchmarks isn't to show how effective commit_delay now is, or can be - we already had that discussion months ago, before the alteration to its behaviour was committed. The point is to put my proposed doc changes in the appropriate context, so that I can build confidence that they're balanced and helpful, by showing cases that are not so sympathetic. Thoughts? [1] http://archives.postgresql.org/pgsql-performance/2012-08/msg3.php [2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=82e429794b348cd80c1d1b011e21ffac98bc6e88 [3] http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html [4] http://blog.2ndquadrant.com/intel_ssds_lifetime_and_the_32/ -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services commit_delay_doc.2012_11_14.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().
Xi Wang xi.w...@gmail.com writes: Return 0 for INT_MIN % -1 (64-bit) instead of throwing an exception. This patch complements commit f9ac414c that fixed int4mod(). Meh. I didn't care for the explicit dependency on INT_MIN in the previous patch, and I like introducing INT64_MIN even less. I think we should be able to reduce the test to just if (arg2 == -1) return 0; since zero is the correct result for any value of arg1, not only INT_MIN. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().
On 11/14/12 4:46 PM, Tom Lane wrote: Meh. I didn't care for the explicit dependency on INT_MIN in the previous patch, and I like introducing INT64_MIN even less. I think we should be able to reduce the test to just if (arg2 == -1) return 0; since zero is the correct result for any value of arg1, not only INT_MIN. I agree. Will send a v2. Thanks. :) - xi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().
Xi Wang xi.w...@gmail.com writes: I agree. Will send a v2. Thanks. :) No need, I'm already patching it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH 1/2] Fix INT_MIN % -1 overflow in int8mod().
Return 0 for x % -1 instead of throwing an exception (e.g., when x is INT_MIN). Suggested by Tom Lane. --- src/backend/utils/adt/int8.c |4 1 file changed, 4 insertions(+) diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c index 0e59956..a30ab36 100644 --- a/src/backend/utils/adt/int8.c +++ b/src/backend/utils/adt/int8.c @@ -649,6 +649,10 @@ int8mod(PG_FUNCTION_ARGS) PG_RETURN_NULL(); } + /* SELECT ((-9223372036854775808)::int8) % (-1); causes a floating point exception */ + if (arg2 == -1) + PG_RETURN_INT64(0); + /* No overflow is possible */ PG_RETURN_INT64(arg1 % arg2); -- 1.7.10.4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH 2/2] Clean up INT_MIN % -1 overflow in int4mod().
Since x % -1 returns 0 for any x, we don't need the check x == INT_MIN. Suggested by Tom Lane. --- src/backend/utils/adt/int.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c index 4be3901..3e423fe 100644 --- a/src/backend/utils/adt/int.c +++ b/src/backend/utils/adt/int.c @@ -1096,7 +1096,7 @@ int4mod(PG_FUNCTION_ARGS) } /* SELECT ((-2147483648)::int4) % (-1); causes a floating point exception */ - if (arg1 == INT_MIN arg2 == -1) + if (arg2 == -1) PG_RETURN_INT32(0); /* No overflow is possible */ -- 1.7.10.4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix INT_MIN % -1 overflow in int8mod().
On 11/14/12 5:03 PM, Tom Lane wrote: No need, I'm already patching it. Oops. Sorry. Ignore my patches. - xi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Tue, Oct 9, 2012 at 1:51 AM, Amit Kapila amit.kap...@huawei.com wrote: On Monday, October 01, 2012 11:11 PM Jeff Davis wrote: On Mon, 2012-10-01 at 18:14 +0100, Simon Riggs wrote: You are missing large parts of the previous thread, giving various opinions on what the UI should look like for enabling checksums. I read through all of the discussion that I could find. There was quite a lot, so perhaps I have forgotten pieces of it. But that one section in the docs does look out of date and/or confusing to me. I remember there was discussion about a way to ensure that checksums are set cluster-wide with some kind of special command (perhaps related to VACUUM) and a magic file to let recovery know whether checksums are set everywhere or not. That doesn't necessarily conflict with the GUC though (the GUC could be a way to write checksums lazily, while this new command could be a way to write checksums eagerly). If some consensus was reached on the exact user interface, can you please send me a link? AFAICT complete consensus has not been reached but one of the discussions can be found on below link: http://archives.postgresql.org/pgsql-hackers/2012-03/msg00279.php Here Robert has given suggestions and then further there is more discussion based on that points. According to me, the main points where more work for this patch is required as per previous discussions is as follows: 1. Performance impact of WAL log for hint-bits needs to be verified for scenario's other than pg_bench (Such as bulk data load (which I feel there is some way to optimize, but I don't know if that’s part of this patch)). Atri has a patch posted which (if it passes muster) would eliminate the i/o impact of WAL logging hint bits following a bulk load or any scenario where many pages worth of tuples were sequentially written out with the same XID. Atri's patch was written with the checksum patch in mind. http://archives.postgresql.org/message-id/CAOeZVid7zd9V-9WsEtf9wRCd0H4yw_1OgCLm7%3DCdvGPCxZcJJw%40mail.gmail.com merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
Patch applied to git head. Thanks Ants Aasma for the analysis that lead to the patch. --- On Tue, Nov 13, 2012 at 07:03:51PM -0500, Bruce Momjian wrote: On Tue, Nov 13, 2012 at 05:44:54AM +0200, Ants Aasma wrote: On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian br...@momjian.us wrote: You can see a significant speedup with those loops removed. The 16k case is improved, but still not linear. The 16k dump/restore scale looks fine, so it must be something in pg_upgrade, or in the kernel. I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade for 8k and 64k tables are attached. pg_upgrade itself is now taking negligible time. I generated these timings from the attached test script. -- 9.3 normal -- binary_upgrade -- -- pg_upgrade - - dmp - - res - - dmp - - res -git patch 1 0.12 0.07 0.13 0.07 11.06 11.02 1000 2.20 2.46 3.57 2.82 19.15 18.61 2000 4.51 5.01 8.22 5.80 29.12 26.89 4000 8.97 10.88 14.76 12.43 45.87 43.08 800015.30 24.72 30.57 27.10100.31 79.75 1600036.14 54.88 62.27 61.69248.03167.94 3200055.29162.20115.16179.15695.05376.84 64000 149.86716.46265.77724.32 2323.73 1122.38 You can see the speedup of the patch, particularly for a greater number of tables, e.g. 2x faster for 64k tables. The 64k profile shows the AtEOXact_RelationCache scaling problem. For the 8k profile nothing really pops out as a clear bottleneck. CPU time distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7% to pg_upgrade. At 64k I see pg_upgrade taking 12% of the duration time, if I subtract out the dump/restore times. I am attaching an updated pg_upgrade patch, which I believe is ready for application for 9.3. Postgres time itself breaks down with 10% for shutdown checkpoint and 90% for regular running, consisting of 16% parsing, 13% analyze, 20% plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network. That SVG graph was quite impressive. It looks to me that most benefit could be had from introducing more parallelism. Are there any large roadblocks to pipelining the dump and restore to have them happen in parallel? I talked to Andrew Dustan about parallelization in pg_restore. First, we currently use pg_dumpall, which isn't in the custom format required for parallel restore, but if we changed to custom format, create table isn't done in parallel, only create index/check constraints, and trigger creation, etc. Not sure if it worth perusing this just for pg_upgrade. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] My first patch! (to \df output)
On Fri, Nov 9, 2012 at 3:22 PM, Jon Erdman postgre...@thewickedtribe.net wrote: Oops! Here it is in the proper diff format. I didn't have my env set up correctly :( Better add it here so it doesn't get lost: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add contrib module functions to docs' function index
On Tue, Nov 13, 2012 at 7:10 PM, Craig Ringer cr...@2ndquadrant.com wrote: I'm talking about making sure that contrib module functions (and settings) appear in the documentation index ( http://www.postgresql.org/docs/current/static/bookindex.html) so it's easy to find a function by name whether it's in core or contrib. This is what I want to add to TODO. +1. Separately, it might also be nice to add the contrib functions to the section 9 tables with an extra column showing their origin, but that's less clearly a good thing. Even if there's a column saying intarray for intarray functions in the array functions list, people will still try to use them without loading the extension and get confused when they're not found. It'll also bloat the listings of core functions. Rather than do that, I'd probably prefer to add a note to relevant sections. For example, in array functions I'd want to add Additional functions that operate only on arrays of integers are available in the a href=...intarray extension/a. The second approach seems better, and maybe only in cases where it's particularly relevant. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Hmm... what if we took this a step further and actually stored the checksums in a separate relation fork? That would make it pretty simple to support enabling/disabling checksums for particular relations. It would also allow us to have a wider checksum, like 32 or 64 bits rather than 16. I'm not scoffing at a 16-bit checksum, because even that's enough to catch a very high percentage of errors, but it wouldn't be terrible to be able to support a wider one, either. I don't remember exactly why this idea was sidelined before, but I don't think there were any showstoppers. It does have some desirable properties; most notably the ability to add checksums without a huge effort, so perhaps the idea can be revived. But there are some practical issues, as Tom points out. Another one is that it's harder for external utilities (like pg_basebackup) to verify checksums. And I just had another thought: these pages of checksums would be data pages, with an LSN. But as you clean ordinary data pages, you need to constantly bump the LSN of the very same checksum page (because it represents 1000 ordinary data pages); making it harder to actually clean the checksum page and finish a checkpoint. Is this a practical concern or am I borrowing trouble? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg_upgrade speed for many tables
On Mon, Nov 12, 2012 at 10:29:39AM -0800, Jeff Janes wrote: On Mon, Nov 5, 2012 at 12:08 PM, Bruce Momjian br...@momjian.us wrote: Magnus reported that a customer with a million tables was finding pg_upgrade slow. I had never considered many table to be a problem, but decided to test it. I created a database with 2k tables like this: CREATE TABLE test1990 (x SERIAL); Running the git version of pg_upgrade on that took 203 seconds. Using synchronous_commit=off dropped the time to 78 seconds. This was tested on magnetic disks with a write-through cache. (No change on an SSD with a super-capacitor.) I don't see anything unsafe about having pg_upgrade use synchronous_commit=off. I could set it just for the pg_dump reload, but it seems safe to just use it always. We don't write to the old cluster, and if pg_upgrade fails, you have to re-initdb the new cluster anyway. Patch attached. I think it should be applied to 9.2 as well. Is turning off synchronous_commit enough? What about turning off fsync? I did some testing with the attached patch on a magnetic disk with no BBU that turns off fsync; I got these results: sync_com=off fsync=off 115.90 13.51 100026.09 24.56 200033.41 31.20 400057.39 57.74 8000 102.84116.28 16000 189.43207.84 It shows fsync faster for 4k, and slower for 4k. Not sure why this is the cause but perhaps the buffering of the fsync is actually faster than doing a no-op fsync. I don't think fsync=off makes sense, except for testing; let me know if I should test something else. When I'm doing a pg_upgrade with thousands of tables, the shutdown checkpoint after restoring the dump to the new cluster takes a very long time, as the writer drains its operation table by opening and individually fsync-ing thousands of files. This takes about 40 ms per file, which I assume is a combination of slow lap-top disk drive, and a strange deal with ext4 which makes fsyncing a recently created file very slow. But even with faster hdd, this would still be a problem if it works the same way, with every file needing 4 rotations to be fsynced and this happens in serial. Is this with the current code that does synchronous_commit=off? If not, can you test to see if this is still a problem? Worse, the shutdown only waits for the default of 60 seconds for the shutdown to take place before it throws an error and the entire pg_upgrade gives up. It seems to me that either the -t setting should be increased, or should be an option to pg_upgrade. My work around was to invoke a system-wide sync a couple seconds after the 'pg_ctl stop' is initiated. Flushing the files wholesale seems to work to make the checkpoint writer rapidly find it has nothing to do when it tries to flush them retail. Anyway, the reason I think turning fsync off might be reasonable is that as soon as the new cluster is shut down, pg_upgrade starts overwriting most of those just-fsynced file with other files from the old cluster, and AFAICT makes no effort to fsync them. So until there is a system-wide sync after the pg_upgrade finishes, your new cluster is already in mortal danger anyway. pg_upgrade does a cluster shutdown before overwriting those files. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/server.c b/contrib/pg_upgrade/server.c new file mode 100644 index 49d4c8f..01e0dd3 *** a/contrib/pg_upgrade/server.c --- b/contrib/pg_upgrade/server.c *** start_postmaster(ClusterInfo *cluster) *** 219,225 (cluster-controldata.cat_ver = BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? -b : -c autovacuum=off -c autovacuum_freeze_max_age=20, ! (cluster == new_cluster) ? -c synchronous_commit=off : , cluster-pgopts ? cluster-pgopts : , socket_string); /* --- 219,225 (cluster-controldata.cat_ver = BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? -b : -c autovacuum=off -c autovacuum_freeze_max_age=20, ! (cluster == new_cluster) ? -c fsync=off : , cluster-pgopts ? cluster-pgopts : , socket_string); /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] logical changeset generation v3
Hi, In response to this you will soon find the 14 patches that currently implement $subject. I'll go over each one after showing off for a bit: Start postgres: Start postgres instance (with pg_hba.conf allowing replication cons): $ postgres -D ~/tmp/pgdev-lcr \ -c wal_level=logical \ -c max_wal_senders=10 \ -c max_logical_slots=10 \ -c wal_keep_segments=100 \ -c log_line_prefix=[%p %x] Start the changelog receiver: $ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v Generate changes: $ psql -h /tmp postgres EOF DROP TABLE IF EXISTS replication_example; CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); -- plain insert INSERT INTO replication_example(somedata, text) VALUES (1, 1); -- plain update UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT currval('replication_example_id_seq')); -- plain delete DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq')); -- wrapped in a transaction BEGIN; INSERT INTO replication_example(somedata, text) VALUES (1, 1); UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq')); DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq')); COMMIT; -- dont write out aborted data BEGIN; INSERT INTO replication_example(somedata, text) VALUES (2, 1); UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq')); DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq')); ROLLBACK; -- add a column BEGIN; INSERT INTO replication_example(somedata, text) VALUES (3, 1); ALTER TABLE replication_example ADD COLUMN bar int; INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1); COMMIT; -- once more outside INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1); -- DDL with table rewrite BEGIN; INSERT INTO replication_example(somedata, text) VALUES (5, 1); ALTER TABLE replication_example RENAME COLUMN text TO somenum; INSERT INTO replication_example(somedata, somenum) VALUES (5, 2); ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4); INSERT INTO replication_example(somedata, somenum) VALUES (5, 3); COMMIT; EOF And the results printed by llog: BEGIN 16556826 COMMIT 16556826 BEGIN 16556827 table replication_example_id_seq: INSERT: sequence_name[name]:replication_example_id_seq last_value[int8]:1 start_value[int8]:1 increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0 is_cycled[bool]:f is_called[bool]:f COMMIT 16556827 BEGIN 16556828 table replication_example: INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1 COMMIT 16556828 BEGIN 16556829 table replication_example: UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1 COMMIT 16556829 BEGIN 16556830 table replication_example: DELETE (pkey): id[int4]:1 COMMIT 16556830 BEGIN 16556833 table replication_example: INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1 table replication_example: INSERT: id[int4]:5 somedata[int4]:3 text[varchar]:1 bar[int4]:1 COMMIT 16556833 BEGIN 16556834 table replication_example: INSERT: id[int4]:6 somedata[int4]:4 text[varchar]:1 bar[int4]:1 COMMIT 16556834 BEGIN 16556835 table replication_example: INSERT: id[int4]:7 somedata[int4]:5 text[varchar]:1 bar[int4]:(null) table replication_example: INSERT: id[int4]:8 somedata[int4]:5 somenum[varchar]:2 bar[int4]:(null) table pg_temp_74943: INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 bar[int4]:(null) table pg_temp_74943: INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 bar[int4]:1 table pg_temp_74943: INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 bar[int4]:1 table pg_temp_74943: INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 bar[int4]:(null) table pg_temp_74943: INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 bar[int4]:(null) table replication_example: INSERT: id[int4]:9 somedata[int4]:5 somenum[int4]:3 bar[int4]:(null) COMMIT 16556835 As you can see above we can decode WAL in the presence of nearly all forms of DDL. The plugin that outputted these changes is supposed to be added to contrib and is fairly small and uncomplicated. An interesting piece of information might be that in the very preliminary benchmarking I have done on this even the textual decoding could keep up with a full tilt pgbench -c16 -j16 -M prepared on my (somewhat larger) workstation. The wal space overhead was less than 1% between two freshly initdb'ed clusters, comparing wal_level=hot_standby with =logical. With a custom pgbench script I can saturate the decoding to the effect that it lags a second or so, but once I write out the data in a binary format it can keep up again. The biggest overhead is currently the more slowly increasing Global/RecentXmin, but that can be greatly improved by logging
[HACKERS] [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
--- src/backend/utils/cache/relmapper.c | 53 + src/include/catalog/indexing.h | 4 +-- src/include/utils/relmapper.h | 2 ++ 3 files changed, 57 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c index 6f21495..771f34d 100644 --- a/src/backend/utils/cache/relmapper.c +++ b/src/backend/utils/cache/relmapper.c @@ -180,6 +180,59 @@ RelationMapOidToFilenode(Oid relationId, bool shared) return InvalidOid; } +/* RelationMapFilenodeToOid + * + * Do the reverse of the normal direction of mapping done in + * RelationMapOidToFilenode. + * + * This is not supposed to be used during normal running but rather for + * information purposes when looking at the filesystem or the xlog. + * + * Returns InvalidOid if the OID is not know which can easily happen if the + * filenode is not of a relation that is nailed or shared or if it simply + * doesn't exists anywhere. + */ +Oid +RelationMapFilenodeToOid(Oid filenode, bool shared) +{ + const RelMapFile *map; + int32 i; + + /* If there are active updates, believe those over the main maps */ + if (shared) + { + map = active_shared_updates; + for (i = 0; i map-num_mappings; i++) + { + if (filenode == map-mappings[i].mapfilenode) +return map-mappings[i].mapoid; + } + map = shared_map; + for (i = 0; i map-num_mappings; i++) + { + if (filenode == map-mappings[i].mapfilenode) +return map-mappings[i].mapoid; + } + } + else + { + map = active_local_updates; + for (i = 0; i map-num_mappings; i++) + { + if (filenode == map-mappings[i].mapfilenode) +return map-mappings[i].mapoid; + } + map = local_map; + for (i = 0; i map-num_mappings; i++) + { + if (filenode == map-mappings[i].mapfilenode) +return map-mappings[i].mapoid; + } + } + + return InvalidOid; +} + /* * RelationMapUpdateMap * diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h index c3db3ff..81811f1 100644 --- a/src/include/catalog/indexing.h +++ b/src/include/catalog/indexing.h @@ -106,8 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o #define ClassOidIndexId 2662 DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops)); #define ClassNameNspIndexId 2663 -DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3171, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops)); -#define ClassTblspcRelfilenodeIndexId 3171 +DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3455, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops)); +#define ClassTblspcRelfilenodeIndexId 3455 DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops)); #define CollationNameEncNspIndexId 3164 diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h index 111a05c..4e56508 100644 --- a/src/include/utils/relmapper.h +++ b/src/include/utils/relmapper.h @@ -36,6 +36,8 @@ typedef struct xl_relmap_update extern Oid RelationMapOidToFilenode(Oid relationId, bool shared); +extern Oid RelationMapFilenodeToOid(Oid relationId, bool shared); + extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared, bool immediate); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH 01/14] Add minimal binary heap implementation
Will be replaces by the binaryheap.[ch] from Abhijit once its been reviewed. --- src/backend/lib/Makefile | 3 +- src/backend/lib/simpleheap.c | 255 +++ src/include/lib/simpleheap.h | 91 +++ 3 files changed, 348 insertions(+), 1 deletion(-) create mode 100644 src/backend/lib/simpleheap.c create mode 100644 src/include/lib/simpleheap.h diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 98ce3d7..c2bc5ba 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -12,6 +12,7 @@ subdir = src/backend/lib top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = ilist.o stringinfo.o + +OBJS = ilist.o simpleheap.o stringinfo.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/lib/simpleheap.c b/src/backend/lib/simpleheap.c new file mode 100644 index 000..825d0a8 --- /dev/null +++ b/src/backend/lib/simpleheap.c @@ -0,0 +1,255 @@ +/*- + * + * simpleheap.c + * A simple binary heap implementaion + * + * Portions Copyright (c) 2012, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * src/backend/lib/simpleheap.c + * + *- + */ + +#include postgres.h + +#include math.h + +#include lib/simpleheap.h + +static inline int +simpleheap_left_off(size_t i) +{ + return 2 * i + 1; +} + +static inline int +simpleheap_right_off(size_t i) +{ + return 2 * i + 2; +} + +static inline int +simpleheap_parent_off(size_t i) +{ + return floor((i - 1) / 2); +} + +/* sift up */ +static void +simpleheap_sift_up(simpleheap *heap, size_t node_off); + +/* sift down */ +static void +simpleheap_sift_down(simpleheap *heap, size_t node_off); + +static inline void +simpleheap_swap(simpleheap *heap, size_t a, size_t b) +{ + simpleheap_kv swap; + swap.value = heap-values[a].value; + swap.key = heap-values[a].key; + + heap-values[a].value = heap-values[b].value; + heap-values[a].key = heap-values[b].key; + + heap-values[b].key = swap.key; + heap-values[b].value = swap.value; +} + +/* sift down */ +static void +simpleheap_sift_down(simpleheap *heap, size_t node_off) +{ + /* manually unrolled tail recursion */ + while (true) + { + size_t left_off = simpleheap_left_off(node_off); + size_t right_off = simpleheap_right_off(node_off); + size_t swap_off = 0; + + /* only one child can violate the heap property after a change */ + + /* check left child */ + if (left_off heap-size + heap-compare(heap-values[left_off], + heap-values[node_off]) 0) + { + /* heap condition violated */ + swap_off = left_off; + } + + /* check right child */ + if (right_off heap-size + heap-compare(heap-values[right_off], + heap-values[node_off]) 0) + { + /* heap condition violated */ + + /* swap with the smaller child */ + if (!swap_off || + heap-compare(heap-values[right_off], + heap-values[left_off]) 0) + { +swap_off = right_off; + } + } + + if (!swap_off) + { + /* heap condition fullfilled, abort */ + break; + } + + /* swap node with the child violating the property */ + simpleheap_swap(heap, swap_off, node_off); + + /* recurse, check child subtree */ + node_off = swap_off; + } +} + +/* sift up */ +static void +simpleheap_sift_up(simpleheap *heap, size_t node_off) +{ + /* manually unrolled tail recursion */ + while (true) + { + size_t parent_off = simpleheap_parent_off(node_off); + + if (heap-compare(heap-values[parent_off], + heap-values[node_off]) 0) + { + /* heap property violated */ + simpleheap_swap(heap, node_off, parent_off); + + /* recurse */ + node_off = parent_off; + } + else + break; + } +} + +simpleheap* +simpleheap_allocate(size_t allocate) +{ + simpleheap* heap = palloc(sizeof(simpleheap)); + heap-values = palloc(sizeof(simpleheap_kv) * allocate); + heap-size = 0; + heap-space = allocate; + return heap; +} + +void +simpleheap_free(simpleheap* heap) +{ + pfree(heap-values); + pfree(heap); +} + +/* initial building of a heap */ +void +simpleheap_build(simpleheap *heap) +{ + int i; + + for (i = simpleheap_parent_off(heap-size - 1); i = 0; i--) + { + simpleheap_sift_down(heap, i); + } +} + +/* + * Change the + */ +void +simpleheap_change_key(simpleheap *heap, void* key) +{ + size_t next_off = 0; + int ret; + simpleheap_kv* kv; + + heap-values[0].key = key; + + /* no need to do anything if there is only one element */ + if (heap-size == 1) + { + return; + } + else if (heap-size == 2) + { + next_off = 1; + } + else + { + ret = heap-compare( + heap-values[simpleheap_left_off(0)], + heap-values[simpleheap_right_off(0)]); + + if (ret == -1) + next_off = simpleheap_left_off(0); + else + next_off = simpleheap_right_off(0); + } + + /* + * compare with the next key. If were still smaller we can skip + *
[HACKERS] [PATCH 06/14] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
This requires the previously added RELFILENODE syscache and the added RelationMapFilenodeToOid function added in previous commits. --- doc/src/sgml/func.sgml | 23 +++- src/backend/utils/adt/dbsize.c | 79 ++ src/include/catalog/pg_proc.h | 2 ++ src/include/utils/builtins.h | 1 + 4 files changed, 104 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index f8f63d8..708da35 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -15170,7 +15170,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); para The functions shown in xref linkend=functions-admin-dblocation assist -in identifying the specific disk files associated with database objects. +in identifying the specific disk files associated with database objects or doing the reverse. /para indexterm @@ -15179,6 +15179,9 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); indexterm primarypg_relation_filepath/primary /indexterm + indexterm +primarypg_relation_by_filenode/primary + /indexterm table id=functions-admin-dblocation titleDatabase Object Location Functions/title @@ -15207,6 +15210,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); File path name of the specified relation /entry /row + row + entry +literalfunctionpg_relation_by_filenode(parametertablespace/parameter typeoid/type, parameterfilenode/parameter typeoid/type)/function/literal +/entry + entrytyperegclass/type/entry + entry +Find the associated relation of a filenode + /entry + /row /tbody /tgroup /table @@ -15230,6 +15242,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); the relation. /para + para +functionpg_relation_by_filenode/ is the reverse of +functionpg_relation_filenode/. Given a quotetablespace/ OID and +a quotefilenode/ it returns the associated relation. The default +tablespace for user tables can be replaced with 0. Check the +documentation of functionpg_relation_filenode/ for an explanation why +this cannot always easily answered by querying structnamepg_class/. + /para + /sect2 sect2 id=functions-admin-genfile diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c index cd23334..ec26291 100644 --- a/src/backend/utils/adt/dbsize.c +++ b/src/backend/utils/adt/dbsize.c @@ -741,6 +741,85 @@ pg_relation_filenode(PG_FUNCTION_ARGS) } /* + * Get the relation via (reltablespace, relfilenode) + * + * This is expected to be used when somebody wants to match an individual file + * on the filesystem back to its table. Thats not trivially possible via + * pg_class because that doesn't contain the relfilenodes of shared and nailed + * tables. + * + * We don't fail but return NULL if we cannot find a mapping. + * + * Instead of knowing DEFAULTTABLESPACE_OID you can pass 0. + */ +Datum +pg_relation_by_filenode(PG_FUNCTION_ARGS) +{ + Oid reltablespace = PG_GETARG_OID(0); + Oid relfilenode = PG_GETARG_OID(1); + Oid lookup_tablespace = reltablespace; + Oid result = InvalidOid; + HeapTuple tuple; + + if (reltablespace == 0) + reltablespace = DEFAULTTABLESPACE_OID; + + /* pg_class stores 0 instead of DEFAULTTABLESPACE_OID */ + if (reltablespace == DEFAULTTABLESPACE_OID) + lookup_tablespace = 0; + + tuple = SearchSysCache2(RELFILENODE, + lookup_tablespace, + relfilenode); + + /* found it in the system catalog, not be a shared/nailed table */ + if (HeapTupleIsValid(tuple)) + { + result = HeapTupleHeaderGetOid(tuple-t_data); + ReleaseSysCache(tuple); + } + else + { + if (reltablespace == GLOBALTABLESPACE_OID) + { + result = RelationMapFilenodeToOid(relfilenode, true); + } + else + { + Form_pg_class relform; + + result = RelationMapFilenodeToOid(relfilenode, false); + + if (result != InvalidOid) + { +/* check that we found the correct relation */ +tuple = SearchSysCache1(RELOID, + result); + +if (!HeapTupleIsValid(tuple)) +{ + elog(ERROR, Couldn't refind previously looked up relation with oid %u, + result); +} + +relform = (Form_pg_class) GETSTRUCT(tuple); + +if (relform-reltablespace != reltablespace + relform-reltablespace != lookup_tablespace) + result = InvalidOid; + +ReleaseSysCache(tuple); + } + } + } + + if (!OidIsValid(result)) + PG_RETURN_NULL(); + else + PG_RETURN_OID(result); +} + +/* * Get the pathname (relative to $PGDATA) of a relation * * See comments for pg_relation_filenode. diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 16033c7..d28db63 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -3436,6 +3436,8 @@ DATA(insert OID = 2998 ( pg_indexes_size PGNSP PGUID 12 1 0 0 0 f f f f
[HACKERS] [PATCH 04/14] Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)
This cache is somewhat problematic because formally indexes used by syscaches needs to be unique, this one is not. This is just because of 0/InvalidOids stored in pg_class.relfilenode for nailed/shared catalog relations. The syscache will never be queried for InvalidOid relfilenodes however so it seems to be safe even if it violates the rules somewhat. It might be nicer to add infrastructure to do this properly, like using a partial index, its not clear what the best way to do this is though. Needs a CATVERSION bump. --- src/backend/utils/cache/syscache.c | 11 +++ src/include/catalog/indexing.h | 2 ++ src/include/catalog/pg_proc.h | 1 + src/include/utils/syscache.h | 1 + 4 files changed, 15 insertions(+) diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ca22efd..9d2f6b7 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -613,6 +613,17 @@ static const struct cachedesc cacheinfo[] = { }, 1024 }, + {RelationRelationId, /* RELFILENODE */ + ClassTblspcRelfilenodeIndexId, + 2, + { + Anum_pg_class_reltablespace, + Anum_pg_class_relfilenode, + 0, + 0 + }, + 1024 + }, {RewriteRelationId, /* RULERELNAME */ RewriteRelRulenameIndexId, 2, diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h index 238fe58..c3db3ff 100644 --- a/src/include/catalog/indexing.h +++ b/src/include/catalog/indexing.h @@ -106,6 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o #define ClassOidIndexId 2662 DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops)); #define ClassNameNspIndexId 2663 +DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3171, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops)); +#define ClassTblspcRelfilenodeIndexId 3171 DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops)); #define CollationNameEncNspIndexId 3164 diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index f935eb1..16033c7 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -4673,6 +4673,7 @@ DATA(insert OID = 3473 ( spg_range_quad_leaf_consistent PGNSP PGUID 12 1 0 0 0 DESCR(SP-GiST support for quad tree over range); + /* * Symbolic values for provolatile column: these indicate whether the result * of a function is dependent *only* on the values of its explicit arguments, diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index d1a9855..9a39077 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -77,6 +77,7 @@ enum SysCacheIdentifier RANGETYPE, RELNAMENSP, RELOID, + RELFILENODE, RULERELNAME, STATRELATTINH, TABLESPACEOID, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids
To avoid complicating logic we store both, the toplevel and the subxids, in -xip, first -xcnt toplevel ones, and then -subxcnt subxids. Also skip logging any subxids if the snapshot is suboverflowed, they aren't useful in that case anyway. This allows to make some operations cheaper and it allows faster startup for the future logical decoding feature because that doesn't care about subtransactions/suboverflow'edness. --- src/backend/access/transam/xlog.c | 2 ++ src/backend/storage/ipc/procarray.c | 65 - src/backend/storage/ipc/standby.c | 8 +++-- src/include/storage/standby.h | 2 ++ 4 files changed, 52 insertions(+), 25 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1faf666..1749f46 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5629,6 +5629,7 @@ StartupXLOG(void) * subxids are listed with their parent prepared transactions. */ running.xcnt = nxids; +running.subxcnt = 0; running.subxid_overflow = false; running.nextXid = checkPoint.nextXid; running.oldestRunningXid = oldestActiveXID; @@ -7813,6 +7814,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record) * with their parent prepared transactions. */ running.xcnt = nxids; + running.subxcnt = 0; running.subxid_overflow = false; running.nextXid = checkPoint.nextXid; running.oldestRunningXid = oldestActiveXID; diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 8c0d7b0..a98358d 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -501,6 +501,13 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running) * Remove stale transactions, if any. */ ExpireOldKnownAssignedTransactionIds(running-oldestRunningXid); + + /* + * Remove stale locks, if any. + * + * Locks are always assigned to the toplevel xid so we don't need to care + * about subxcnt/subxids (and by extension not about -suboverflowed). + */ StandbyReleaseOldLocks(running-xcnt, running-xids); /* @@ -581,13 +588,13 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running) * Allocate a temporary array to avoid modifying the array passed as * argument. */ - xids = palloc(sizeof(TransactionId) * running-xcnt); + xids = palloc(sizeof(TransactionId) * (running-xcnt + running-subxcnt)); /* * Add to the temp array any xids which have not already completed. */ nxids = 0; - for (i = 0; i running-xcnt; i++) + for (i = 0; i running-xcnt + running-subxcnt; i++) { TransactionId xid = running-xids[i]; @@ -1627,15 +1634,13 @@ GetRunningTransactionData(void) oldestRunningXid = ShmemVariableCache-nextXid; /* - * Spin over procArray collecting all xids and subxids. + * Spin over procArray collecting all xids */ for (index = 0; index arrayP-numProcs; index++) { int pgprocno = arrayP-pgprocnos[index]; - volatile PGPROC *proc = allProcs[pgprocno]; volatile PGXACT *pgxact = allPgXact[pgprocno]; TransactionId xid; - int nxids; /* Fetch xid just once - see GetNewTransactionId */ xid = pgxact-xid; @@ -1652,30 +1657,46 @@ GetRunningTransactionData(void) if (TransactionIdPrecedes(xid, oldestRunningXid)) oldestRunningXid = xid; - /* - * Save subtransaction XIDs. Other backends can't add or remove - * entries while we're holding XidGenLock. - */ - nxids = pgxact-nxids; - if (nxids 0) - { - memcpy(xids[count], (void *) proc-subxids.xids, - nxids * sizeof(TransactionId)); - count += nxids; - subcount += nxids; + if (pgxact-overflowed) + suboverflowed = true; + } - if (pgxact-overflowed) -suboverflowed = true; + /* + * Spin over procArray collecting all subxids, but only if there hasn't + * been a suboverflow. + */ + if (!suboverflowed) + { + for (index = 0; index arrayP-numProcs; index++) + { + int pgprocno = arrayP-pgprocnos[index]; + volatile PGPROC *proc = allProcs[pgprocno]; + volatile PGXACT *pgxact = allPgXact[pgprocno]; + int nxids; /* - * Top-level XID of a transaction is always less than any of its - * subxids, so we don't need to check if any of the subxids are - * smaller than oldestRunningXid + * Save subtransaction XIDs. Other backends can't add or remove + * entries while we're holding XidGenLock. */ + nxids = pgxact-nxids; + if (nxids 0) + { +memcpy(xids[count], (void *) proc-subxids.xids, + nxids * sizeof(TransactionId)); +count += nxids; +subcount += nxids; + +/* + * Top-level XID of a transaction is always less than any of + * its subxids, so we don't need to check if any of the subxids + * are smaller than oldestRunningXid + */ + } } } - CurrentRunningXacts-xcnt = count; + CurrentRunningXacts-xcnt = count - subcount; + CurrentRunningXacts-subxcnt = subcount; CurrentRunningXacts-subxid_overflow =
[HACKERS] [PATCH 10/14] Allow walsender's to connect to a specific database
Currently the decision whether to connect to a database or not is made by checking whether the passed dbname parameter is replication. Unfortunately this makes it impossible to connect a to a database named replication... This is useful for future walsender commands which need database interaction. --- src/backend/postmaster/postmaster.c| 7 -- .../libpqwalreceiver/libpqwalreceiver.c| 4 ++-- src/backend/replication/walsender.c| 27 ++ src/backend/utils/init/postinit.c | 5 src/bin/pg_basebackup/pg_basebackup.c | 4 ++-- src/bin/pg_basebackup/pg_receivexlog.c | 4 ++-- src/bin/pg_basebackup/receivelog.c | 4 ++-- 7 files changed, 41 insertions(+), 14 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index b223fee..05048bc 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1806,10 +1806,13 @@ retry1: if (strlen(port-user_name) = NAMEDATALEN) port-user_name[NAMEDATALEN - 1] = '\0'; - /* Walsender is not related to a particular database */ - if (am_walsender) + /* Generic Walsender is not related to a particular database */ + if (am_walsender strcmp(port-database_name, replication) == 0) port-database_name[0] = '\0'; + if (am_walsender) + elog(WARNING, connecting to %s, port-database_name); + /* * Done putting stuff in TopMemoryContext. */ diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index bfaebea..c39062b 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -114,7 +114,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint) the primary server: %s, PQerrorMessage(streamConn; } - if (PQnfields(res) != 3 || PQntuples(res) != 1) + if (PQnfields(res) != 4 || PQntuples(res) != 1) { int ntuples = PQntuples(res); int nfields = PQnfields(res); @@ -122,7 +122,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint) PQclear(res); ereport(ERROR, (errmsg(invalid response from primary server), - errdetail(Expected 1 tuple with 3 fields, got %d tuples with %d fields., + errdetail(Expected 1 tuple with 4 fields, got %d tuples with %d fields., ntuples, nfields))); } primary_sysid = PQgetvalue(res, 0, 0); diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 8774d7e..6452c34 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -40,6 +40,7 @@ #include access/transam.h #include access/xlog_internal.h #include catalog/pg_type.h +#include commands/dbcommands.h #include funcapi.h #include libpq/libpq.h #include libpq/pqformat.h @@ -202,10 +203,12 @@ IdentifySystem(void) char tli[11]; char xpos[MAXFNAMELEN]; XLogRecPtr logptr; + char*dbname = NULL; /* - * Reply with a result set with one row, three columns. First col is - * system ID, second is timeline ID, and third is current xlog location. + * Reply with a result set with one row, four columns. First col is system + * ID, second is timeline ID, third is current xlog location and the fourth + * contains the database name if we are connected to one. */ snprintf(sysid, sizeof(sysid), UINT64_FORMAT, @@ -216,9 +219,14 @@ IdentifySystem(void) snprintf(xpos, sizeof(xpos), %X/%X, (uint32) (logptr 32), (uint32) logptr); + if (MyDatabaseId != InvalidOid) + dbname = get_database_name(MyDatabaseId); + else + dbname = (none); + /* Send a RowDescription message */ pq_beginmessage(buf, 'T'); - pq_sendint(buf, 3, 2); /* 3 fields */ + pq_sendint(buf, 4, 2); /* 4 fields */ /* first field */ pq_sendstring(buf, systemid); /* col name */ @@ -246,17 +254,28 @@ IdentifySystem(void) pq_sendint(buf, -1, 2); pq_sendint(buf, 0, 4); pq_sendint(buf, 0, 2); + + /* fourth field */ + pq_sendstring(buf, dbname); + pq_sendint(buf, 0, 4); + pq_sendint(buf, 0, 2); + pq_sendint(buf, TEXTOID, 4); + pq_sendint(buf, -1, 2); + pq_sendint(buf, 0, 4); + pq_sendint(buf, 0, 2); pq_endmessage(buf); /* Send a DataRow message */ pq_beginmessage(buf, 'D'); - pq_sendint(buf, 3, 2); /* # of columns */ + pq_sendint(buf, 4, 2); /* # of columns */ pq_sendint(buf, strlen(sysid), 4); /* col1 len */ pq_sendbytes(buf, (char *) sysid, strlen(sysid)); pq_sendint(buf, strlen(tli), 4); /* col2 len */ pq_sendbytes(buf, (char *) tli, strlen(tli)); pq_sendint(buf, strlen(xpos), 4); /* col3 len */ pq_sendbytes(buf, (char *) xpos, strlen(xpos)); + pq_sendint(buf, strlen(dbname), 4); /* col4 len */ + pq_sendbytes(buf, (char *) dbname, strlen(dbname)); pq_endmessage(buf); } diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index
[HACKERS] [PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader
For the regular satisfies routines this is needed in prepareation of logical decoding. I changed the non-regular ones for consistency as well. The naming between htup, tuple and similar is rather confused, I could not find any consistent naming anywhere. This is preparatory work for the logical decoding feature which needs to be able to get to a valid relfilenode from when checking the visibility of a tuple. --- contrib/pgrowlocks/pgrowlocks.c | 2 +- src/backend/access/heap/heapam.c | 13 ++ src/backend/access/heap/pruneheap.c | 16 ++-- src/backend/catalog/index.c | 2 +- src/backend/commands/analyze.c | 3 ++- src/backend/commands/cluster.c | 2 +- src/backend/commands/vacuumlazy.c| 3 ++- src/backend/storage/lmgr/predicate.c | 2 +- src/backend/utils/time/tqual.c | 50 +--- src/include/utils/snapshot.h | 4 +-- src/include/utils/tqual.h| 20 +++ 11 files changed, 83 insertions(+), 34 deletions(-) diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c index 20beed2..8f9db55 100644 --- a/contrib/pgrowlocks/pgrowlocks.c +++ b/contrib/pgrowlocks/pgrowlocks.c @@ -120,7 +120,7 @@ pgrowlocks(PG_FUNCTION_ARGS) /* must hold a buffer lock to call HeapTupleSatisfiesUpdate */ LockBuffer(scan-rs_cbuf, BUFFER_LOCK_SHARE); - if (HeapTupleSatisfiesUpdate(tuple-t_data, + if (HeapTupleSatisfiesUpdate(tuple, GetCurrentCommandId(false), scan-rs_cbuf) == HeapTupleBeingUpdated) { diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 64aecf2..d025ff7 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -276,6 +276,7 @@ heapgetpage(HeapScanDesc scan, BlockNumber page) HeapTupleData loctup; bool valid; + loctup.t_tableOid = RelationGetRelid(scan-rs_rd); loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); loctup.t_len = ItemIdGetLength(lpp); ItemPointerSet((loctup.t_self), page, lineoff); @@ -1590,7 +1591,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, heapTuple-t_data = (HeapTupleHeader) PageGetItem(dp, lp); heapTuple-t_len = ItemIdGetLength(lp); - heapTuple-t_tableOid = relation-rd_id; + heapTuple-t_tableOid = RelationGetRelid(relation); heapTuple-t_self = *tid; /* @@ -1638,7 +1639,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, * transactions. */ if (all_dead *all_dead - !HeapTupleIsSurelyDead(heapTuple-t_data, RecentGlobalXmin)) + !HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin)) *all_dead = false; /* @@ -2418,12 +2419,13 @@ heap_delete(Relation relation, ItemPointer tid, lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid)); Assert(ItemIdIsNormal(lp)); + tp.t_tableOid = RelationGetRelid(relation); tp.t_data = (HeapTupleHeader) PageGetItem(page, lp); tp.t_len = ItemIdGetLength(lp); tp.t_self = *tid; l1: - result = HeapTupleSatisfiesUpdate(tp.t_data, cid, buffer); + result = HeapTupleSatisfiesUpdate(tp, cid, buffer); if (result == HeapTupleInvisible) { @@ -2788,6 +2790,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, lp = PageGetItemId(page, ItemPointerGetOffsetNumber(otid)); Assert(ItemIdIsNormal(lp)); + oldtup.t_tableOid = RelationGetRelid(relation); oldtup.t_data = (HeapTupleHeader) PageGetItem(page, lp); oldtup.t_len = ItemIdGetLength(lp); oldtup.t_self = *otid; @@ -2800,7 +2803,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, */ l2: - result = HeapTupleSatisfiesUpdate(oldtup.t_data, cid, buffer); + result = HeapTupleSatisfiesUpdate(oldtup, cid, buffer); if (result == HeapTupleInvisible) { @@ -3502,7 +3505,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, tuple-t_tableOid = RelationGetRelid(relation); l3: - result = HeapTupleSatisfiesUpdate(tuple-t_data, cid, *buffer); + result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer); if (result == HeapTupleInvisible) { diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 97a2868..edb3a09 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -340,6 +340,9 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, OffsetNumber chainitems[MaxHeapTuplesPerPage]; int nchain = 0, i; + HeapTupleData tup; + + tup.t_tableOid = RelationGetRelid(relation); rootlp = PageGetItemId(dp, rootoffnum); @@ -349,6 +352,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, if (ItemIdIsNormal(rootlp)) { htup = (HeapTupleHeader) PageGetItem(dp, rootlp); + + tup.t_data = htup; + tup.t_len = ItemIdGetLength(rootlp); + ItemPointerSet((tup.t_self), BufferGetBlockNumber(buffer), rootoffnum); + if (HeapTupleHeaderIsHeapOnly(htup)) { /*
[HACKERS] [PATCH 12/14] Add a simple decoding module in contrib named 'test_decoding'
--- contrib/Makefile | 1 + contrib/test_decoding/Makefile| 16 +++ contrib/test_decoding/test_decoding.c | 192 ++ 3 files changed, 209 insertions(+) create mode 100644 contrib/test_decoding/Makefile create mode 100644 contrib/test_decoding/test_decoding.c diff --git a/contrib/Makefile b/contrib/Makefile index d230451..8709992 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -48,6 +48,7 @@ SUBDIRS = \ tablefunc \ tcn \ test_parser \ + test_decoding \ tsearch2 \ unaccent \ vacuumlo diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile new file mode 100644 index 000..2ac9653 --- /dev/null +++ b/contrib/test_decoding/Makefile @@ -0,0 +1,16 @@ +# contrib/test_decoding/Makefile + +MODULE_big = test_decoding +OBJS = test_decoding.o + + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/test_decoding +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c new file mode 100644 index 000..f3d90e3 --- /dev/null +++ b/contrib/test_decoding/test_decoding.c @@ -0,0 +1,192 @@ +/*- + * + * test_deocding.c + * example output plugin for the logical replication functionality + * + * Copyright (c) 2012, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/test_decoding/test_decoding.c + * + *- + */ +#include postgres.h + +#include catalog/pg_class.h +#include catalog/pg_type.h +#include catalog/index.h + +#include replication/output_plugin.h +#include replication/snapbuild.h + +#include utils/lsyscache.h +#include utils/memutils.h +#include utils/rel.h +#include utils/relcache.h +#include utils/syscache.h +#include utils/typcache.h + + +PG_MODULE_MAGIC; + +void _PG_init(void); + +void WalSndWriteData(XLogRecPtr lsn, const char *data, Size len); + +extern void pg_decode_init(void **private); + +extern bool pg_decode_begin_txn(void *private, StringInfo out, ReorderBufferTXN* txn); +extern bool pg_decode_commit_txn(void *private, StringInfo out, ReorderBufferTXN* txn, XLogRecPtr commit_lsn); +extern bool pg_decode_change(void *private, StringInfo out, ReorderBufferTXN* txn, Oid tableoid, ReorderBufferChange *change); + + +void +_PG_init(void) +{ +} + +void +pg_decode_init(void **private) +{ + AssertVariableIsOfType(pg_decode_init, LogicalDecodeInitCB); + *private = AllocSetContextCreate(TopMemoryContext, + text conversion context, + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); +} + +bool +pg_decode_begin_txn(void *private, StringInfo out, ReorderBufferTXN* txn) +{ + AssertVariableIsOfType(pg_decode_begin_txn, LogicalDecodeBeginCB); + + appendStringInfo(out, BEGIN %d, txn-xid); + return true; +} + +bool +pg_decode_commit_txn(void *private, StringInfo out, ReorderBufferTXN* txn, XLogRecPtr commit_lsn) +{ + AssertVariableIsOfType(pg_decode_commit_txn, LogicalDecodeCommitCB); + + appendStringInfo(out, COMMIT %d, txn-xid); + return true; +} + +static void +tuple_to_stringinfo(StringInfo s, TupleDesc tupdesc, HeapTuple tuple) +{ + int i; + HeapTuple typeTuple; + Form_pg_type pt; + + for (i = 0; i tupdesc-natts; i++) + { + Oid typid, typoutput; + bool typisvarlena; + Datum origval, val; + char*outputstr; + boolisnull; + if (tupdesc-attrs[i]-attisdropped) + continue; + if (tupdesc-attrs[i]-attnum 0) + continue; + + typid = tupdesc-attrs[i]-atttypid; + + typeTuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid)); + if (!HeapTupleIsValid(typeTuple)) + elog(ERROR, cache lookup failed for type %u, typid); + pt = (Form_pg_type) GETSTRUCT(typeTuple); + + appendStringInfoChar(s, ' '); + appendStringInfoString(s, NameStr(tupdesc-attrs[i]-attname)); + appendStringInfoChar(s, '['); + appendStringInfoString(s, NameStr(pt-typname)); + appendStringInfoChar(s, ']'); + + getTypeOutputInfo(typid, + typoutput, typisvarlena); + + ReleaseSysCache(typeTuple); + + origval = fastgetattr(tuple, i + 1, tupdesc, isnull); + + if (typisvarlena !isnull) + val = PointerGetDatum(PG_DETOAST_DATUM(origval)); + else + val = origval; + + if (isnull) + outputstr = (null); + else + outputstr = OidOutputFunctionCall(typoutput, val); + + appendStringInfoChar(s, ':'); + appendStringInfoString(s, outputstr); + } +} + +/* This is is just for demonstration, don't ever use this code for anything real! */ +bool +pg_decode_change(void *private, StringInfo out, ReorderBufferTXN* txn, + Oid tableoid, ReorderBufferChange *change) +{ + Relation relation = RelationIdGetRelation(tableoid); + Form_pg_class class_form =
[HACKERS] [PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes
--- src/bin/pg_basebackup/Makefile | 7 +- src/bin/pg_basebackup/pg_receivellog.c | 717 + src/bin/pg_basebackup/streamutil.c | 3 +- src/bin/pg_basebackup/streamutil.h | 1 + 4 files changed, 725 insertions(+), 3 deletions(-) create mode 100644 src/bin/pg_basebackup/pg_receivellog.c diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile index 5a2a46a..3775c44 100644 --- a/src/bin/pg_basebackup/Makefile +++ b/src/bin/pg_basebackup/Makefile @@ -20,7 +20,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS) OBJS=receivelog.o streamutil.o $(WIN32RES) -all: pg_basebackup pg_receivexlog +all: pg_basebackup pg_receivexlog pg_receivellog pg_basebackup: pg_basebackup.o $(OBJS) | submake-libpq submake-libpgport $(CC) $(CFLAGS) pg_basebackup.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) @@ -28,6 +28,9 @@ pg_basebackup: pg_basebackup.o $(OBJS) | submake-libpq submake-libpgport pg_receivexlog: pg_receivexlog.o $(OBJS) | submake-libpq submake-libpgport $(CC) $(CFLAGS) pg_receivexlog.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) +pg_receivellog: pg_receivellog.o $(OBJS) | submake-libpq submake-libpgport + $(CC) $(CFLAGS) pg_receivellog.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) + install: all installdirs $(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)' $(INSTALL_PROGRAM) pg_receivexlog$(X) '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' @@ -40,4 +43,4 @@ uninstall: rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' clean distclean maintainer-clean: - rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o + rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o diff --git a/src/bin/pg_basebackup/pg_receivellog.c b/src/bin/pg_basebackup/pg_receivellog.c new file mode 100644 index 000..1a95991 --- /dev/null +++ b/src/bin/pg_basebackup/pg_receivellog.c @@ -0,0 +1,717 @@ +/*- + * + * pg_receivellog.c - receive streaming logical log data and write it + * to a local file. + * + * Author: Magnus Hagander mag...@hagander.net + * + * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/bin/pg_basebackup/pg_receivellog.c + *- + */ + +/* + * We have to use postgres.h not postgres_fe.h here, because there's so much + * backend-only stuff in the XLOG include files we need. But we need a + * frontend-ish environment otherwise. Hence this ugly hack. + */ +#define FRONTEND 1 +#include postgres.h +#include libpq-fe.h +#include libpq/pqsignal.h +#include access/xlog_internal.h +#include utils/datetime.h +#include utils/timestamp.h + +#include receivelog.h +#include streamutil.h + +#include dirent.h +#include sys/stat.h +#include sys/time.h +#include sys/types.h +#include unistd.h + +#include getopt_long.h + +/* Time to sleep between reconnection attempts */ +#define RECONNECT_SLEEP_TIME 5 + +/* Global options */ +char *outfile = NULL; +int outfd = -1; +int verbose = 0; +int noloop = 0; +int standby_message_timeout = 10 * 1000; /* 10 sec = default */ +volatile bool time_to_abort = false; + + +static void usage(void); +static void StreamLog(); + +static void +usage(void) +{ + printf(_(%s receives PostgreSQL streaming transaction logs.\n\n), + progname); + printf(_(Usage:\n)); + printf(_( %s [OPTION]...\n), progname); + printf(_(\nOptions:\n)); + printf(_( -f, --file=FILEreceive log into this file\n)); + printf(_( -n, --no-loop do not loop on connection lost\n)); + printf(_( -v, --verbose output verbose messages\n)); + printf(_( -V, --version output version information, then exit\n)); + printf(_( -?, --help show this help, then exit\n)); + printf(_(\nConnection options:\n)); + printf(_( -h, --host=HOSTNAMEdatabase server host or socket directory\n)); + printf(_( -p, --port=PORTdatabase server port number\n)); + printf(_( -d, --database=DBNAME database to connect to\n)); + printf(_( -s, --status-interval=INTERVAL\n + time between status packets sent to server (in seconds)\n)); + printf(_( -U, --username=NAMEconnect as specified database user\n)); + printf(_( -w, --no-password never prompt for password\n)); + printf(_( -W, --password force password prompt (should happen automatically)\n)); + printf(_(\nReport bugs to pgsql-b...@postgresql.org.\n)); +} + + +/* + * Local version of GetCurrentTimestamp(), since we are not linked with + * backend code. The protocol always uses integer timestamps, regardless of + * server setting. + */ +static int64 +localGetCurrentTimestamp(void) +{ + int64 result; + struct timeval tp; + + gettimeofday(tp,
[HACKERS] [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
Features: - streaming reading/writing - filtering - reassembly of records Reusing the ReadRecord infrastructure in situations where the code that wants to do so is not tightly integrated into xlog.c is rather hard and would require changes to rather integral parts of the recovery code which doesn't seem to be a good idea. Missing: - compressing the stream when removing uninteresting records - writing out correct CRCs - separating reader/writer --- src/backend/access/transam/Makefile |2 +- src/backend/access/transam/xlogreader.c | 1032 +++ src/include/access/xlogreader.h | 264 3 files changed, 1297 insertions(+), 1 deletion(-) create mode 100644 src/backend/access/transam/xlogreader.c create mode 100644 src/include/access/xlogreader.h diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile index 700cfd8..eb6cfc5 100644 --- a/src/backend/access/transam/Makefile +++ b/src/backend/access/transam/Makefile @@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \ timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \ - xlogutils.o + xlogreader.o xlogutils.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c new file mode 100644 index 000..71e7d52 --- /dev/null +++ b/src/backend/access/transam/xlogreader.c @@ -0,0 +1,1032 @@ +/*- + * + * xlogreader.c + * Generic xlog reading facility + * + * Portions Copyright (c) 2012, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/backend/access/transam/readxlog.c + * + * NOTES + * Documentation about how do use this interface can be found in + * xlogreader.h, more specifically in the definition of the + * XLogReaderState struct where all parameters are documented. + * + * TODO: + * * more extensive validation of read records + * * separation of reader/writer + * * customizable error response + * * usable without backend code around + *- + */ + +#include postgres.h + +#include access/xlog_internal.h +#include access/transam.h +#include catalog/pg_control.h +#include access/xlogreader.h + +/* If (very) verbose debugging is needed: + * #define VERBOSE_DEBUG + */ + +XLogReaderState* +XLogReaderAllocate(void) +{ + XLogReaderState* state = (XLogReaderState*)malloc(sizeof(XLogReaderState)); + int i; + + if (!state) + goto oom; + + memset(state-buf.record, 0, sizeof(XLogRecord)); + state-buf.record_data_size = XLOG_BLCKSZ*8; + state-buf.record_data = + malloc(state-buf.record_data_size); + + if (!state-buf.record_data) + goto oom; + + memset(state-buf.record_data, 0, state-buf.record_data_size); + state-buf.origptr = InvalidXLogRecPtr; + + for (i = 0; i XLR_MAX_BKP_BLOCKS; i++) + { + state-buf.bkp_block_data[i] = + malloc(BLCKSZ); + + if (!state-buf.bkp_block_data[i]) + goto oom; + } + + state-is_record_interesting = NULL; + state-writeout_data = NULL; + state-finished_record = NULL; + state-private_data = NULL; + state-output_buffer_size = 0; + + XLogReaderReset(state); + return state; + +oom: + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg(out of memory), + errdetail(failed while allocating an XLogReader))); + return NULL; +} + +void +XLogReaderFree(XLogReaderState* state) +{ + int i; + + for (i = 0; i XLR_MAX_BKP_BLOCKS; i++) + { + free(state-buf.bkp_block_data[i]); + } + + free(state-buf.record_data); + + free(state); +} + +void +XLogReaderReset(XLogReaderState* state) +{ state-in_record = false; + state-in_record_header = false; + state-do_reassemble_record = false; + state-in_bkp_blocks = 0; + state-in_bkp_block_header = false; + state-in_skip = false; + state-remaining_size = 0; + state-already_written_size = 0; + state-incomplete = false; + state-initialized = false; + state-needs_input = false; + state-needs_output = false; + state-stop_at_record_boundary = false; +} + +static inline bool +XLogReaderHasInput(XLogReaderState* state, Size size) +{ + XLogRecPtr tmp = state-curptr; + XLByteAdvance(tmp, size); + if (XLByteLE(state-endptr, tmp)) + return false; + return true; +} + +static inline bool +XLogReaderHasOutput(XLogReaderState* state, Size size){ + /* if we don't do output or have no limits in the output size */ + if (state-writeout_data == NULL || state-output_buffer_size == 0) + return true; + + if (state-already_written_size + size state-output_buffer_size) + return false; + + return true; +} + +static inline bool +XLogReaderHasSpace(XLogReaderState* state, Size size) +{ + if (!XLogReaderHasInput(state, size)) + return false; + + if (!XLogReaderHasOutput(state, size)) + return false; + + return true; +} + +/*
Re: [HACKERS] logical changeset generation v3 - git repository
On 2012-11-15 01:27:46 +0100, Andres Freund wrote: In response to this you will soon find the 14 patches that currently implement $subject. As its not very wieldly to send around that many/big patches all the time, until the next major version I will just update the git tree at: Web: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3 Git: git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3 Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, 2012-11-13 at 15:27 -0500, Robert Haas wrote: A small patch that gets committed is better than a big one that doesn't. Here's a small patch (two, actually, because the TLI one is uninteresting and noisy). It's based on Simon's patch, but with some significant changes: * I ripped out all of the handling for a mix of some checksummed and some non-checksummed pages. No more control bits or page version stuff. * I moved the checksum to the pd_tli field, and renamed it pd_checksum. * vm/fsm_extend were not setting the verification information for some reason. I'm not sure why, but since it's now on/off for the entire system, they need to do the same thing. * Added a flag to pg_control called data_checksums. It is set by initdb when the -k/--data-checksums option is specified (open for discussion). * Added a function in xlog.c that is a simple reader of the control file flag. * Got rid of page_checksums GUC. * Incorporated the page number into the checksum calculation, to detect pages that are transposed. I'll do another pass to make sure I update all of the comments, and try to self review it. So, slightly rough in some places. Regards, Jeff Davis checksums.patch.gz Description: GNU Zip compressed data replace-tli-with-checksums.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lcr - walsender integration
Hi, The current logical walsender integration looks like the following: =# INIT_LOGICAL_REPLICATION 'text'; WARNING: Initiating logical rep WARNING: reached consistent point, stopping! replication_id | consistent_point | snapshot_name | plugin +--+---+ id-2 | 3/CACBDF98 | 0xDEADBEEF| text (1 row) =# START_LOGICAL_REPLICATION 'id-2' 3/CACBDF98; ... So the current protocol is: INIT_LOGICAL_REPLICATION '$plugin'; returns * slot * first consistent point * snapshot id START_LOGICAL_REPLICATION '$slot' $last_received_lsn; streams changes, each wrapped in a 'w' message with (start, end) set to the same value. The content of the data is completely free-format and only depends on the output plugin. Feedback is provided from the client via the normal 'r' messages. I think thats not a bad start, but we probably can improve it a bit: INIT_LOGICAL_REPLICATION '$slot' '$plugin' ($value = $key, ...); START_LOGICAL_REPLICATION '$slot' $last_received_lsn; STOP_LOGICAL_REPLICATION '$slot'; The option to INIT_LOGICAL_REPLICATION would then get passed to the 'pg_decode_init' output plugin function (i.e. a function of that name would get dlsym()'ed using the pg infrastructure for that). Does that look good to you? Any suggestions? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/15/2012 12:46 AM, Tom Lane wrote: Agreed, and there's also the question of passing switches etc to the program, so the string can't be a bare file name anyway. I proposed pipe symbols (|) in the string previously, but if you find that too Unix-centric I suppose we could do COPY TABLE FROM PROGRAM 'command line'; COPY TABLE TO PROGRAM 'command line'; I'd strongly prefer that from a security standpoint. I intensely dislike the idea that COPY will change from a command that can at worst expose data to a command that can execute programs. It means existing security decisions in applications that use it must be re-evaluated ... and most won't be. Also, it isn't too hard to check the command string for pipe chars, but experience has taught over and over with SQL injection, shell metacharacter exploits, XSS, etc that magic characters that you must check for are a bad idea, and it's much better to have separate syntax (like parameterized statements) rather than magic strings. Additionally, the pipe design appears to presume the presence of a shell and the desirability of using it. I don't think either assumption is sensible. Windows has a shell of sorts (cmd.exe) but its behaviour is different with regards to quoting and it can be tricky to produce commands that work under both a UNIX shell and cmd.exe . More importantly, the shell provides fun opportunities for unexpected side-effects via metacharacters, leading to undesired behaviour or even exploits. It's IMO strongly preferable to use an argument vector and direct execution, so the shell never gets involved. How about: COPY ... FROM PROGRAM '/bin/my_program', '$notavariable', '$(rm -rf $HOME)'; or: COPY ... FROM (PROGRAM '/bin/my_program', ARGUMENTS ('$notavariable', '$(rm -rf $HOME)') ); ? Something extensible would be good, as somebody is *inevitably* going to ask so how do I set environment variables before I call the command and how do I control which return values are considered success. -- Craig Ringer regards, tom lane -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v3
On 11/14/12 4:27 PM, Andres Freund wrote: Hi, In response to this you will soon find the 14 patches that currently implement $subject. I'll go over each one after showing off for a bit: Lemme be the first to say, wow. Impressive work. Now the debugging starts ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
Craig Ringer cr...@2ndquadrant.com writes: On 11/15/2012 12:46 AM, Tom Lane wrote: Agreed, and there's also the question of passing switches etc to the program, so the string can't be a bare file name anyway. I proposed pipe symbols (|) in the string previously, but if you find that too Unix-centric I suppose we could do COPY TABLE FROM PROGRAM 'command line'; COPY TABLE TO PROGRAM 'command line'; I'd strongly prefer that from a security standpoint. That's a reasonable concern. Additionally, the pipe design appears to presume the presence of a shell and the desirability of using it. I don't think either assumption is sensible. I disagree very very strongly with that. If we prevent use of shell syntax, we will lose a lot of functionality, for instance copy ... from program 'foo somefile' copy ... from program 'foo | bar' unless you're imagining that we will reimplement a whole lot of that same shell syntax for ourselves. (And no, I don't care whether the Windows syntax is exactly the same or not. The program name/path is already likely to vary across systems, so it's pointless to suppose that use of the feature would be 100% portable if only we lobotomized it.) More importantly, the shell provides fun opportunities for unexpected side-effects via metacharacters, leading to undesired behaviour or even exploits. So? You're already handing the keys to the kingdom to anybody who can control the contents of that command line, even if it's only to point at the wrong program. And one man's unexpected side-effect is another man's essential feature, as in my examples above. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Nov 14, 2012 at 6:24 PM, Jeff Davis pg...@j-davis.com wrote: Hmm... what if we took this a step further and actually stored the checksums in a separate relation fork? That would make it pretty simple to support enabling/disabling checksums for particular relations. It would also allow us to have a wider checksum, like 32 or 64 bits rather than 16. I'm not scoffing at a 16-bit checksum, because even that's enough to catch a very high percentage of errors, but it wouldn't be terrible to be able to support a wider one, either. I don't remember exactly why this idea was sidelined before, but I don't think there were any showstoppers. It does have some desirable properties; most notably the ability to add checksums without a huge effort, so perhaps the idea can be revived. But there are some practical issues, as Tom points out. Another one is that it's harder for external utilities (like pg_basebackup) to verify checksums. And I just had another thought: these pages of checksums would be data pages, with an LSN. But as you clean ordinary data pages, you need to constantly bump the LSN of the very same checksum page (because it represents 1000 ordinary data pages); making it harder to actually clean the checksum page and finish a checkpoint. Is this a practical concern or am I borrowing trouble? Well, I think the invariant we'd need to maintain is as follows: every page for which the checksum fork might be wrong must have an FPI following the redo pointer. So, at the time we advance the redo pointer, we need the checksum fork to be up-to-date for all pages for which a WAL record was written after the old redo pointer except for those for which a WAL record has again been written after the new redo pointer. In other words, the checksum pages we write out don't need to be completely accurate; the checksums for any blocks we know will get clobbered anyway during replay don't really matter. However, reading your comments, I do see one sticking point. If we don't update the checksum page until a buffer is written out, which of course makes a lot of sense, then during a checkpoint, we'd have to flush all of the regular pages first and then all the checksum pages afterward. Otherwise, the checksum pages wouldn't be sufficiently up-to-date at the time we write them. There's no way to make that happen just by fiddling with the LSN; rather, we'd need some kind of two-pass algorithm over the buffer pool. That doesn't seem unmanageable, but it's more complicated than what we do now. I'm not sure we'd actually bother setting the LSN on the checksum pages, because the action that prompts an update of a checksum page is the decision to write out a non-checksum page, and that's not a WAL-loggable action, so there's no obvious LSN to apply, and no obvious need to apply one at all. I'm also not quite sure what happens with full_page_writes=off. I don't really see how to make this scheme work at all in that environment. Keeping the checksum in the page seems to dodge quite a few problems in that case ... as long as you assume that 8kB writes really are atomic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doc patch making firm recommendation for setting the value of commit_delay
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- ow...@postgresql.org] On Behalf Of Peter Geoghegan Sent: 15 November 2012 09:44 To: PG Hackers Subject: [HACKERS] Doc patch making firm recommendation for setting the value of commit_delay Some of you will be aware that I've tried to formulate a good general recommendation for setting the value of commit_delay, in light of the improvements made in 9.3. I previously asked for help finding a methodology for optimising commit_delay [1]. The documentation related to commit_delay still says that we don't know what might work well, though I don't think that's true any more. I found the time to do some benchmarks of my own - Greg Smith made available a server that he frequently uses for benchmarks. It was previously my observation that half of raw single-page sync time as reported by pg_test_fsync for you wal_sync_method worked best for commit_delay. I went so far as to modify pg_test_fsync to output average time per op in microseconds for each operation with commit_delay in mind, which is a patch that has already been committed [2]. This general approach worked really well on my laptop, which has a slowish fsync of about 8 milliseconds (8,000 microseconds), for which a commit_delay setting of 4,000 (microseconds) seemed to clearly work best, on both insert and tpc-b benchmarks [3]. snip Thoughts? I think for sure, since the GUC maintained its original name, that the docs need to be updated to let people know the background behaviour has changed and may now be far more useful. I've read through the patch. Only thing I see out of place is a small typo: !values of varnamecommit_siblings/varname should be used is such cases, Should probably read !values of varnamecommit_siblings/varname should be used *in* such cases, Thanks for doing all the benchmarks too. Good to see such a range of different hardware tested. Regards David Rowley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH 03/14] Add simple xlogdump tool
--- src/bin/Makefile| 2 +- src/bin/xlogdump/Makefile | 25 +++ src/bin/xlogdump/xlogdump.c | 468 3 files changed, 494 insertions(+), 1 deletion(-) create mode 100644 src/bin/xlogdump/Makefile create mode 100644 src/bin/xlogdump/xlogdump.c diff --git a/src/bin/Makefile b/src/bin/Makefile index b4dfdba..9992f7a 100644 --- a/src/bin/Makefile +++ b/src/bin/Makefile @@ -14,7 +14,7 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global SUBDIRS = initdb pg_ctl pg_dump \ - psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup + psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump ifeq ($(PORTNAME), win32) SUBDIRS += pgevent diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile new file mode 100644 index 000..d54640a --- /dev/null +++ b/src/bin/xlogdump/Makefile @@ -0,0 +1,25 @@ +#- +# +# Makefile for src/bin/xlogdump +# +# Copyright (c) 1998-2012, PostgreSQL Global Development Group +# +# src/bin/pg_resetxlog/Makefile +# +#- + +PGFILEDESC = xlogdump +PGAPPICON=win32 + +subdir = src/bin/xlogdump +top_builddir = ../../.. +include $(top_builddir)/src/Makefile.global + +OBJS= xlogdump.o \ + $(WIN32RES) + +all: xlogdump + + +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s \012|grep -v /main.o|sed 's/^/..\/..\/..\//') + $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c new file mode 100644 index 000..0f984e4 --- /dev/null +++ b/src/bin/xlogdump/xlogdump.c @@ -0,0 +1,468 @@ +#include postgres.h + +#include unistd.h + +#include access/xlogreader.h +#include access/rmgr.h +#include miscadmin.h +#include storage/ipc.h +#include utils/memutils.h +#include utils/guc.h + +#include getopt_long.h + +/* + * needs to be declared because otherwise its defined in main.c which we cannot + * link from here. + */ +const char *progname = xlogdump; + +typedef struct XLogDumpPrivateData { + TimeLineID timeline; + char* outpath; + char* inpath; +} XLogDumpPrivateData; + +static void +XLogDumpXLogRead(const char *directory, TimeLineID timeline_id, + XLogRecPtr startptr, char *buf, Size count); + +static void +XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id, + XLogRecPtr startptr, const char *buf, Size count); + +#define XLogFilePathWrite(path, base, tli, logSegNo) \ + snprintf(path, MAXPGPATH, %s/%08X%08X%08X, base, tli, \ + (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \ + (uint32) ((logSegNo) % XLogSegmentsPerXLogId)) + +static void +XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id, + XLogRecPtr startptr, const char *buf, Size count) +{ + const char *p; + XLogRecPtr recptr; + Size nbytes; + + static int sendFile = -1; + static XLogSegNo sendSegNo = 0; + static uint32 sendOff = 0; + + p = buf; + recptr = startptr; + nbytes = count; + + while (nbytes 0) + { + uint32 startoff; + int segbytes; + int writebytes; + + startoff = recptr % XLogSegSize; + + if (sendFile 0 || !XLByteInSeg(recptr, sendSegNo)) + { + char path[MAXPGPATH]; + + /* Switch to another logfile segment */ + if (sendFile = 0) +close(sendFile); + + XLByteToSeg(recptr, sendSegNo); + XLogFilePathWrite(path, directory, timeline_id, sendSegNo); + + sendFile = open(path, O_WRONLY|O_CREAT, S_IRUSR | S_IWUSR); + if (sendFile 0) + { +ereport(ERROR, +(errcode_for_file_access(), + errmsg(could not open file \%s\: %m, +path))); + } + sendOff = 0; + } + + /* Need to seek in the file? */ + if (sendOff != startoff) + { + if (lseek(sendFile, (off_t) startoff, SEEK_SET) 0){ +char fname[MAXPGPATH]; +XLogFileName(fname, timeline_id, sendSegNo); + +ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not seek in log segment %s to offset %u: %m, + fname, +startoff))); + } + sendOff = startoff; + } + + /* How many bytes are within this segment? */ + if (nbytes (XLogSegSize - startoff)) + segbytes = XLogSegSize - startoff; + else + segbytes = nbytes; + + writebytes = write(sendFile, p, segbytes); + if (writebytes = 0) + { + char fname[MAXPGPATH]; + XLogFileName(fname, timeline_id, sendSegNo); + + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not write to log segment %s, offset %u, length %lu: %m, + fname, + sendOff, (unsigned long) segbytes))); + } + + /* Update state for read */ + XLByteAdvance(recptr, writebytes); + + sendOff += writebytes; + nbytes -= writebytes; + p += writebytes; + } +} + +/* this should probably be put in a general implementation */ +static void +XLogDumpXLogRead(const
[HACKERS] [PATCH 07/14] Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
This is useful to be able to represent a CommandId thats invalid. There was no such value before. This decreases the possible number of subtransactions by one which seems unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are never looked at outside the context of their own transaction (spare timetravel access, but thats new anyway). --- src/backend/access/transam/xact.c | 4 ++-- src/include/c.h | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 10386da..f28b4c8 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -745,12 +745,12 @@ CommandCounterIncrement(void) if (currentCommandIdUsed) { currentCommandId += 1; - if (currentCommandId == FirstCommandId) /* check for overflow */ + if (currentCommandId == InvalidCommandId) { currentCommandId -= 1; ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg(cannot have more than 2^32-1 commands in a transaction))); + errmsg(cannot have more than 2^32-2 commands in a transaction))); } currentCommandIdUsed = false; diff --git a/src/include/c.h b/src/include/c.h index a6c0e6e..e52af3b 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -367,6 +367,7 @@ typedef uint32 MultiXactOffset; typedef uint32 CommandId; #define FirstCommandId ((CommandId) 0) +#define InvalidCommandId (~(CommandId)0) /* * Array indexing support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 11/15/2012 10:19 AM, Tom Lane wrote: I disagree very very strongly with that. If we prevent use of shell syntax, we will lose a lot of functionality, for instance copy ... from program 'foo somefile' copy ... from program 'foo | bar' unless you're imagining that we will reimplement a whole lot of that same shell syntax for ourselves. (And no, I don't care whether the Windows syntax is exactly the same or not. The program name/path is already likely to vary across systems, so it's pointless to suppose that use of the feature would be 100% portable if only we lobotomized it.) That's reasonable - and it isn't worth making people jump through hoops with ('bash','-c','/some/command infile') . So? You're already handing the keys to the kingdom to anybody who can control the contents of that command line, even if it's only to point at the wrong program. And one man's unexpected side-effect is another man's essential feature, as in my examples above. That's true if they're controlling the whole command, not so much if they just provide a file name. I'm just worried that people will use it without thinking deeply about the consequences, just like they do with untrusted client input in SQL injection attacks. I take you point about wanting more than just the execve()-style invocation. I'd still like to see a way to invoke the command without having the shell involved, though; APIs to invoke external programs seem to start out with a version that launches via the shell then quickly grow more controlled argument-vector versions. There's certainly room for a quick'n'easy COPY ... FROM PROGRAM ('cmd1 | cmd2 | tee /tmp/log') . At this point all I think is really vital is to make copy-with-exec *syntactically different* to plain COPY, and to leave room for extending the syntax for environment, separate args vector, etc when they're called for. Like VACUUM, where VACUUM VERBOSE ANALYZE became VACUUM (VERBOSE, ANALYZE) to make room for (BUFFERS), etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH 14/14] design document v2.3 and snapshot building design doc v0.2
--- src/backend/replication/logical/DESIGN.txt | 603 + src/backend/replication/logical/Makefile | 6 + .../replication/logical/README.SNAPBUILD.txt | 298 ++ 3 files changed, 907 insertions(+) create mode 100644 src/backend/replication/logical/DESIGN.txt create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt diff --git a/src/backend/replication/logical/DESIGN.txt b/src/backend/replication/logical/DESIGN.txt new file mode 100644 index 000..8383602 --- /dev/null +++ b/src/backend/replication/logical/DESIGN.txt @@ -0,0 +1,603 @@ +//-*- mode: adoc -*- += High Level Design for Logical Replication in Postgres = +:copyright: PostgreSQL Global Development Group 2012 +:author: Andres Freund, 2ndQuadrant Ltd. +:email: and...@2ndquadrant.com + +== Introduction == + +This document aims to first explain why we think postgres needs another +replication solution and what that solution needs to offer in our opinion. Then +it sketches out our proposed implementation. + +In contrast to an earlier version of the design document which talked about the +implementation of four parts of replication solutions: + +1. Source data generation +1. Transportation of that data +1. Applying the changes +1. Conflict resolution + +this version only plans to talk about the first part in detail as it is an +independent and complex part usable for a wide range of use cases which we want +to get included into postgres in a first step. + +=== Previous discussions === + +There are two rather large threads discussing several parts of the initial +prototype and proposed architecture: + +- http://archives.postgresql.org/message-id/201206131327.24092.and...@2ndquadrant.com[Logical Replication/BDR prototype and architecture] +- http://archives.postgresql.org/message-id/201206211341.25322.and...@2ndquadrant.com[Catalog/Metadata consistency during changeset extraction from WAL] + +Those discussions lead to some fundamental design changes which are presented in this document. + +=== Changes from v1 === +* At least a partial decoding step required/possible on the source system +* No intermediate (schema only) instances required +* DDL handling, without event triggers +* A very simple text conversion is provided for debugging/demo purposes +* Smaller scope + +== Existing approaches to replication in Postgres == + +If any currently used approach to replication can be made to support every +use-case/feature we need, it likely is not a good idea to implement something +different. Currently three basic approaches are in use in/around postgres +today: + +. Trigger based +. Recovery based/Physical footnote:[Often referred to by terms like Hot Standby, Streaming Replication, Point In Time Recovery] +. Statement based + +Statement based replication has obvious and known problems with consistency and +correctness making it hard to use in the general case so we will not further +discuss it here. + +Lets have a look at the advantages/disadvantages of the other approaches: + +=== Trigger based Replication === + +This variant has a multitude of significant advantages: + +* implementable in userspace +* easy to customize +* just about everything can be made configurable +* cross version support +* cross architecture support +* can feed into systems other than postgres +* no overhead from writes to non-replicated tables +* writable standbys +* mature solutions +* multimaster implementations possible existing + +But also a number of disadvantages, some of them very hard to solve: + +* essentially duplicates the amount of writes (or even more!) +* synchronous replication hard or impossible to implement +* noticeable CPU overhead +** trigger functions +** text conversion of data +* complex parts implemented in several solutions +* not in core + +Especially the higher amount of writes might seem easy to solve at a first +glance but a solution not using a normal transactional table for its log/queue +has to solve a lot of problems. The major ones are: + +* crash safety, restartability spilling to disk +* consistency with the commit status of transactions +* only a minimal amount of synchronous work should be done inside individual +transactions + +In our opinion those problems are restricting progress/wider distribution of +these class of solutions. It is our aim though that existing solutions in this +space - most prominently slony and londiste - can benefit from the work we are +doing planning to do by incorporating at least parts of the changeset +generation infrastructure. + +=== Recovery based Replication === + +This type of solution, being built into postgres and of increasing popularity, +has and will have its use cases and we do not aim to replace but to complement +it. We plan to reuse some of the infrastructure and to make it possible to mix +both modes of replication + +Advantages: + +* builtin +* built on existing infrastructure from crash recovery +* efficient +** minimal CPU,
Re: [HACKERS] recursive view syntax
On Tue, 2012-11-13 at 23:44 -0500, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: I noticed we don't implement the recursive view syntax, even though it's part of the standard SQL feature set for recursive queries. Here is a patch to add that. Can't you simplify that by using SELECT * FROM name? You mean in the expansion? Maybe, but SELECT * is perhaps something best avoided because of unclear side effects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v3
Looks like cool stuff @-@ I might be interested in looking at that a bit as I think I will hopefully be hopefully be able to grab some time in the next couple of weeks. Are some of those patches already submitted to a CF? -- Michael Paquier http://michael.otacoo.com
Re: [HACKERS] Materialized views WIP patch
Kevin Grittner wrote: Interesting stuff. /* + * SetRelationIsValid + * Set the value of the relation's relisvalid field in pg_class. + * + * NOTE: caller must be holding an appropriate lock on the relation. + * ShareUpdateExclusiveLock is sufficient. + * + * NOTE: an important side-effect of this operation is that an SI invalidation + * message is sent out to all backends --- including me --- causing plans + * referencing the relation to be rebuilt with the new list of children. + * This must happen even if we find that no change is needed in the pg_class + * row. + */ + void + SetRelationIsValid(Oid relationId, bool relisvalid) + { It's not clear to me that it's right to do this by doing regular heap updates here instead of heap_inplace_update. Also, I think this might end up causing a lot of pg_class tuple churn (at least for matviews that delete rows at xact end), which would be nice to avoid. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doc patch, put commas in the right place in pg_restore docs
On Tue, 2012-10-16 at 21:50 -0500, Karl O. Pinc wrote: Simple punctuation change to pg_restore docs. committed -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Wednesday, November 14, 2012 10:19 PM Fujii Masao wrote: On Thu, Nov 15, 2012 at 12:55 AM, Amit Kapila amit.kap...@huawei.com wrote: Now user can use this utility to decide if new-standby has max LSN greater And that situation can occur when new-standby has startpoint LSN greater than new-master? Whether the backup is required has nothing to do with the startpoint. The backup is required when the data page in old-master precedes the last applied LSN in old-standby (i.e., new-master) at the moment of the failover. Without the backup, there is no way to revert the data which is ahead of new-master. Okay. So as Robert and Alvaro suggested to have it separate utility rather than having options in pg_resetxlog to print MAX LSN seems to be quite appropriate. I am planning to update the patch to make it a separate utility as pg_computemaxlsn with options same as what I have proposed for pg_resetxlog to print MAX LSN. So considering it a separate utility there can be 2 options: a. have a utility in contrib. b. have a utility in bin similar to pg_resetxlog. What is the best place to have it? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move postgresql_fdw_validator into dblink
Sorry for long absence. On Sat, Oct 20, 2012 at 4:24 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote: IIRC, the reason why postgresql_fdw instead of pgsql_fdw was no other fdw module has shorten naming such as ora_fdw for Oracle. However, I doubt whether it is enough strong reason to force to solve the technical difficulty; naming conflicts with existing user visible features. Isn't it worth to consider to back to the pgsql_fdw_validator naming again? AFAIR, in the discussion about naming of the new FDW, another name postgres_fdw was suggested as well as postgresql_fdw, and I chose the one more familiar to me at that time. I think that only few people feel that postgres is shortened name of postgresql. How about using postgres_fdw for PG-FDW? Regards, -- Shigeru HANADA -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] add -Wlogical-op to standard compiler options?
I think it might be worth adding -Wlogical-op to the standard warning options (for supported compilers, determined by configure test). `-Wlogical-op' Warn about suspicious uses of logical operators in expressions. This includes using logical operators in contexts where a bit-wise operator is likely to be expected. In addition to what it says there, it appears to warn about illogical combinations of and . I have been using it locally for a while without problems. We have already found a couple of bugs this way. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] feature proposal - triggers by semantics
I have a feature request, which at one level should require little code change, but at another level may require more. Since Postgres 9.3 is going to be doing some significant feature additions for triggers, I'd like to see some more. As they currently exist, triggers always fire based on certain SQL syntax used, rather than on the semantics of what is actually going on. I would like to see a new class of triggers that fire when particular database operations happen regardless of what SQL syntax was used. As a simple example, I'd like to be able to define a trigger like AFTER DELETE ON foo FOR EACH ROW and have that trigger be invoked not only by a DELETE on foo but also by a TRUNCATE on foo. So I would like to do some auditing action when a row of foo is deleted, no matter how it happens. The reason this particular example in particular is important is that TRUNCATE is documented as a data-manipulation action semantically equivalent to an unqualified DELETE in its effects, primarily. As such, I would expect the same triggers to fire as would for an unqualified DELETE. The reason I propose it be a new kind of trigger is so that then we also retain the ability to declare triggers that fire on DELETE and not on TRUNCATE. Less important, but also nice at least from the ability to be less verbose, is that said trigger could also run when an UPDATE happens, optionally, since an UPDATE can be considered semantically a DELETE+INSERT. But adding the TRUNCATE support is most important because it simply doesn't exist now, while UPDATE you can get just by adding or update. I suggest that the simplest way to add this feature is to just extend the existing syntax for defining a FOR EACH ROW so that TRUNCATE is also an option, besides INSERT/UPDATE/DELETE. In that case, the semantics of the TRUNCATE statement could be altered as follows: Iff TRUNCATE foo is invoked and foo has an TRUNCATE FOR EACH ROW trigger defined on it, then an unqualified DELETE FROM foo will be performed instead with its usual semantics. If such a trigger is not defined on foo, then the old TRUNCATE semantics happen. As such, this case of the feature can be added without breaking anything legacy. So, I'm partly proposing a specific narrow new feature, TRUNCATE FOR EACH ROW, but I'm also proposing the ability to generally define triggers based not on the syntax used but the actual action requested. A tangential feature request is to provide a runtime config option that can cause TRUNCATE to always behave as unqualified DELETE FROM regardless of any triggers, as if it were just a syntactic shorthand. Or alternately/also provide extra syntax to TRUNCATE itself where one can specify which behavior to have, and both options can be given explicitly to override any config option. -- Darren Duncan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Tuesday, November 13, 2012 10:17 PM Fujii Masao wrote: On Tue, Nov 13, 2012 at 1:23 PM, Amit kapila amit.kap...@huawei.com wrote: On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote: Robert Haas escribió: On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote: I think I can see all of those things being potentially useful. There are a couple of pending patches that will revise the WAL format I wonder if we shouldn't make this a separate utility, rather than something that is part of pg_resetxlog. Anyone have a thought on that topic? That thought did cross my mind too. We might be able to use this utility to decide whether we need to take a fresh backup from the master onto the standby, to start old master as new standby after failover. When starting new standby after failover, any data page in the standby must not precede the master. Otherwise, the standby cannot catch up with the master consistently. But, the master might write the data page corresponding to the WAL which has not been replicated to the standby yet. So, if failover happens before that WAL has been replicated, the data page in old master would precede new master (i.e., old standby), and in this case the backup is required. OTOH, if maximum LSN in data page in the standby is less than the master, the backup is not required. When new standby will start the replication (RequestXLogStreaming()), it will send the startpoint, so won't in above scenario that startpoint will be ahead of new master (or new master won't have that LSN) and replication will not be eastablished? So now user may not be able to decide whether he needs to do incremental or full backup from new master, is this the case you are trying to point? Without this utility, it's difficult to calculate the maximum LSN of data page, so basically we needed to take a backup when starting the standby. In the future, thanks to this utility, we can calculate the maximum LSN, and can skip a backup if that LSN is less than the master (i.e., last applied LSN, IOW, timeline switch LSN). With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Why does delete from table where not exists (select 1 from ... LIMIT 1) perform badly?
Hi, I've read about the reason for this before, but cannot find a reference to it now. How come the planner treats the delete from table where not extists(select 1 from table2 where ... LIMIT 1) so differently, and usually badly, when the LIMIT 1 is there. In older version of postgresql, I remember that the effect was the opposite, a limit 1 would actually perform substantially better. Hence we have old code (and old habits), where the LIMIT 1 is still used. Shouldn't the planner really understand that the intention is the same in these two queries? -- bad: DELETE FROM iup_locked_gradings ilg WHERE NOT EXISTS ( SELECT 1 FROM iup_locked_subject ils WHERE ils.locked_gradings_id = ilg.locked_gradings_id LIMIT 1 ) ; -- good: DELETE FROM iup_locked_gradings ilg WHERE NOT EXISTS ( SELECT 1 FROM iup_locked_subject ils WHERE ils.locked_gradings_id = ilg.locked_gradings_id ) ; pp=# begin; explain DELETE FROM iup_locked_gradings ilg WHERE NOT EXISTS (SELECT 1 FROM iup_locked_subject ils WHERE ils.locked_gradings_id = ilg.locked_gradings_id LIMIT 1); BEGIN QUERY PLAN --- Delete (cost=0.00..523542963.48 rows=291737 width=6) - Seq Scan on iup_locked_gradings ilg (cost=0.00..523542963.48 rows=291737 width=6) Filter: (NOT (SubPlan 1)) SubPlan 1 - Limit (cost=0.00..897.27 rows=1 width=0) - Seq Scan on iup_locked_subject ils (cost=0.00..18842.76 rows=21 width=0) Filter: (locked_gradings_id = $0) (7 rows) pp=# begin; explain DELETE FROM iup_locked_gradings ilg WHERE NOT EXISTS (SELECT 1 FROM iup_locked_subject ils WHERE ils.locked_gradings_id = ilg.locked_gradings_id ); BEGIN QUERY PLAN --- Delete (cost=31705.39..47934.47 rows=553737 width=12) - Hash Anti Join (cost=31705.39..47934.47 rows=553737 width=12) Hash Cond: (ilg.locked_gradings_id = ils.locked_gradings_id) - Seq Scan on iup_locked_gradings ilg (cost=0.00..6677.44 rows=583474 width=10) - Hash (cost=15776.83..15776.83 rows=1226373 width=10) - Seq Scan on iup_locked_subject ils (cost=0.00..15776.83 rows=1226373 width=10) (6 rows) pp=# chalmers=# \d iup_locked_gradings Table public.iup_locked_gradings Column| Type | Modifiers --+-+-- locked_gradings_id | integer | not null default nextval('iup_locked_gradings_locked_gradings_id_seq'::regclass) type | integer | description | text| name | text| original_gradings_id | integer | Indexes: iup_locked_gradings_pkey PRIMARY KEY, btree (locked_gradings_id), tablespace opt Referenced by: TABLE iup_locked_subject CONSTRAINT iup_locked_subject_locked_gradings_id_fkey FOREIGN KEY (locked_gradings_id) REFERENCES iup_locked_gradings(locked_gradings_id) ON UPDATE CASCADE ON DELETE SET NULL Tablespace: opt chalmers=# \d iup_locked_subject Table public.iup_locked_subject Column| Type | Modifiers -+-+ locked_subject_id | integer | not null default nextval('iup_locked_subject_locked_subject_id_seq'::regclass) name| text| not null link_url| text| description | text| use_measures| boolean | not null default true locked_gradings_id | integer | original_subject_id | integer | use_fail_warning| boolean | not null default false Indexes: iup_locked_subject_pkey PRIMARY KEY, btree (locked_subject_id), tablespace opt Foreign-key constraints: iup_locked_subject_locked_gradings_id_fkey FOREIGN KEY (locked_gradings_id) REFERENCES iup_locked_gradings(locked_gradings_id) ON UPDATE CASCADE ON DELETE SET NULL Referenced by: Tablespace: opt signature.asc Description: OpenPGP digital signature
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
I wrote: From: Tom Lane [mailto:t...@sss.pgh.pa.us] I wrote: Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes: I have a question. I think it would be also better to extend the syntax for the SQL COPY command in the same way, ie, COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format 'csv' Yeah, sure --- that case is already superuser-only, so why not give it the option of being a popen instead of just fopen. BTW, one thought that comes to mind is that such an operation is extremely likely to fail under environments such as SELinux. That's not necessarily a reason not to do it, but we should be wary of promising that it will work everywhere. Probably a documentation note about this would be enough. OK I'll revise the patch. I've revised the patch. In this version a user can specify hooks for pre- and post-processor executables for COPY and \copy in the follwoing way: $ echo '/bin/gunzip -c $1' decompress.sh $ chmod +x decompress.sh In the case of the COPY command, postgres=# COPY foo FROM '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz' WITH (format 'csv'); Also, in the case of the \copy instruction, postgres=# \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz' with (format 'csv') As shown in the example above, I've assumed that the syntax for this option for e.g., the COPY command is: COPY table_name FROM 'progname filename' WITH ... COPY table_name TO 'progname filename' WITH ... Here, progname for COPY IN is the user-supplied program that takes filename as its argument and that writes on standard output. Also, prgoname for COPY OUT is the user-supplied program that reads standard input and writes to filename taken as its argument. This makes simple the identification and verification of progname and filename. Todo: * Documentation including documentation note about the limitation for environments such as SELinux mentioned by Tom. * More test Any comments and suggestions are welcomed. Thanks, Best regards, Etsuro Fujita copy-popen-20121114.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
On 13 September 2012 10:13, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I'd like to add the following options to the SQL COPY command and the psql \copy instruction: * PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data from an input file is preprocessed by the program before the data is loaded into a postgres table. * POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data from a postgres table is postprocessed by the program before the data is stored in an output file. These options can be specified only when an input or output file is specified. These options allow to move data between postgres tables and e.g., compressed files or files on a distributed file system such as Hadoop HDFS. These options look pretty strange to me and I'm not sure they are a good idea. If we want to read other/complex data, we have Foreign Data Wrappers. What I think we need is COPY FROM (SELECT). COPY (query) TO already exists, so this is just the same thing in the other direction. Once we have a SELECT statement in both directions we can add any user defined transforms we wish implemented as database functions. At present we only support INSERT SELECT ... FROM FDW which means all the optimisations we've put into COPY are useless with FDWs. So we need a way to speed up loads from other data sources. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] binary heap implementation
Hi. There are two or three places in the Postgres source that implement heap sort (e.g. tuplesort.c, nodeMergeAppend.c), and it's also needed by the BDR code. It seemed reasonable to factor out the functionality. I've attached a patch (binaryheap.diff) that contains a straightforward implementation of a binary heap (originally written by Andres, with a few bugfixes and cleanups by me). There doesn't seem to be any place to put unit tests for code like this, so at Alvaro's suggestion, I have attached a small standalone program I used for testing (testbinheap.c) and a Makefile. If you copy them both to src/backend/lib and run make -f Makefile.binheap, it'll build the test program. It's nothing exciting, just exercises the functions in various ways. I've also attached a patch (nodeMergeAppend.diff) that converts the code in nodeMergeAppend.c to use binaryheap. It passes make check, and also the following test (which is planned as a merge append): CREATE OR REPLACE VIEW gen AS SELECT * FROM generate_series(1, 10) g(i) ORDER BY g.i OFFSET 0; SELECT * FROM ( SELECT * FROM gen UNION ALL SELECT * FROM gen UNION ALL SELECT * FROM gen UNION ALL SELECT * FROM gen UNION ALL SELECT * FROM gen UNION ALL SELECT * FROM gen) s ORDER BY i OFFSET 100; Initially there was a slight performance degradation with my patch, but I speeded things up and now my code is at least at par with, and maybe slightly faster than, the old code. On my laptop, both consistently take ~2.4s on average to execute the above SELECT. Comments? Suggestions? -- Abhijit diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 98ce3d7..327a1bc 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -12,6 +12,6 @@ subdir = src/backend/lib top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = ilist.o stringinfo.o +OBJS = ilist.o binaryheap.o stringinfo.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h new file mode 100644 index 000..aebf08f --- /dev/null +++ b/src/include/lib/binaryheap.h @@ -0,0 +1,138 @@ +/* + * binaryheap.h + * + * A simple binary heap implementation + * + * Portions Copyright (c) 2012, PostgreSQL Global Development Group + * + * src/include/lib/binaryheap.h + */ + +#ifndef BINARYHEAP_H +#define BINARYHEAP_H + +/* + * This structure represents a single node in a binaryheap, and just + * holds two pointers. The heap management code doesn't care what is + * stored in a node (in particular, the key or value may be NULL), + * only that the comparator function can compare any two nodes. + */ + +typedef struct binaryheap_node +{ + void *key; + void *value; +} binaryheap_node; + +/* + * For a max-heap, the comparator must return: + * -1 iff a b + * 0 iff a == b + * +1 iff a b + * For a min-heap, the conditions are reversed. + */ +typedef int (*binaryheap_comparator)(binaryheap_node *a, binaryheap_node *b); + +/* + * binaryheap + * + * size how many nodes are currently in nodes + * space how many nodes can be stored in nodes + * comparator comparator to define the heap property + * nodes the first of a list of space nodes + */ + +typedef struct binaryheap +{ + size_t size; + size_t space; + binaryheap_comparator compare; + binaryheap_node nodes[1]; +} binaryheap; + +/* + * binaryheap_allocate + * + * Returns a pointer to a newly-allocated heap that has the capacity to + * store the given number of nodes, with the heap property defined by + * the given comparator function. + */ + +binaryheap * +binaryheap_allocate(size_t capacity, binaryheap_comparator compare); + +/* + * binaryheap_free + * + * Releases memory used by the given binaryheap. + */ + +void +binaryheap_free(binaryheap *heap); + +/* + * binaryheap_add_unordered + * + * Adds the given key and value to the end of the heap's list of nodes + * in O(1) without preserving the heap property. This is a convenience + * to add elements quickly to a new heap. To obtain a valid heap, one + * must call binaryheap_build() afterwards. + */ + +void +binaryheap_add_unordered(binaryheap *heap, void *key, void *value); + +/* + * binaryheap_build + * + * Assembles a valid heap in O(n) from the nodes added by + * binaryheap_add_unordered(). Not needed otherwise. + */ + +void +binaryheap_build(binaryheap *heap); + +/* + * binaryheap_add + * + * Adds the given key and value to the heap in O(log n), while + * preserving the heap property. + */ + +void +binaryheap_add(binaryheap *heap, void *key, void *value); + +/* + * binaryheap_first + * + * Returns a pointer to the first (root, topmost) node in the heap + * without modifying the heap. Returns NULL if the heap is empty. + * Always O(1). + */ + +binaryheap_node * +binaryheap_first(binaryheap *heap); + +/* + * binaryheap_remove_first + * + * Removes the first (root, topmost) node in the heap and returns a + * pointer to it after rebalancing the heap.