Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name
On Mon, Jan 23, 2012 at 7:13 PM, Matthew Draper matt...@trebex.net wrote: On 19/01/12 20:28, Hitoshi Harada wrote: (Now it occurred to me that forgetting the #include parse_func.h might hit this breakage..., so I'll fix it here and continue to test, but if you'll fix it yourself, let me know) I fixed it here and it now works with my environment. Well spotted; that's exactly what I'd done. :/ The regression tests pass, the feature seems working as aimed, but it seems to me that it needs more test cases and documentation. For the tests, I believe at least we need ambiguous case given upthread, so that we can ensure to keep compatibility. For the document, it should describe the name resolution rule, as stated in the patch comment. Attached are a new pair of patches, fixing the missing include (and the other warning), plus addressing the above. I'm still not sure whether to just revise (almost) all the SQL function examples to use parameter names, and declare them the right choice; as it's currently written, named parameters still seem rather second-class. Agreed. The patch seems ok, except an example I've just found. db1=# create function t(a int, t t) returns int as $$ select t.a $$ language sql; CREATE FUNCTION db1=# select t(0, row(1, 2)::t); t --- 1 (1 row) Should we throw an error in such ambiguity? Or did you make it happen intentionally? If latter, we should also mention the rule in the manual. Other than that, the patch looks good to me. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
I've been thinking, what exactly is the important part of this group commit patch that gives the benefit? Keeping the queue sorted isn't all that important - XLogFlush() requests for commits will come in almost the correct order anyway. I also don't much like the division of labour between groupcommit.c and xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back DoXLogFlush(), which is a bit hard to follow. (that's in my latest version; the original patch had similar division but it also cut across processes, with the wal writer actually doing the flush) It occurs to me that the WALWriteLock already provides much of the same machinery we're trying to build here. It provides FIFO-style queuing, with the capability to wake up the next process or processes in the queue. Perhaps all we need is some new primitive to LWLock, to make it do exactly what we need. Attached is a patch to do that. It adds a new mode to LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it is acquired and the function returns immediately. However, unlike normal LWLockConditionalAcquire(), if the lock is not free it goes to sleep until it is released. But unlike normal LWLockAcquire(), when the lock is released, the function returns *without* acquiring the lock. I modified XLogFlush() to use that new function for WALWriteLock. It tries to get WALWriteLock, but if it's not immediately free, it waits until it is released. Then, before trying to acquire the lock again, it rechecks LogwrtResult to see if the record was already flushed by the process that released the lock. This is a much smaller patch than the group commit patch, and in my pgbench-tools tests on my laptop, it has the same effect on performance. The downside is that it touches lwlock.c, which is a change at a lower level. Robert's flexlocks patch probably would've been useful here. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ce659ec..d726a98 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2066,23 +2066,42 @@ XLogFlush(XLogRecPtr record) /* initialize to given target; may increase below */ WriteRqstPtr = record; - /* read LogwrtResult and update local state */ + /* + * Now wait until we get the write lock, or someone else does the + * flush for us. + */ + for (;;) { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; + /* read LogwrtResult and update local state */ SpinLockAcquire(xlogctl-info_lck); if (XLByteLT(WriteRqstPtr, xlogctl-LogwrtRqst.Write)) WriteRqstPtr = xlogctl-LogwrtRqst.Write; LogwrtResult = xlogctl-LogwrtResult; SpinLockRelease(xlogctl-info_lck); - } - /* done already? */ - if (!XLByteLE(record, LogwrtResult.Flush)) - { - /* now wait for the write lock */ - LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); + /* done already? */ + if (XLByteLE(record, LogwrtResult.Flush)) + break; + + /* + * Try to get the write lock. If we can't get it immediately, wait + * until it's released, but before actually acquiring it, loop back + * to check if the backend holding the lock did the flush for us + * already. This helps to maintain a good rate of group committing + * when the system is bottlenecked by the speed of fsyncing. + */ + if (!LWLockConditionalAcquire(WALWriteLock, LW_EXCLUSIVE_BUT_WAIT)) + { + /* + * Didn't get the lock straight away, but we might be done + * already + */ + continue; + } + /* Got the lock */ LogwrtResult = XLogCtl-Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { @@ -2111,6 +2130,8 @@ XLogFlush(XLogRecPtr record) XLogWrite(WriteRqst, false, false); } LWLockRelease(WALWriteLock); + /* done */ + break; } END_CRIT_SECTION(); diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index cc41568..488f573 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -430,6 +430,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode) elog(PANIC, cannot wait without a PGPROC structure); proc-lwWaiting = true; + proc-lwWaitOnly = false; proc-lwExclusive = (mode == LW_EXCLUSIVE); proc-lwWaitLink = NULL; if (lock-head == NULL) @@ -496,7 +497,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode) /* * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode * - * If the lock is not available, return FALSE with no side-effects. + * If the lock is not available, return FALSE with no side-effects. In + * LW_EXCLUSIVE_BUT_WAIT mode, if the lock is not available, waits until it + * is available, but then returns false without acquiring it. * * If successful, cancel/die interrupts are held off until lock release. */ @@ -504,7 +507,9 @@ bool LWLockConditionalAcquire(LWLockId lockid,
Re: [HACKERS] Online base backup from the hot-standby
On Tue, Jan 24, 2012 at 7:54 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao masao.fu...@gmail.com wrote: I'll proceed to commit for this now. Thanks a lot! Can I just check a few things? Sure! You say /* + * Update full_page_writes in shared memory and write an + * XLOG_FPW_CHANGE record before resource manager writes cleanup + * WAL records or checkpoint record is written. + */ why does it need to be before the cleanup and checkpoint? Because the cleanup and checkpoint need to see FPW in shared memory. If FPW in shared memory is not updated there, the cleanup and (end-of-recovery) checkpoint always use an initial value (= false) of FPW in shared memory. You say /* + * Currently only non-exclusive backup can be taken during recovery. + */ why? At first I proposed to allow exclusive backup to be taken during recovery. But Heikki disagreed with the proposal because he thought that the exclusive backup procedure which I proposed was too fragile. No one could come up with any good user-friendly easy-to-implement procedure. So we decided to allow only non-exclusive backup to be taken during recovery. In non-exclusive backup, the complicated procedure is performed by pg_basebackup, so a user doesn't need to care about that. You mention in the docs The backup history file is not created in the database cluster backed up. but we need to explain the bad effect, if any. Users cannot know various information (e.g., which WAL files are required for backups, backup starting/ending time, etc) about backups which have been taken so far. If they need such information, they need to record that manually. Users cannot pass the backup history file to pg_archivecleanup. Which might make the usage of pg_archivecleanup more difficult. After a little thought, pg_basebackup would be able to create the backup history file in the backup, though it cannot be archived. We shoud implement that feature to alleviate the bad effect? You say If the standby is promoted to the master during online backup, the backup fails. but no explanation of why? I could work those things out, but I don't want to have to, plus we may disagree if I did. If the backup succeeds in that case, when we start an archive recovery from that backup, the recovery needs to cross between two timelines. Which means that we need to set recovery_target_timeline before starting recovery. Whether recovery_target_timeline needs to be set or not depends on whether the standby was promoted during taking the backup. Leaving such a decision to a user seems fragile. pg_basebackup -x ensures that all required files are included in the backup and we can start recovery without restoring any file from the archive. But if the standby is promoted during the backup, the timeline history file would become an essential file for recovery, but it's not included in the backup. There are some good explanations in comments of other things, just not everywhere needed. What happens if we shutdown the WALwriter and then issue SIGHUP? SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about the case where smart shutdown kills walwriter but some backends are still running? Currently SIGHUP affects full_page_writes and running backends use the changed new value of full_page_writes. But in the patch, SIGHUP doesn't affect... To address the problem, we should either postpone the shutdown of walwriter until all backends have gone away, or leave the update of full_page_writes to checkpointer process instead of walwriter. Thought? Are we sure we want to make the change of file format mandatory? That means earlier versions of clients such as pg_basebackup will fail against this version. Really? Unless I'm missing something, pg_basebackup doesn't care about the file format of backup_label. So I don't think that earlier version of pg_basebackup fails. There are no docs to explain the new feature is available in the main docs, or to explain the restrictions. I expect you will add that later after commit. Okay. Will do. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao masao.fu...@gmail.com wrote: What happens if we shutdown the WALwriter and then issue SIGHUP? SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about the case where smart shutdown kills walwriter but some backends are still running? Currently SIGHUP affects full_page_writes and running backends use the changed new value of full_page_writes. But in the patch, SIGHUP doesn't affect... To address the problem, we should either postpone the shutdown of walwriter until all backends have gone away, or leave the update of full_page_writes to checkpointer process instead of walwriter. Thought? checkpointer seems the correct place to me -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PgNext: CFP
What's about travelaccomodation support ? On Tue, 24 Jan 2012, Joshua D. Drake wrote: Hello, The call for papers for PgNext (the old PgWest/PgEast) is now open: January 19th: Talk submission opens April 15th: Talk submission closes April 30th: Speaker notification Submit: https://www.postgresqlconference.org/talk_types Sincerely, JD Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On tis, 2012-01-24 at 20:13 -0500, Tom Lane wrote: Yeah. In both cases, the (proposed) new output format is self-identifying *to clients that know what to look for*. Unfortunately it would only be the most anally-written pre-existing client code that would be likely to spit up on the unexpected variations. What's much more likely to happen, and did happen in the bytea case, is silent data corruption. The problem in the bytea case is that the client libraries are written to ignore encoding errors. No amount of protocol versioning will help you in that case. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote: Furthermore, while we haven't settled the question of exactly what a good negotiation facility would look like, we seem to agree that a GUC isn't it. I think that means this isn't going to happen for 9.2, so we should mark this patch Returned with Feedback and return to this topic for 9.3. Simply extending the text/bin flags should be quite uncontroversial first step. How to express the capability in startup packet, I leave to others to decide. But my proposal would be following: bit 0 : text/bin bit 1..15 : format version number, maps to best formats in some Postgres version. It does not solve the resultset problem, where I'd like to say gimme well-known types in optimal representation, others in text. I don't know the perfect solution for that, but I suspect the biggest danger here is the urge to go to maximal complexity immediately. So perhaps the good idea would simply give one additional bit (0x8000?) in result flag to say that only well-known types should be optimized. That should cover 95% of use-cases, and we can design more flexible packet format when we know more about actual needs. libpq suggestions: PQsetformatcodes(bool) only if its called with TRUE, it starts interpreting text/bin codes as non-bools. IOW, we will be compatible with old code using -1 as TRUE. protocol suggestions: On startup server sends highest supported text/bin codes, and gives error if finds higher code than supported. Poolers/proxies with different server versions in pool will simply give lowest common code out. Small QA, to put obvious aspects into writing -- * Does that mean we need to keep old formats around infinitely? Yes.On-wire formats have *much* higher visibility than on-disk formats. Also, except some basic types they are not parsed in adapters, but in client code. Libpq offers least help in that respect. Basically - changing on-wire formatting is big deal, don't do it willy-nilly. * Does that mean we cannot turn on new formats automatically? Yes. Should be obvious.. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Configuring Postgres to Add A New Source File
Hi, I'm doing some development on the storage manager module of Postgres. I have added few source files already to the smgr folder, and I was able to have the Make system includes them by adding their names to the OBJS list in the Makefile inside the smgr folder. (i.e. When I add A.c, I would add A.o to the OBJS list). That was working fine. Now I'm trying to add a new file hdfs_test.c to the project. The problem with this file is that it requires some extra directives in its compilation command (-I and -L directives). Using gcc the file is compiled with the command: gcc hdfs_test.c -I/HDFS_HOME/hdfs/src/c++/libhdfs -I/usr/lib/jvm/default-java/include -L/HDFS_HOME/hdfs/src/c++/libhdfs -L/HDFS_HOME/build/c++/Linux-i386-32/lib -L/usr/lib/jvm/default-java/jre/lib/i386/server -ljvm -lhdfs -o hdfs_test I was told that in order to add the extra directives, I need to modify $LDFLAGS in configure.in and $LIBS in MakeFile. I read about autoconf and checked configure.in of Postgres but still wasn't able to know exactly what I should be doing. Thanks
Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace
2012/1/24 Robert Haas robertmh...@gmail.com: On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires jul...@gmail.com wrote: 2011/12/15 Alvaro Herrera alvhe...@commandprompt.com: Uhm, surely you could compare the original toast tablespace to the heap tablespace, and if they differ, handle appropriately when creating the new toast table? Just pass down the toast tablespace into AlterTableCreateToastTable, instead of having it assume that rel-rd_rel-relnamespace is sufficient. This should be done in all cases where a toast tablespace is created, which shouldn't be more than a handful of them. Thank you, that way seems right. Now, I distinguish before each creation of a TOAST table with AlterTableCreateToastTable() : if it will create a new one or recreate an existing one. Thus, in create_toast_table() when toastTableSpace is equal to InvalidOid, we are able : - to fallback to the main table tablespace in case of new TOAST table creation - to keep it previous tablespace in case of recreation. Here's a new version rebased against HEAD. To ask more directly the question that's come up a few times upthread, why do *you* think this is useful? What motivated you to want this behavior, and/or how do you think it could benefit other PostgreSQL users? Sorry, I didn't get this question to me. I've just picked up this item from the TODO list and then I was thinking that it could be useful. My motivation was to learn more about PostgreSQL dev. and to work on a concrete case. Now, I'm not sure anymore this is useful. -- JT -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace
On Wed, Jan 25, 2012 at 7:13 AM, Julien Tachoires jul...@gmail.com wrote: 2012/1/24 Robert Haas robertmh...@gmail.com: On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires jul...@gmail.com wrote: 2011/12/15 Alvaro Herrera alvhe...@commandprompt.com: Uhm, surely you could compare the original toast tablespace to the heap tablespace, and if they differ, handle appropriately when creating the new toast table? Just pass down the toast tablespace into AlterTableCreateToastTable, instead of having it assume that rel-rd_rel-relnamespace is sufficient. This should be done in all cases where a toast tablespace is created, which shouldn't be more than a handful of them. Thank you, that way seems right. Now, I distinguish before each creation of a TOAST table with AlterTableCreateToastTable() : if it will create a new one or recreate an existing one. Thus, in create_toast_table() when toastTableSpace is equal to InvalidOid, we are able : - to fallback to the main table tablespace in case of new TOAST table creation - to keep it previous tablespace in case of recreation. Here's a new version rebased against HEAD. To ask more directly the question that's come up a few times upthread, why do *you* think this is useful? What motivated you to want this behavior, and/or how do you think it could benefit other PostgreSQL users? Sorry, I didn't get this question to me. I've just picked up this item from the TODO list and then I was thinking that it could be useful. My motivation was to learn more about PostgreSQL dev. and to work on a concrete case. Now, I'm not sure anymore this is useful. OK. In that case, I don't think it makes sense to add this right now. I share the feeling that it could possibly be useful under some set of circumstances, but it doesn't seem prudent to add features because there might hypothetically be a use case. I suggest that we mark this patch Returned with Feedback and add links to your latest version of this patch and one of the emails expressing concerns about the utility of this patch to the Todo list. If we later have a clearer idea in mind why this might be useful, we can add it then - and document the use case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Client Messages
On Tue, Jan 24, 2012 at 7:38 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 23.01.2012 22:52, Jim Mlodgenski wrote: On Wed, Jan 18, 2012 at 9:19 AM, Jim Mlodgenskijimm...@gmail.com wrote: On Wed, Jan 18, 2012 at 3:08 AM, Heikki Linnakangas I don't think that's a problem, it's just a free-form message to display. But it also doesn't seem very useful to have it PGC_USERSET: if it's only displayed at connect time, there's no point in changing it after connecting. Should we make it PGC_BACKEND? In hindsight, making it PGC_BACKEND makes it much less useful, because then you can't set it on a per-database basis with ALTER DATABASE foo SET ... So I made it PGC_USERSET again. Here is the revised patch based on the feedback. Thanks! I renamed the GUC to welcome_message, to avoid confusion with client_min_messages. I also moved it to Connection Settings category. Although it's technically settable within a session, the purpose is to display a message when connecting, so Connection Settings feels more fitting. There's one little problem remaining with this, which is what to do if the message is in a different encoding than used by the client? That's not a new problem, we have the same problem with any string GUC, if you do SHOW setting. We restricted application_name to ASCII characters, which is an obvious way to avoid the problem, but I think it would be a shame to restrict this to ASCII-only. Isn't that an issue for the administrator understanding his audience. Maybe some additional documentation explaining the encoding issue? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs si...@2ndquadrant.com wrote: I think we should be working to commit XLogInsert and then Group Commit, then come back to the discussion. I definitely agree that those two have way more promise than anything else on the table. However, now that I understand how badly we're getting screwed by checkpoints, I'm curious to do some more investigation of what's going on there. I can't help thinking that in these particular cases the full page writes may be a bigger issue than the checkpoint itself. If that turns out to be the case it's not likely to be something we're able to address for 9.2, but I'd like to at least characterize it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.
On Wed, Jan 25, 2012 at 1:23 AM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jan 25, 2012 at 5:35 AM, Jaime Casanova ja...@2ndquadrant.com wrote: On Tue, Jan 24, 2012 at 3:22 PM, Simon Riggs si...@2ndquadrant.com wrote: Add new replication mode synchronous_commit = 'write'. Replication occurs only to memory on standby, not to disk, so provides additional performance if user wishes to reduce durability level slightly. Adds concept of multiple independent sync rep queues. Fujii Masao and Simon Riggs i guess, you should add the new value in postgresql.conf too. Yes, I forgot to do that. Patch attached. I think that this would be a lot more clear if we described this as synchronous_commit = remote_write rather than just synchronous_commit = write. Actually, the internal constant is named that way already, but it's not exposed as such to the user. There's a logical hierarchy here: fully async commit (off) local flush only (local) local flush + remote write (currently write) local flush + remote flush (currently on) local flush + remote apply All of the levels except for off involve waiting for local flush; the higher levels also involve waiting for something on the remote side. But the name of the variable is just synchronous_commit, so I thik that if the word remote doesn't appear anywhere in the user-visible parameter name, it's not as clear as it could be. In addition to renaming write to remote_write, I think we might also want to add remote_flush as an alias for on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.
On Wed, Jan 25, 2012 at 1:57 PM, Robert Haas robertmh...@gmail.com wrote: I think that this would be a lot more clear if we described this as synchronous_commit = remote_write rather than just synchronous_commit = write. Actually, the internal constant is named that way already, but it's not exposed as such to the user. That's something to discuss at the end of the CF when people are less busy and we get more input. It's an easy change whatever we decide to do. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE
Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012: Thanks. I've attached a new version fixing this problem. Looks good to me. Can you please clarify this bit? * Since we elsewhere require that all collations share the same * notion of equality, don't compare collation here. Since I'm not familiar with this code, it's difficult for me to figure out where this elsewhere is, and what does same notion of equality mean. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
Marko Kreen mark...@gmail.com writes: On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote: Furthermore, while we haven't settled the question of exactly what a good negotiation facility would look like, we seem to agree that a GUC isn't it. I think that means this isn't going to happen for 9.2, so we should mark this patch Returned with Feedback and return to this topic for 9.3. Simply extending the text/bin flags should be quite uncontroversial first step. How to express the capability in startup packet, I leave to others to decide. But my proposal would be following: bit 0 : text/bin bit 1..15 : format version number, maps to best formats in some Postgres version. It does not solve the resultset problem, where I'd like to say gimme well-known types in optimal representation, others in text. I don't know the perfect solution for that, but I suspect the biggest danger here is the urge to go to maximal complexity immediately. So perhaps the good idea would simply give one additional bit (0x8000?) in result flag to say that only well-known types should be optimized. That should cover 95% of use-cases, and we can design more flexible packet format when we know more about actual needs. Huh? How can that work? If we decide to change the representation of some other well known type, say numeric, how do we decide whether a client setting that bit is expecting that change or not? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote: Marko Kreen mark...@gmail.com writes: On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote: Furthermore, while we haven't settled the question of exactly what a good negotiation facility would look like, we seem to agree that a GUC isn't it. I think that means this isn't going to happen for 9.2, so we should mark this patch Returned with Feedback and return to this topic for 9.3. Simply extending the text/bin flags should be quite uncontroversial first step. How to express the capability in startup packet, I leave to others to decide. But my proposal would be following: bit 0 : text/bin bit 1..15 : format version number, maps to best formats in some Postgres version. It does not solve the resultset problem, where I'd like to say gimme well-known types in optimal representation, others in text. I don't know the perfect solution for that, but I suspect the biggest danger here is the urge to go to maximal complexity immediately. So perhaps the good idea would simply give one additional bit (0x8000?) in result flag to say that only well-known types should be optimized. That should cover 95% of use-cases, and we can design more flexible packet format when we know more about actual needs. Huh? How can that work? If we decide to change the representation of some other well known type, say numeric, how do we decide whether a client setting that bit is expecting that change or not? It sets that bit *and* version code - which means that it is up-to-date with all well-known type formats in that version. The key here is to sanely define the well-known types and document them, so clients can be uptodate with them. Variants: - All built-in and contrib types in some Postgres version - All built-in types in some Postgres version - Most common types (text, numeric, bytes, int, float, bool, ..) Also, as we have only one bit, the set of types cannot be extended. (Unless we provide more bits for that, but that may get too confusing?) Basically, I see 2 scenarios here: 1) Client knows the result types and can set the text/bin/version code safely, without further restrictions. 2) There is generic framework, that does not know query contents but can be expected to track Postgres versions closely. Such framework cannot say binary for results safely, but *could* do it for some well-defined subset of types. Ofcourse it may be that 2) is not worth supporting, as frameworks can throw errors on their own if they find format that they cannot parse. Then the user needs to either register their own parser, or simply turn off optmized formats to get the plain-text values. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trigger_depth() v3 (was: TG_DEPTH)
Committed, with OID change. Thanks. I tested it with plphp just for the heck of it and it worked wonderfully. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?
hubert depesz lubaczewski dep...@depesz.com writes: anyway - the point is that in \df date_part(, timestamp) says it's immutable, while it is not. Hmm, you're right. I thought we'd fixed that way back when, but obviously not. Or maybe the current behavior of the epoch case postdates that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
Marko Kreen mark...@gmail.com writes: On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote: Huh? How can that work? If we decide to change the representation of some other well known type, say numeric, how do we decide whether a client setting that bit is expecting that change or not? It sets that bit *and* version code - which means that it is up-to-date with all well-known type formats in that version. Then why bother with the bit in the format code? If you've already done some other negotiation to establish what datatype formats you will accept, this doesn't seem to be adding any value. Basically, I see 2 scenarios here: 1) Client knows the result types and can set the text/bin/version code safely, without further restrictions. 2) There is generic framework, that does not know query contents but can be expected to track Postgres versions closely. Such framework cannot say binary for results safely, but *could* do it for some well-defined subset of types. The hole in approach (2) is that it supposes that the client side knows the specific datatypes in a query result in advance. While this is sometimes workable for application-level code that knows what query it's issuing, it's really entirely untenable for a framework or library. The only way that a framework can deal with arbitrary queries is to introduce an extra round trip (Describe step) to see what datatypes the query will produce so it can decide what format codes to issue ... and that will pretty much eat up any time savings you might get from a more efficient representation. You really want to do the negotiation once, at connection setup, and then be able to process queries without client-side prechecking of what data types will be sent back. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] sepgsql's DROP Permission checks
On Sun, Jan 22, 2012 at 9:54 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I tried to implement based on the idea to call object-access-hook with flag; that informs extensions contexts of objects being removed. Because I missed DROP_RESTRICT and DROP_CASCADE are enum type, this patch added performInternalDeletion() instead of OR-masked DROP_INTERNAL. All its difference from performDeletion() is a flag (OAT_DROP_FLAGS_INTERNAL) shall be delivered to extension module. I replaced several performDeletion() by performInternalDeletion() that clean-up objects due to internal stuff. How about this approach? I generally agree with this line of attack, but I think you've failed to find all the cases where a drop should be considered internal, and I'd rather add a new parameter to an existing function than define a new one that someone might accidentally fail to use in some place where it's needed. Here's a cut-down patch that *just* adds a PERFORM_DELETE_INTERNAL flag, plus some related comment additions. If this looks reasonable to you, I'll commit it and then we can work out the remaining details. Since sepgsql doesn't seem to need the DropBehavior, I'm inclined to say we shouldn't go to any extra work to pass it just now. We can always add that later if some other client needs it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company perform-deletion-internal.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas robertmh...@gmail.com wrote: Early yesterday morning, I was able to use Nate Boley's test machine do a single 30-minute pgbench run at scale factor 300 using a variety of trees built with various patches, and with the -l option added to track latency on a per-transaction basis. All tests were done using 32 clients and permanent tables. The configuration was otherwise identical to that described here: http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com Previously we mostly used this machine for CPU benchmarking. Have you previously described the IO subsystem? That is becoming relevant for these types of benchmarks. For example, is WAL separated from data? By doing this, I hoped to get a better understanding of (1) the effects of a scale factor too large to fit in shared_buffers, In my hands, the active part of data at scale of 300 fits very comfortably into 8GB shared_buffers. Indeed, until you have run a very long time so that pgbench_history gets large, everything fits in 8GB. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote: Huh? How can that work? If we decide to change the representation of some other well known type, say numeric, how do we decide whether a client setting that bit is expecting that change or not? It sets that bit *and* version code - which means that it is up-to-date with all well-known type formats in that version. Then why bother with the bit in the format code? If you've already done some other negotiation to establish what datatype formats you will accept, this doesn't seem to be adding any value. Basically, I see 2 scenarios here: 1) Client knows the result types and can set the text/bin/version code safely, without further restrictions. 2) There is generic framework, that does not know query contents but can be expected to track Postgres versions closely. Such framework cannot say binary for results safely, but *could* do it for some well-defined subset of types. The hole in approach (2) is that it supposes that the client side knows the specific datatypes in a query result in advance. While this is sometimes workable for application-level code that knows what query it's issuing, it's really entirely untenable for a framework or library. The only way that a framework can deal with arbitrary queries is to introduce an extra round trip (Describe step) to see what datatypes the query will produce so it can decide what format codes to issue ... and that will pretty much eat up any time savings you might get from a more efficient representation. You really want to do the negotiation once, at connection setup, and then be able to process queries without client-side prechecking of what data types will be sent back. What might work is for clients to advertise a list of capability strings, like compact_array_format, at connection startup time. The server can then adjust its behavior based on that list. But the problem with that is that as we make changes to the wire protocol, the list of capabilities clients need to advertise could get pretty long in a hurry. A simpler alternative is to have the client send a server version along with the initial connection attempt and have the server do its best not to use any features that weren't present in that server version - but that seems to leave user-defined types out in the cold. I reiterate my previous view that we don't have time to engineer a good solution to this problem right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote: Marko Kreen mark...@gmail.com writes: On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote: Huh? How can that work? If we decide to change the representation of some other well known type, say numeric, how do we decide whether a client setting that bit is expecting that change or not? It sets that bit *and* version code - which means that it is up-to-date with all well-known type formats in that version. Then why bother with the bit in the format code? If you've already done some other negotiation to establish what datatype formats you will accept, this doesn't seem to be adding any value. The other negotiation is done via Postgres release notes... I specifically want to avoid any sort of per-connection negotation, except the max format version supported, because it will mess up multiplexed usage of single connection. Then they need to either disabled advanced formats completely, or still do it per-query somehow (via GUCs?) which is mess. Also I don't see any market for flexible negotations, instead I see that people want 2 things: - Updated formats are easily available - Old apps not to break I might be mistaken here, then please correct me, but currently I'm designing for simplicity. Basically, I see 2 scenarios here: 1) Client knows the result types and can set the text/bin/version code safely, without further restrictions. 2) There is generic framework, that does not know query contents but can be expected to track Postgres versions closely. Such framework cannot say binary for results safely, but *could* do it for some well-defined subset of types. The hole in approach (2) is that it supposes that the client side knows the specific datatypes in a query result in advance. While this is sometimes workable for application-level code that knows what query it's issuing, it's really entirely untenable for a framework or library. No, the list of well-known types is documented and fixed. The bit is specifically for frameworks, so that they can say I support all well-known types in Postgres version X.Y. Note I said that the list cannot be extended but that is wrong. When this bit and version code are taken together, it clearly defines list as in version X.Y. So considering that client should not send any higher version than server supports, it means server always knows what list client refers to. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Wed, Jan 25, 2012 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: After that I started doing some performance testing, and the initial news was bad: the planner was about 3x slower than 9.1 on a moderately large join problem. I've spent the past several days hacking away at that, and have got it down to about 35% slower by dint of the following changes: * micro-optimization of add_path(), in particular avoiding duplicate calculations during cost comparisons. * introducing a two-step mechanism for testing whether proposed join paths are interesting. The patch first calculates a quick and dirty lower bound on the cost of the join path (mostly, from the costs of its input paths) and looks through the joinrel's pathlist to see if there is already a path that is clearly better. If not, it proceeds with the full cost calculation and add_path insertion. In my testing the precheck is able to eliminate 80% or so of the proposed paths, more than repaying the extra pathlist search. Now this is of course cheating with both hands, in that either of these optimization techniques could have been applied before ... but they weren't. I think that 35% slower on large join problems is probably acceptable, given that this is investigating a larger number of possible solutions and hopefully often finding a better plan. I think I can knock some more off of that by refactoring the costsize.c APIs, anyway. Right now the first-pass and second-pass cost calculations are independent and so there's some repeated work, which I'd like to eliminate both for speed reasons and to get rid of duplicate code that'd have to be kept in sync if it's left as-is. If there are not objections, I'd like to go ahead and commit this after one more round of polishing. There's still a lot left to do, but what it mainly needs now is some testing on real-world planning problems, and I don't think it's going to get that until it's been merged in. I have to admit that I have a bad feeling about this. It strikes me that there is no way we're not going to get complaints about a 35% slowdown on planning a large join problem. It is true that some people will have the benefit of finding a faster plan - but I think many people won't. We're really facing the same trade-off here that we do with many other things people have asked for the query planner to do over the years: is it worth slowing down everyone, on every query, for an optimization that will apply rarely but provide huge benefits when it does? Of course, that's fundamentally a judgement call. But I can't help observing that the number of requests we've had for the planner to deduce implied inequalities is far larger than the number of people who have complained about the problem that this fixes. Now, maybe the speed penalty for deducing implied inequalities would be even larger than 35%: I don't know. But we've sweat blood to avoid much smaller regressions on much less important cases. To be clear, I'd love to have this feature. But if there is a choice between reducing planning time significantly for everyone and NOT getting this feature, and increasing planning time significantly for everyone and getting this feature, I think we will make more people happy by doing the first one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
Marko Kreen mark...@gmail.com writes: On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote: Then why bother with the bit in the format code? If you've already done some other negotiation to establish what datatype formats you will accept, this doesn't seem to be adding any value. The other negotiation is done via Postgres release notes... That is really not going to work if the requirement is to not break old apps. They haven't read the release notes. I specifically want to avoid any sort of per-connection negotation, except the max format version supported, because it will mess up multiplexed usage of single connection. Then they need to either disabled advanced formats completely, or still do it per-query somehow (via GUCs?) which is mess. Hmm, that adds yet another level of not-obvious-how-to-meet requirement. I tend to concur with Robert that we are not close to a solution. No, the list of well-known types is documented and fixed. The bit is specifically for frameworks, so that they can say I support all well-known types in Postgres version X.Y. So in other words, if we have a client that contains a framework that knows about version N, and we connect it up to a server that speaks version N+1, it suddenly loses the ability to use any version-N optimizations? That does not meet my idea of not breaking old apps. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao masao.fu...@gmail.com wrote: What happens if we shutdown the WALwriter and then issue SIGHUP? SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about the case where smart shutdown kills walwriter but some backends are still running? Currently SIGHUP affects full_page_writes and running backends use the changed new value of full_page_writes. But in the patch, SIGHUP doesn't affect... To address the problem, we should either postpone the shutdown of walwriter until all backends have gone away, or leave the update of full_page_writes to checkpointer process instead of walwriter. Thought? checkpointer seems the correct place to me Done. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)
On Sun, Jan 22, 2012 at 5:12 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: 2012/1/21 Robert Haas robertmh...@gmail.com: On Sat, Jan 21, 2012 at 3:59 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I marked the default leakproof function according to the criteria that does not leak contents of the argument. Indeed, timestamp_ne_timestamptz() has a code path that rises an error of timestamp out of range message. Is it a good idea to avoid mark leakproof on these functions also? I think that anything which looks at the data and uses that as a basis for whether or not to throw an error is non-leakproof. Even if doesn't directly leak an arbitrary value, I think that leaking even some information about what the value is no good. Otherwise, you might imagine that we would allow /(int, int), because it only leaks in the second_arg = 0 case. And you might imagine we'd allow -(int, int) because it only leaks in the case where an overflow occurs. But of course the combination of the two allows writing something of the form 1/(a-constant) and getting it pushed down, and now you have the ability to probe for an arbitrary value. So I think it's just no good to allow any leaking at all: otherwise it'll be unclear how safe it really is, especially when combinations of different functions or operators are involved. OK. I checked list of the default leakproof functions. Functions that contains translation between date and timestamp(tz) can raise an error depending on the supplied arguments. Thus, I unmarked leakproof from them. In addition, varstr_cmp() contains translation from UTF-8 to UTF-16 on win32 platform; that may raise an error if string contains a character that is unavailable to translate. Although I'm not sure which case unavailable to translate between them, it seems to me hit on the basis not to leak what kind of information is no good. Thus, related operator functions of bpchar and text got unmarked. (Note that bpchareq, bpcharne, texteq and textne don't use it.) Can you rebase this? It seems that the pg_proc.h and select_views{,_1}.out hunks no longer apply cleanly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
Robert Haas robertmh...@gmail.com writes: I have to admit that I have a bad feeling about this. It strikes me that there is no way we're not going to get complaints about a 35% slowdown on planning a large join problem. I have to admit that I'm worried about that too. However, I hope to continue whittling away at that number. It's also worth pointing out that the number is based on just a couple of test cases that are not very real-world, in that I'm testing against empty tables with no statistics. I think that this is a worst case, because it results in a lot of paths with basically the same costs, making them hard to prune; but I can't do much better, because the examples I've got were supplied without test data. Also, you're assuming that the changes have no upside whatsoever, which I fondly hope is not the case. Large join problems tend not to execute instantaneously --- so nobody is going to complain if the planner takes awhile longer but the resulting plan is enough better to buy that back. In my test cases, the planner *is* finding better plans, or at least ones with noticeably lower estimated costs. It's hard to gauge how much that translates to in real-world savings, since I don't have real data loaded up. I also think, though I've not tried to measure, that I've made planning cheaper for very simple queries by eliminating some overhead in those cases. Anyway, I'd be willing to hold off committing if someone were to volunteer to test an unintegrated copy of the patch against some moderately complicated application. But it's a sufficiently large patch that I don't really care to sit on it and try to maintain it outside the tree for a long time. To be clear, I'd love to have this feature. But if there is a choice between reducing planning time significantly for everyone and NOT getting this feature, and increasing planning time significantly for everyone and getting this feature, I think we will make more people happy by doing the first one. We're not really talking about are we going to accept or reject a specific feature. We're talking about whether we're going to decide that the last two years worth of planner development were headed in the wrong direction and we're now going to reject that and try to think of some entirely new concept. This isn't an isolated patch, it's the necessary next step in a multi-year development plan. The fact that it's a bit slower at the moment just means there's still work to do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote: I specifically want to avoid any sort of per-connection negotation, except the max format version supported, because it will mess up multiplexed usage of single connection. Then they need to either disabled advanced formats completely, or still do it per-query somehow (via GUCs?) which is mess. Being able to explicitly pick format version other than the one the application was specifically written against adds a lot of complexity and needs to be justified. Maybe you're trying to translate data between two differently versioned servers? I'm trying to understand the motive behind your wanting finer grained control of picking format version... merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 12:58:15PM -0500, Tom Lane wrote: Marko Kreen mark...@gmail.com writes: On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote: Then why bother with the bit in the format code? If you've already done some other negotiation to establish what datatype formats you will accept, this doesn't seem to be adding any value. The other negotiation is done via Postgres release notes... That is really not going to work if the requirement is to not break old apps. They haven't read the release notes. Yes, but they also keep requesting the old formats so everything is fine? Note that formats are under full control of client, server has no way to send newer formats to client that has not requested it. I specifically want to avoid any sort of per-connection negotation, except the max format version supported, because it will mess up multiplexed usage of single connection. Then they need to either disabled advanced formats completely, or still do it per-query somehow (via GUCs?) which is mess. Hmm, that adds yet another level of not-obvious-how-to-meet requirement. I tend to concur with Robert that we are not close to a solution. Well, my simple scheme seems to work fine with such requirement. [My scheme - client-supplied 16bit type code is only thing that decides format.] No, the list of well-known types is documented and fixed. The bit is specifically for frameworks, so that they can say I support all well-known types in Postgres version X.Y. So in other words, if we have a client that contains a framework that knows about version N, and we connect it up to a server that speaks version N+1, it suddenly loses the ability to use any version-N optimizations? That does not meet my idea of not breaking old apps. That is up to Postgres maintainers to decide, whether they want to phase out some type from the list. But my main point was it's OK to add types into list. I missed that aspect on my previous mail. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote: On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote: I specifically want to avoid any sort of per-connection negotation, except the max format version supported, because it will mess up multiplexed usage of single connection. Then they need to either disabled advanced formats completely, or still do it per-query somehow (via GUCs?) which is mess. Being able to explicitly pick format version other than the one the application was specifically written against adds a lot of complexity and needs to be justified. Maybe you're trying to translate data between two differently versioned servers? I'm trying to understand the motive behind your wanting finer grained control of picking format version... You mean if client has written with version N formats, but connects to server with version N-1 formats? True, simply not supporting such case simplifies client-side API. But note that it does not change anything on protocol level, it's purely client-API specific. It may well be that some higher-level APIs (JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level API-s (raw libpq), it may be optional whether the client wants to support such usage or not. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 1:24 PM, Marko Kreen mark...@gmail.com wrote: On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote: On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote: I specifically want to avoid any sort of per-connection negotation, except the max format version supported, because it will mess up multiplexed usage of single connection. Then they need to either disabled advanced formats completely, or still do it per-query somehow (via GUCs?) which is mess. Being able to explicitly pick format version other than the one the application was specifically written against adds a lot of complexity and needs to be justified. Maybe you're trying to translate data between two differently versioned servers? I'm trying to understand the motive behind your wanting finer grained control of picking format version... You mean if client has written with version N formats, but connects to server with version N-1 formats? True, simply not supporting such case simplifies client-side API. But note that it does not change anything on protocol level, it's purely client-API specific. It may well be that some higher-level APIs (JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level API-s (raw libpq), it may be optional whether the client wants to support such usage or not. well, I see the following cases: 1) Vserver Vapplication: server downgrades wire formats to applications version 2) Vapplication Vlibpq Vserver: since the application is reading/writing formats the server can't understand, an error should be raised if they are used in either direction 3) Vlibpq = VApplication Vserver: same as above, but libpq can 'upconvert' low version wire format to application's wire format or error otherwise. By far, the most common cause of problems (both in terms of severity and frequency) is case #1. #3 allows a 'compatibility mode' via libpq, but that comes at significant cost of complexity since libpq needs to be able to translate wire formats up (but not down). #2/3 is a less common problem though as it's more likely the application can be adjusted to get up to speed: so to keep things simple we can maybe just error out in those scenarios. In the database, we need to maintain outdated send/recv functions basically forever and as much as possible try and translate old wire format data to and from newer backend structures (maybe in very specific cases that will be impossible such that the application is SOL, but that should be rare). All send/recv functions, including user created ones need to be stamped with a version token (database version?). With the versions of the application, libpq, and all server functions, we can determine all wire formats as long as we assume the application's targeted database version represents all the wire formats it was using. My good ideas stop there: the exact mechanics of how the usable set of functions are determined, how exactly the adjusted type look ups will work, etc. would all have to be sorted out. Most of the nastier parts though (protocol changes notwithstanding) are not in libpq, but the server. There's just no quick fix on the client side I can see. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Jan 25, 2012, at 10:24 AM, Tom Lane wrote: Anyway, I'd be willing to hold off committing if someone were to volunteer to test an unintegrated copy of the patch against some moderately complicated application. But it's a sufficiently large patch that I don't really care to sit on it and try to maintain it outside the tree for a long time. Why not create a branch? IIRC the build farm can be configured to run branches. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_bulkload ON_DUPLICATE_MERGE
PG Gurus, I have a table like this: CREATE TABLE filemods ( guid BIGINT NOT NULL UNIQUE, filepath_guid BIGINT NOT NULL, createtimeTIMESTAMP WITH TIME ZONE DEFAULT NULL, writetime TIMESTAMP WITH TIME ZONE DEFAULT NULL, deletetimeTIMESTAMP WITH TIME ZONE DEFAULT NULL, ); One event might have (1, '2012-01-25 11:00:00', NULL, NULL) and another event might have (1, NULL, '2012-01-25 11:05:00', NULL) and the end result should be: (1, '2012-01-25 11:00:00', '2012-01-25 11:05:00', NULL). I'm trying to modify pg_bulkload to merge two rows together. The changes I have made seem to be working, although I would like input on what I am doing that is unsafe or terribly wrong. You can be brutal. I've seen incredible write speed with using pg_bulkload. If I can just get it to consolidate our rows based on the unique key it will remove a lot of complexity in our software. Also, I'm not entirely sure this mailing list is the correct one, but with all the internals you all know, I'm hoping you can help point out nasty flaws in my algorithm. This is the first legitimate attempt I have made at modifying PG source, so I'm not real familiar with the proper way of loading pages and tuples and updating heaps and all that pg core stuff. Here's the modifications to pg_btree.c (from pg_bulkload HEAD): http://pastebin.com/U23CapvR I also attached the patch. Thank you!! Ben -- Benjamin Johnson http://getcarbonblack.com/ | @getcarbonblack cell: 312.933.3612 diff -r 3f065ec72ab8 pgbulkload/lib/pg_btree.c --- a/pgbulkload/lib/pg_btree.c Fri Jan 20 09:26:20 2012 -0600 +++ b/pgbulkload/lib/pg_btree.c Wed Jan 25 13:37:43 2012 -0600 @@ -398,6 +398,8 @@ BTReaderTerm(reader); } +void merge_tuples(Relation heap, IndexTuple itup_dst, IndexTuple itup_src); + /* * _bt_mergeload - Merge two streams of index tuples into new index files. */ @@ -462,7 +464,6 @@ } else { - // TODO -- BSJ if (on_duplicate == ON_DUPLICATE_KEEP_NEW) { self-dup_old++; @@ -470,7 +471,21 @@ RelationGetRelationName(wstate-index)); itup2 = BTReaderGetNextItem(btspool2); } - else + else if (on_duplicate == ON_DUPLICATE_MERGE) + { + self-dup_old++; + + // merge from itup into itup2 where itup's col[i] is not null + // but itup2's col[i] IS null + merge_tuples(heapRel, itup2, itup); + + ItemPointerCopy(t_tid2, itup2-t_tid); + self-dup_new++; + remove_duplicate(self, heapRel, itup, + RelationGetRelationName(wstate-index)); + itup = BTSpoolGetNextItem(btspool, itup, should_free); + } + else { ItemPointerCopy(t_tid2, itup2-t_tid); self-dup_new++; @@ -950,6 +965,113 @@ self-dup_old + self-dup_new, relname); } +// returns Buffer after locking it (BUFFER_LOCK_SHARE then BUFFER_LOCK_UNLOCK) +Buffer load_buffer(Relation heap, IndexTuple itup, HeapTupleData *tuple /*OUT */, ItemId *itemid /*OUT */ ) +{ + BlockNumber blknum; + BlockNumber offnum; + Buffer buffer; + Pagepage; + + blknum = ItemPointerGetBlockNumber(itup-t_tid); + offnum = ItemPointerGetOffsetNumber(itup-t_tid); + buffer = ReadBuffer(heap, blknum); + + LockBuffer(buffer, BUFFER_LOCK_SHARE); + page = BufferGetPage(buffer); + *itemid = PageGetItemId(page, offnum); + tuple-t_data = ItemIdIsNormal(*itemid) + ? (HeapTupleHeader) PageGetItem(page, *itemid) + : NULL; + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + return buffer; +} + +void load_tuple(Relation heap, + HeapTuple tuple, + IndexTuple itup, + ItemId itemid, +
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Wed, Jan 25, 2012 at 9:09 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas robertmh...@gmail.com wrote: Early yesterday morning, I was able to use Nate Boley's test machine do a single 30-minute pgbench run at scale factor 300 using a variety of trees built with various patches, and with the -l option added to track latency on a per-transaction basis. All tests were done using 32 clients and permanent tables. The configuration was otherwise identical to that described here: http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com Previously we mostly used this machine for CPU benchmarking. Have you previously described the IO subsystem? That is becoming relevant for these types of benchmarks. For example, is WAL separated from data? I actually don't know much about the I/O subsystem, but, no, WAL is not separated from data. I believe $PGDATA is on a SAN, but I don't know anything about its characteristics. I think the checkpointing issues that Greg is exploring are important, and I'm pretty sure that that is what is limiting your TPS in this case. However, I don't think we can make much progress on that front using a machine whose IO subsystem is largely unknown, and not tweakable. So I think the best use of this machine would be to explore the purely CPU based scaling issues, like freelist, CLOG, and XLogInsert. To do that, I'd set the scale factor small enough so that entire data set is willing to be cached dirty by the kernel, based on the kernel parameters: egrep . /proc/sys/vm/dirty_* Then set shared_buffers to be less than the needs for that scale factor, so freelist gets exercised. And neutralize checkpoints, by setting fsync=off, so they don't generate physical IO that we can't control for given the current constraints on the machine set up. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas robertmh...@gmail.com wrote: Early yesterday morning, I was able to use Nate Boley's test machine do a single 30-minute pgbench run at scale factor 300 using a variety of trees built with various patches, and with the -l option added to track latency on a per-transaction basis. All tests were done using 32 clients and permanent tables. The configuration was otherwise identical to that described here: http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com Previously we mostly used this machine for CPU benchmarking. Have you previously described the IO subsystem? That is becoming relevant for these types of benchmarks. For example, is WAL separated from data? I actually don't know much about the I/O subsystem, but, no, WAL is not separated from data. I believe $PGDATA is on a SAN, but I don't know anything about its characteristics. By doing this, I hoped to get a better understanding of (1) the effects of a scale factor too large to fit in shared_buffers, In my hands, the active part of data at scale of 300 fits very comfortably into 8GB shared_buffers. Indeed, until you have run a very long time so that pgbench_history gets large, everything fits in 8GB. Hmm, my mistake. Maybe I need to crank up the scale factor some more. This is why good benchmarking is hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
David E. Wheeler da...@justatheory.com writes: On Jan 25, 2012, at 10:24 AM, Tom Lane wrote: Anyway, I'd be willing to hold off committing if someone were to volunteer to test an unintegrated copy of the patch against some moderately complicated application. But it's a sufficiently large patch that I don't really care to sit on it and try to maintain it outside the tree for a long time. Why not create a branch? IIRC the build farm can be configured to run branches. I already know what the patch does against the regression tests. Buildfarm testing is not of interest here. What would be of help is, say, Kevin volunteering to load up his Circuit Courts software and data into a git-head server and see how performance looks with and without the patch. Distribution of the code doesn't strike me as being the bottleneck. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Jan 25, 2012, at 12:19 PM, Tom Lane wrote: Why not create a branch? IIRC the build farm can be configured to run branches. I already know what the patch does against the regression tests. Buildfarm testing is not of interest here. What would be of help is, say, Kevin volunteering to load up his Circuit Courts software and data into a git-head server and see how performance looks with and without the patch. Distribution of the code doesn't strike me as being the bottleneck. Yeah, but it would be easier to keep a branch up-to-date via `git rebase` than to maintain the patch, I would think. And if it’s a remote branch, then simpler distribution is a bonus. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 01:43:03PM -0600, Merlin Moncure wrote: On Wed, Jan 25, 2012 at 1:24 PM, Marko Kreen mark...@gmail.com wrote: On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote: On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote: I specifically want to avoid any sort of per-connection negotation, except the max format version supported, because it will mess up multiplexed usage of single connection. Then they need to either disabled advanced formats completely, or still do it per-query somehow (via GUCs?) which is mess. Being able to explicitly pick format version other than the one the application was specifically written against adds a lot of complexity and needs to be justified. Maybe you're trying to translate data between two differently versioned servers? I'm trying to understand the motive behind your wanting finer grained control of picking format version... You mean if client has written with version N formats, but connects to server with version N-1 formats? True, simply not supporting such case simplifies client-side API. But note that it does not change anything on protocol level, it's purely client-API specific. It may well be that some higher-level APIs (JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level API-s (raw libpq), it may be optional whether the client wants to support such usage or not. well, I see the following cases: 1) Vserver Vapplication: server downgrades wire formats to applications version 2) Vapplication Vlibpq Vserver: since the application is reading/writing formats the server can't understand, an error should be raised if they are used in either direction 3) Vlibpq = VApplication Vserver: same as above, but libpq can 'upconvert' low version wire format to application's wire format or error otherwise. I don't see why you special-case libpq here. There is no reason libpq cannot pass older/newer formats through. Only thing that matters it parser/formatter version. If that is done in libpq, then app version does not matter. If it's done in app, then libpq version does not matter. By far, the most common cause of problems (both in terms of severity and frequency) is case #1. #3 allows a 'compatibility mode' via libpq, but that comes at significant cost of complexity since libpq needs to be able to translate wire formats up (but not down). #2/3 is a less common problem though as it's more likely the application can be adjusted to get up to speed: so to keep things simple we can maybe just error out in those scenarios. I don't like the idea of conversion. Instead either client writes values through API that picks format based on server version, or it writes them for specific version only. In latter case it cannot work with older server. Unless the fixed version is the baseline. In the database, we need to maintain outdated send/recv functions basically forever and as much as possible try and translate old wire format data to and from newer backend structures (maybe in very specific cases that will be impossible such that the application is SOL, but that should be rare). All send/recv functions, including user created ones need to be stamped with a version token (database version?). With the versions of the application, libpq, and all server functions, we can determine all wire formats as long as we assume the application's targeted database version represents all the wire formats it was using. My good ideas stop there: the exact mechanics of how the usable set of functions are determined, how exactly the adjusted type look ups will work, etc. would all have to be sorted out. Most of the nastier parts though (protocol changes notwithstanding) are not in libpq, but the server. There's just no quick fix on the client side I can see. It does not need to be complex - just bring the version number to i/o function and let it decide whether it cares about it or not. Most functions will not.. Only those that we want to change in compatible manner need to look at it. But I don't see that there is danger of having regular changes in wire formats. So most of the functions will ignore the versioning. Including the ones that don't care about compatibility. But seriously - on-wire compatibility is good thing, do not fear it... -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: New version that repairs a defective test case. Committed. I don't find this to be particularly good style: + for (i = 0; i old_natts ret; i++) + ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i + irel-rd_att-attrs[i]-atttypid == typeObjectId[i]); ...but I am not sure whether we have any formal policy against it, so I just committed it as-is for now. I would have surrounded the loop with an if (ret) block and written the body of the loop as if (condition) { ret = false; break; }. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 2:29 PM, Marko Kreen mark...@gmail.com wrote: well, I see the following cases: 1) Vserver Vapplication: server downgrades wire formats to applications version 2) Vapplication Vlibpq Vserver: since the application is reading/writing formats the server can't understand, an error should be raised if they are used in either direction 3) Vlibpq = VApplication Vserver: same as above, but libpq can 'upconvert' low version wire format to application's wire format or error otherwise. I don't see why you special-case libpq here. There is no reason libpq cannot pass older/newer formats through. Only thing that matters it parser/formatter version. If that is done in libpq, then app version does not matter. If it's done in app, then libpq version does not matter. Only because if the app is targeting wire format N, but the server can only handle N-1, libpq has the opportunity to fix it up. That's could be just over thinking it though. By far, the most common cause of problems (both in terms of severity and frequency) is case #1. #3 allows a 'compatibility mode' via libpq, but that comes at significant cost of complexity since libpq needs to be able to translate wire formats up (but not down). #2/3 is a less common problem though as it's more likely the application can be adjusted to get up to speed: so to keep things simple we can maybe just error out in those scenarios. I don't like the idea of conversion. Instead either client writes values through API that picks format based on server version, or it writes them for specific version only. In latter case it cannot work with older server. Unless the fixed version is the baseline. ok. another point about that: libpq isn't really part of the solution anyways since there are other popular fully native protocol consumers, including (and especially) jdbc, but also python, node.js etc etc. that's why I was earlier insisting on a protocol bump, so that we could in the new protocol force application version to be advertised. v3 would remain caveat emptor for wire formats but v4 would not. In the database, we need to maintain outdated send/recv functions basically forever and as much as possible try and translate old wire format data to and from newer backend structures (maybe in very specific cases that will be impossible such that the application is SOL, but that should be rare). All send/recv functions, including user created ones need to be stamped with a version token (database version?). With the versions of the application, libpq, and all server functions, we can determine all wire formats as long as we assume the application's targeted database version represents all the wire formats it was using. My good ideas stop there: the exact mechanics of how the usable set of functions are determined, how exactly the adjusted type look ups will work, etc. would all have to be sorted out. Most of the nastier parts though (protocol changes notwithstanding) are not in libpq, but the server. There's just no quick fix on the client side I can see. It does not need to be complex - just bring the version number to i/o function and let it decide whether it cares about it or not. Most functions will not.. Only those that we want to change in compatible manner need to look at it. well, maybe instead of passing version number around, the server installs the proper compatibility send/recv functions just once on session start up so your code isn't littered with stuff like if(version n) do this; else do this;? But seriously - on-wire compatibility is good thing, do not fear it... sure -- but for postgres I just don't think it's realistic, especially for the binary wire formats. a json based data payload could give it to you (and I'm only half kidding) :-). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012: On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: New version that repairs a defective test case. Committed. I don't find this to be particularly good style: + for (i = 0; i old_natts ret; i++) + ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i + irel-rd_att-attrs[i]-atttypid == typeObjectId[i]); ...but I am not sure whether we have any formal policy against it, so I just committed it as-is for now. I would have surrounded the loop with an if (ret) block and written the body of the loop as if (condition) { ret = false; break; }. I find that code way too clever. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
I actually don't know much about the I/O subsystem, but, no, WAL is not separated from data. I believe $PGDATA is on a SAN, but I don't know anything about its characteristics. The computer is here: http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm $PGDATA is on a 5 disk SATA software raid 5. Is there anything else that would be useful to know? ( Sorry, this isn't a subject that I'm very familiar with ) -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Wed, Jan 25, 2012 at 02:50:09PM -0600, Merlin Moncure wrote: On Wed, Jan 25, 2012 at 2:29 PM, Marko Kreen mark...@gmail.com wrote: well, I see the following cases: 1) Vserver Vapplication: server downgrades wire formats to applications version 2) Vapplication Vlibpq Vserver: since the application is reading/writing formats the server can't understand, an error should be raised if they are used in either direction 3) Vlibpq = VApplication Vserver: same as above, but libpq can 'upconvert' low version wire format to application's wire format or error otherwise. I don't see why you special-case libpq here. There is no reason libpq cannot pass older/newer formats through. Only thing that matters it parser/formatter version. If that is done in libpq, then app version does not matter. If it's done in app, then libpq version does not matter. Only because if the app is targeting wire format N, but the server can only handle N-1, libpq has the opportunity to fix it up. That's could be just over thinking it though. I think it's over thinking. The value should be formatted/parsed just once. Server side must support processing different versions. Whether client side supports downgrading, it's up to client-side programmers. If you want to write compatible client, you have a choice of using proper wrapper API, or simply writing baseline formatting, ignoring format changes in new versions. Both are valid approaches and I think we should keep it that way. By far, the most common cause of problems (both in terms of severity and frequency) is case #1. #3 allows a 'compatibility mode' via libpq, but that comes at significant cost of complexity since libpq needs to be able to translate wire formats up (but not down). #2/3 is a less common problem though as it's more likely the application can be adjusted to get up to speed: so to keep things simple we can maybe just error out in those scenarios. I don't like the idea of conversion. Instead either client writes values through API that picks format based on server version, or it writes them for specific version only. In latter case it cannot work with older server. Unless the fixed version is the baseline. ok. another point about that: libpq isn't really part of the solution anyways since there are other popular fully native protocol consumers, including (and especially) jdbc, but also python, node.js etc etc. that's why I was earlier insisting on a protocol bump, so that we could in the new protocol force application version to be advertised. v3 would remain caveat emptor for wire formats but v4 would not. We can bump major/minor anyway to inform clients about new functionality. I don't particularly care about that. What I'm interested in is what the actual type negotation looks like. It might be possible we could get away without bumpping anything. But I have not thought about that angle too deeply yet. In the database, we need to maintain outdated send/recv functions basically forever and as much as possible try and translate old wire format data to and from newer backend structures (maybe in very specific cases that will be impossible such that the application is SOL, but that should be rare). All send/recv functions, including user created ones need to be stamped with a version token (database version?). With the versions of the application, libpq, and all server functions, we can determine all wire formats as long as we assume the application's targeted database version represents all the wire formats it was using. My good ideas stop there: the exact mechanics of how the usable set of functions are determined, how exactly the adjusted type look ups will work, etc. would all have to be sorted out. Most of the nastier parts though (protocol changes notwithstanding) are not in libpq, but the server. There's just no quick fix on the client side I can see. It does not need to be complex - just bring the version number to i/o function and let it decide whether it cares about it or not. Most functions will not.. Only those that we want to change in compatible manner need to look at it. well, maybe instead of passing version number around, the server installs the proper compatibility send/recv functions just once on session start up so your code isn't littered with stuff like if(version n) do this; else do this;? Seems confusing. Note that type i/o functions are user-callable. How should they act then? Also note that if()s are needed only for types that want to change their on-wire formatting. Considering the mess incompatible on-wire format change can cause, it's good price to pay. But seriously - on-wire compatibility is good thing, do not fear it... sure -- but for postgres I just don't think it's realistic, especially for the binary wire formats. a json based data payload could give it to you (and I'm only half
Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings
Excerpts from Noah Misch's message of sáb ene 14 12:40:02 -0300 2012: It has bothered me that psql's \copy ignores the ON_ERROR_ROLLBACK setting. Only SendQuery() takes note of ON_ERROR_ROLLBACK, and \copy, like all backslash commands, does not route through SendQuery(). Looking into this turned up several other weaknesses in psql's handling of COPY. Interesting. Committed, thanks. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On Wed, Jan 25, 2012 at 1:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Also, you're assuming that the changes have no upside whatsoever, which I fondly hope is not the case. Large join problems tend not to execute instantaneously --- so nobody is going to complain if the planner takes awhile longer but the resulting plan is enough better to buy that back. In my test cases, the planner *is* finding better plans, or at least ones with noticeably lower estimated costs. It's hard to gauge how much that translates to in real-world savings, since I don't have real data loaded up. I also think, though I've not tried to measure, that I've made planning cheaper for very simple queries by eliminating some overhead in those cases. I had a 34-table join on one of the last applications I maintained that planned and executed in less than 2 seconds. That was pushing it, but I had many joins in the 10-20 table range that planned and executed in 100-200 ms. I agree that if you are dealing with a terabyte table - or even a gigabyte table - then the growth of planning time will probably not bother anyone even if you fail to find a better plan, and will certainly make the user very happy if you do. But on tables with only a megabyte of data, it's not nearly so clear-cut. In an ideal world, I'd like the amount of effort we spend planning to be somehow tied to the savings we can expect to get, and deploy optimizations like this only in cases where we have a reasonable expectation of that effort being repaid. AIUI, this is mostly going to benefit cases like small LJ (big1 IJ big2) and, of course, those cases aren't going to arise if your query only involves small tables, or even if you have something like big IJ small1 IJ small2 IJ small3 IJ small4 LJ small5 LJ small6 IJ small7, which is a reasonably common pattern for me. Now, if you come back and say, ah, well, those cases aren't the ones that are going to be harmed by this, then maybe we should have a more detailed conversation about where the mines are. Or maybe it is helping in more cases than I'm thinking about at the moment. To be clear, I'd love to have this feature. But if there is a choice between reducing planning time significantly for everyone and NOT getting this feature, and increasing planning time significantly for everyone and getting this feature, I think we will make more people happy by doing the first one. We're not really talking about are we going to accept or reject a specific feature. We're talking about whether we're going to decide that the last two years worth of planner development were headed in the wrong direction and we're now going to reject that and try to think of some entirely new concept. This isn't an isolated patch, it's the necessary next step in a multi-year development plan. The fact that it's a bit slower at the moment just means there's still work to do. I'm not proposing that you should never commit this. I'm proposing that any commit by anyone that introduces a 35% performance regression is unwise, and doubly so at the end of the release cycle. I have every confidence that you can improve the code further over time, but the middle of the last CommitFest is not a great time to commit code that, by your own admission, needs a considerable amount of additional work. Sure, there are some things that we're not going to find out until the code goes into production, but it seems to me that you've already uncovered a fairly major performance problem that is only partially fixed. Once this is committed, it's not coming back out, so we're either going to have to figure out how to fix it before we release, or release with a regression in certain cases. If you got it down to 10% I don't think I'd be worried, but a 35% regression that we don't know how to fix seems like a lot. On another note, nobody besides you has looked at the code yet, AFAIK... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
On Wed, Jan 25, 2012 at 3:52 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012: On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: New version that repairs a defective test case. Committed. I don't find this to be particularly good style: + for (i = 0; i old_natts ret; i++) + ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i + irel-rd_att-attrs[i]-atttypid == typeObjectId[i]); ...but I am not sure whether we have any formal policy against it, so I just committed it as-is for now. I would have surrounded the loop with an if (ret) block and written the body of the loop as if (condition) { ret = false; break; }. I find that code way too clever. The way he wrote it, or the way I proposed to write it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
Excerpts from Robert Haas's message of mié ene 25 19:05:44 -0300 2012: On Wed, Jan 25, 2012 at 3:52 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012: On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: New version that repairs a defective test case. Committed. I don't find this to be particularly good style: + for (i = 0; i old_natts ret; i++) + ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i + irel-rd_att-attrs[i]-atttypid == typeObjectId[i]); ...but I am not sure whether we have any formal policy against it, so I just committed it as-is for now. I would have surrounded the loop with an if (ret) block and written the body of the loop as if (condition) { ret = false; break; }. I find that code way too clever. The way he wrote it, or the way I proposed to write it? The code as committed. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Fix Leaky View Problem
On Sun, Jan 22, 2012 at 5:57 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: This passes installcheck initially. Then upon second invocation of installcheck, it fails. It creates the role alice, and doesn't clean it up. On next invocation alice already exists and cases a failure in test select_views. Thanks for your pointing out. The attached patch adds cleaning-up part of object being defined within this test; includes user alice. Urp. I failed to notice this patch and committed a different fix for the problem pointed out by Jeff. I'm inclined to think it's OK to leave the non-shared objects behind - among other things, if people are testing pg_upgrade using the regression database, this will help to ensure that pg_dump is handling security_barrier views correctly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Different error messages executing CREATE TABLE or ALTER TABLE to create a column xmin
Hi hackers, the attached patch fixes the problem I explained in pgsql-bugs (forwarded). It's a trivial problem, so no initial patch design was required. Hope to see my patch applied. Thanks for your work. Regards. -- Forwarded message -- From: Giuseppe Sucameli brush.ty...@gmail.com Date: Sun, Jan 22, 2012 at 2:22 PM Subject: Different error messages executing CREATE TABLE or ALTER TABLE to create a column xmin To: pgsql-b...@postgresql.org Hi all, trying to create a table with a column xmin I get the following error message: test= create table lx (xmin int); ERROR: column name xmin conflicts with a system column name Instead I get a different (and less understandable) error message if I try to add a column named xmin to an existent table: test= create table lx (i int); CREATE TABLE test= alter table lx add xmin int; ERROR: column xmin of relation lx already exists. The same problem occurs using xmax as column name. I'm on Ubuntu 11.04. Tried on both PostgreSQL 8.4.10 and 9.1.2 Regards. -- Giuseppe Sucameli -- Giuseppe Sucameli postgresql_syscol_message.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PATCH: pg_basebackup (missing exit on error)
While testing a backup script, I noticed that pg_basebackup exits with 0, even if it had errors while writing the backup to disk when the backup file is to be sent to stdout. The attached patch should fix this problem (and some special cases when the last write fails). Regards, Thomas diff -uNr postgresql-9.1.2/src/bin/pg_basebackup/pg_basebackup.c postgresql-9.1.2-patch/src/bin/pg_basebackup/pg_basebackup.c --- postgresql-9.1.2/src/bin/pg_basebackup/pg_basebackup.c 2011-12-01 22:47:20.0 +0100 +++ postgresql-9.1.2-patch/src/bin/pg_basebackup/pg_basebackup.c 2012-01-23 13:14:16.0 +0100 @@ -410,6 +410,7 @@ { fprintf(stderr, _(%s: could not write to compressed file \%s\: %s\n), progname, filename, get_gz_error(ztarfile)); + disconnect_and_exit(1); } } else @@ -428,7 +429,14 @@ #ifdef HAVE_LIBZ if (ztarfile) gzclose(ztarfile); +else #endif + if (fflush (tarfile) != 0) + { + fprintf(stderr, _(%s: error flushing stdout: %s\n), +strerror (errno)); + disconnect_and_exit(1); + } } else { @@ -437,7 +445,14 @@ gzclose(ztarfile); #endif if (tarfile != NULL) - fclose(tarfile); +{ + if (fclose(tarfile) != 0) + { + fprintf(stderr, _(%s: error closing file \%s\: %s\n), +progname, filename, strerror (errno)); + disconnect_and_exit(1); + } +} } break; @@ -456,6 +471,7 @@ { fprintf(stderr, _(%s: could not write to compressed file \%s\: %s\n), progname, filename, get_gz_error(ztarfile)); +disconnect_and_exit(1); } } else -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Psql names completion.
Hello. It's probably not the right place to write, but I guess you are there to take care of it. When I was creating a trigger with command like: create trigger asdf before update on beginninOfTheNameOfMyTable... I hit tab and I got: create trigger asdf before update on SET That was obviously not what I expected to get. Regards. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum rate limit in KBps
On 01/23/2012 11:16 PM, Robert Treat wrote: I keep thinking Greg has mistaken happiness with the MB based info in the vacuum patch as being happy without the output format, when really it is all about increased visibility. I haven't taken that as anything but evidence I'm at least moving in the right direction. I'm relatively systematic about how I approach these things nowadays: figure out what isn't being logged/accounted for yet, add visibility to that thing, iterate on improving its behavior as measured by that, then make the best sorts of behavior changes automatic. This suggested feature change, moving around what I see as the worst part of the tuning knob UI, is an initial attempt along step 3 in that path. There's still plenty of ongoing work adding more visibility too. To more quickly summarize the point I was trying to make, providing a meter that trivially shows you good vs. bad is the almost same problem as making it fully automatic. If you can measure exactly when something needs to happen and how badly screwed up it is, that's the hard part of knowing when to just go fix it. Right now, the autovacuum measure is based on the percent of change to something. I don't think any improved vacuum triggering will go somewhere useful unless it starts with a better measure of how real-world bloat accumulates, which you cannot extract from the data collected yet. The free space measurement thread and the ideas that have mainly been bouncing between Jaime and Noah recently on this subject are directly addressing that, the part that I've found to be the single most useful additional thing to monitor. It may not be obvious that's providing information that can be consumed by autovacuum, but that was my intention for how these pieces will ultimately fit together. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] show Heap Fetches in EXPLAIN for index-only scans
On Sat, Jan 21, 2012 at 9:50 PM, Jeff Janes jeff.ja...@gmail.com wrote: A review: [ review ] Thanks. Committed with hopefully-appropriate revisions. As a side-note, I noticed that I needed to run vacuum twice in a row to get the Heap Fetches to drop to zero. I vaguely recall that only one vacuum was needed when ios first went in (and I had instrumented it to elog heap-fetches). Does anyone know if this the expected consequence of one of the recent changes we made to vacuum? No, that's not expected. The change we made to vacuum was to skip pages that are busy - but it shouldn't be randomly skipping pages for no reason. I can reproduce what you're observing, though: [rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004' TLI: 0x0001 Prune XID: 0x Flags: 0x0005 (HAS_FREE_LINES) TLI: 0x0001 Prune XID: 0x Flags: 0x0005 (HAS_FREE_LINES) After updating a row in the table and checkpointing, the page the rows was on is marked full and the page that gets the new version becomes not-all-visible: [rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004' TLI: 0x0001 Prune XID: 0x03fb Flags: 0x0003 (HAS_FREE_LINES|PAGE_FULL) TLI: 0x0001 Prune XID: 0x Flags: 0x0001 (HAS_FREE_LINES) Now I vacuum the relation and checkpoint, and the page the *new* relation is on becomes all-visible: [rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004' TLI: 0x0001 Prune XID: 0x Flags: 0x0001 (HAS_FREE_LINES) TLI: 0x0001 Prune XID: 0x Flags: 0x0005 (HAS_FREE_LINES) Now I vacuum it again and checkpoint, and now the old page also becomes all-visible: [rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004' TLI: 0x0001 Prune XID: 0x Flags: 0x0005 (HAS_FREE_LINES) TLI: 0x0001 Prune XID: 0x Flags: 0x0005 (HAS_FREE_LINES) But it seems to me that this is expected (if non-optimal) behavior. Only the first pass of vacuum knows how to mark pages all-visible. After the update, the first pass of the first vacuum sees a dead tuple on the old page and truncates it to a dead line pointer. When it comes to the new page, it observes that the page is now all-visible and marks it so. It then does index vacuuming and returns to the first page, marking the dead line pointer unused. But during this second visit to the old page, there's no possibility of marking the page as all-visible, because the code doesn't know how to do that. The next vacuum's first pass, however, can do so, because there are no longer any dead tuples on the page. We could fix this by modifying lazy_vacuum_page(): since we have to dirty the buffer anyway, we could recheck whether all the remaining tuples on the page are now all-visible, and if so set the visibility map bit. This is probably desirable even apart from index-only scans, because it will frequently save the next vacuum the cost of reading, dirtying, and writing extra pages. There will be some incremental CPU cost, but that seems likely to be more than repaid by the I/O savings. Thoughts? Should we do this at all? If so, should we squeeze it into 9.2 or leave it for 9.3? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I've been thinking, what exactly is the important part of this group commit patch that gives the benefit? Keeping the queue sorted isn't all that important - XLogFlush() requests for commits will come in almost the correct order anyway. I also don't much like the division of labour between groupcommit.c and xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back DoXLogFlush(), which is a bit hard to follow. (that's in my latest version; the original patch had similar division but it also cut across processes, with the wal writer actually doing the flush) It occurs to me that the WALWriteLock already provides much of the same machinery we're trying to build here. It provides FIFO-style queuing, with the capability to wake up the next process or processes in the queue. Perhaps all we need is some new primitive to LWLock, to make it do exactly what we need. Attached is a patch to do that. It adds a new mode to LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it is acquired and the function returns immediately. However, unlike normal LWLockConditionalAcquire(), if the lock is not free it goes to sleep until it is released. But unlike normal LWLockAcquire(), when the lock is released, the function returns *without* acquiring the lock. I modified XLogFlush() to use that new function for WALWriteLock. It tries to get WALWriteLock, but if it's not immediately free, it waits until it is released. Then, before trying to acquire the lock again, it rechecks LogwrtResult to see if the record was already flushed by the process that released the lock. This is a much smaller patch than the group commit patch, and in my pgbench-tools tests on my laptop, it has the same effect on performance. The downside is that it touches lwlock.c, which is a change at a lower level. Robert's flexlocks patch probably would've been useful here. I think you should break this off into a new function, LWLockWaitUntilFree(), rather than treating it as a new LWLockMode. Also, instead of adding lwWaitOnly, I would suggest that we generalize lwWaiting and lwExclusive into a field lwWaitRequest, which can be set to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc. That way we don't have to add another boolean every time someone invents a new type of wait - not that that should hopefully happen very often. A side benefit of this is that you can simplify the test in LWLockRelease(): keep releasing waiters until you come to an exclusive waiter, then stop. This possibly saves one shared memory fetch and subsequent test instruction, which might not be trivial - all of this code is extremely hot. We probably want to benchmark this carefully to make sure that it doesn't cause a performance regression even just from this: + if (!proc-lwWaitOnly) + lock-releaseOK = false; I know it sounds crazy, but I'm not 100% sure that that additional test there is cheap enough not to matter. Assuming it is, though, I kind of like the concept: I think there must be other places where these semantics are useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Sat, Jan 21, 2012 at 10:11 AM, Simon Riggs si...@2ndquadrant.com wrote: Your caution is wise. All users of an index have already checked whether the index is usable at plan time, so although there is much code that assumes they can look at the index itself, that is not executed until after the correct checks. I'll look at VACUUM and other utilities, so thanks for that. I don't see much point in having the higher level lock, except perhaps as a test this code works. I thought of another way this can cause a problem: suppose that while we're dropping the relation with only ShareUpdateExclusiveLock, we get as far as calling DropRelFileNodeBuffers. Just after we finish, some other process that holds AccessShareLock or RowShareLock or RowExclusiveLock reads and perhaps even dirties a page in the relation. Now we've got pages in the buffer pool that might even be dirty, but the backing file is truncated or perhaps removed altogether. If the page is dirty, I think the background writer will eventually choke trying to write out the page. If the page is not dirty, I'm less certain whether anything is going to explode violently, but it seems mighty sketchy to have pages in the buffer pool with a buffer tag that don't exist any more. As the comment says: * It is the responsibility of higher-level code to ensure that the * deletion or truncation does not lose any data that could be needed * later. It is also the responsibility of higher-level code to ensure * that no other process could be trying to load more pages of the * relation into buffers. ...and of course, the intended mechanism is an AccessExclusiveLock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Tue, Jan 24, 2012 at 11:24:08AM -0500, Jaime Casanova wrote: On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch n...@leadboat.com wrote: If someone feels like doing it, +1 for making pgstattuple() count non-leaf free space. actually i agreed that non-leaf pages are irrelevant... i just confirmed that in a production system with 300GB none of the indexes in an 84M rows table nor in a heavily updated one has more than 1 root page, all the rest are deleted, half_dead or leaf. so the posibility of bloat coming from non-leaf pages seems very odd FWIW, the number to look at is internal_pages from pgstatindex(): [local] test=# create table t4 (c) as select * from generate_series(1,100); SELECT 100 [local] test=# alter table t4 add primary key(c); NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index t4_pkey for table t4 ALTER TABLE [local] test=# select * from pgstatindex('t4_pkey'); -[ RECORD 1 ]--+- version| 2 tree_level | 2 index_size | 22478848 root_block_no | 290 internal_pages | 10 leaf_pages | 2733 empty_pages| 0 deleted_pages | 0 avg_leaf_density | 90.06 leaf_fragmentation | 0 So, 0.4% of this index. They appear in proportion to the logarithm of the total index size. I agree that bloat centered on them is unlikely. Counting them would be justified, but that is a question of formal accuracy rather than practical importance. but the possibility of bloat coming from the meta page doesn't exist, AFAIUI at least we need the most accurate value about usable free space, because the idea is to add a sampler mode to the function so we don't scan the whole relation. that's why we still need the function. I doubt we'd add this function solely on the basis that a future improvement will make it useful. For the patch to go in now, it needs to be useful now. (This is not a universal principle, but it mostly holds for low-complexity patches like this one.) All my comments below would also apply to such a broader patch. btw... pgstattuple also has the problem that it's not using a ring buffer attached are two patches: - v5: is the same original patch but only track space in leaf, deleted and half_dead pages - v5.1: adds the same for all kind of indexes (problem is that this is inconsistent with the fact that pageinspect only manages btree indexes for everything else) Let's take a step back. Again, what you're proposing is essentially a faster implementation of SELECT free_percent FROM pgstattuple(rel). If this code belongs in core at all, it belongs in the pgstattuple module. Share as much infrastructure as is reasonable between the user-visible functions of that module. For example, I'm suspecting that the pgstat_index() call tree should be shared, with pgstat_index_page() checking a flag to decide whether to gather per-tuple stats. Next, compare the bits of code that differ between pgstattuple() and relation_free_space(), convincing yourself that the differences are justified. Each difference will yield one of the following conclusions: 1. Your code contains an innovation that would apply to both functions. Where not too difficult, merge these improvements into pgstattuple(). In order for a demonstration of your new code's better performance to be interesting, we must fix the same low-hanging fruit in its competitor. One example is the use of the bulk read strategy. Another is the support for SP-GiST. 2. Your code is missing an essential behavior of pgstattuple(). Add it to your code. One example is the presence of CHECK_FOR_INTERRUPTS() calls. 3. Your code behaves differently from pgstattuple() due to a fundamental difference in their tasks. These are the only functional differences that ought to remain in your finished patch; please point them out in comments. For example, pgstat_heap() visits every tuple in the heap. You'll have no reason to do that; pgstattuple() only needs it to calculate statistics other than free_percent. In particular, I call your attention to the fact that pgstattuple() takes shared buffer content locks before examining pages. Your proposed patch does not do so. I do not know with certainty whether that falls under #1 or #2. The broad convention is to take such locks, because we elsewhere want an exact answer. These functions are already inexact; they make no effort to observe a consistent snapshot of the table. If you convince yourself that the error arising from not locking buffers is reasonably bounded, we can lose the locks (in both functions -- conclusion #1). Comments would then be in order. With all that done, run some quick benchmarks: see how SELECT free_percent FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) for a large heap and for a large B-tree index. If the timing difference is too small to be interesting to you, remove relation_free_space() and submit your pgstattuple() improvements alone. Otherwise,
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
On Wed, Jan 25, 2012 at 03:32:49PM -0500, Robert Haas wrote: On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: New version that repairs a defective test case. Committed. I don't find this to be particularly good style: Thanks. + for (i = 0; i old_natts ret; i++) + ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i + irel-rd_att-attrs[i]-atttypid == typeObjectId[i]); ...but I am not sure whether we have any formal policy against it, so I just committed it as-is for now. I would have surrounded the loop with an if (ret) block and written the body of the loop as if (condition) { ret = false; break; }. I value the savings in vertical space more than the lost idiomaticness. This decision is 90+% subjective, so I cannot blame you for concluding otherwise. I do know the feeling of looking at PostgreSQL source code and wishing the author had not attempted to conserve every line. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE
On Wed, Jan 25, 2012 at 11:17:27AM -0300, Alvaro Herrera wrote: Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012: Thanks. I've attached a new version fixing this problem. Looks good to me. Can you please clarify this bit? * Since we elsewhere require that all collations share the same * notion of equality, don't compare collation here. Since I'm not familiar with this code, it's difficult for me to figure out where this elsewhere is, and what does same notion of equality mean. Good questions. See the comment in front of ri_GenerateQualCollation(), the comment in process_ordered_aggregate_single() at nodeAgg.c:605, the larger comment in add_sort_column(), and the two XXX comments in relation_has_unique_index_for(). We should probably document that assumption in xindex.sgml to keep type implementors apprised. Same notion of equality just means that a COLLATE x = b COLLATE y has the same value for every x, y. In any event, the patch needed a rebase, so I've attached it rebased and with that comment edited to reference ri_GenerateQualCollation(), that being the most-relevant source for the assumption in question. Thanks for reviewing, nm diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index cb8ac67..ba20950 100644 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 276,281 static Oid transformFkeyCheckAttrs(Relation pkrel, --- 276,282 int numattrs, int16 *attnums, Oid *opclasses); static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts); + static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid *funcid); static void validateCheckConstraint(Relation rel, HeapTuple constrtup); static void validateForeignKeyConstraint(char *conname, Relation rel, Relation pkrel, *** *** 357,362 static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMOD --- 358,364 static void ATPostAlterTypeParse(Oid oldId, char *cmd, List **wqueue, LOCKMODE lockmode, bool rewrite); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); + static void TryReuseForeignKey(Oid oldId, Constraint *con); static void change_owner_fix_column_acls(Oid relationOid, Oid oldOwnerId, Oid newOwnerId); static void change_owner_recurse_to_sequences(Oid relationOid, *** *** 5591,5596 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5593,5600 numpks; Oid indexOid; Oid constrOid; + boolold_check_ok; + ListCell *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop); /* * Grab an exclusive lock on the pk table, so that someone doesn't delete *** *** 5707,5712 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5711,5723 (errcode(ERRCODE_INVALID_FOREIGN_KEY), errmsg(number of referencing and referenced columns for foreign key disagree))); + /* +* On the strength of a previous constraint, we might avoid scanning +* tables to validate this one. See below. +*/ + old_check_ok = (fkconstraint-old_conpfeqop != NIL); + Assert(!old_check_ok || numfks == list_length(fkconstraint-old_conpfeqop)); + for (i = 0; i numpks; i++) { Oid pktype = pktypoid[i]; *** *** 5721,5726 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, --- 5732,5738 Oid ppeqop; Oid ffeqop; int16 eqstrategy; + Oid pfeqop_right; /* We need several fields out of the pg_opclass entry */ cla_ht = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclasses[i])); *** *** 5763,5772 ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, pfeqop = get_opfamily_member(opfamily, opcintype, fktyped, eqstrategy); if (OidIsValid(pfeqop)) ffeqop = get_opfamily_member(opfamily, fktyped, fktyped, eqstrategy); else ! ffeqop = InvalidOid;/* keep compiler quiet */ if (!(OidIsValid(pfeqop) OidIsValid(ffeqop))) { --- 5775,5791
[HACKERS] cursors FOR UPDATE don't return most recent row
This is my test case (all in one session): CREATE TABLE foo ( key int PRIMARY KEY, value int ); INSERT INTO foo VALUES (1, 1); BEGIN; DECLARE foo CURSOR FOR SELECT * FROM foo FOR UPDATE; UPDATE foo SET value = 2 WHERE key = 1; UPDATE foo SET value = 3 WHERE key = 1; FETCH 1 FROM foo; COMMIT; I expected the FETCH to return one row, with the latest data, i.e. (1, 3), but instead it's returning empty. If instead I run both UPDATEs in another session, then I do get alvherre=# FETCH 1 FROM foo; key | value -+--- 1 | 3 (1 fila) Is this intended? -- Álvaro Herrera alvhe...@alvh.no-ip.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Restore process during recovery
On Tue, Jan 24, 2012 at 6:43 PM, Fujii Masao masao.fu...@gmail.com wrote: Cleaned up the points noted, new patch attached in case you wish to review further. Still has bug, so still with me to fix. Thanks! Will review further. v3 patch contains lots of unrelated code changes like the following. - structfieldpid/structfield column of the + structfieldprocpid/structfield column of the You seem to have failed to extract the patch from your repository correctly. So I used v2 patch for the review. Sorry if I comment the things which you've already fixed in v3 patch. Here are the comments. They are almost not serious problems. +/* + * GUC parameters + */ +intWalRestoreDelay = 1; You forget to change guc.c to define wal_restore_delay as a GUC parameter? Or just that source code comment is incorrect? + elog(FATAL, recovery_restore_command is too long); Typo: s/recovery_restore_command/restore_command + InitLatch(walrstr-WALRestoreLatch); /* initialize latch used in main loop */ That latch is shared one. OwnLatch() should be called instead of InitLatch()? If yes, it's better to call DisownLatch() when walrestore exits. Though skipping DisownLatch() would be harmless because the latch is never owned by new process after walrestore exits. + { + /* use volatile pointer to prevent code rearrangement */ + volatile WalRestoreData *walrstr = WalRstr; + + nextFileTli = walrstr-nextFileTli; The declaration of walrstr is not required here because it's already done at the head of WalRestoreNextFile(). + if (restoredFromArchive) + { + /* use volatile pointer to prevent code rearrangement */ + volatile WalRestoreData *walrstr = WalRstr; Same as above. +#define TMPRECOVERYXLOGRECOVERYXLOG ISTM that it's better to move this definition to an include file and we should use it in all the places where the fixed value RECOVERYXLOG is still used. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012: On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: New version that repairs a defective test case. Committed. I don't find this to be particularly good style: + for (i = 0; i old_natts ret; i++) + ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i + irel-rd_att-attrs[i]-atttypid == typeObjectId[i]); ...but I am not sure whether we have any formal policy against it, so I just committed it as-is for now. I would have surrounded the loop with an if (ret) block and written the body of the loop as if (condition) { ret = false; break; }. I find that code way too clever. Not only is that code spectacularly unreadable, but has nobody noticed that this commit broke the buildfarm? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding shutdown checkpoint at failover
On Fri, Jan 20, 2012 at 12:33 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs si...@2ndquadrant.com wrote: When I say skip the shutdown checkpoint, I mean remove it from the critical path of required actions at the end of recovery. We can still have a normal checkpoint kicked off at that time, but that no longer needs to be on the critical path. Any problems foreseen? If not, looks like a quick patch. Patch attached for discussion/review. This feature is what I want, and very helpful to shorten the failover time in streaming replication. Here are the review comments. Though I've not checked enough whether this feature works fine in all recovery patterns yet. LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery(). LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery(). XLOG_END_OF_RECOVERY record is written to the WAL file with new assigned timeline ID. But it must be written to the WAL file with old one. Otherwise, when re-entering a recovery after failover, we cannot find XLOG_END_OF_RECOVERY record at all. Before XLOG_END_OF_RECOVERY record is written, RmgrTable[rmid].rm_cleanup() might write WAL records. They also should be written to the WAL file with old timeline ID. When recovery target is specified, we cannot write new WAL to the file with old timeline because which means that valid WAL records in it are overwritten with new WAL. So when recovery target is specified, ISTM that we cannot skip end of recovery checkpoint. Or we might need to save all information about timelines in the database cluster instead of writing XLOG_END_OF_RECOVERY record, and use it when re-entering a recovery. LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise, what if the server crashes after new timeline history file is created and recovery.conf is removed, but before XLOG_END_OF_RECOVERY record has not been flushed to the disk yet? During recovery, when we replay XLOG_END_OF_RECOVERY record, we should close the currently-opened WAL file and read the WAL file with the timeline which XLOG_END_OF_RECOVERY record indicates. Otherwise, when re-entering a recovery with old timeline, we cannot reach new timeline. OK, some bad things there, thanks for the insightful comments. I think you're right that we can't skip the checkpoint if xlog_cleanup writes WAL records, since that implies at least one and maybe more blocks have changed and need to be flushed. That can be improved upon, but not now in 9.2.Cleanup WAL is written in either the old or the new timeline, depending upon whether we increment it. So we don't need to change anything there, IMHO. The big problem is how we handle crash recovery after we startup without a checkpoint. No quick fixes there. So let me rethink this: The idea was that we can skip the checkpoint if we promote to normal running during streaming replication. WALReceiver has been writing to WAL files, so can write more data without all of the problems noted. Rather than write the XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from the WALreceiver** as a dummy record by direct injection into the WAL stream. So the Startup process sees a WAL record that looks like it was written by the primary saying promote yourself, although it was actually written locally by WALreceiver when requested to shutdown. That doesn't damage anything because we know we've received all the WAL there is. Most importantly we don't need to change any of the logic in a way that endangers the other code paths at end of recovery. Writing the record in that way means we would need to calculate the new tli slightly earlier, so we can input the correct value into the record. That also solves the problem of how to get additional standbys to follow the new master. The XLOG_END_OF_RECOVERY record is simply the contents of the newly written tli history file. If we skip the checkpoint and then crash before the next checkpoint we just change timeline when we see XLOG_END_OF_RECOVERY. When we replay the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli file and then switch to it. So this solves 2 problems: having other standbys follow us when they don't have archiving, and avoids the checkpoint. Let me know what you think. Looks good to me. One thing I would like to ask is that why you think walreceiver is more appropriate for writing XLOG_END_OF_RECOVERY record than startup process. I was thinking the opposite, because if we do so, we might be able to skip the end-of-recovery checkpoint even in file-based log-shipping case. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list
Re: [HACKERS] Online base backup from the hot-standby
On Thu, Jan 26, 2012 at 3:07 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao masao.fu...@gmail.com wrote: What happens if we shutdown the WALwriter and then issue SIGHUP? SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about the case where smart shutdown kills walwriter but some backends are still running? Currently SIGHUP affects full_page_writes and running backends use the changed new value of full_page_writes. But in the patch, SIGHUP doesn't affect... To address the problem, we should either postpone the shutdown of walwriter until all backends have gone away, or leave the update of full_page_writes to checkpointer process instead of walwriter. Thought? checkpointer seems the correct place to me Done. Thanks a lot!! I proposed another small patch which fixes the issue about an error message of pg_basebackup, in this upthread. If it's reasonable, could you commit it? http://archives.postgresql.org/message-id/CAHGQGwENjSDN=f_vdpwvq53qru0cu9+wzkbvwnaexmawj-y...@mail.gmail.com Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.
On Wed, Jan 25, 2012 at 11:12 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Jan 25, 2012 at 1:57 PM, Robert Haas robertmh...@gmail.com wrote: I think that this would be a lot more clear if we described this as synchronous_commit = remote_write rather than just synchronous_commit = write. Actually, the internal constant is named that way already, but it's not exposed as such to the user. That's something to discuss at the end of the CF when people are less busy and we get more input. It's an easy change whatever we decide to do. Added this to 9.2 Open Items. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pause at end of recovery
On Wed, Dec 28, 2011 at 7:27 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Dec 22, 2011 at 6:16 AM, Simon Riggs si...@2ndquadrant.com wrote: I can see a reason to do this now. I've written patch and will commit on Friday. Nudge me if I don't. It's hard to write this so it works in all cases and doesn't work in the right cases also. Basically, we can't get in the way of crash recovery, so the only way we can currently tell a crash recovery from an archive recovery is the presence of restore_command. If you don't have that and you haven't set a recovery target, it won't pause and there's nothing I can do, AFAICS. Please test this and review before commit. What if wrong recovery target is specified and an archive recovery reaches end of WAL files unexpectedly? Even in this case, we want to pause recovery at the end? Otherwise, we'll lose chance to correct the recovery target and retry archive recovery. One idea; starting archive recovery with standby_mode=on meets your needs? When archive recovery reaches end of WAL files, regardless of whether recovery target is specified or not, recovery pauses at the end. If hot_standby is enabled, you can check the contents and if it's OK you can finish recovery by pg_ctl promote. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On 01/25/2012 06:40 PM, Tom Lane wrote: Marko Kreenmark...@gmail.com writes: On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote: Huh? How can that work? If we decide to change the representation of some other well known type, say numeric, how do we decide whether a client setting that bit is expecting that change or not? It sets that bit *and* version code - which means that it is up-to-date with all well-known type formats in that version. Then why bother with the bit in the format code? If you've already done some other negotiation to establish what datatype formats you will accept, this doesn't seem to be adding any value. Basically, I see 2 scenarios here: 1) Client knows the result types and can set the text/bin/version code safely, without further restrictions. 2) There is generic framework, that does not know query contents but can be expected to track Postgres versions closely. Such framework cannot say binary for results safely, but *could* do it for some well-defined subset of types. The hole in approach (2) is that it supposes that the client side knows the specific datatypes in a query result in advance. While this is sometimes workable for application-level code that knows what query it's issuing, it's really entirely untenable for a framework or library. The only way that a framework can deal with arbitrary queries is to introduce an extra round trip (Describe step) to see what datatypes the query will produce so it can decide what format codes to issue ... and that will pretty much eat up any time savings you might get from a more efficient representation. This is pretty much what jdbc driver already does, since it does not have 100% coverage of even current binary formats. First time you execute a query it requests text encoding, but caches the Describe results. Next time it sets the binary bits on all return columns that it knows how to decode. You really want to do the negotiation once, at connection setup, and then be able to process queries without client-side prechecking of what data types will be sent back. I think my original minor_version patch tried to do that. It introduced a per-connection setting for version. Server GUC_REPORTED the maximum supported minor_version but defaulted to the baseline wire format. The jdbc client could bump the minor_version to supported higher value (error if value larger than what server advertised). A way was provided for the application using jdbc driver to override the requested minor_version in the rare event that something broke (rare, because jdbc driver generally does not expose the wire-encoding to applications). Now if pgbounce and other pooling solutions would reset the minor_version to 0 then it should work. Scenarios where other end is too old to know about the minor_version: VserverVlibpq = client does nothing - use baseline version VlibpqVserver = no supported_minor_version in GUC_REPORT - use baseline Normal 9.2+ scenarios: VserverVlibpq = libpg sets minor_version to largest that is supports - libpq requested version used VlibpqVserver = libpg notices that server supported value is lower than its so it sets minor_version to server supported value - server version used For perl driver that exposes the wire format to application by default I can envision that the driver needs to add a new API that applications need to use to explicitly bump the minor_version up instead of defaulting to the largest supported by the driver as in jdbc/libpg. The reason why I proposed a incrementing minor_version instead of bit flags of new encodings was that it takes less space and is easier to document and understand so that exposing it to applications is possible. But how to handle postgres extensions that change their wire-format? Maybe we do need to have oid:minor_version,oid:ver,oid_ver as the negotiated version variable? -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers