Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/22 9:34), Simon Riggs wrote: AFAICS, all that has happened is that people have given their opinions and we've got almost the same identical patch, with a rush-rush comment to commit even though we've waited months. If you submit a patch, then you need to listen to feedback and be clear about what you will do next, if you don't people will learn to ignore you and nobody wants that. I think it was replied that will be heavily. If we realize histogram in pg_stat_statements, we have to implement dobuble precision arrays for storing histogram data. And when we update histogram data in each statements, we must update arrays with searching what response time is the smallest or biggest? It is very big cost, assuming large memory, and too hevily when updating than we get benefit from it. So I just add stddev for as fast as latest pg_stat_statements. I got some agreed from some people, as you say. On 21 January 2014 21:19, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 21, 2014 at 11:48 AM, Simon Riggs si...@2ndquadrant.com wrote: I agree with people saying that stddev is better than nothing at all, so I am inclined to commit this, in spite of the above. I could live with stddev. But we really ought to be investing in making pg_stat_statements work well with third-party tools. I am very wary of enlarging the counters structure, because it is protected by a spinlock. There has been no attempt to quantify that cost, nor has anyone even theorized that it is not likely to be appreciable. OK, Kondo, please demonstrate benchmarks that show we have 1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Trigger information for auto_explain.
Hello, I came back with doc patch and revised 0002 patch. I think documentation is the only thing that stops this patch to be commitable... can you add it? Agreed. I have pushed patch 0001 for now. Thank you, I'll put it sooner. I found the default setting for log_triggers was true in the last patch while writing doc but it's better be false ragarding backward compatibility. The 0002 patch attached has been changed there. - 0002_auto_explain_triggers_v2_20140122.patch default value for log_triggers from 'true' to 'false'. As added documents says. - 0003_auto_explain_triggers_doc_v1_20140122.patch documentation. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml index c9b8192..53f38cb 100644 --- a/doc/src/sgml/auto-explain.sgml +++ b/doc/src/sgml/auto-explain.sgml @@ -124,6 +124,24 @@ LOAD 'auto_explain'; varlistentry term + varnameauto_explain.log_triggers/varname (typeboolean/type) +/term +indexterm + primaryvarnameauto_explain.log_triggers/ configuration parameter/primary +/indexterm +listitem + para + varnameauto_explain.log_triggers/varname causes additional trigger + statistics output to be printed when an execution plan is logged. This + parameter is off by default. Only superusers can change this + setting. This parameter has no effect + unless varnameauto_explain.log_analyze/ parameter is set. + /para +/listitem + /varlistentry + + varlistentry +term varnameauto_explain.log_format/varname (typeenum/type) /term indexterm diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index af68479..2354327 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -25,6 +25,7 @@ static int auto_explain_log_min_duration = -1; /* msec or -1 */ static bool auto_explain_log_analyze = false; static bool auto_explain_log_verbose = false; static bool auto_explain_log_buffers = false; +static bool auto_explain_log_triggers = false; static bool auto_explain_log_timing = false; static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT; static bool auto_explain_log_nested_statements = false; @@ -113,6 +114,17 @@ _PG_init(void) NULL, NULL); + DefineCustomBoolVariable(auto_explain.log_triggers, + Collect trigger stats, avaialble when log_analyze., + NULL, + auto_explain_log_triggers, + false, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable(auto_explain.log_format, EXPLAIN format to be used for plan logging., NULL, @@ -295,6 +307,8 @@ explain_ExecutorEnd(QueryDesc *queryDesc) ExplainBeginOutput(es); ExplainQueryText(es, queryDesc); ExplainPrintPlan(es, queryDesc); + if (es.analyze auto_explain_log_triggers) +ExplainPrintTriggers(es, queryDesc); ExplainEndOutput(es); /* Remove last line break */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 01/22/2014 09:25 AM, Alexander Korotkov wrote: On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads? Hmm, so you would merrily insert duplicate TIDs into the uncompressed area, and weed them out when reading or recompressing the page? I had not thought of that. Yeah, it might be a good trade-off, duplicates are not expected to happen very often. I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right? The segmentation was a sensible change on code-readability grounds alone. Yes, it made it easier to experiment with different encodings, and will make it easier to replace the encoding in the future, but that wasn't the only reason for doing it. It's nice to have the encoding/decoding stuff in ginpostinglists.c, so that the rest of the code just passes around opaque GinPostingList structs (previously known as PostingListSegment). One thing I have been pondering, though: Instead of storing the posting lists one after each other on the leaf page, it might be better to store them as Items on the page, with normal ItemIds pointing to each. So the page layout would be more standard, and you could use PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. The immediate advantage of that would be that it would make it possible to do a binary search of the segments, to quickly locate the right segment where a tuple belongs to. That might not be significant in practice - linearly scanning 32 items is not slow either. And it would add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per page with the current segment size of 256 bytes. But then again, it might be a good idea just because it would make the pages look more like any other page, which is generally a good thing. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On Tue, Jan 14, 2014 at 9:28 AM, Yugo Nagata nag...@sraoss.co.jp wrote: Here is the patch to implement to_regclass, to_regproc, to_regoper, and to_regtype. + static Datum regclass_guts(char *class_name_or_oid, bool raiseError); Minor bikeshedding, a lot of code currently uses an argument named missing_ok for this purpose (with inverse meaning of course). Any reasons why you chose raiseError instead? I only had a brief look at the patch, so maybe I'm missing something. But I don't think you should create 3 variants of these functions: * parseTypeString calls parseTypeString_guts with false * parseTypeStringMissingOk calls parseTypeString_guts with true * parseTypeString_guts And this is just silly: if (raiseError) parseTypeString(typ_name_or_oid, result, typmod); else parseTypeStringMissingOk(typ_name_or_oid, result, typmod); Just add an argument to parseTypeString and patch all the callers. if requested object is not found, returns InvalidOid, rather than raises an error. I thought the consensus was that returning NULL is better than InvalidOid? From an earlier message: On Thu, Dec 5, 2013 at 5:25 PM, Robert Haas robertmh...@gmail.com wrote: Another advantage of this approach is that, IIUC, type input functions can't return a NULL value. So 'pg_klass'::regclass could return 0, but not NULL. On the other hand, toregclass('pg_klass') *could* return NULL, which seems conceptually cleaner. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan p...@heroku.com wrote: Actually, I think the whole thing is rather badly designed, as warm cache or no you're still proposing to do thousands of kernel calls while holding a potentially contended LWLock. I suggest what we do is (1) read in the whole file, (2) acquire the lock, (3) re-read the whole file in the same buffer, (4) work from the buffer. fseeko() and fread() are part of the standard library, not any kernel. I don't believe that what I've done in qtext_fetch() implies thousands of kernel calls, or thousands of context switches. I ran an strace on a pg_stat_statements backend with a full ~5,000 entries. It seems that glibc is actually content to always make lseek() and read() syscalls in the event of an fseek(), even though it does maintain its own buffer and could in principle have a lot more gumption about that [1]. I agree that we should copy everything into a buffer in one go. I think that making the LWLocking duration as brief as possible isn't critically important. Automated tools will only be interested in new query texts (implying possible I/O while share locking) as they observe new entries. But new entries are the only thing worth considering that may potentially block on that shared lock (although, not as they write out their query texts, which almost always just requires a shared lock). New entries ought to be very rare occurrence compared to the execution of queries - certainly, on the busy production systems I use pg_stat_statements on, it may be weeks before a new query is observed (uh, with the possible exception of *my* ad-hoc pg_stat_statements queries). On my laptop, explain analyze select * from pg_stat_statements indicates a total runtime of ~9ms with 5,000 regression test entries, even with the (still modest) overhead of doing all those read()/lseek() calls. So we're talking about a small impact on a new entry, that will only be affected once, that was always going to have to write out its query text to file anyway. On top of all this, in general it's very improbable that any given new entry will be affected at all, because two unlikely things need to happen at the same time for that to be possible. The most important aspect of keeping the overhead of pg_stat_statements low is that there not be too much cache pressure. Obviously the huge reduction in its shared memory footprint that this patch allows helps with that. [1] http://www.pixelbeat.org/programming/stdio_buffering/ -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan p...@heroku.com wrote: BTW shouldn't there be an fflush() in qtext_store? I don't think so, no. Whenever qtext_store() callers don't fflush() before releasing their exclusive lock, they FreeFile() before doing so instead. In the case of pgss_shmem_startup() there is no lock held anyway, because it doesn't matter, for the same reason it doesn't matter that we're modifying shmem structures in a way that ordinarily necessitates an exclusive lock. Why fflush() every qtext_store() call if there is expected to be more than one, as is the case for several callers? Having said all that, it occurs to me now that I could have been smarter about where I fflush(). I'm pretty sure pgss_store() should have two fflush() calls. The first can occur with just a shared LWLock held, before the lock strength promotion interim (you probably noticed that I no longer generate a normalized query text in that interim, for reasons that are obvious). The second fflush() may occur only in the very unlikely event of a garbage collection necessitating a new qtext_store() call with an exclusive lock held, a few lines after the first fflush(). No need to fflush() with the exclusive lock the vast majority of the time. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On Tue, Jan 14, 2014 at 9:28 AM, Yugo Nagata nag...@sraoss.co.jp wrote: Here is the patch to implement to_regclass, to_regproc, to_regoper, and to_regtype. + static Datum regclass_guts(char *class_name_or_oid, bool raiseError); Minor bikeshedding, a lot of code currently uses an argument named missing_ok for this purpose (with inverse meaning of course). Any reasons why you chose raiseError instead? Originally the proposal checks errors like syntactical one in addition to missing objects. So I think raiseError was more appropriate at that time. Now they only check missing objects. So renaming to missing_ok could be more appropriate. I only had a brief look at the patch, so maybe I'm missing something. But I don't think you should create 3 variants of these functions: * parseTypeString calls parseTypeString_guts with false * parseTypeStringMissingOk calls parseTypeString_guts with true * parseTypeString_guts And this is just silly: if (raiseError) parseTypeString(typ_name_or_oid, result, typmod); else parseTypeStringMissingOk(typ_name_or_oid, result, typmod); Just add an argument to parseTypeString and patch all the callers. Leave the disccusion to Yugo.. if requested object is not found, returns InvalidOid, rather than raises an error. I thought the consensus was that returning NULL is better than InvalidOid? From an earlier message: On Thu, Dec 5, 2013 at 5:25 PM, Robert Haas robertmh...@gmail.com wrote: Another advantage of this approach is that, IIUC, type input functions can't return a NULL value. So 'pg_klass'::regclass could return 0, but not NULL. On the other hand, toregclass('pg_klass') *could* return NULL, which seems conceptually cleaner. Not sure. There's already at least one counter example: pg_my_temp_schema() oid OID of session's temporary schema, or 0 if none Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire
On 22 January 2014 04:42, Rajeev rastogi rajeev.rast...@huawei.com wrote: On 31st December 2013, Christian Kruse Wrote: Hi there, I created two patches improving the log messages generated by log_lock_waits. The first patch shows the process IDs holding a lock we try to acquire (show_pids_in_lock_log.patch), sample output (log_lock_waits=on required): session 1: BEGIN; LOCK TABLE foo IN SHARE MODE; session 2: BEGIN; LOCK TABLE foo IN SHARE MODE; session 3: BEGIN; LOCK TABLE foo IN EXCLUSIVE MODE; Output w/o patch: LOG: process 13777 still waiting for ExclusiveLock on relation 16385 of database 16384 after 1000.030 ms Output with patch: LOG: process 13777 still waiting for ExclusiveLock on relation 16385 of database 16384 after 1000.030 ms CONTEXT: processes owning lock: 13775, 13776 I am reviewing this patch. The idea seems to be reasonable. Following are my first level observation: 1. I find a issue in following scenario: session 1 with process id X: BEGIN; LOCK TABLE foo IN SHARE MODE; session 2 with process id Y: BEGIN; LOCK TABLE foo IN EXCLUSIVE MODE; session 3 with process id Z: BEGIN; LOCK TABLE foo IN SHARE MODE; On execution of LOCK in session-3, as part of log it will display as: processes owning lock: X, Y But actually if we see Y has not yet own the lock, it is still waiting with higher priority. It may mislead user. May be we should change message to give all meaning i.e. which process is owning lock and Which process is already in queue. Perhaps this? CONTEXT: lock owner request queue XXX, XXX, XXX, etc 2. Can we give a better name to new variable 'buf1'? 3. Do we need to take performance reading to see if any impact? Don't think so. Diagnosing problems will help performance, not hinder it 4. Do we require documentation? Don't think so. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On Wed, 22 Jan 2014 20:04:12 +0900 (JST) Tatsuo Ishii is...@postgresql.org wrote: On Tue, Jan 14, 2014 at 9:28 AM, Yugo Nagata nag...@sraoss.co.jp wrote: Here is the patch to implement to_regclass, to_regproc, to_regoper, and to_regtype. + static Datum regclass_guts(char *class_name_or_oid, bool raiseError); Minor bikeshedding, a lot of code currently uses an argument named missing_ok for this purpose (with inverse meaning of course). Any reasons why you chose raiseError instead? Originally the proposal checks errors like syntactical one in addition to missing objects. So I think raiseError was more appropriate at that time. Now they only check missing objects. So renaming to missing_ok could be more appropriate. I only had a brief look at the patch, so maybe I'm missing something. But I don't think you should create 3 variants of these functions: * parseTypeString calls parseTypeString_guts with false * parseTypeStringMissingOk calls parseTypeString_guts with true * parseTypeString_guts And this is just silly: if (raiseError) parseTypeString(typ_name_or_oid, result, typmod); else parseTypeStringMissingOk(typ_name_or_oid, result, typmod); Just add an argument to parseTypeString and patch all the callers. Leave the disccusion to Yugo.. parseTypeString() is called by some other functions and I avoided influences of modifying the definition on them, since this should raise errors in most cases. This is same reason for other *MissingOk functions in parse_type.c. Is it better to write definitions of these function and all there callers? if requested object is not found, returns InvalidOid, rather than raises an error. I thought the consensus was that returning NULL is better than InvalidOid? From an earlier message: On Thu, Dec 5, 2013 at 5:25 PM, Robert Haas robertmh...@gmail.com wrote: Another advantage of this approach is that, IIUC, type input functions can't return a NULL value. So 'pg_klass'::regclass could return 0, but not NULL. On the other hand, toregclass('pg_klass') *could* return NULL, which seems conceptually cleaner. Not sure. There's already at least one counter example: pg_my_temp_schema() oid OID of session's temporary schema, or 0 if none Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Yugo Nagata nag...@sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 01:23, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 18:59:13 -0500, Tom Lane wrote: Another thing to think about is whether we couldn't put a hard limit on WAL record size somehow. Multi-megabyte WAL records are an abuse of the design anyway, when you get right down to it. So for example maybe we could split up commit records, with most of the bulky information dumped into separate records that appear before the real commit. This would complicate replay --- in particular, if we abort the transaction after writing a few such records, how does the replayer realize that it can forget about those records? But that sounds probably surmountable. I think removing the list of subtransactions from commit records would essentially require not truncating pg_subtrans after a restart anymore. I'm not suggesting that we stop providing that information! I'm just saying that we perhaps don't need to store it all in one WAL record, if instead we put the onus on WAL replay to be able to reconstruct what it needs from a series of WAL records. I think removing excess subxacts from commit and abort records could be a good idea. Not sure anybody considered it before. We already emit xid allocation records when we overflow, so we already know which subxids have been allocated. We also issue subxact abort records for anything that aborted. So in theory we should be able to reconstruct an arbitrarily long chain of subxacts. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 01:30, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: How are we supposed to wait while e.g. ProcArrayLock? Aborting transactions doesn't work either, that writes abort records which can get signficantly large. Yeah, that's an interesting point ;-). We can't *either* commit or abort without emitting some WAL, possibly quite a bit of WAL. Right, which is why we don't need to lock ProcArrayLock. As soon as we try to write a commit or abort it goes through the normal XLogInsert route. As soon as wal_buffers fills WALWriteLock will be held continuously until we free some space. Since ProcArrayLock isn't held, read-only users can continue. As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Note that my proposal would not require aborting any in-flight transactions; they would continue to completion as soon as space is cleared. My proposal again, so we can review how simple it was... 1. Allow a checkpoint to complete by updating the control file, rather than writing WAL. The control file is already there and is fixed size, so we can be more confident it will accept the update. We could add a new checkpoint mode for that, or we could do that always for shutdown checkpoints (my preferred option). EFFECT: Since a checkpoint can now be called and complete without writing WAL, we are able to write dirty buffers and then clean out WAL files to reduce space. 2. If we fill the disk when writing WAL we do not PANIC, we signal the checkpointer process to perform an immediate checkpoint and then wait for its completion. EFFECT: Since we are holding WALWriteLock, all other write users will soon either wait for that lock directly or indirectly. Both of those points are relatively straightforward to implement and this proposal minimises seldom-tested code paths. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
GIN pending list pages not recycled promptly (was Re: [HACKERS] GIN improvements part 1: additional information)
On 01/22/2014 03:39 AM, Tomas Vondra wrote: What annoys me a bit is the huge size difference between the index updated incrementally (by a sequence of INSERT commands), and the index rebuilt from scratch using VACUUM FULL. It's a bit better with the patch (2288 vs. 2035 MB), but is there a chance to improve this? Hmm. What seems to be happening is that pending item list pages that the fast update mechanism uses are not getting recycled. When enough list pages are filled up, they are flushed into the main index and the list pages are marked as deleted. But they are not recorded in the FSM, so they won't be recycled until the index is vacuumed. Almost all of the difference can be attributed to deleted pages left behind like that. So this isn't actually related to the packed postinglists patch at all. It just makes the bloat more obvious, because it makes the actual size of the index size, excluding deleted pages, smaller. But it can be observed on git master as well: I created a simple test table and index like this: create table foo (intarr int[]); create index i_foo on foo using gin(intarr) with (fastupdate=on); I filled the table like this: insert into foo select array[-1] from generate_series(1, 1000) g; postgres=# \d+i List of relations Schema | Name | Type | Owner | Size | Description +--+---+++- public | foo | table | heikki | 575 MB | (1 row) postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Size | Description +---+---++---++- public | i_foo | index | heikki | foo | 251 MB | (1 row) I wrote a little utility that scans all pages in a gin index, and prints out the flags indicating what kind of a page it is. The distribution looks like this: 19 DATA 7420 DATA LEAF 24701 DELETED 1 LEAF 1 META I think we need to add the deleted pages to the FSM more aggressively. I tried simply adding calls to RecordFreeIndexPage, after the list pages have been marked as deleted, but unfortunately that didn't help. The problem is that the FSM is organized into a three-level tree, and RecordFreeIndexPage only updates the bottom level. The upper levels are not updated until the FSM is vacuumed, so the pages are still not visible to GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a call to IndexFreeSpaceMapVacuum after flushing the pending list, per attached patch. I'm slightly worried about the performance impact of the IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update the upper levels of the FSM in a retail-fashion instead. - Heikki *** a/src/backend/access/gin/ginfast.c --- b/src/backend/access/gin/ginfast.c *** *** 21,26 --- 21,27 #include access/gin_private.h #include commands/vacuum.h #include miscadmin.h + #include storage/indexfsm.h #include utils/memutils.h #include utils/rel.h *** *** 434,440 ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) --- 435,453 END_CRIT_SECTION(); if (needCleanup) + { ginInsertCleanup(ginstate, false, NULL); + + /* + * Vacuum the FSM to make the deleted list pages available for re-use. + * + * gininsertCleanup marked them as free in the FSM by calling + * RecordFreeIndexPage, but that only updates the bottom FSM level. + * Since GetFreeIndexPage scans from the top, they won't be visible + * for reuse until the upper levels have been updated. + */ + IndexFreeSpaceMapVacuum(index); + } } /* *** *** 599,608 shiftList(Relation index, Buffer metabuffer, BlockNumber newHead, } } for (i = 0; i data.ndeleted; i++) UnlockReleaseBuffer(buffers[i]); ! ! END_CRIT_SECTION(); } while (blknoToDelete != newHead); return false; --- 612,625 } } + END_CRIT_SECTION(); + for (i = 0; i data.ndeleted; i++) + { + BlockNumber blkno = BufferGetBlockNumber(buffers[i]); UnlockReleaseBuffer(buffers[i]); ! RecordFreeIndexPage(index, blkno); ! } } while (blknoToDelete != newHead); return false; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 01/22/2014 02:10 PM, Simon Riggs wrote: As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Checkpoint also acquires the content lock of every dirty page in the buffer cache... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 22, 2014 at 12:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 09:25 AM, Alexander Korotkov wrote: On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads? Hmm, so you would merrily insert duplicate TIDs into the uncompressed area, and weed them out when reading or recompressing the page? I had not thought of that. Yeah, it might be a good trade-off, duplicates are not expected to happen very often. I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right? The segmentation was a sensible change on code-readability grounds alone. Yes, it made it easier to experiment with different encodings, and will make it easier to replace the encoding in the future, but that wasn't the only reason for doing it. It's nice to have the encoding/decoding stuff in ginpostinglists.c, so that the rest of the code just passes around opaque GinPostingList structs (previously known as PostingListSegment). One thing I have been pondering, though: Instead of storing the posting lists one after each other on the leaf page, it might be better to store them as Items on the page, with normal ItemIds pointing to each. So the page layout would be more standard, and you could use PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. The immediate advantage of that would be that it would make it possible to do a binary search of the segments, to quickly locate the right segment where a tuple belongs to. That might not be significant in practice - linearly scanning 32 items is not slow either. And it would add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per page with the current segment size of 256 bytes. But then again, it might be a good idea just because it would make the pages look more like any other page, which is generally a good thing. We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite
Re: [HACKERS] improve the help message about psql -F
OK,I will do it.Thanks. Jov blog: http:amutu.com/blog http://amutu.com/blog 2014/1/21 Marti Raudsepp ma...@juffo.org On Mon, Jan 20, 2014 at 2:04 PM, Jov am...@amutu.com wrote: reasonable,I removed the set,patch attached. Hi Jov, A new commitfest was just opened, due on 2014-06. Please add your patch here: https://commitfest.postgresql.org/action/commitfest_view?id=22 (You'll need a community account if you don't already have one) Sometimes simple fixes like yours are merged outside a CommitFest, but adding it there makes sure it won't get lost. Regards, Marti
Re: [HACKERS] [bug fix] pg_ctl always uses the same event source
On Tue, Jan 21, 2014 at 11:20 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Jan 22, 2014 at 9:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Tue, Jan 21, 2014 at 6:57 PM, MauMau maumau...@gmail.com wrote: To follow this, we have the line as: #event_source = 'PostgreSQL 9.4' But this requires us to change this line for each major release. That's a maintenance headache. What I had in mind was to change it during initdb, we are already doing it for some other parameter (unix_socket_directories), What happens when somebody copies their postgresql.conf from an older version? That's hardly uncommon, even though it might be considered bad practice. I don't think it's a good idea to try to insert a version identifier this way. But ... what's the point of including the PG version in this string anyhow? If you're trying to make the string unique across different installations on the same machine, it's surely insufficient, and if that's not the point then what is? Well, certainly it cannot handle all different scenario's (particularly same version installations), but the original report for this case was for different versions of server. I think chances of having different versions of server are much more, which will be handled by this case. I wonder if the port number wouldn't be a better choice. And that could even be done without adding a parameter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: OK, Kondo, please demonstrate benchmarks that show we have 1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. To see any impact from spinlock contention, I think you're pretty much going to need a machine with 32 cores, I think, and lots of concurrency. pgbench -S is probably a better test than DBT-2, because it leaves out all the writing, so percentage-wise more time will be spent doing things like updating the pgss hash table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:10 PM, Simon Riggs wrote: As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Checkpoint also acquires the content lock of every dirty page in the buffer cache... Good point. We would need to take special action for any dirty blocks that we cannot obtain content lock for, which should be a smallish list, to be dealt with right at the end of the checkpoint writes. We know that anyone waiting for the WAL lock will not be modifying the block and so we can copy it without obtaining the lock. We can inspect the lock queue on the WAL locks and then see which buffers we can skip the lock for. The alternative of adding a check for WAL space to the normal path is a non-starter, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GIN pending list pages not recycled promptly (was Re: [HACKERS] GIN improvements part 1: additional information)
Heikki Linnakangas wrote: I wrote a little utility that scans all pages in a gin index, and prints out the flags indicating what kind of a page it is. The distribution looks like this: 19 DATA 7420 DATA LEAF 24701 DELETED 1 LEAF 1 META Hah. I think we need to add the deleted pages to the FSM more aggressively. I tried simply adding calls to RecordFreeIndexPage, after the list pages have been marked as deleted, but unfortunately that didn't help. The problem is that the FSM is organized into a three-level tree, and RecordFreeIndexPage only updates the bottom level. Interesting. I think the idea of having an option for RecordFreeIndexPage to update upper levels makes sense (no need to force it for other users.) Some time ago I proposed an index-only cleanup for vacuum. That would help GIN get this kind of treatment (vacuuming its FSM and processing the pending list) separately from vacuuming the index. It's probably too late for 9.4 though. One other thing worth considering in this area is that making the pending list size depend on work_mem appears to have been a really bad idea. I know one case where the server is really large and seems to run mostly OLAP type stuff with occasional updates, so they globally set work_mem=2GB; they have GIN indexes for text search, and the result is horrible performance 90% of the time, then a vacuum cleans the pending list and it is blazing fast until the pending list starts getting big again. Now you can argue that setting work_mem to that value is a bad idea, but as it turns out, in this case other than the GIN pending list it seems to work fine. Not related to the patch at hand, but I thought I would out it for consideration, 'cause I'm not gonna start a new thread about it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Tom Lane t...@sss.pgh.pa.us wrote: Well, PANIC is certainly bad, but what I'm suggesting is that we just focus on getting that down to ERROR and not worry about trying to get out of the disk-shortage situation automatically. Nor do I believe that it's such a good idea to have the database freeze up until space appears rather than reporting errors. Dusting off my DBA hat for a moment, I would say that I would be happy if each process either generated an ERROR or went into a wait state. They would not all need to do the same thing; whichever is easier in each process's context would do. It would be nice if a process which went into a long wait state until disk space became available would issue a WARNING about that, where that is possible. I feel that anyone running a database that matters to them should be monitoring disk space and getting an alert from the monitoring system in advance of actually running out of disk space; but we all know that poorly managed systems are out there. To accomodate them we don't want to add risk or performance hits for those who do manage their systems well. The attempt to make more space by deleting WAL files scares me a bit. The dynamics of that seem like they could be counterproductive if the pg_xlog directory is on the same filesystem as everything else and there is a rapidly growing file. Every time you cleaned up, the fast-growing file would eat more of the space pg_xlog had been using, until perhaps it had no space to keep even a single segment, no? And how would that work with a situation where the archive_command was failing, causing a build-up WAL files? It just seems like too much mechanism and incomplete coverage of the problem space. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER SYSTEM SET typos and fix for temporary file name management
On Tue, Jan 21, 2014 at 7:02 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Jan 22, 2014 at 5:29 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I agree with Michael that having pg_basebackup be aware of the .temp suffix is ugly; for instance if we were to fix it to .tmp in ALTER SYSTEM but forgot to change pg_basebackup, the check would be immediately broken. But on the other hand I'm not sure why it's such a problem for pg_basebackup that it needs to be checking in the first place -- how about we rip that out? Coupled with the addition of a pgsql_tmp prefix to the temp configuration file name would work, yes we could remove this check part. Perhaps the temp file needs to be in pgsql_tmp? (Do we need to worry about cross-filesystem links if we do? I mean, do we support mounting pgsql_tmp separately?) Using pgsql_tmp looks a bit weird as well, as this prefix is used only for temporary files doing database-related operations, and ALTER SYSTEM is not one. Doing that this would need a mention in the docs at least: http://www.postgresql.org/docs/devel/static/storage-file-layout.html Thoughts? I think moving the temp file to a different directory than the permanent file is a non-starter. As Alvaro says, that could be on a different file system, and then attempting to rename() the file into place would fail. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dynamic Shared Memory stuff
On Tue, Jan 21, 2014 at 2:58 PM, Noah Misch n...@leadboat.com wrote: What do people prefer? I recommend performing cleanup on the control segment named in PGShmemHeader just before shmdt() in PGSharedMemoryCreate(). No new ERROR or WARNING sites are necessary. Have dsm_postmaster_startup() continue to perform a cleanup on the control segment named in the state file. I think I'm on board with the first two sentences of this, but after Fujii Masao's email yesterday, I can't help thinking that what you propose the third sentence is a bad idea. He cloned a master to create a standby server on the same machine, and the standby startup ate the master's dynamic shared memory. We could teach pg_basebackup not to copy the state file, but that wouldn't help people who take base backups using the file system copy method, which is a lot of people. 5. Give up on this approach. We could keep what we have now, or make the DSM control segment land at a well-known address as we do for the main segment. How would having the DSM control segment at a well-known address affect the problem at hand? Did you mean a well-known dsm_handle? Yeah. So the idea is that we'd always use a dsm_handle of 100 + (100 * port) or something like that, and then search forward until we find a dsm_handle that works. This is basically the same algorithm we're using today for the main shared memory segment, but with a large additive constant so that they don't collide with each other. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] pg_ctl always uses the same event source
On Wed, Jan 22, 2014 at 6:54 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 21, 2014 at 11:20 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Jan 22, 2014 at 9:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Tue, Jan 21, 2014 at 6:57 PM, MauMau maumau...@gmail.com wrote: To follow this, we have the line as: #event_source = 'PostgreSQL 9.4' But this requires us to change this line for each major release. That's a maintenance headache. What I had in mind was to change it during initdb, we are already doing it for some other parameter (unix_socket_directories), What happens when somebody copies their postgresql.conf from an older version? That's hardly uncommon, even though it might be considered bad practice. I don't think it's a good idea to try to insert a version identifier this way. But ... what's the point of including the PG version in this string anyhow? If you're trying to make the string unique across different installations on the same machine, it's surely insufficient, and if that's not the point then what is? Well, certainly it cannot handle all different scenario's (particularly same version installations), but the original report for this case was for different versions of server. I think chances of having different versions of server are much more, which will be handled by this case. I wonder if the port number wouldn't be a better choice. And that could even be done without adding a parameter. We need this for register of event source which might be done before start of server. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] The problems of PQhost()
Hi, I reported in other thread that PQhost() has three problems. http://www.postgresql.org/message-id/cahgqgwe77akyabywde5+8bjldopthp7cnswao_syedjogyv...@mail.gmail.com (1) PQhost() can return Unix-domain socket directory path even in the platform that doesn't support Unix-domain socket. (2) In the platform that doesn't support Unix-domain socket, when neither host nor hostaddr are specified, the default host 'localhost' is used to connect to the server and PQhost() must return that, but it doesn't. (3) PQhost() cannot return the hostaddr. As the result of these problems, you can see the incorrect result of \conninfo, for example. $ psql -d hostaddr=127.0.0.1 =# \conninfo You are connected to database postgres as user postgres via socket in /tmp at port 5432. Obviously /tmp should be 127.0.0.1. Attached patch fixes these problems. We can fix the problem (3) by changing PQhost() so that it also returns the hostaddr. But this change might break the existing application using PQhost(). So, I added new libpq function PQhostaddr() which returns the hostaddr, and changed \conninfo so that it reports correct connection information by using both PQhost() and PQhostaddr(). These problems exist in v9.3 or before. Basically we should backport the patch into those versions. But obviously it's too late to add new libpq function PQhostaddr() into those versions. Then, I'm thinking to backport only the change for (1) and (2) because we can fix them without adding new libpq function. BTW, I think that the patch also fixes the problem of \conninfo that MauMau reported in other thread. http://www.postgresql.org/message-id/8913B51B3B534F878D54B89E1EB964C6@maumau Regards, -- Fujii Masao *** a/doc/src/sgml/libpq.sgml --- b/doc/src/sgml/libpq.sgml *** *** 1464,1469 char *PQhost(const PGconn *conn); --- 1464,1487 /listitem /varlistentry + varlistentry id=libpq-pqhostaddr + term + functionPQhostaddr/function + indexterm +primaryPQhostaddr/primary + /indexterm + /term + + listitem + para +Returns the server numeric IP address or host name of the connection. + synopsis + char *PQhostaddr(const PGconn *conn); + /synopsis + /para + /listitem + /varlistentry + varlistentry id=libpq-pqport term functionPQport/function *** a/src/bin/psql/command.c --- b/src/bin/psql/command.c *** *** 300,306 exec_command(const char *cmd, else if (strcmp(cmd, conninfo) == 0) { char *db = PQdb(pset.db); ! char *host = PQhost(pset.db); if (db == NULL) printf(_(You are currently not connected to a database.\n)); --- 300,306 else if (strcmp(cmd, conninfo) == 0) { char *db = PQdb(pset.db); ! char *host = (PQhostaddr(pset.db) != NULL) ? PQhostaddr(pset.db) : PQhost(pset.db); if (db == NULL) printf(_(You are currently not connected to a database.\n)); *** a/src/interfaces/libpq/exports.txt --- b/src/interfaces/libpq/exports.txt *** *** 165,167 lo_lseek64162 --- 165,168 lo_tell64 163 lo_truncate64 164 PQconninfo165 + PQhostaddr166 *** a/src/interfaces/libpq/fe-connect.c --- b/src/interfaces/libpq/fe-connect.c *** *** 5188,5194 PQhost(const PGconn *conn) { if (!conn) return NULL; ! return conn-pghost ? conn-pghost : conn-pgunixsocket; } char * --- 5188,5211 { if (!conn) return NULL; ! if (conn-pghost != NULL conn-pghost[0] != '\0') ! return conn-pghost; ! else ! { ! #ifdef HAVE_UNIX_SOCKETS ! return conn-pgunixsocket; ! #else ! return DefaultHost; ! #endif ! } ! } ! ! char * ! PQhostaddr(const PGconn *conn) ! { ! if (!conn) ! return NULL; ! return conn-pghostaddr; } char * *** a/src/interfaces/libpq/libpq-fe.h --- b/src/interfaces/libpq/libpq-fe.h *** *** 301,306 extern char *PQdb(const PGconn *conn); --- 301,307 extern char *PQuser(const PGconn *conn); extern char *PQpass(const PGconn *conn); extern char *PQhost(const PGconn *conn); + extern char *PQhostaddr(const PGconn *conn); extern char *PQport(const PGconn *conn); extern char *PQtty(const PGconn *conn); extern char *PQoptions(const PGconn *conn); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)
On 2014-01-18 08:35:47 -0500, Robert Haas wrote: I am not sure I understand that point. We can either update the in-memory bit before performing the on-disk operations or afterwards. Either way, there's a way to be inconsistent if the disk operation fails somewhere inbetween (it might fail but still have deleted the file/directory!). The normal way to handle that in other places is PANICing when we don't know so we recover from the on-disk state. I really don't see the problem here? Code doesn't get more robust by doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only ERROR, often that's not warranted. People get cranky when the database PANICs because of a filesystem failure. We should avoid that, especially when it's trivial to do so. The update to shared memory should be done second and should be set up to be no-fail. I don't see how that would help. If we fail during unlink/rmdir, we don't really know at which point we failed. Just keeping the slot in memory, won't help us in any way - we'll continue to reserve resources while the slot is half-gone. I don't think trying to handle errors we don't understand and we don't routinely expect actually improves robustness. It just leads to harder to diagnose errors. It's not like the cases here are likely to be caused by anthything but severe admin failure like removing the write permissions of the postgres directory while the server is running. Or do you see more valid causes? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
Hello 2014/1/22 Dean Rasheed dean.a.rash...@gmail.com On 21 January 2014 22:28, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I have been mulling over this patch, and I can't seem to come to terms with it. I first started making it look nicer here and there, thinking it was all mostly okay, but eventually arrived at the idea that it seems wrong to do what this does: basically, get_object_address() tries to obtain an object address, and if that fails, return InvalidOid; then, in RemoveObjects, we try rather hard to figure out why that failed, and construct an error message. It seems to me that it would make more sense to have get_object_address somehow return a code indicating what failed; then we don't have to go all over through the parser code once more. Perhaps, for example, when missing_ok is given as true to get_object_address it also needs to get a pointer to ObjectType and a string; if some object does not exist then fill the ObjectType with the failing object and the string with the failing name. Then RemoveObjects can construct a string more easily. Not sure how workable this exact idea is; maybe there is a better way. Yeah, when I initially started reviewing this patch I had a very similar thought. But when I looked more deeply at the code beneath get_object_address, I started to doubt whether it could be done without rather extensive changes all over the place. Also get_object_address is itself called from a lot of places (not necessarily all in our code) and all the other places (in our code, at least) pass missing_ok=false. So it seemed rather ugly to change its signature and force a matching change in all those other places, which actually don't care about missing objects. Perhaps the answer would be to have a separate get_object_address_if_exists function, and remove the missing_ok flag from get_object_address, but that all felt like a much larger patch. In the end, I felt that Pavel's approach wasn't adding that much new code, and it's all localised in the one place that does actually tolerate missing objects. I though about it too. But I didn't continue - reasons was named by Dean - and RemoveObjects are not difficult code - lot of code is mechanical - and it is not on critical path. Regards Pavel I admit though, that I didn't explore the other approach very deeply, so perhaps it might fall out more neatly than I feared. Regards, Dean
Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)
On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-01-18 08:35:47 -0500, Robert Haas wrote: I am not sure I understand that point. We can either update the in-memory bit before performing the on-disk operations or afterwards. Either way, there's a way to be inconsistent if the disk operation fails somewhere inbetween (it might fail but still have deleted the file/directory!). The normal way to handle that in other places is PANICing when we don't know so we recover from the on-disk state. I really don't see the problem here? Code doesn't get more robust by doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only ERROR, often that's not warranted. People get cranky when the database PANICs because of a filesystem failure. We should avoid that, especially when it's trivial to do so. The update to shared memory should be done second and should be set up to be no-fail. I don't see how that would help. If we fail during unlink/rmdir, we don't really know at which point we failed. This doesn't make sense to me. unlink/rmdir are atomic operations. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dynamic Shared Memory stuff
On Wed, Jan 22, 2014 at 09:32:09AM -0500, Robert Haas wrote: On Tue, Jan 21, 2014 at 2:58 PM, Noah Misch n...@leadboat.com wrote: What do people prefer? I recommend performing cleanup on the control segment named in PGShmemHeader just before shmdt() in PGSharedMemoryCreate(). No new ERROR or WARNING sites are necessary. Have dsm_postmaster_startup() continue to perform a cleanup on the control segment named in the state file. I think I'm on board with the first two sentences of this, but after Fujii Masao's email yesterday, I can't help thinking that what you propose the third sentence is a bad idea. He cloned a master to create a standby server on the same machine, and the standby startup ate the master's dynamic shared memory. We could teach pg_basebackup not to copy the state file, but that wouldn't help people who take base backups using the file system copy method, which is a lot of people. I agree we should not rely on folks learning to omit the state file from base backups. Abandoning the state file is one way to resolve that, and the reasons I outlined for preferring to keep it were not overriding concerns. We could instead store a postmaster PID in dsm_control_header and only clean if that PID is dead. We could make DSM startup aware of whether we're using a backup label, but that would be awkward thanks to StartupXLOG() happening a good bit later. Yeah, abandoning the state file is looking attractive. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
Amit Kapila amit.kapil...@gmail.com writes: On Tue, Jan 21, 2014 at 8:25 PM, Robert Haas robertmh...@gmail.com wrote: While reviewing, I noted that the skipping missing configuration file message in ParseConfigFile() uses an elevel of LOG, while the other messages in the same file use elevel. I'm thinking that's a bug. It might not be right for all cases, but I think this is saving us in few cases when the caller (directly or indirectly) has sent elevel as ERROR or FATAL, in such cases we just want to LOG it, if strict is not true. Now it might not be appropriate if the caller has sent DEBUG2 and we use LOG, but to fix it (if this sounds an issue), some more analysis is required, so let's keep it as it is for now. The problem with this coding is that at SIGHUP, *all* the postgres processes will bleat about a missing file. To avoid that kind of log spam, the usual practice is to code the elog level as something like IsUnderPostmaster ? DEBUG2 : LOG; see the elevel setting in ProcessConfigFile() for precedent. In fact, I'm pretty sure that the elevel passed to ParseConfigFile should always have come from that logic. In any case, your argument is bogus because we could already have thrown an error at the given elevel further up in ParseConfigFile, or later in ParseConfigFp. None of this code is meant to receive an ERROR-or-higher elevel. So yes, this is broken and needs to be changed as Robert says. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: option --if-exists for pg_dump
Tomorrow I'll send updated version Regards Pavel 2014/1/21 Jeevan Chalke jeevan.cha...@enterprisedb.com Hi Pavel, Consider following test scenario: mydb=# \d emp Table public.emp Column | Type | Modifiers +-+--- empno | integer | not null deptno | integer | ename | text| Indexes: emp_pkey PRIMARY KEY, btree (empno) Foreign-key constraints: emp_deptno_fkey FOREIGN KEY (deptno) REFERENCES dept(deptno) mydb=# \d dept Table public.dept Column | Type | Modifiers +-+--- deptno | integer | not null dname | text| Indexes: dept_pkey PRIMARY KEY, btree (deptno) Referenced by: TABLE emp CONSTRAINT emp_deptno_fkey FOREIGN KEY (deptno) REFERENCES dept(deptno) mydb=# \q jeevan@ubuntu:~/pg_master$ ./install/bin/pg_dump -d mydb --if-exists -c mydb_ic.dmp I see following lines in dump which looks certainly wrong: === DROP FK CONSTRAINT IF EXISTS ublic.emp DROP CONSTRAINT emp_deptno_fkey; DROP CONSTRAINT IF EXISTS Y public.emp DROP CONSTRAINT emp_pkey; DROP CONSTRAINT IF EXISTS Y public.dept DROP CONSTRAINT dept_pkey; When try to restore, as expected it is throwing an error: === psql:mydb_ic.dmp:14: ERROR: syntax error at or near FK LINE 1: DROP FK CONSTRAINT IF EXISTS ublic.emp DROP CONSTRAINT emp_d... ^ psql:mydb_ic.dmp:15: ERROR: syntax error at or near CONSTRAINT LINE 1: DROP CONSTRAINT IF EXISTS Y public.emp DROP CONSTRAINT emp_p... ^ psql:mydb_ic.dmp:16: ERROR: syntax error at or near CONSTRAINT LINE 1: DROP CONSTRAINT IF EXISTS Y public.dept DROP CONSTRAINT dept... ^ Note: === Commands which are in form of ALTER TABLE ... DROP are failing. You need to test each and every object with DROP .. IF EXISTS command. Better write small test-case with all objects included. Following logic has flaw: === diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 7fc0288..0677712 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -413,8 +413,30 @@ RestoreArchive(Archive *AHX) /* Select owner and schema as necessary */ _becomeOwner(AH, te); _selectOutputSchema(AH, te-namespace); -/* Drop it */ -ahprintf(AH, %s, te-dropStmt); + +if (*te-dropStmt != '\0') +{ +/* Inject IF EXISTS clause when it is required. */ +if (ropt-if_exists) +{ +char buffer[40]; +size_t l; + +/* But do it only, when it is not there yet. */ +snprintf(buffer, sizeof(buffer), DROP %s IF EXISTS, + te-desc); +l = strlen(buffer); + +if (strncmp(te-dropStmt, buffer, l) != 0) +{ +ahprintf(AH, DROP %s IF EXISTS %s, +te-desc, +te-dropStmt + l); +} +else +ahprintf(AH, %s, te-dropStmt); +} +} } } Also: === 1. This is still out of sync. @@ -348,6 +350,8 @@ main(int argc, char *argv[]) appendPQExpBufferStr(pgdumpopts, --binary-upgrade); if (column_inserts) appendPQExpBufferStr(pgdumpopts, --column-inserts); +if (if_exists) +appendPQExpBufferStr(pgdumpopts, --if-exists); if (disable_dollar_quoting) appendPQExpBufferStr(pgdumpopts, --disable-dollar-quoting); if (disable_triggers) 2. Spell check required: +/* skip first n chars, and create a modifieble copy */ modifieble = modifiable +/* DROP IF EXISTS pattern is not appliable on dropStmt */ appliable = applicable 3. +/* + * Object description is based on dropStmt statement. But + * a drop statements can be enhanced about IF EXISTS clause. + * We have to increase a offset in this case, IF EXISTS + * should not be included on object description. + */ Looks like you need to re-phrase these comments line. Something like: /* * Object description is based on dropStmt statement which may have * IF EXISTS clause. Thus we need to update an offset such that it * won't be included in the object description. */ Or as per your choice. Need to have careful thought on a bug mentioned above. Thanks On Fri, Jan 17, 2014 at 6:23 PM, Pavel Stehule pavel.steh...@gmail.comwrote: Hello 2014/1/16 Jeevan Chalke jeevan.cha...@enterprisedb.com Hi Pavel, I have
Re: [HACKERS] [bug fix] pg_ctl always uses the same event source
Amit Kapila amit.kapil...@gmail.com writes: On Wed, Jan 22, 2014 at 6:54 PM, Robert Haas robertmh...@gmail.com wrote: I wonder if the port number wouldn't be a better choice. And that could even be done without adding a parameter. We need this for register of event source which might be done before start of server. So? Anything which can know the value of a GUC parameter can certainly know the selected port number. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-21 21:42:19 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 19:45:19 -0500, Tom Lane wrote: I don't think that's a comparable case. Incomplete actions are actions to be taken immediately, and which the replayer then has to complete somehow if it doesn't find the rest of the action in the WAL sequence. The only thing to be done with the records I'm proposing is to remember their contents (in some fashion) until it's time to apply them. If you hit end of WAL you don't really have to do anything. Would that work for the promotion case as well? Afair there's the assumption that everything = TransactionXmin can be looked up in pg_subtrans or in the procarray - which afaics wouldn't be the case with your scheme? And TransactionXmin could very well be below such an incomplete commit's xids afaics. Uh, what? The behavior I'm talking about is *exactly the same* as what happens now. The only change is that the data sent to the WAL file is laid out a bit differently, and the replay logic has to work harder to reassemble it before it can apply the commit or abort action. If anything outside replay can detect a difference at all, that would be a bug. Once again: the replayer is not supposed to act immediately on the subsidiary records. It's just supposed to remember their contents so it can reattach them to the eventual commit or abort record, and then do what it does today to replay the commit or abort. I (think) I get what you want to do, but splitting the record like that nonetheless opens up behaviour that previously wasn't there. Imagine we promote inbetween replaying the list of subxacts (only storing it in memory) and the main commit record. Either we have something like the incomplete action stuff doing something with the in-memory data, or we are in a situation where there can be xids bigger than TransactionXmin that are not in pg_subtrans and not in the procarray. Which I don't think exists today since we either read the commit record in it's entirety or not. We'd also need to use the MyPgXact-delayChkpt mechanism to prevent checkpoints from occuring inbetween those records, but we do that already, so that seems rather uncontroversial. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Funny representation in pg_stat_statements.query.
Kyotaro HORIGUCHI wrote: Hello, Since this becomes more than a simple bug fix, I think it is too late for 9.4 now. I'll work on this in the longer term. OK. I will get around to it someday, but if you do it first, that's fine. Nevertheless of my words, the drive is getting diminished after the issue was settled:-p Anyway I'll keep this in my mind. This is a common theme with patch submitters: once you accept a good enough workaround, their motivation to write a real fix diminishes considerably, and as a result kluges live on. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)
On 2014-01-22 10:14:27 -0500, Robert Haas wrote: On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-01-18 08:35:47 -0500, Robert Haas wrote: I am not sure I understand that point. We can either update the in-memory bit before performing the on-disk operations or afterwards. Either way, there's a way to be inconsistent if the disk operation fails somewhere inbetween (it might fail but still have deleted the file/directory!). The normal way to handle that in other places is PANICing when we don't know so we recover from the on-disk state. I really don't see the problem here? Code doesn't get more robust by doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only ERROR, often that's not warranted. People get cranky when the database PANICs because of a filesystem failure. We should avoid that, especially when it's trivial to do so. The update to shared memory should be done second and should be set up to be no-fail. I don't see how that would help. If we fail during unlink/rmdir, we don't really know at which point we failed. This doesn't make sense to me. unlink/rmdir are atomic operations. Yes, individual operations should be, but you cannot be sure whether a rename()/unlink() will survive a crash until the directory is fsync()ed. So, what is one going to do if the unlink suceeded, but the fsync didn't? Deletion currently works like: if (rename(path, tmppath) != 0) ereport(ERROR, (errcode_for_file_access(), errmsg(could not rename \%s\ to \%s\: %m, path, tmppath))); /* make sure no partial state is visible after a crash */ fsync_fname(tmppath, false); fsync_fname(pg_replslot, true); if (!rmtree(tmppath, true)) { ereport(ERROR, (errcode_for_file_access(), errmsg(could not remove directory \%s\: %m, tmppath))); } If we fail between the rename() and the fsync_fname() we don't really know which state we are in. We'd also have to add code to handle incomplete slot directories, which currently only exists for startup, to other places. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 21, 2014 at 3:20 PM, Jan Kara j...@suse.cz wrote: But that still doesn't work out very well, because now the guy who does the write() has to wait for it to finish before he can do anything else. That's not always what we want, because WAL gets written out from our internal buffers for multiple different reasons. Well, you can always use AIO (io_submit) to submit direct IO without waiting for it to finish. But then you might need to track the outstanding IO so that you can watch with io_getevents() when it is finished. Yeah. That wouldn't work well for us; the process that did the io_submit() would want to move on to other things, and how would it, or any other process, know that the I/O had completed? As I wrote in some other email in this thread, using IO priorities for data file checkpoint might be actually the right answer. They will work for IO submitted by fsync(). The downside is that currently IO priorities / IO scheduling classes work only with CFQ IO scheduler. IMHO, the problem is simpler than that: no single process should be allowed to completely screw over every other process on the system. When the checkpointer process starts calling fsync(), the system begins writing out the data that needs to be fsync()'d so aggressively that service times for I/O requests from other process go through the roof. It's difficult for me to imagine that any application on any I/O scheduler is ever happy with that behavior. We shouldn't need to sprinkle of fsync() calls with special magic juju sauce that says hey, when you do this, could you try to avoid causing the rest of the system to COMPLETELY GRIND TO A HALT?. That should be the *default* behavior, if not the *only* behavior. Now, that is not to say that we're unwilling to sprinkle magic juju sauce if that's what it takes to solve this problem. If calling fadvise() or sync_file_range() or some new API that you invent at some point prior to calling fsync() helps the kernel do the right thing, we're willing to do that. Or if you/the Linux community wants to invent a new API fsync_but_do_not_crush_system() and have us call that instead of the regular fsync(), we're willing to do that, too. But I think there's an excellent case to be made, at least as far as checkpoint I/O spikes are concerned, that the API is just fine as it is and Linux's implementation is simply naive. We'd be perfectly happy to wait longer for fsync() to complete in exchange for not starving the rest of the system - and really, who wouldn't? Linux is a multi-user system, and apportioning resources among multiple tasks is a basic function of a multi-user kernel. /rant Anyway, if CFQ or any other Linux I/O scheduler gets an option to lower the priority of the fsyncs, I'm sure somebody here will test it out and see whether it solves this problem. AFAICT, experiments to date have pretty much universally shown CFQ to be worse than not-CFQ and everything else to be more or less equivalent - but if that changes, I'm sure many PostgreSQL DBAs will be more than happy to flip CFQ back on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire
Hi, On 22/01/14 12:40, Simon Riggs wrote: 1. I find a issue in following scenario: session 1 with process id X: BEGIN; LOCK TABLE foo IN SHARE MODE; session 2 with process id Y: BEGIN; LOCK TABLE foo IN EXCLUSIVE MODE; session 3 with process id Z: BEGIN; LOCK TABLE foo IN SHARE MODE; On execution of LOCK in session-3, as part of log it will display as: processes owning lock: X, Y But actually if we see Y has not yet own the lock, it is still waiting with higher priority. It may mislead user. May be we should change message to give all meaning i.e. which process is owning lock and Which process is already in queue. Perhaps this? CONTEXT: lock owner request queue XXX, XXX, XXX, etc Fixed. 2. Can we give a better name to new variable 'buf1'? Fixed. 3. Do we need to take performance reading to see if any impact? Don't think so. Diagnosing problems will help performance, not hinder it I agree. And this code path will only get executed when log_lock_waits = on, which seems to be a debugging method to me. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index ee6c24c..588c4ef 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -1195,13 +1195,23 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) */ if (log_lock_waits deadlock_state != DS_NOT_YET_CHECKED) { - StringInfoData buf; + StringInfoData buf, + lock_waiters_sbuf, + lock_holders_sbuf; const char *modename; long secs; int usecs; long msecs; + SHM_QUEUE *procLocks; + PROCLOCK *proclock; + bool first_holder = true, + first_waiter = true; + int lockHoldersNum = 0; initStringInfo(buf); + initStringInfo(lock_waiters_sbuf); + initStringInfo(lock_holders_sbuf); + DescribeLockTag(buf, locallock-tag.lock); modename = GetLockmodeName(locallock-tag.lock.locktag_lockmethodid, lockmode); @@ -1211,10 +1221,58 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) msecs = secs * 1000 + usecs / 1000; usecs = usecs % 1000; + LWLockAcquire(partitionLock, LW_SHARED); + + procLocks = (lock-procLocks); + proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks, + offsetof(PROCLOCK, lockLink)); + + while (proclock) + { +/* + * we are a waiter if myProc-waitProcLock == proclock; we are + * a holder if it is NULL or something different + */ +if (proclock-tag.myProc-waitProcLock == proclock) +{ + if (first_waiter) + { + appendStringInfo(lock_waiters_sbuf, %d, + proclock-tag.myProc-pid); + first_waiter = false; + } + else + appendStringInfo(lock_waiters_sbuf, , %d, + proclock-tag.myProc-pid); +} +else +{ + if (first_holder) + { + appendStringInfo(lock_holders_sbuf, %d, + proclock-tag.myProc-pid); + first_holder = false; + } + else + appendStringInfo(lock_holders_sbuf, , %d, + proclock-tag.myProc-pid); + + lockHoldersNum++; +} + +proclock = (PROCLOCK *) SHMQueueNext(procLocks, proclock-lockLink, + offsetof(PROCLOCK, lockLink)); + } + + LWLockRelease(partitionLock); + if (deadlock_state == DS_SOFT_DEADLOCK) ereport(LOG, (errmsg(process %d avoided deadlock for %s on %s by rearranging queue order after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), + (errcontext(ngettext(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum), lock_holders_sbuf.data, lock_waiters_sbuf.data; else if (deadlock_state == DS_HARD_DEADLOCK) { /* @@ -1226,13 +1284,19 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) */ ereport(LOG, (errmsg(process %d detected deadlock while waiting for %s on %s after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), + (errcontext(ngettext(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum), lock_holders_sbuf.data, lock_waiters_sbuf.data; } if (myWaitStatus == STATUS_WAITING) ereport(LOG, (errmsg(process %d still waiting for %s on %s after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), + (errcontext(ngettext(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum),
Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire
Christian Kruse wrote: I think this could use some more comments -- for instance at the top of the while loop, explain what's its purpose. if (deadlock_state == DS_SOFT_DEADLOCK) ereport(LOG, (errmsg(process %d avoided deadlock for %s on %s by rearranging queue order after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); + MyProcPid, modename, buf.data, msecs, usecs), + (errcontext(ngettext(process owning lock: %s request queue: %s, +processes owning lock: %s request queue: %s, + lockHoldersNum), lock_holders_sbuf.data, lock_waiters_sbuf.data; This ngettext() call is repeated four times in the new code, which is a bit annoying because it's not trivial. I think you could assign the ngettext() to a char * at the bottom of the loop, and then in the ereport() calls use it: char *errcxt = NULL; while ( ... ) { ... errcxt = ngettext(processes owning lock: ..); } ereport(LOG, (errmsg(blah blah), errcxt != NULL ? errcontext(errcxt) : 0)); That would avoid the repetition. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WAL replay should fdatasync() segments?
Hi, Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will write() data before fdatasync()ing them (duh, kinda obvious). But I think given the current recovery code that leaves a window where we can get into strange inconsistencies. Consider what happens if postgres (not the OS!) crashes after writing WAL data to the OS, but before fdatasync()ing it. Replay will happily read that record from disk and replay it, which is fine. At the end of recovery we then will start inserting new records, and those will be properly fsynced to disk. But if the *OS* crashes in that moment we might get into the strange situation where older records might be lost since they weren't fsync()ed, but newer records and the control file will persist. I think for a primary that window is relatively small, but I think it's a good bit bigger for a standby, especially if it's promoted. I think the correct way to handle this would be to fsync() segments we read from pg_xlog/ during recovery. Am I missing something? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrectly reporting config errors
Andres Freund and...@2ndquadrant.com wrote: A rather common and sensible configuration is to have a common configuration file used across servers, which then is overwritten by a per-server or per-cluster config file containing values specific to a server/cluster. Agreed. My preference would be to not generate noise for interim states; just report net changes. And don't say that a file contains errors when we mean those options are ignored on reload; they will only take effect on restart. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory and locks
On 2014-01-10 13:11:32 -0500, Robert Haas wrote: OK, I've implemented this: here's what I believe to be a complete patch, based on the previous lwlock-pointers.patch but now handling LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose ends. I think this should be adequate for allowing lwlocks to be stored elsewhere in the main shared memory segment as well as in dynamic shared memory segments. Thoughts? Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid breaking external code using lwlocks? Requiring code that has worked for several versions to sprinkle version specific ifdefs seems unneccessary here. I see the comments still claim + * LWLock is between 16 and 32 bytes on all known platforms, so these two + * cases are sufficient. + */ +#define LWLOCK_PADDED_SIZE (sizeof(LWLock) = 16 ? 16 : 32) I don't think that's true anymore with the addition of the tranche id. The new minimum size seems to be 20 on 32bit platforms. That could be fixed by making the tranche id a uint8, but I don't think that's necessary, so just removing the support for 32 seems sufficient. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 21:42:19 -0500, Tom Lane wrote: Uh, what? The behavior I'm talking about is *exactly the same* as what happens now. The only change is that the data sent to the WAL file is laid out a bit differently, and the replay logic has to work harder to reassemble it before it can apply the commit or abort action. If anything outside replay can detect a difference at all, that would be a bug. Once again: the replayer is not supposed to act immediately on the subsidiary records. It's just supposed to remember their contents so it can reattach them to the eventual commit or abort record, and then do what it does today to replay the commit or abort. I (think) I get what you want to do, but splitting the record like that nonetheless opens up behaviour that previously wasn't there. Obviously we are not on the same page yet. In my vision, the WAL writer is dumping the same data it would have dumped, though in a different layout, and it's working from process-local state same as it does now. The WAL replayer is taking the same actions at the same time using the same data as it does now. There is no behavior that wasn't there, unless you're claiming that there are *existing* race conditions in commit/abort WAL processing. The only thing that seems mildly squishy about this is that it's not clear how long the WAL replayer ought to hang onto subsidiary records for a commit or abort it hasn't seen yet. In the case where we change our minds and abort a transaction after already having written some subsidiary records for the commit, it's not really a problem; the replayer can throw away any saved data related to the commit of xid N as soon as it sees an abort for xid N. However, what if the session crashes and never writes either a final commit or abort record? I think we can deal with this fairly easily though, because that case should end with a crash recovery cycle writing a shutdown checkpoint to the log (we do do that no?). So the rule can be discard any unmatched subsidiary records if you see a shutdown checkpoint. This makes sense on its own terms since there are surely no active transactions at that point in the log. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL replay should fdatasync() segments?
On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will write() data before fdatasync()ing them (duh, kinda obvious). But I think given the current recovery code that leaves a window where we can get into strange inconsistencies. Consider what happens if postgres (not the OS!) crashes after writing WAL data to the OS, but before fdatasync()ing it. Replay will happily read that record from disk and replay it, which is fine. At the end of recovery we then will start inserting new records, and those will be properly fsynced to disk. But if the *OS* crashes in that moment we might get into the strange situation where older records might be lost since they weren't fsync()ed, but newer records and the control file will persist. I think for a primary that window is relatively small, but I think it's a good bit bigger for a standby, especially if it's promoted. In normal streaming replication case, ISTM that window is not bigger for the standby because basically the standby replays only the WAL data which walreceiver fsync'd to the disk. But if it replays the WAL file which was fetched from the archive, that WAL file might not have been flushed to the disk yet. In this case, that window might become bigger... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire
Alvaro Herrera alvhe...@2ndquadrant.com writes: This ngettext() call is repeated four times in the new code, which is a bit annoying because it's not trivial. I think you could assign the ngettext() to a char * at the bottom of the loop, and then in the ereport() calls use it: Would that not break the compiler's ability to verify the format codes in the string? Not to mention make it harder for people to compare format to arguments, too? However, the real problem here is that you shouldn't be calling ngettext manually in an ereport context in the first place. There is infrastructure in place for that, and this isn't using it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL replay should fdatasync() segments?
On 2014-01-23 02:05:48 +0900, Fujii Masao wrote: On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will write() data before fdatasync()ing them (duh, kinda obvious). But I think given the current recovery code that leaves a window where we can get into strange inconsistencies. Consider what happens if postgres (not the OS!) crashes after writing WAL data to the OS, but before fdatasync()ing it. Replay will happily read that record from disk and replay it, which is fine. At the end of recovery we then will start inserting new records, and those will be properly fsynced to disk. But if the *OS* crashes in that moment we might get into the strange situation where older records might be lost since they weren't fsync()ed, but newer records and the control file will persist. I think for a primary that window is relatively small, but I think it's a good bit bigger for a standby, especially if it's promoted. In normal streaming replication case, ISTM that window is not bigger for the standby because basically the standby replays only the WAL data which walreceiver fsync'd to the disk. But if it replays the WAL file which was fetched from the archive, that WAL file might not have been flushed to the disk yet. In this case, that window might become bigger... Yea, but if the walreceiver receives data and crashes/disconnects before fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll start inserting new records before establishing a new checkpoint. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory and locks
Andres Freund and...@2ndquadrant.com writes: Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid breaking external code using lwlocks? +1, in fact there's probably no reason to touch most *internal* code using that type name either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrectly reporting config errors
On 2014-01-22 12:10:30 -0500, Tom Lane wrote: Kevin Grittner kgri...@ymail.com writes: My preference would be to not generate noise for interim states; just report net changes. Yeah. Is it worth explicitly detecting and dropping redundant assignments to the same variable? A naive check for that would be O(N^2) in the number of entries in the conf file, but perhaps that's still cheap enough in practice. This would mean for example that shared_buffers = 'oops' shared_buffers = '128MB' would not draw an error, which doesn't bother me but might bother somebody. Wouldn't bother me overly much. But it doesn't seem too hard to avoid it. Simply split the current ProcessConfigFile() into one more phase. Currently we seem to be parsing the config file, then iterate over the list returned from that to set options. If we'd instead have one pass over the items storing the new value in a new config_generic variable, after checking, and then a second pass over all gucs setting those that have changed we should end up at the same complexity since it could be merged with the reset pass. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrectly reporting config errors
Kevin Grittner kgri...@ymail.com writes: My preference would be to not generate noise for interim states; just report net changes. Yeah. Is it worth explicitly detecting and dropping redundant assignments to the same variable? A naive check for that would be O(N^2) in the number of entries in the conf file, but perhaps that's still cheap enough in practice. This would mean for example that shared_buffers = 'oops' shared_buffers = '128MB' would not draw an error, which doesn't bother me but might bother somebody. And don't say that a file contains errors when we mean those options are ignored on reload; they will only take effect on restart. I'm not happy about complicating that logic even more. I think the reasonable choices here are to reword that message somehow, or just drop it completely. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 01/22/2014 02:17 PM, Alexander Korotkov wrote: We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs. Remember we have also subsequent fast-scan which is very needed for hstore and other application. I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times. I propose following decisions: 1) Leave uncompressed area, allow duplicates in it 2) Don't introduce Items on page. Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not. So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory and locks
On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid breaking external code using lwlocks? +1, in fact there's probably no reason to touch most *internal* code using that type name either. I thought about this but figured it was too much of a misnomer to refer to a pointer as an ID. But, if we're sure we want to go that route, I can go revise the patch along those lines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory and locks
On 2014-01-22 12:40:34 -0500, Robert Haas wrote: On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid breaking external code using lwlocks? +1, in fact there's probably no reason to touch most *internal* code using that type name either. I thought about this but figured it was too much of a misnomer to refer to a pointer as an ID. But, if we're sure we want to go that route, I can go revise the patch along those lines. I personally don't care either way for internal code as long as external code continues to work. There's the argument of making the commit better readable by having less noise and less divergence in the branches and there's your argument of that being less clear. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: hide application_name from other users
On 01/21/2014 05:21 PM, Andres Freund wrote: I think the only realistic thing is a monitoring capability, like we have replication. GRANT/REVOKE doesn't even come close to being able to generically allow to grant permissions of even the moderate complexity pg_stat_get_activity() has. That would work for me, personally. I don't know how it would work for anyone else. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
On 01/21/2014 06:21 PM, Marko Tiikkaja wrote: Hi Andrew, On 1/18/14, 10:05 PM, I wrote: But I'll continue with my review now that this has been sorted out. Sorry about the delay. I think the API for the new functions looks good. They are all welcome additions to the JSON family. The implementation side looks reasonable to me. I'm not sure there's need to duplicate so much code, though. E.g. json_to_recordset is almost identical to json_populate_recordset, and json_to_record has a bit of the same disease. Finally, (as I'm sure you know already), docs are still missing. Marking the patch Waiting on Author for the time being. New patch attached. Main change is I changed json_populate_record/json_to_record to call a common worker function, and likewise with json_populate_recordset/json_to_recordset. We're still finalizing the docs - should be ready in the next day or so. cheers andrew diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 481db16..ef5e125 100644 --- a/src/backend/utils/adt/json.c +++ b/src/backend/utils/adt/json.c @@ -68,6 +68,10 @@ static void array_dim_to_json(StringInfo result, int dim, int ndims, int *dims, bool use_line_feeds); static void array_to_json_internal(Datum array, StringInfo result, bool use_line_feeds); +static void datum_to_json(Datum val, bool is_null, StringInfo result, + TYPCATEGORY tcategory, Oid typoutputfunc, bool key_scalar); +static void add_json(Datum orig_val, bool is_null, StringInfo result, + Oid val_type, bool key_scalar); /* the null action object used for pure validation */ static JsonSemAction nullSemAction = @@ -1219,7 +1223,7 @@ extract_mb_char(char *s) */ static void datum_to_json(Datum val, bool is_null, StringInfo result, - TYPCATEGORY tcategory, Oid typoutputfunc) + TYPCATEGORY tcategory, Oid typoutputfunc, bool key_scalar) { char *outputstr; text *jsontext; @@ -1241,24 +1245,32 @@ datum_to_json(Datum val, bool is_null, StringInfo result, composite_to_json(val, result, false); break; case TYPCATEGORY_BOOLEAN: - if (DatumGetBool(val)) -appendStringInfoString(result, true); + if (!key_scalar) +appendStringInfoString(result, DatumGetBool(val) ? true : false); else -appendStringInfoString(result, false); +escape_json(result, DatumGetBool(val) ? true : false); break; case TYPCATEGORY_NUMERIC: outputstr = OidOutputFunctionCall(typoutputfunc, val); - - /* - * Don't call escape_json here if it's a valid JSON number. - */ - dummy_lex.input = *outputstr == '-' ? outputstr + 1 : outputstr; - dummy_lex.input_length = strlen(dummy_lex.input); - json_lex_number(dummy_lex, dummy_lex.input, numeric_error); - if (!numeric_error) -appendStringInfoString(result, outputstr); - else + if (key_scalar) + { +/* always quote keys */ escape_json(result, outputstr); + } + else + { +/* + * Don't call escape_json for a non-key if it's a valid JSON + * number. + */ +dummy_lex.input = *outputstr == '-' ? outputstr + 1 : outputstr; +dummy_lex.input_length = strlen(dummy_lex.input); +json_lex_number(dummy_lex, dummy_lex.input, numeric_error); +if (!numeric_error) + appendStringInfoString(result, outputstr); +else + escape_json(result, outputstr); + } pfree(outputstr); break; case TYPCATEGORY_JSON: @@ -1276,6 +1288,10 @@ datum_to_json(Datum val, bool is_null, StringInfo result, break; default: outputstr = OidOutputFunctionCall(typoutputfunc, val); + if (key_scalar *outputstr == '\0') +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(key value must not be empty))); escape_json(result, outputstr); pfree(outputstr); break; @@ -1309,7 +1325,7 @@ array_dim_to_json(StringInfo result, int dim, int ndims, int *dims, Datum *vals, if (dim + 1 == ndims) { datum_to_json(vals[*valcount], nulls[*valcount], result, tcategory, - typoutputfunc); + typoutputfunc, false); (*valcount)++; } else @@ -1490,13 +1506,85 @@ composite_to_json(Datum composite, StringInfo result, bool use_line_feeds) else tcategory = TypeCategory(tupdesc-attrs[i]-atttypid); - datum_to_json(val, isnull, result, tcategory, typoutput); + datum_to_json(val, isnull, result, tcategory, typoutput, false); } appendStringInfoChar(result, '}'); ReleaseTupleDesc(tupdesc); } +static void +add_json(Datum orig_val, bool is_null, StringInfo result, Oid val_type, bool key_scalar) +{ + Datum val; + TYPCATEGORY tcategory; + Oid typoutput; + bool typisvarlena; + Oid castfunc = InvalidOid; + + if (val_type == InvalidOid) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(could not determine input data type))); + + + getTypeOutputInfo(val_type, typoutput, typisvarlena); + + if (val_type FirstNormalObjectId) + { + HeapTuple tuple; + Form_pg_cast castForm;
Re: [HACKERS] WAL replay should fdatasync() segments?
On Thu, Jan 23, 2014 at 2:08 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-01-23 02:05:48 +0900, Fujii Masao wrote: On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will write() data before fdatasync()ing them (duh, kinda obvious). But I think given the current recovery code that leaves a window where we can get into strange inconsistencies. Consider what happens if postgres (not the OS!) crashes after writing WAL data to the OS, but before fdatasync()ing it. Replay will happily read that record from disk and replay it, which is fine. At the end of recovery we then will start inserting new records, and those will be properly fsynced to disk. But if the *OS* crashes in that moment we might get into the strange situation where older records might be lost since they weren't fsync()ed, but newer records and the control file will persist. I think for a primary that window is relatively small, but I think it's a good bit bigger for a standby, especially if it's promoted. In normal streaming replication case, ISTM that window is not bigger for the standby because basically the standby replays only the WAL data which walreceiver fsync'd to the disk. But if it replays the WAL file which was fetched from the archive, that WAL file might not have been flushed to the disk yet. In this case, that window might become bigger... Yea, but if the walreceiver receives data and crashes/disconnects before fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll start inserting new records before establishing a new checkpoint. Yeah, true. Such unflushed WAL file can be read by the subsequent recovery... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)
On Wed, Jan 22, 2014 at 10:48 AM, Andres Freund and...@2ndquadrant.com wrote: Yes, individual operations should be, but you cannot be sure whether a rename()/unlink() will survive a crash until the directory is fsync()ed. So, what is one going to do if the unlink suceeded, but the fsync didn't? Well, apparently, one is going to PANIC and reinitialize the system. I presume that upon reinitialization we'll decide that the slot is gone, and thus won't recreate it in shared memory. Of course, if the entire system suffers a hard power failure after that and before the directory is succesfully fsync'd, then the slot could reappear on the next startup. Which is also exactly what would happen if we removed the slot from shared memory after doing the unlink, and then the system suffered a hard power failure before the directory contents made it to disk. Except that we also panicked. In the case of shared buffers, the way we handle fsync failures is by not allowing the system to checkpoint until all of the fsyncs succeed. If there's an OS-level reset before that happens, WAL replay will perform the same buffer modifications over again and the next checkpoint will again try to flush them to disk and will not complete unless it does. That forms a closed system where we never advance the redo pointer over the covering WAL record until the changes it covers are on the disk. But I don't think this code has any similar interlock; if it does, I missed it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: hide application_name from other users
On 01/21/2014 05:22 PM, Mark Kirkwood wrote: If said malicious attacker can log into postgres and issue its own queries, and connect to other database then you are in serious trouble already. I also wonder that if such an attacker knows the application name, that would suggest that they have access to the application server and are able to read its config files...which would probably also contain the host and db name too (and possibly the password in some unfortunate cases)! Common case: Multitenant shared hosting on a public cloud. 1. attacker writes a tool which exploits VulnerableApplication and takes it over. 2. they exploit SiteX, running that web app. 3. using SiteX's database credentials, they check pg_stat_activity and see what other hosts are running VulnerableApplication. 4. They then infect the other hosts running VulnerableApplication. Alternately: 4. They use VulnerableApplication's predictable password-generating flaw to log into the other databases, or to try the default password which ships with the app. However, thinking about the case above, there are a number of caveats and workarounds which make the above not that interesting of an exploit case: A. it's easy to find VulnerableApplication simply by scanning the web. Easier, in fact, than the above, if you have an IP block to start with, and you would. B. Most applications don't actually set application-specific application names anyway (psycopg2, libpq). C. It would be trivially easy for a DBA concerned about security to obfuscate application names in a way which would not be easy for an attacker to analyze. D. Don't use default passwords. Also, the attacker merely needs to try each database in turn anyway. Given the above, I think this specific patch falls into the broad class of things we would like to have in a multi-tenant toolkit (and is far from the most useful of those), which would include: * hiding application_name, user_name, and database_names from users of other databases * local superuser who can create per-database users and extensions from an approved list * ability to block users from changing some resource limits (work_mem, for example). * per-database logging (could be done with log hooks, just needs a contrib). It seems to me that it's not terribly useful to fix one item on the above list without having at least a *plan* to address the others. This really needs to be part of a comprehensive system, not piecework, or we'll end up with a bunch of little security options which don't work together. Probably Heroku has some more specific exploit case to be concerned about here; if so, might I suggest taking it up with the -security list? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
Peter Geoghegan p...@heroku.com writes: On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan p...@heroku.com wrote: Actually, I think the whole thing is rather badly designed, as warm cache or no you're still proposing to do thousands of kernel calls while holding a potentially contended LWLock. I suggest what we do is (1) read in the whole file, (2) acquire the lock, (3) re-read the whole file in the same buffer, (4) work from the buffer. fseeko() and fread() are part of the standard library, not any kernel. I don't believe that what I've done in qtext_fetch() implies thousands of kernel calls, or thousands of context switches. I ran an strace on a pg_stat_statements backend with a full ~5,000 entries. It seems that glibc is actually content to always make lseek() and read() syscalls in the event of an fseek(), even though it does maintain its own buffer and could in principle have a lot more gumption about that [1]. I agree that we should copy everything into a buffer in one go. Yeah, that was what I thought might happen. Even if glibc were smarter, a lot of other libc implementations aren't. Also, ISTM that traversal of the hash table will generally lead to more-or-less-random access into the text file. So unless you assume that libc has mmap'd the whole file, it'd have to do a lot of kernel calls anyway to get different pages of the file into its buffer at different times. I think that read the whole file is probably going to net out not any more complex as far as our code goes, and it'll be making a lot fewer assumptions about how smart the libc level is and what's happening at the kernel level. I'll have a go at coding it, anyway. I think that making the LWLocking duration as brief as possible isn't critically important. Maybe, but I don't like the idea that calling pg_stat_statements() frequently could result in random glitches in response time for other queries. Automated tools will only be interested in new query texts (implying possible I/O while share locking) as they observe new entries. This argument would be more plausible if there were a way to fetch the text for a previously-observed entry without invoking pg_stat_statements(). I'm surprised you've not submitted a patch to add such a function. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrectly reporting config errors
Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kgri...@ymail.com writes: My preference would be to not generate noise for interim states; just report net changes. Yeah. Is it worth explicitly detecting and dropping redundant assignments to the same variable? A naive check for that would be O(N^2) in the number of entries in the conf file, but perhaps that's still cheap enough in practice. This would mean for example that shared_buffers = 'oops' shared_buffers = '128MB' would not draw an error, which doesn't bother me but might bother somebody. It doesn't bother me any. And don't say that a file contains errors when we mean those options are ignored on reload; they will only take effect on restart. I'm not happy about complicating that logic even more. I think the reasonable choices here are to reword that message somehow, or just drop it completely. I agree. No strong preference which, -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: hide application_name from other users
Stephen Frost sfr...@snowman.net wrote: Harold Giménez (har...@heroku.com) wrote: This is just an example of how running monitoring as superuser is not necessarily the worst thing, and there are other reasons to do it already. It's a horrible thing and that isn't a good reason- if my database isn't accepting connections, I probably don't care one bit how bloated a table is. Indeed, I care *more* that I'm out of connections and would want to know that ASAP. +1 -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire
Hi, attached you will find a new version of the patch containing more comments. On 22/01/14 12:00, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: This ngettext() call is repeated four times in the new code, which is a bit annoying because it's not trivial. I think you could assign the ngettext() to a char * at the bottom of the loop, and then in the ereport() calls use it: Would that not break the compiler's ability to verify the format codes in the string? Not to mention make it harder for people to compare format to arguments, too? I agree. However, the real problem here is that you shouldn't be calling ngettext manually in an ereport context in the first place. There is infrastructure in place for that, and this isn't using it. Fixed in attached patch. I changed it from calling errorcontext(ngettext()) to calling errdetail_plural(). Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 122afb2..552c5a4 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -1195,13 +1195,23 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) */ if (log_lock_waits deadlock_state != DS_NOT_YET_CHECKED) { - StringInfoData buf; + StringInfoData buf, + lock_waiters_sbuf, + lock_holders_sbuf; const char *modename; long secs; int usecs; long msecs; + SHM_QUEUE *procLocks; + PROCLOCK *proclock; + bool first_holder = true, + first_waiter = true; + int lockHoldersNum = 0; initStringInfo(buf); + initStringInfo(lock_waiters_sbuf); + initStringInfo(lock_holders_sbuf); + DescribeLockTag(buf, locallock-tag.lock); modename = GetLockmodeName(locallock-tag.lock.locktag_lockmethodid, lockmode); @@ -1211,10 +1221,67 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) msecs = secs * 1000 + usecs / 1000; usecs = usecs % 1000; + LWLockAcquire(partitionLock, LW_SHARED); + + /* + * we loop over the lock's procLocks to gather a list of all + * holders and waiters. Thus we will be able to provide more + * detailed information for lock debugging purposes. + * + * lock-procLocks contains all processes which hold or wait for + * this lock. + */ + + procLocks = (lock-procLocks); + proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks, + offsetof(PROCLOCK, lockLink)); + + while (proclock) + { +/* + * we are a waiter if myProc-waitProcLock == proclock; we are + * a holder if it is NULL or something different + */ +if (proclock-tag.myProc-waitProcLock == proclock) +{ + if (first_waiter) + { + appendStringInfo(lock_waiters_sbuf, %d, + proclock-tag.myProc-pid); + first_waiter = false; + } + else + appendStringInfo(lock_waiters_sbuf, , %d, + proclock-tag.myProc-pid); +} +else +{ + if (first_holder) + { + appendStringInfo(lock_holders_sbuf, %d, + proclock-tag.myProc-pid); + first_holder = false; + } + else + appendStringInfo(lock_holders_sbuf, , %d, + proclock-tag.myProc-pid); + + lockHoldersNum++; +} + +proclock = (PROCLOCK *) SHMQueueNext(procLocks, proclock-lockLink, + offsetof(PROCLOCK, lockLink)); + } + + LWLockRelease(partitionLock); + if (deadlock_state == DS_SOFT_DEADLOCK) ereport(LOG, (errmsg(process %d avoided deadlock for %s on %s by rearranging queue order after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), +(errdetail_plural(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data; else if (deadlock_state == DS_HARD_DEADLOCK) { /* @@ -1226,13 +1293,19 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) */ ereport(LOG, (errmsg(process %d detected deadlock while waiting for %s on %s after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), +(errdetail_plural(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data; } if (myWaitStatus == STATUS_WAITING) ereport(LOG, (errmsg(process %d still waiting for %s on %s after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), +(errdetail_plural(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum,
Re: [HACKERS] plpgsql.consistent_into
On 1/15/14, 12:35 AM, Tom Lane wrote: Jim Nasby j...@nasby.net writes: Do we actually support = right now? We already support v_field := field FROM table ... ; and I think it's a bad idea to have different meaning for = and :=. That ship sailed a *very* long time ago. See other thread about documenting rather than ignoring this more-or-less-aboriginal behavior of plpgsql. Yeah, I had no idea that was even supported... I have no idea if this is related or not, but I would REALLY like for this to work (doesn't in 8.4, AFAIK not in 9.1 either...) Hm ... too tired to be sure, but I think the issue about inlining a function of this kind has to do with whether you get the same answers in corner cases such as subselect fetching no rows. There was some discussion about this a few years ago and I think that was essentially the issue. What I think would work is essentially a macro that would dump the function definition right into the query and then let the planner deal with it. So SELECT blah, ( SELECT status_code FROM status_code WHERE status_code_id = blah_status_code_id ) FROM blah; can become simply SELECT blah, status_code__get_text( blah_status_code_id ) FROM blah; but have it translate to the same raw SQL, same as views. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire
Christian Kruse christ...@2ndquadrant.com writes: However, the real problem here is that you shouldn't be calling ngettext manually in an ereport context in the first place. There is infrastructure in place for that, and this isn't using it. Fixed in attached patch. I changed it from calling errorcontext(ngettext()) to calling errdetail_plural(). Well, is it context or detail? Those fields have reasonably well defined meanings IMO. If we need errcontext_plural, let's add it, not adopt inferior solutions just because that isn't there for lack of previous need. But having said that, I think this is indeed detail not context. (I kinda wonder whether some of the stuff that's now in the primary message shouldn't be pushed to errdetail as well. It looks like some previous patches in this area have been lazy.) While I'm griping, this message isn't even trying to follow the project's message style guidelines. Detail or context messages are supposed to be complete sentence(s), with capitalization and punctuation to match. Lastly, is this information that we want to be shipping to clients? Perhaps from a security standpoint that's not such a wise idea, and errdetail_log() is what should be used. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: hide application_name from other users
On Tue, Jan 21, 2014 at 7:38 PM, Stephen Frost sfr...@snowman.net wrote: * Harold Giménez (har...@heroku.com) wrote: Definitely agree with you. This is just an example of how running monitoring as superuser is not necessarily the worst thing, and there are other reasons to do it already. It's a horrible thing and that isn't a good reason- if my database isn't accepting connections, I probably don't care one bit how bloated a table is. Indeed, I care *more* that I'm out of connections and would want to know that ASAP This is a silly argument. When things are failing that's precisely when you need your monitoring system to work. Of course you are interested in the fact that you're out of connections but you're trying to figure out *why* you're out of connections. Maybe you have queries consuming all iops, maybe disks have filled up and that's cascaded into a postgres problem, maybe you have lots of similar queries running, etc. The first step in debugging is always gathering data which is why it's so frustrating if the first thing to go down is your monitoring software. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote: On Fri 17-01-14 08:57:25, Robert Haas wrote: On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote: So this says to me that the WAL is a place where DIO should really be reconsidered. It's mostly sequential writes that need to hit the disk ASAP, and you need to know that they have hit the disk before you can proceed with other operations. Ironically enough, we actually *have* an option to use O_DIRECT here. But it doesn't work well. See below. Also, is the WAL actually ever read under normal (non-recovery) conditions or is it write-only under normal operation? If it's seldom read, then using DIO for them also avoids some double buffering since they wouldn't go through pagecache. This is the first problem: if replication is in use, then the WAL gets read shortly after it gets written. Using O_DIRECT bypasses the kernel cache for the writes, but then the reads stink. OK, yes, this is hard to fix with direct IO. Actually, it's not. Block level caching is the time-honoured answer to this problem, and it's been used very successfully on a large scale by many organisations. e.g. facebook with MySQL, O_DIRECT, XFS and flashcache sitting on an SSD in front of rotating storage. There's multiple choices for this now - bcache, dm-cache, flahscache, etc, and they all solve this same problem. And in many cases do it better than using the page cache because you can independently scale the size of the block level cache... And given the size of SSDs these days, being able to put half a TB of flash cache in front of spinning disks is a pretty inexpensive way of solving such IO problems If we're forcing the WAL out to disk because of transaction commit or because we need to write the buffer protected by a certain WAL record only after the WAL hits the platter, then it's fine. But sometimes we're writing WAL just because we've run out of internal buffer space, and we don't want to block waiting for the write to complete. Opening the file with O_SYNC deprives us of the ability to control the timing of the sync relative to the timing of the write. O_SYNC has a heavy performance penalty. For ext4 it means an extra fs transaction commit whenever there's any metadata changed on the filesystem. Since mtime ctime of files will be changed often, the will be a case very often. Therefore: O_DATASYNC. Maybe it'll be useful to have hints that say always write this file to disk as quick as you can and always postpone writing this file to disk for as long as you can for WAL and temp files respectively. But the rule for the data files, which are the really important case, is not so simple. fsync() is actually a fine API except that it tends to destroy system throughput. Maybe what we need is just for fsync() to be less aggressive, or a less aggressive version of it. We wouldn't mind waiting an almost arbitrarily long time for fsync to complete if other processes could still get their I/O requests serviced in a reasonable amount of time in the meanwhile. As I wrote in some other email in this thread, using IO priorities for data file checkpoint might be actually the right answer. They will work for IO submitted by fsync(). The downside is that currently IO priorities / IO scheduling classes work only with CFQ IO scheduler. And I don't see it being implemented anywhere else because it's the priority aware scheduling infrastructure in CFQ that causes all the problems with IO concurrency and scalability... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed 22-01-14 09:07:19, Dave Chinner wrote: On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote: If we're forcing the WAL out to disk because of transaction commit or because we need to write the buffer protected by a certain WAL record only after the WAL hits the platter, then it's fine. But sometimes we're writing WAL just because we've run out of internal buffer space, and we don't want to block waiting for the write to complete. Opening the file with O_SYNC deprives us of the ability to control the timing of the sync relative to the timing of the write. O_SYNC has a heavy performance penalty. For ext4 it means an extra fs transaction commit whenever there's any metadata changed on the filesystem. Since mtime ctime of files will be changed often, the will be a case very often. Therefore: O_DATASYNC. O_DSYNC to be exact. Maybe it'll be useful to have hints that say always write this file to disk as quick as you can and always postpone writing this file to disk for as long as you can for WAL and temp files respectively. But the rule for the data files, which are the really important case, is not so simple. fsync() is actually a fine API except that it tends to destroy system throughput. Maybe what we need is just for fsync() to be less aggressive, or a less aggressive version of it. We wouldn't mind waiting an almost arbitrarily long time for fsync to complete if other processes could still get their I/O requests serviced in a reasonable amount of time in the meanwhile. As I wrote in some other email in this thread, using IO priorities for data file checkpoint might be actually the right answer. They will work for IO submitted by fsync(). The downside is that currently IO priorities / IO scheduling classes work only with CFQ IO scheduler. And I don't see it being implemented anywhere else because it's the priority aware scheduling infrastructure in CFQ that causes all the problems with IO concurrency and scalability... So CFQ has all sorts of problems but I never had the impression that priority aware scheduling is the culprit. It is all just complex - sync idling, seeky writer detection, cooperating threads detection, sometimes even sync vs async distinction isn't exactly what one would want. And I'm not speaking about the cgroup stuff... So it doesn't seem to me that some other IO scheduler couldn't reasonably efficiently implement stuff like IO scheduling classes. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Fri 17-01-14 08:57:25, Robert Haas wrote: On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote: So this says to me that the WAL is a place where DIO should really be reconsidered. It's mostly sequential writes that need to hit the disk ASAP, and you need to know that they have hit the disk before you can proceed with other operations. Ironically enough, we actually *have* an option to use O_DIRECT here. But it doesn't work well. See below. Also, is the WAL actually ever read under normal (non-recovery) conditions or is it write-only under normal operation? If it's seldom read, then using DIO for them also avoids some double buffering since they wouldn't go through pagecache. This is the first problem: if replication is in use, then the WAL gets read shortly after it gets written. Using O_DIRECT bypasses the kernel cache for the writes, but then the reads stink. OK, yes, this is hard to fix with direct IO. However, if you configure wal_sync_method=open_sync and disable replication, then you will in fact get O_DIRECT|O_SYNC behavior. But that still doesn't work out very well, because now the guy who does the write() has to wait for it to finish before he can do anything else. That's not always what we want, because WAL gets written out from our internal buffers for multiple different reasons. Well, you can always use AIO (io_submit) to submit direct IO without waiting for it to finish. But then you might need to track the outstanding IO so that you can watch with io_getevents() when it is finished. If we're forcing the WAL out to disk because of transaction commit or because we need to write the buffer protected by a certain WAL record only after the WAL hits the platter, then it's fine. But sometimes we're writing WAL just because we've run out of internal buffer space, and we don't want to block waiting for the write to complete. Opening the file with O_SYNC deprives us of the ability to control the timing of the sync relative to the timing of the write. O_SYNC has a heavy performance penalty. For ext4 it means an extra fs transaction commit whenever there's any metadata changed on the filesystem. Since mtime ctime of files will be changed often, the will be a case very often. Again, I think this discussion would really benefit from an outline of the different files used by pgsql, and what sort of data access patterns you expect with them. I think I more or less did that in my previous email, but here it is again in briefer form: - WAL files are written (and sometimes read) sequentially and fsync'd very frequently and it's always good to write the data out to disk as soon as possible - Temp files are written and read sequentially and never fsync'd. They should only be written to disk when memory pressure demands it (but are a good candidate when that situation comes up) - Data files are read and written randomly. They are fsync'd at checkpoint time; between checkpoints, it's best not to write them sooner than necessary, but when the checkpoint arrives, they all need to get out to the disk without bringing the system to a standstill We have other kinds of files, but off-hand I'm not thinking of any that are really very interesting, apart from those. Maybe it'll be useful to have hints that say always write this file to disk as quick as you can and always postpone writing this file to disk for as long as you can for WAL and temp files respectively. But the rule for the data files, which are the really important case, is not so simple. fsync() is actually a fine API except that it tends to destroy system throughput. Maybe what we need is just for fsync() to be less aggressive, or a less aggressive version of it. We wouldn't mind waiting an almost arbitrarily long time for fsync to complete if other processes could still get their I/O requests serviced in a reasonable amount of time in the meanwhile. As I wrote in some other email in this thread, using IO priorities for data file checkpoint might be actually the right answer. They will work for IO submitted by fsync(). The downside is that currently IO priorities / IO scheduling classes work only with CFQ IO scheduler. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
On 22/01/14 03:16, Jon Nelson wrote: Greetings -hackers: I have worked up a patch to PostgreSQL which elides tuples during an external sort. The primary use case is when sorted input is being used to feed a DISTINCT operation. The idea is to throw out tuples that compare as identical whenever it's convenient, predicated on the assumption that even a single I/O is more expensive than some number of (potentially extra) comparisons. Obviously, this is where a cost model comes in, which has not been implemented. This patch is a work-in-progress. Dedup-in-sort is also done by my WIP internal merge sort, and extended (in much the same ways as Jon's) to the external merge. https://github.com/j47996/pgsql_sorb I've not done a cost model either, but the dedup capability is exposed from tuplesort.c to the executor, and downstream uniq nodes removed. I've not worked out yet how to eliminate upstream hashagg nodes, which would be worthwhile from testing results. -- Cheers, Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
On 22/01/14 03:53, Tom Lane wrote: Jon Nelson jnelson+pg...@jamponi.net writes: A rough summary of the patch follows: - a GUC variable enables or disables this capability - in nodeAgg.c, eliding duplicate tuples is enabled if the number of distinct columns is equal to the number of sort columns (and both are greater than zero). - in createplan.c, eliding duplicate tuples is enabled if we are creating a unique plan which involves sorting first - ditto planner.c - all of the remaining changes are in tuplesort.c, which consist of: + a new macro, DISCARDTUP and a new structure member, discardtup, are both defined and operate similar to COMPARETUP, COPYTUP, etc... + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply throw out the new tuple if it compares as identical to the tuple at the top of the heap. Since we're already performing this comparison, this is essentially free. + in mergeonerun, we may discard a tuple if it compares as identical to the *last written tuple*. This is a comparison that did not take place before, so it's not free, but it saves a write I/O. + We perform the same logic in dumptuples [ raised eyebrow ... ] And what happens if the planner drops the unique step and then the sort doesn't actually go to disk? I don't think Jon was suggesting that the planner drop the unique step. -- Cheers, Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
Peter Geoghegan p...@heroku.com writes: I ran an strace on a pg_stat_statements backend with a full ~5,000 entries. BTW, have you got any sort of test scenario you could share for this purpose? I'm sure I could build something, but if you've already done it ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Jeremy Harris j...@wizmail.org writes: On 22/01/14 03:53, Tom Lane wrote: Jon Nelson jnelson+pg...@jamponi.net writes: - in createplan.c, eliding duplicate tuples is enabled if we are creating a unique plan which involves sorting first [ raised eyebrow ... ] And what happens if the planner drops the unique step and then the sort doesn't actually go to disk? I don't think Jon was suggesting that the planner drop the unique step. Hm, OK, maybe I misread what he said there. Still, if we've told tuplesort to remove duplicates, why shouldn't we expect it to have done the job? Passing the data through a useless Unique step is not especially cheap. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, have you got any sort of test scenario you could share for this purpose? I'm sure I could build something, but if you've already done it ... I simply ran the standard regression tests, and then straced a backend as it executed the pgss-calling query. I'm not sure that I've understood your question. As previously noted, my approach to testing this patch involved variations of: running make installcheck-parallel Concurrently, running pgbench with multiple clients each calling pg_stat_statements() and pg_stat_statements_reset(). The latter was sometimes rigged to do a direct garbage collection to stress test things. My expectation was that the sanity checking done everywhere would complain if there were any race conditions or other bugs. This was pretty effective as a smoke test during development. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
Peter Geoghegan p...@heroku.com writes: On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, have you got any sort of test scenario you could share for this purpose? I'm sure I could build something, but if you've already done it ... I simply ran the standard regression tests, and then straced a backend as it executed the pgss-calling query. I'm not sure that I've understood your question. Hm, are there 5K distinct queries in the regression tests? I'd have thought they weren't enough to fill a large pgss hash. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jeremy Harris j...@wizmail.org writes: On 22/01/14 03:53, Tom Lane wrote: Jon Nelson jnelson+pg...@jamponi.net writes: - in createplan.c, eliding duplicate tuples is enabled if we are creating a unique plan which involves sorting first [ raised eyebrow ... ] And what happens if the planner drops the unique step and then the sort doesn't actually go to disk? I don't think Jon was suggesting that the planner drop the unique step. Hm, OK, maybe I misread what he said there. Still, if we've told tuplesort to remove duplicates, why shouldn't we expect it to have done the job? Passing the data through a useless Unique step is not especially cheap. That's correct - I do not propose to drop the unique step. Duplicates are only dropped if it's convenient to do so. In one case, it's a zero-cost drop (no extra comparison is made). In most other cases, an extra comparison is made, typically right before writing a tuple to tape. If it compares as identical to the previously-written tuple, it's thrown out instead of being written. The output of the modified code is still sorted, still *might* (and in most cases, probably will) contain duplicates, but will (probably) contain fewer duplicates. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
On Wed, Jan 22, 2014 at 1:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hm, are there 5K distinct queries in the regression tests? I'd have thought they weren't enough to fill a large pgss hash. Well, yes. They're a really bad case for pg_stat_statements, in that almost all distinct queries are only executed once or twice and yet there are enough of them. I think this is generally useful for testing. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote: If we're forcing the WAL out to disk because of transaction commit or because we need to write the buffer protected by a certain WAL record only after the WAL hits the platter, then it's fine. But sometimes we're writing WAL just because we've run out of internal buffer space, and we don't want to block waiting for the write to complete. Opening the file with O_SYNC deprives us of the ability to control the timing of the sync relative to the timing of the write. O_SYNC has a heavy performance penalty. For ext4 it means an extra fs transaction commit whenever there's any metadata changed on the filesystem. Since mtime ctime of files will be changed often, the will be a case very often. Also, there is the issue of writes that don't need sycning being synced because sync is set on the file descriptor. Here is output from our pg_test_fsync tool when run on an SSD with a BBU: $ pg_test_fsync 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a fdatasync 8424.785 ops/sec 119 usecs/op fsync 7127.072 ops/sec 140 usecs/op fsync_writethrough n/a open_sync 10548.469 ops/sec 95 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a fdatasync 4367.375 ops/sec 229 usecs/op fsync 4427.761 ops/sec 226 usecs/op fsync_writethrough n/a open_sync 4303.564 ops/sec 232 usecs/op Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) -- 1 * 16kB open_sync write 4938.711 ops/sec 202 usecs/op -- 2 * 8kB open_sync writes 4233.897 ops/sec 236 usecs/op -- 4 * 4kB open_sync writes 2904.710 ops/sec 344 usecs/op -- 8 * 2kB open_sync writes 1736.720 ops/sec 576 usecs/op -- 16 * 1kB open_sync writes 935.917 ops/sec1068 usecs/op Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close7626.783 ops/sec 131 usecs/op write, close, fsync6492.697 ops/sec 154 usecs/op Non-Sync'ed 8kB writes: write351517.178 ops/sec 3 usecs/op -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 14:25, Simon Riggs si...@2ndquadrant.com wrote: On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:10 PM, Simon Riggs wrote: As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Checkpoint also acquires the content lock of every dirty page in the buffer cache... Good point. We would need to take special action for any dirty blocks that we cannot obtain content lock for, which should be a smallish list, to be dealt with right at the end of the checkpoint writes. We know that anyone waiting for the WAL lock will not be modifying the block and so we can copy it without obtaining the lock. We can inspect the lock queue on the WAL locks and then see which buffers we can skip the lock for. This could be handled similarly to the way we handle buffer pin deadlocks in Hot Standby. So I don't see any blockers from that angle. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Tue, Jan 21, 2014 at 3:20 AM, Florian Pflug f...@phlo.org wrote: On Jan20, 2014, at 08:42 , David Rowley dgrowle...@gmail.com wrote: On Mon, Jan 20, 2014 at 2:45 PM, Florian Pflug f...@phlo.org wrote: * I've also renamed INVFUNC to INVSFUNC. That's a pretty invasive change, and it's the last commit, so if you object to that, then you can merge up to eafa72330f23f7c970019156fcc26b18dd55be27 instead of de3d9148be9732c4870b76af96c309eaf1d613d7. Seems like sfunc really should be tfunc then we could have invtfunc. I'd probably understand this better if I knew what the 's' was for in sfunc. I've not applied this just yet. Do you have a reason why you think it's better? My issue with just invfunc is mainly that it's too generic - it doesn't tell you what it's supposed to be the inverse of. I've always assumed that 's' in 'sfunc' and 'stype' stands for 'state', and that the naming is inspired by control theory, where the function which acts on the state space is often called S. Ok, that makes more sense now and it seems like a reasonable idea. I'm not not quite sure yet as when someone said upthread that these negative transition functions as I was calling them at the time should really be called inverse transition functions, I then posted that I was going to call the create aggregate option invfunc which nobody seemed to object to. I just don't want to go and change that now. It is very possible this will come up again when the committer is looking at the patch. It would be a waste if it ended up back at invfunc after we changed it to invsfunc. Regards David Rowley
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Wed, Jan 22, 2014 at 12:46 AM, Florian Pflug f...@phlo.org wrote: On Jan21, 2014, at 10:53 , David Rowley dgrowle...@gmail.com wrote: It came to me that it might be possible to implement inverse transitions for floating point aggregates by just detecting if precision has been lost during forward transitions. I've written the test to do this as: IF state.value + value = state.value AND value 0 THEN newstate.precision_lost := true; newstate.value := state.value; ELSE newstate.precision_lost := false; newstate.value := state.value + value; END IF; The inverse transition function checks the precision_lost and if it's true it returns NULL. The core code is now implemented (thanks to Florian) to re-aggregate when NULL is returned from the inverse transition function. That's not sufficient, I fear. You can lose all significant digits of the value and still have precision_lost = false afterwards. Try summing over 1e16, 1.01. SELECT 1e16::float8 + 1.01::float8 = 1e16::float8 returns FALSE, yet SELECT 1e16::float8 + 1.01::float8 - 1e16::float8 returns 2 where 1.01 would have been correct. That's still too much precision loss. I'm quite certain that the general idea has merit, but the actual invertibility condition is going to be more involved. If you want to play with this, I think the first step has to be to find a set of guarantees that SUM(float) is supposed to meet. Currently, SUM(float) guarantees that if the summands all have the same sign, the error is bound by C * N, where C is some (machine-specific?) constant (about 1e-15 or so), and N is the number of input rows. Or at least so I think from looking at SUMs over floats in descending order, which I guess is the worst case. You could, for example, try to see if you can find a invertibility conditions which guarantees the same, but allows C to be larger. That would put a bound on the number of digits lost by the new SUM(float) compared to the old one. I guess if the case is that normal (non-window) sum(float) results are undefined unless you add an order by to the aggregate then I guess there is no possible logic to put in for inverse transitions that will make them behave the same as an undefined behaviour. I don't have high hopes for this getting int 9.4, though. Yeah I'm going to drop it. I need to focus on documents and performance testing. If it seems sound enough, then I may implement it in C to see how much overhead it adds to forward aggregation for floating point types, but even if it did add too much overhead to forward aggregation it might be worth allowing aggregates to have 2 forward transition functions and if the 2nd one exists then it could be used in windowing functions where the frame does not have unbounded following. I don't think adding yet another type of aggregation function is the solution here. Why not? I thought, if in the cases where we need to burden the forward transition functions with extra code to make inverse transitions possible, then why not have an extra transition function so that it does not slow down normal aggregation? My testing of sum(numeric) last night does not show any slow down by that extra dscale tracking code that was added, but if it did then I'd be thinking seriously about 2 forward transition functions to get around the problem. I don't understand what would be wrong with that. The core code could just make use of the 2nd function if it existed and inverse transitions were thought to be required. i.e in a window context and does not have UNBOUNDED PRECEDING. Regards David Rowley
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 1/21/14, 6:46 PM, Andres Freund wrote: On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: On Tue, Jan 21, 2014 at 3:43 PM, Andres Freundand...@2ndquadrant.com wrote: I personally think this isn't worth complicating the code for. You're probably right. However, I don't see why the bar has to be very high when we're considering the trade-off between taking some emergency precaution against having a PANIC shutdown, and an assured PANIC shutdown Well, the problem is that the tradeoff would very likely include making already complex code even more complex. None of the proposals, even the one just decreasing the likelihood of a PANIC, like like they'd end up being simple implementation-wise. And that additional complexity would hurt robustness and prevent things I find much more important than this. If we're not looking for perfection, what's wrong with Peter's idea of a ballast file? Presumably the check to see if that file still exists would be cheap so we can do that before entering the appropriate critical section. There's still a small chance that we'd end up panicing, but it's better than today. I'd argue that even if it doesn't work for CoW filesystems it'd still be a win. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan23, 2014, at 01:17 , David Rowley dgrowle...@gmail.com wrote: On Wed, Jan 22, 2014 at 12:46 AM, Florian Pflug f...@phlo.org wrote: If you want to play with this, I think the first step has to be to find a set of guarantees that SUM(float) is supposed to meet. Currently, SUM(float) guarantees that if the summands all have the same sign, the error is bound by C * N, where C is some (machine-specific?) constant (about 1e-15 or so), and N is the number of input rows. Or at least so I think from looking at SUMs over floats in descending order, which I guess is the worst case. You could, for example, try to see if you can find a invertibility conditions which guarantees the same, but allows C to be larger. That would put a bound on the number of digits lost by the new SUM(float) compared to the old one. I guess if the case is that normal (non-window) sum(float) results are undefined unless you add an order by to the aggregate then I guess there is no possible logic to put in for inverse transitions that will make them behave the same as an undefined behaviour. Actually, if sum(float) was really undefined, it'd be *very* easy to provide an inverse transition function - just make it a no-op. Heck, you could then even make the forward transition function a no-op, since the very definition of undefined behaviour is result can be anything, including setting your house on fire. The point is, it's *not* undefined - it's just imprecise. And the imprecision can even be quantified, it just depends on the number of input rows (the equal-sign requirement is mostly there to make imprecision equivalent to relative error). If it seems sound enough, then I may implement it in C to see how much overhead it adds to forward aggregation for floating point types, but even if it did add too much overhead to forward aggregation it might be worth allowing aggregates to have 2 forward transition functions and if the 2nd one exists then it could be used in windowing functions where the frame does not have unbounded following. I don't think adding yet another type of aggregation function is the solution here. Why not? I thought, if in the cases where we need to burden the forward transition functions with extra code to make inverse transitions possible, then why not have an extra transition function so that it does not slow down normal aggregation? My testing of sum(numeric) last night does not show any slow down by that extra dscale tracking code that was added, but if it did then I'd be thinking seriously about 2 forward transition functions to get around the problem. I don't understand what would be wrong with that. The core code could just make use of the 2nd function if it existed and inverse transitions were thought to be required. i.e in a window context and does not have UNBOUNDED PRECEDING. First, every additional function increases the maintenance burden, and at some point the expected benefit simply isn't worth it. IMHO at least. Secondly, you'd really need a second aggregate definition - usually, the non-invertible aggregate will get away with a much smaller state representation. Aggregates with state type internal could hack their way around that by simply not using the additional parts of their state structure, but e.g. plpgsql aggregates would still incur overhead due to repeated copying of a unnecessary large state value. If it's possible to represent the forward-only but not the invertible state state with a copy-by-value type, that overhead could easily be a large part of the total overhead you paid for having an inverse transition function. And lastly, we could achieve much of the benefit of a second transition function by simply telling the forward transition function whether to expect inverse transition function calls. We already expose things like the aggregation memory context to C-language transition functions, so the basic mechanism is already there. In fact, I pondered whether to do this - but then figured I'd leave it to however needs that facility first. Though it wouldn't be much code - basically, setting a flag in the WindowAggState, and exporting a function to be used by aggregates to read it, similar to what AggCheckCallContext does. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 1/17/14, 7:57 AM, Robert Haas wrote: - WAL files are written (and sometimes read) sequentially and fsync'd very frequently and it's always good to write the data out to disk as soon as possible - Temp files are written and read sequentially and never fsync'd. They should only be written to disk when memory pressure demands it (but are a good candidate when that situation comes up) - Data files are read and written randomly. They are fsync'd at checkpoint time; between checkpoints, it's best not to write them sooner than necessary, but when the checkpoint arrives, they all need to get out to the disk without bringing the system to a standstill For sake of completeness... there are also data files that are temporary and don't need to be written to disk unless the kernel thinks there's better things to use that memory for. AFAIK those files are never fsync'd. In other words, these are the same as the temp files Robert describes except they also have random access. Dunno if that matters. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: hide application_name from other users
On Wed, Jan 22, 2014 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote: Probably Heroku has some more specific exploit case to be concerned about here; if so, might I suggest taking it up with the -security list? I don't think there's a specific vulnerability that needs to be kept secret here. Here's an example. I just created a new hobby database which is on a multi-tenant cluster and ran select * from pg_stat_activity. Here are two of the more interesting examples: 463752 | de5nmf0gbii3u5 | 32250 | 463751 | qspfkgrwgqtbcu | unicorn worker[1] -p 30390 -c ./config/unicorn.rb || | | | | | | || insufficient privilege 463752 | de5nmf0gbii3u5 | 32244 | 463751 | qspfkgrwgqtbcu | unicorn worker[0] -p 30390 -c ./config/unicorn.rb || | | | | | | || insufficient privilege Note that the contents of the ARGV array are being set by the unicorn task queuing library. It knows it's making this information visible to other users with shell access on this machine. But the decision to stuff the ARGV information into the application_name is being made by the Pg driver. Neither is under the control of the application author who may not even be aware this is happening. Neither component has the complete information to make a competent decision about whether this information is safe to be in application_name or not. Note that the query is showing as insufficient privilege even though it is listed in the ps output -- the same ps output that is listing the unicorn ARGV that is being shown in the application_name You might say that the Pg gem is at fault for making a blanket policy decision for applications that the ARGV is safe to show to other database users but realistically it's so useful to see this information for your own connections that it's probably the right decision. Without it it's awfully hard to tell which worker is on which connection. It would just be nice to be able to treat application_name the same as query. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
My review: Clearly, everyone likes this feature. I'm tempted to think it should be mandatory to specify this option in plain mode when tablespaces are present. Otherwise, creating a base backup is liable to create random files all over the place. Obviously, there would be backward compatibility concerns. I'm not totally happy with the choice of : as the mapping separator, because that would always require escaping on Windows. We could make it analogous to the path handling and make ; the separator on Windows. Then again, this is not a path, so maybe it should look like one. We pick something else altogether, like =. The option name --tablespace isn't very clear. It ought to be something like --tablespace-mapping. I don't think we should require the new directory to be an absolute path. It should be relative to the current directory, just like the -D option does it. I'm not so worried about the atooid() and linked-list duplication. That can be addressed at some later point. I would try to write this patch without using MAXPGPATH. I know existing code uses it, but we should try to use it less, because it overallocates memory and requires handling additional error conditions. For example, you catch overflow in tablespace_list_append() but later report that as invalid tablespace format. We now have psprintf() to make coding with dynamic memory allocation easier. It looks like when you ignore an escaped : as a separator, you don't actually unescape it for use as a directory. OLDDIR and NEWDIR should be capitalized in the help message. Somehow, I had the subconscious assumption that this feature would do prefix matching on the directories, not only complete string matching. So if I had tablespaces in /mnt/data1, /mnt/data2, /mnt/data3, I could map them all with -T /mnt:mnt and be done. Not sure if that is useful to many, but it's worth a thought. Review style guide for error messages: http://www.postgresql.org/docs/devel/static/error-style-guide.html We need to think about how to handle this on platforms without symlinks. I don't like just printing an error message and moving on. It should be either pass or fail or an option to choose between them. Please put the options in the getopt handling in alphabetical order. It's not very alphabetical now, but between D and F is probably not the best place. ;-) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 1/17/14, 2:24 PM, Gregory Smith wrote: I am skeptical that the database will take over very much of this work and perform better than the Linux kernel does. My take is that our most useful role would be providing test cases kernel developers can add to a performance regression suite. Ugly we never though that would happen situations seems at the root of many of the kernel performance regressions people here get nailed by. FWIW, there are some scenarios where we could potentially provide additional info to the kernel scheduler; stuff that we know that it never will. For example, if we have a limit clause we can (sometimes) provide a rough estimate of how many pages we'll need to read from a relation. Probably more useful is the case of index scans; if we pre-read more data from the index we could hand the kernel a list of base relation blocks that we know we'll need. There's some other things that have been mentioned, such as cases where files will only be accessed sequentially. Outside of that though, the kernel is going to be in a way better position to schedule IO than we will ever be. Not only does it understand the underlying hardware, it can also see everything else that's going on. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 1/19/14, 5:51 PM, Dave Chinner wrote: Postgres is far from being the only application that wants this; many people resort to tmpfs because of this: https://lwn.net/Articles/499410/ Yes, we covered the possibility of using tmpfs much earlier in the thread, and came to the conclusion that temp files can be larger than memory so tmpfs isn't the solution here.:) Although... instead of inventing new APIs and foisting this work onto applications, perhaps it would be better to modify tmpfs such that it can handle a temp space that's larger than memory... possibly backing it with X amount of real disk and allowing it/the kernel to decide when to passively move files out of the in-memory tmpfs and onto disk. Of course that's theoretically what swapping is supposed to do, but if that's not up to the job... -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/22 22:26), Robert Haas wrote: On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: OK, Kondo, please demonstrate benchmarks that show we have 1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. To see any impact from spinlock contention, I think you're pretty much going to need a machine with 32 cores, I think, and lots of concurrency. pgbench -S is probably a better test than DBT-2, because it leaves out all the writing, so percentage-wise more time will be spent doing things like updating the pgss hash table. Oh, thanks to inform me. I think essential problem of my patch has bottle neck in sqrt() function and other division caluculation. I will replcace sqrt() function in math.h to more faster algorithm. And moving unneccessary part of caluculation in LWlocks or other locks. It might take time to improvement, so please wait for a while. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Wed, Jan 22, 2014 at 5:28 PM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: Oh, thanks to inform me. I think essential problem of my patch has bottle neck in sqrt() function and other division caluculation. Well, that's a pretty easy theory to test. Just stop calling them (and do something similar to what we do for current counter fields instead) and see how much difference it makes. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On Wed, Jan 22, 2014 at 1:44 PM, Yugo Nagata nag...@sraoss.co.jp wrote: On Wed, 22 Jan 2014 20:04:12 +0900 (JST) Tatsuo Ishii is...@postgresql.org wrote: parseTypeString() is called by some other functions and I avoided influences of modifying the definition on them, since this should raise errors in most cases. This is same reason for other *MissingOk functions in parse_type.c. Is it better to write definitions of these function and all there callers? Yes, for parseTypeString certainly. There have been many refactorings like that in the past and all of them use this pattern. typenameTypeIdAndMod is less clear since the code paths differ so much, maybe keep 2 versions (merging back to 1 function is OK too, but in any case you don't need 3). typenameTypeIdAndModMissingOk(...) { Type tup = LookupTypeName(pstate, typeName, typmod_p); if (tup == NULL || !((Form_pg_type) GETSTRUCT(tup))-typisdefined) *typeid_p = InvalidOid; else *typeid_p = HeapTupleGetOid(tup); if (tup) ReleaseSysCache(tup); } typenameTypeIdAndMod(...) { Type tup = typenameType(pstate, typeName, typmod_p); *typeid_p = HeapTupleGetOid(tup); ReleaseSysCache(tup); } Also, there's no need for else here: if (raiseError) ereport(ERROR, ...); else return InvalidOid; Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 22, 2014 at 10:08 PM, Jim Nasby j...@nasby.net wrote: Probably more useful is the case of index scans; if we pre-read more data from the index we could hand the kernel a list of base relation blocks that we know we'll need. Actually, I've already tried this. The most important part is fetching heap pages, not index. Tried that too. Currently, fadvising those pages works *in detriment* of physically correlated scans. That's a kernel bug I've reported to LKML, and I could probably come up with a patch. I've just never had time to set up the testing machinery to test the patch myself. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Confusing documentation of ordered-set aggregates?
Hi After reading through the relevant parts of sytnax.sgml, create_aggregate.smgl and xaggr.sgml, I think I understand how these work - they work exactly like regular aggregates, except that some arguments are evaluated only once and passed to the final function instead of the transition function. The whole ORDER BY thing is just crazy syntax the standard mandates - a saner alternative would have been ordered_set_agg(direct1,...,directN, WITHIN(arg1,...,argM)) or something like that, right? So whether ORDER BY implies any actual ordering is up to the ordered-set aggregate's final function. Or at least that's what xaggr.sgml seems to say Unlike the case for normal aggregates, the sorting of input rows for an ordered-set aggregate is emphasisnot/ done behind the scenes, but is the responsibility of the aggregate's support functions. but that seems to contradict syntax.sgml which says The expressions in the replaceableorder_by_clause/replaceable are evaluated once per input row just like normal aggregate arguments, sorted as per the replaceableorder_by_clause/replaceable's requirements, and fed to the aggregate function as input arguments. Also, xaggr.sgml has the following to explain why the NULLs are passed for all aggregated arguments to the final function, instead of simply not passing them at all While the null values seem useless at first sight, they are important because they make it possible to include the data types of the aggregated input(s) in the final function's signature, which may be necessary to resolve the output type of a polymorphic aggregate. Why do ordered-set aggregates required that, when plain aggregates are fine without it? array_agg(), for example, also has a result type that is determined by the argument type, yet it's final function doesn't take an argument of type anyelement, even though it returns anyarray. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan23, 2014, at 01:07 , David Rowley dgrowle...@gmail.com wrote: On Tue, Jan 21, 2014 at 3:20 AM, Florian Pflug f...@phlo.org wrote: On Jan20, 2014, at 08:42 , David Rowley dgrowle...@gmail.com wrote: On Mon, Jan 20, 2014 at 2:45 PM, Florian Pflug f...@phlo.org wrote: * I've also renamed INVFUNC to INVSFUNC. That's a pretty invasive change, and it's the last commit, so if you object to that, then you can merge up to eafa72330f23f7c970019156fcc26b18dd55be27 instead of de3d9148be9732c4870b76af96c309eaf1d613d7. Seems like sfunc really should be tfunc then we could have invtfunc. I'd probably understand this better if I knew what the 's' was for in sfunc. I've not applied this just yet. Do you have a reason why you think it's better? My issue with just invfunc is mainly that it's too generic - it doesn't tell you what it's supposed to be the inverse of. I've always assumed that 's' in 'sfunc' and 'stype' stands for 'state', and that the naming is inspired by control theory, where the function which acts on the state space is often called S. Ok, that makes more sense now and it seems like a reasonable idea. I'm not not quite sure yet as when someone said upthread that these negative transition functions as I was calling them at the time should really be called inverse transition functions, I then posted that I was going to call the create aggregate option invfunc which nobody seemed to object to. I just don't want to go and change that now. It is very possible this will come up again when the committer is looking at the patch. It would be a waste if it ended up back at invfunc after we changed it to invsfunc. Since we already settled on inverse transition function, I kinda doubt that calling the parameter invsfunc is going to meet a lot of resistance. But we can put that off a little longer still... I've pushed a few additional things to https://github.com/fgp/postgres/tree/invtrans. * I update the CREATE AGGREGATE documentation, trying to include the description of the various modes of inverse transition functions into the paragraphs which already explained about STRICT for transition functions and such. * I've also updated the list of window functions to include a list of those aggregates which potentially need to restart the computation, i.e. MIN/MAX and the like. * I've changed nodeWindowAgg.c to use per-aggregate aggregation contexts for the invertible aggregates. Without that, the aggregate context is potentially never reset, because that previously required *all* the aggregates to restart at the same time. That would be OK if we were sure not to leak unbounded amounts of stuff stores in that context, but unfortunately we sometimes do. For example, whenever a strict, invertible aggregate ends up with only NULL inputs, we re-initialize the aggregation, which leaks the old state value. We could pfree() that of course, but that state value might reference other stuff that we don't know about and thus cannot free. Separating the aggregation contexts is the only solution I came up with, so I did that. * I've also tweaked an if to flag aggregates as invertible only if the frame head can actually move, i.e. if the frame start clause is something other than UNBOUNDED PRECEEDING. Since the choice of whether to use a private aggregation context is driven by that flag, that also makes the above apply only to aggregates were the inverse transition function is actually used. I hope to find some time tomorrow or so to complete my pass through the documentation - what's still missing as an explanation of the EXPLAIN VERBOSE ANALYZE field and maybe some cleanup of xaggr.sgml. Do you have any additional things pending? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 01/22/2014 08:28 PM, KONDO Mitsumasa wrote: (2014/01/22 22:26), Robert Haas wrote: On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: OK, Kondo, please demonstrate benchmarks that show we have 1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. To see any impact from spinlock contention, I think you're pretty much going to need a machine with 32 cores, I think, and lots of concurrency. pgbench -S is probably a better test than DBT-2, because it leaves out all the writing, so percentage-wise more time will be spent doing things like updating the pgss hash table. Oh, thanks to inform me. I think essential problem of my patch has bottle neck in sqrt() function and other division caluculation. I will replcace sqrt() function in math.h to more faster algorithm. And moving unneccessary part of caluculation in LWlocks or other locks. It might take time to improvement, so please wait for a while. Umm, I have not read the patch, but are you not using Welford's method? Its per-statement overhead should be absolutely tiny (and should not compute a square root at all per statement - the square root should only be computed when the standard deviation is actually wanted, e.g. when a user examines pg_stat_statements) See for example http://www.johndcook.com/standard_deviation.html cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] pg_ctl always uses the same event source
On Wed, Jan 22, 2014 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Wed, Jan 22, 2014 at 6:54 PM, Robert Haas robertmh...@gmail.com wrote: I wonder if the port number wouldn't be a better choice. And that could even be done without adding a parameter. We need this for register of event source which might be done before start of server. So? Anything which can know the value of a GUC parameter can certainly know the selected port number. 1. In case of registration of event source, either user has to pass the name or it uses hard coded default value, so if we use version number along with 'PostgreSQL', it can be consistent. I don't see any way pgevent.c can know port number to append it to default value, am I missing something here? 2. In pg_ctl if we use port number, then if user changes it across multiple server restarts, then it can also create a problem, as the event source name used will be different than what the name used during registration of event source. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Confusing documentation of ordered-set aggregates?
Florian Pflug f...@phlo.org writes: After reading through the relevant parts of sytnax.sgml, create_aggregate.smgl and xaggr.sgml, I think I understand how these work - they work exactly like regular aggregates, except that some arguments are evaluated only once and passed to the final function instead of the transition function. Yeah, that statement is correct. The whole ORDER BY thing is just crazy syntax the standard mandates - a saner alternative would have been ordered_set_agg(direct1,...,directN, WITHIN(arg1,...,argM)) or something like that, right? Not sure. The syntax is certainly something out of far left field (which is pretty much par for the course with the SQL committee :-(). But the concept basically is to the extent that your results depend on an assumed ordering of the input rows, this is what to use. That seems sane enough, at least for aggregates where the input ordering does matter. So whether ORDER BY implies any actual ordering is up to the ordered-set aggregate's final function. Yes, the committed patch intentionally doesn't force the aggregate to do any ordering, though all the built-in aggregates do so. but that seems to contradict syntax.sgml which says The expressions in the replaceableorder_by_clause/replaceable are evaluated once per input row just like normal aggregate arguments, sorted as per the replaceableorder_by_clause/replaceable's requirements, and fed to the aggregate function as input arguments. Well, syntax.sgml is just trying to explain the users-eye view. I'm not sure that it'd be helpful to say here that the implementation might choose not to do a physical sort. Also, xaggr.sgml has the following to explain why the NULLs are passed for all aggregated arguments to the final function, instead of simply not passing them at all While the null values seem useless at first sight, they are important because they make it possible to include the data types of the aggregated input(s) in the final function's signature, which may be necessary to resolve the output type of a polymorphic aggregate. Why do ordered-set aggregates required that, when plain aggregates are fine without it? Actually, if polymorphic types had existed when the original aggregate infrastructure was designed, it might well have been done like that. I was thinking while working on the ordered-set patch that this would be a really nifty thing for regular polymorphic aggregates too. Right now, the only safe way to make a polymorphic plain aggregate is to use a polymorphic state type, and that type has to be sufficient to determine the result type. If you'd like to define the state type as internal, you lose --- there's no connection between the input and result types. So I was wondering if we shouldn't think about how to allow regular aggregates to use final functions defined in this style. But it's not something I've got time to pursue at the moment. array_agg(), for example, also has a result type that is determined by the argument type, yet it's final function doesn't take an argument of type anyelement, even though it returns anyarray. Yeah. So it's a complete leap of faith on the type system's part that this function is an appropriate final function for array_agg(). I'm not sure offhand if CREATE AGGREGATE would even allow this combination to be created, or if it only works because we manually jammed those rows into the catalogs at initdb time. But it would certainly be safer if CREATE AGGREGATE *didn't* allow it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Confusing documentation of ordered-set aggregates?
I wrote: Florian Pflug f...@phlo.org writes: array_agg(), for example, also has a result type that is determined by the argument type, yet it's final function doesn't take an argument of type anyelement, even though it returns anyarray. Yeah. So it's a complete leap of faith on the type system's part that this function is an appropriate final function for array_agg(). I'm not sure offhand if CREATE AGGREGATE would even allow this combination to be created, or if it only works because we manually jammed those rows into the catalogs at initdb time. But it would certainly be safer if CREATE AGGREGATE *didn't* allow it. Actually, after a little bit of experimentation, the irreproducible manual catalog hack is the very existence of array_agg_finalfn(). If you try to reproduce it via CREATE FUNCTION, the system will object: regression=# create function foo(internal) returns anyarray as regression-# 'array_agg_finalfn' language internal; ERROR: cannot determine result data type DETAIL: A function returning a polymorphic type must have at least one polymorphic argument. So what the ordered-set-aggregate patch has done is introduce a principled way to define polymorphic aggregates with non-polymorphic state types, something we didn't have before. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/23 12:00), Andrew Dunstan wrote: On 01/22/2014 08:28 PM, KONDO Mitsumasa wrote: (2014/01/22 22:26), Robert Haas wrote: On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: OK, Kondo, please demonstrate benchmarks that show we have 1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. To see any impact from spinlock contention, I think you're pretty much going to need a machine with 32 cores, I think, and lots of concurrency. pgbench -S is probably a better test than DBT-2, because it leaves out all the writing, so percentage-wise more time will be spent doing things like updating the pgss hash table. Oh, thanks to inform me. I think essential problem of my patch has bottle neck in sqrt() function and other division caluculation. I will replcace sqrt() function in math.h to more faster algorithm. And moving unneccessary part of caluculation in LWlocks or other locks. It might take time to improvement, so please wait for a while. Umm, I have not read the patch, but are you not using Welford's method? Its per-statement overhead should be absolutely tiny (and should not compute a square root at all per statement - the square root should only be computed when the standard deviation is actually wanted, e.g. when a user examines pg_stat_statements) See for example http://www.johndcook.com/standard_deviation.html Thanks for your advice. I read your example roughly, however, I think calculating variance is not so heavy in my patch. Double based sqrt caluculation is most heavily in my mind. And I find fast square root algorithm that is used in 3D games. http://en.wikipedia.org/wiki/Fast_inverse_square_root This page shows inverse square root algorithm, but it can caluculate normal square root, and it is faster algorithm at the price of precision than general algorithm. I think we want to fast algorithm, so it will be suitable. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] pg_ctl always uses the same event source
Amit Kapila amit.kapil...@gmail.com writes: On Wed, Jan 22, 2014 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: So? Anything which can know the value of a GUC parameter can certainly know the selected port number. 1. In case of registration of event source, either user has to pass the name or it uses hard coded default value, so if we use version number along with 'PostgreSQL', it can be consistent. I don't see any way pgevent.c can know port number to append it to default value, am I missing something here? [ shrug... ] But the same problem applies just as much to any attempt by pg_ctl to know *any* postmaster parameter. In particular, this entire patch is bogus, because the value it extracts from the postgresql.conf file does not necessarily have anything to do with the setting the live postmaster is actually using (which might be determined by a command-line parameter or environment variable instead). If the technique could be relied on, then it could be relied on just as much to extract the postmaster's port setting. I don't necessarily object to teaching pg_ctl to make a best-effort estimate of a postmaster parameter in this way. But it's complete folly to suppose that you can get an accurate value of event_source but not the port number. I think what we might want to do is redefine the server's behavior as creating an event named after the concatenation of event_source and port number, or maybe even get rid of event_source entirely and just say it's PostgreSQL followed by the port number. If we did the latter then the problem would reduce to whether pg_ctl knows the port number, which I think we're already assuming for other reasons. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:17 PM, Alexander Korotkov wrote: We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs. Remember we have also subsequent fast-scan which is very needed for hstore and other application. I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times. I propose following decisions: 1) Leave uncompressed area, allow duplicates in it 2) Don't introduce Items on page. Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not. So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something. Great! Thanks a lot! Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain GinDataLeafMaxContentSize bytes. Patch is attached. My test-suite don't run correctly. I'm debugging now. -- With best regards, Alexander Korotkov. gin-assert-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch (v2) for updatable security barrier views
(2014/01/21 18:18), Dean Rasheed wrote: On 21 January 2014 01:09, KaiGai Kohei kai...@ak.jp.nec.com wrote: (2014/01/13 22:53), Craig Ringer wrote: On 01/09/2014 11:19 PM, Tom Lane wrote: Dean Rasheed dean.a.rash...@gmail.com writes: My first thought was that it should just preprocess any security barrier quals in subquery_planner() in the same way as other quals are preprocessed. But thinking about it further, those quals are destined to become the quals of subqueries in the range table, so we don't actually want to preprocess them at that stage --- that will happen later when the new subquery is planned by recursion back into subquery_planner(). So I think the right answer is to make adjust_appendrel_attrs() handle recursion into sublink subqueries. TBH, this sounds like doubling down on a wrong design choice. I see no good reason that updatable security views should require any fundamental rearrangements of the order of operations in the planner. In that case, would you mind offerign a quick sanity check on the following alternative idea: - Add sourceRelation to Query. This refers to the RTE that supplies tuple projections to feed into ExecModifyTable, with appropriate resjunk ctid and (if requ'd) oid cols present. - When expanding a target view into a subquery, set sourceRelation on the outer view to the index of the RTE of the newly expanded subquery. I have sane opinion. Existing assumption, resultRelation is RTE of the table to be read not only modified, makes the implementation complicated. If we would have a separate sourceRelation, it allows to have flexible form including sub-query with security_barrier on the reader side. Could you tell me the direction you're inclined right now? I wonder whether I should take the latest patch submitted on 09-Jan for a deep code reviewing and testing. Yes, please review the patch from 09-Jan (http://www.postgresql.org/message-id/CAEZATCUiKxOg=voovja2s6g-sixzzxg18totggp8zobq6qn...@mail.gmail.com). The idea behind that patch is to avoid a lot of the complication that leads to and then arises from having a separate sourceRelation in the query. If you go down the route of expanding the subquery in the rewriter and making it the sourceRelation, then you have to make extensive changes to preprocess_targetlist so that it recursively descends into the subquery to pull out variables needed by an update. Also you would probably need additional changes in the rewriter so that later stages didn't trample on the sourceRelation, and correctly updated it in the case of views on top of other views. Also, you would need to make changes to the inheritance planner code, because you'd now have 2 RTEs referring to the target relation (resultRelation and sourceRelation wrapping it in a subquery). Referring to the comment in inheritance_planner(): * Source inheritance is expanded at the bottom of the * plan tree (see allpaths.c), but target inheritance has to be expanded at * the top. except that in the case of the sourceRelation, it is actually effectively the target, which means it shouldn't be expanded, otherwise you'd get plans like the ones I complained about upthread (see the final explain output in http://www.postgresql.org/message-id/caezatcu3vcdkjpnhgfkrmrkz0axkcuh4ce_kqq3z2hzknhi...@mail.gmail.com). Perhaps all of that could be made to work, but I think that it would end up being a far more invasive patch than my 09-Jan patch. My patch avoids both those issues by not doing the subquery expansion until after inheritance expansion, and after targetlist preprocessing. Probably, I could get your point. Even though the sub-query being replaced from a relation with securityQuals is eventually planned according to the usual manner, usual order as a part of regular sub-query planning, however, adjust_appendrel_attrs() is called by inheritance_planner() earlier than sub-query's planning. Maybe, we originally had this problem but not appeared on the surface, because child relations don't have qualifiers on this phase. (It shall be distributed later.) But now, this assumption was broken because of a relation with securityQuals being replaced by a sub-query with SubLink. So, I'm inclined to revise the assumption side, rather than existing assertion checks. Shall we investigate what assumption should be revised if child- relation, being expanded at expand_inherited_tables(), would be a sub-query having Sub-Link? A minor comment is below: ! /* !* Make sure that the query is marked correctly if the added qual !* has sublinks. !*/ ! if (!parsetree-hasSubLinks) ! parsetree-hasSubLinks = checkExprHasSubLink(viewqual); Is this logic really needed? This flag says the top-level query contains SubLinks, however, the above condition checks whether a sub-query to be constructed shall contain SubLinks. Also, securityQuals is not attached to the parse tree right now. Thanks, -- OSS
Re: [HACKERS] [PATCH] Support for pg_stat_archiver view
On Thu, Jan 9, 2014 at 6:36 AM, Gabriele Bartolini gabriele.bartol...@2ndquadrant.it wrote: Il 08/01/14 18:42, Simon Riggs ha scritto: Not sure I see why it needs to be an SRF. It only returns one row. Attached is version 4, which removes management of SRF stages. I have been looking at v4 a bit more, and found a couple of small things: - a warning in pgstatfuncs.c - some typos and format fixing in the sgml docs - some corrections in a couple of comments - Fixed an error message related to pg_stat_reset_shared referring only to bgwriter and not the new option archiver. Here is how the new message looks: =# select pg_stat_reset_shared('popo'); ERROR: 22023: unrecognized reset target: popo HINT: Target must be bgwriter or archiver A new version is attached containing those fixes. I played also with the patch and pgbench, simulating some archive failures and successes while running pgbench and the reports given by pg_stat_archiver were correct, so I am marking this patch as Ready for committer. Regards, -- Michael diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 4ec6981..eb5131f 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -270,11 +270,19 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entrystructnamepg_stat_archiver/indextermprimarypg_stat_archiver/primary/indexterm/entry + entryOne row only, showing statistics about the + WAL archiver process's activity. See + xref linkend=pg-stat-archiver-view for details. + /entry + /row + + row entrystructnamepg_stat_bgwriter/indextermprimarypg_stat_bgwriter/primary/indexterm/entry entryOne row only, showing statistics about the background writer process's activity. See xref linkend=pg-stat-bgwriter-view for details. - /entry + /entry /row row @@ -648,6 +656,63 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /para /note + table id=pg-stat-archiver-view xreflabel=pg_stat_archiver + titlestructnamepg_stat_archiver/structname View/title + + tgroup cols=3 +thead + row + entryColumn/entry + entryType/entry + entryDescription/entry + /row +/thead + +tbody + row + entrystructfieldarchived_wals//entry + entrytypebigint/type/entry + entryNumber of WAL files that have been successfully archived/entry + /row + row + entrystructfieldlast_archived_wal//entry + entrytypetext/type/entry + entryName of the last WAL file successfully archived/entry + /row + row + entrystructfieldlast_archived_wal_time//entry + entrytypetimestamp with time zone/type/entry + entryTime of the last successful archive operation/entry + /row + row + entrystructfieldfailed_attempts//entry + entrytypebigint/type/entry + entryNumber of failed attempts for archiving WAL files/entry + /row + row + entrystructfieldlast_failed_wal//entry + entrytypetext/type/entry + entryName of the WAL file of the last failed archival operation/entry + /row + row + entrystructfieldlast_failed_wal_time//entry + entrytypetimestamp with time zone/type/entry + entryTime of the last failed archival operation/entry + /row + row + entrystructfieldstats_reset//entry + entrytypetimestamp with time zone/type/entry + entryTime at which these statistics were last reset/entry + /row +/tbody + /tgroup + /table + + para + The structnamepg_stat_archiver/structname view will always have a + single row, containing data about the archiver process of the cluster. + /para + table id=pg-stat-bgwriter-view xreflabel=pg_stat_bgwriter titlestructnamepg_stat_bgwriter/structname View/title @@ -1613,6 +1678,8 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re argument (requires superuser privileges). Calling literalpg_stat_reset_shared('bgwriter')/ will zero all the counters shown in the structnamepg_stat_bgwriter/ view. + Calling literalpg_stat_reset_shared('archiver')/ will zero all the + counters shown in the structnamepg_stat_archiver/ view. /entry /row diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 043d118..c8a89fc 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -672,6 +672,17 @@ CREATE VIEW pg_stat_xact_user_functions AS WHERE P.prolang != 12 -- fast check to eliminate built-in functions AND pg_stat_get_xact_function_calls(P.oid) IS NOT NULL; +CREATE VIEW pg_stat_archiver AS +SELECT +s.archived_wals, +s.last_archived_wal, +s.last_archived_wal_time, +s.failed_attempts, +s.last_failed_wal, +s.last_failed_wal_time, +s.stats_reset +FROM
Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire
Hi, On 22/01/14 14:45, Tom Lane wrote: Well, is it context or detail? Those fields have reasonably well defined meanings IMO. I find the distinction somewhat blurry and think both would be appropriate. But since I wasn't sure I changed to detail. If we need errcontext_plural, let's add it, not adopt inferior solutions just because that isn't there for lack of previous need. I would've added it if I would've been sure. But having said that, I think this is indeed detail not context. (I kinda wonder whether some of the stuff that's now in the primary message shouldn't be pushed to errdetail as well. It looks like some previous patches in this area have been lazy.) I agree, the primary message is not very well worded. On the other hand finding an appropriate alternative seems hard for me. While I'm griping, this message isn't even trying to follow the project's message style guidelines. Detail or context messages are supposed to be complete sentence(s), with capitalization and punctuation to match. Hm, I hope I fixed it in this version of the patch. Lastly, is this information that we want to be shipping to clients? Perhaps from a security standpoint that's not such a wise idea, and errdetail_log() is what should be used. Fixed. I added an errdetail_log_plural() for this, too. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 122afb2..552c5a4 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -1195,13 +1195,23 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) */ if (log_lock_waits deadlock_state != DS_NOT_YET_CHECKED) { - StringInfoData buf; + StringInfoData buf, + lock_waiters_sbuf, + lock_holders_sbuf; const char *modename; long secs; int usecs; long msecs; + SHM_QUEUE *procLocks; + PROCLOCK *proclock; + bool first_holder = true, + first_waiter = true; + int lockHoldersNum = 0; initStringInfo(buf); + initStringInfo(lock_waiters_sbuf); + initStringInfo(lock_holders_sbuf); + DescribeLockTag(buf, locallock-tag.lock); modename = GetLockmodeName(locallock-tag.lock.locktag_lockmethodid, lockmode); @@ -1211,10 +1221,67 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) msecs = secs * 1000 + usecs / 1000; usecs = usecs % 1000; + LWLockAcquire(partitionLock, LW_SHARED); + + /* + * we loop over the lock's procLocks to gather a list of all + * holders and waiters. Thus we will be able to provide more + * detailed information for lock debugging purposes. + * + * lock-procLocks contains all processes which hold or wait for + * this lock. + */ + + procLocks = (lock-procLocks); + proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks, + offsetof(PROCLOCK, lockLink)); + + while (proclock) + { +/* + * we are a waiter if myProc-waitProcLock == proclock; we are + * a holder if it is NULL or something different + */ +if (proclock-tag.myProc-waitProcLock == proclock) +{ + if (first_waiter) + { + appendStringInfo(lock_waiters_sbuf, %d, + proclock-tag.myProc-pid); + first_waiter = false; + } + else + appendStringInfo(lock_waiters_sbuf, , %d, + proclock-tag.myProc-pid); +} +else +{ + if (first_holder) + { + appendStringInfo(lock_holders_sbuf, %d, + proclock-tag.myProc-pid); + first_holder = false; + } + else + appendStringInfo(lock_holders_sbuf, , %d, + proclock-tag.myProc-pid); + + lockHoldersNum++; +} + +proclock = (PROCLOCK *) SHMQueueNext(procLocks, proclock-lockLink, + offsetof(PROCLOCK, lockLink)); + } + + LWLockRelease(partitionLock); + if (deadlock_state == DS_SOFT_DEADLOCK) ereport(LOG, (errmsg(process %d avoided deadlock for %s on %s by rearranging queue order after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), +(errdetail_plural(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data; else if (deadlock_state == DS_HARD_DEADLOCK) { /* @@ -1226,13 +1293,19 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable) */ ereport(LOG, (errmsg(process %d detected deadlock while waiting for %s on %s after %ld.%03d ms, - MyProcPid, modename, buf.data, msecs, usecs))); +MyProcPid, modename, buf.data, msecs, usecs), +(errdetail_plural(process owning lock: %s request queue: %s, + processes owning lock: %s request queue: %s, + lockHoldersNum, lock_holders_sbuf.data,