date:20140122

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread KONDO Mitsumasa


(2014/01/22 9:34), Simon Riggs wrote:

AFAICS, all that has happened is that people have given their opinions
and we've got almost the same identical patch, with a rush-rush
comment to commit even though we've waited months. If you submit a
patch, then you need to listen to feedback and be clear about what you
will do next, if you don't people will learn to ignore you and nobody
wants that.
I think it was replied that will be heavily. If we realize histogram in 
pg_stat_statements, we have to implement dobuble precision arrays for storing 
histogram data. And when we update histogram data in each statements, we must 
update arrays with searching what response time is the smallest or biggest? It is 
very big cost, assuming large memory, and too hevily when updating than we get 
benefit from it. So I just add stddev for as fast as latest pg_stat_statements. I 
got some agreed from some people, as you say.



On 21 January 2014 21:19, Peter Geoghegan p...@heroku.com wrote:

On Tue, Jan 21, 2014 at 11:48 AM, Simon Riggs si...@2ndquadrant.com wrote:

I agree with people saying that stddev is better than nothing at all,
so I am inclined to commit this, in spite of the above.


I could live with stddev. But we really ought to be investing in
making pg_stat_statements work well with third-party tools. I am very
wary of enlarging the counters structure, because it is protected by a
spinlock. There has been no attempt to quantify that cost, nor has
anyone even theorized that it is not likely to be appreciable.


OK, Kondo, please demonstrate benchmarks that show we have 1% impact
from this change. Otherwise we may need a config parameter to allow
the calculation.
OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I 
show you detail HTML results.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Trigger information for auto_explain.

2014-01-22 Thread Kyotaro HORIGUCHI

Hello, I came back with doc patch and revised 0002 patch.

   I think documentation is the only thing that stops this patch to be
   commitable... can you add it?
  
  Agreed.  I have pushed patch 0001 for now.
 
 Thank you, I'll put it sooner.

I found the default setting for log_triggers was true in the last
patch while writing doc but it's better be false ragarding
backward compatibility. The 0002 patch attached has been changed
there.

- 0002_auto_explain_triggers_v2_20140122.patch

  default value for log_triggers from 'true' to 'false'. As added
  documents says.

- 0003_auto_explain_triggers_doc_v1_20140122.patch

  documentation.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index c9b8192..53f38cb 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -124,6 +124,24 @@ LOAD 'auto_explain';
 
varlistentry
 term
+ varnameauto_explain.log_triggers/varname (typeboolean/type)
+/term
+indexterm
+ primaryvarnameauto_explain.log_triggers/ configuration parameter/primary
+/indexterm
+listitem
+ para
+  varnameauto_explain.log_triggers/varname causes additional trigger
+  statistics output to be printed when an execution plan is logged. This
+  parameter is off by default. Only superusers can change this
+  setting. This parameter has no effect
+  unless varnameauto_explain.log_analyze/ parameter is set.
+ /para
+/listitem
+   /varlistentry
+
+   varlistentry
+term
  varnameauto_explain.log_format/varname (typeenum/type)
 /term
 indexterm
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index af68479..2354327 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -25,6 +25,7 @@ static int	auto_explain_log_min_duration = -1; /* msec or -1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
+static bool auto_explain_log_triggers = false;
 static bool auto_explain_log_timing = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
@@ -113,6 +114,17 @@ _PG_init(void)
 			 NULL,
 			 NULL);
 
+	DefineCustomBoolVariable(auto_explain.log_triggers,
+			 Collect trigger stats, avaialble when log_analyze.,
+			 NULL,
+			 auto_explain_log_triggers,
+			 false,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
 	DefineCustomEnumVariable(auto_explain.log_format,
 			 EXPLAIN format to be used for plan logging.,
 			 NULL,
@@ -295,6 +307,8 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			ExplainBeginOutput(es);
 			ExplainQueryText(es, queryDesc);
 			ExplainPrintPlan(es, queryDesc);
+			if (es.analyze  auto_explain_log_triggers)
+ExplainPrintTriggers(es, queryDesc);
 			ExplainEndOutput(es);
 
 			/* Remove last line break */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-22 Thread Heikki Linnakangas


On 01/22/2014 09:25 AM, Alexander Korotkov wrote:

On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com

wrote:



On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:


Oh, I see what's going on. I had assumed that there cannot be duplicate
insertions into the posting tree, but that's dead wrong. The fast
insertion mechanism depends on a duplicate insertion to do nothing.


Ok, this turned out to be a much bigger change than I thought...

In principle, it's not difficult to eliminate duplicates. However, to
detect a duplicate, you have to check if the item is present in the
uncompressed items array, or in the compressed lists. That requires
decoding the segment where it should be.

But if we decode the segment, what's the purpose of the uncompressed items
array? Its purpose was to speed up insertions, by buffering them so that we
don't need to decode and re-encode the page/segment on every inserted item.
But if we're paying the price of decoding it anyway, we might as well
include the new item and re-encode the segment. The uncompressed area saves
some effort in WAL-logging, as the record of inserting an entry into the
uncompressed area is much smaller than that of re-encoding part of the
page, but if that really is a concern, we could track more carefully which
parts of the page is modified, and only WAL-log the required parts. And
hopefully, the fast-update lists buffer inserts so that you insert many
items at a time to the posting tree, and the price of re-encoding is only
paid once.

So, now that I think about this once more, I don't think the uncompressed
area in every leaf page is a good idea.


I didn't get why insertion of duplicate TIDs to uncompressed area and
eliminate them of re-encoding? It would be somewhat more work during
updates, more work during scan, more WAL records. But is it really
significant for real-life workloads?


Hmm, so you would merrily insert duplicate TIDs into the uncompressed 
area, and weed them out when reading or recompressing the page? I had 
not thought of that. Yeah, it might be a good trade-off, duplicates are 
not expected to happen very often.



I refactored the way the recompression routine works again. It is now more

clearly a multi-step process. First, the existing page is disassembled
into an in-memory struct, which is a linked list of all the segments.
In-memory each segment can be represented as an array of item pointers, or
in the compressed format. In the next phase, all the new items are added to
the segments where they belong. This naturally can lead to overly large
segments; in the third phase, the items are redistributed among the
segments, and compressed back to the compressed format. Finally, the
finished segments are written back to the page, or pages if it had to be
split.

The same subroutines to disassemble and recompress are also used in vacuum.

Attached is what I've got now. This is again quite heavily-changed from
the previous version - sorry for repeatedly rewriting this. I'll continue
testing and re-reviewing this tomorrow.



Let's clarify where we are. We had quite debugged and tested version with
hard-wired varbyte encoding. Now we're reimplementing and debugging
segmented version of varbyte encoding in a hope that one day we can easily
replace it with something better that meets our restrictions (but we didn't
find it already). Is it right?


The segmentation was a sensible change on code-readability grounds 
alone. Yes, it made it easier to experiment with different encodings, 
and will make it easier to replace the encoding in the future, but that 
wasn't the only reason for doing it. It's nice to have the 
encoding/decoding stuff in ginpostinglists.c, so that the rest of the 
code just passes around opaque GinPostingList structs (previously known 
as PostingListSegment).


One thing I have been pondering, though: Instead of storing the posting 
lists one after each other on the leaf page, it might be better to store 
them as Items on the page, with normal ItemIds pointing to each. So the 
page layout would be more standard, and you could use 
PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. 
The immediate advantage of that would be that it would make it possible 
to do a binary search of the segments, to quickly locate the right 
segment where a tuple belongs to. That might not be significant in 
practice - linearly scanning 32 items is not slow either. And it would 
add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per 
page with the current segment size of 256 bytes. But then again, it 
might be a good idea just because it would make the pages look more like 
any other page, which is generally a good thing.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2014-01-22 Thread Marti Raudsepp

On Tue, Jan 14, 2014 at 9:28 AM, Yugo Nagata nag...@sraoss.co.jp wrote:
 Here is the patch to implement to_regclass, to_regproc, to_regoper,
 and to_regtype.

+ static Datum regclass_guts(char *class_name_or_oid, bool raiseError);

Minor bikeshedding, a lot of code currently uses an argument named
missing_ok for this purpose (with inverse meaning of course). Any
reasons why you chose raiseError instead?

I only had a brief look at the patch, so maybe I'm missing something.
But I don't think you should create 3 variants of these functions:
* parseTypeString calls parseTypeString_guts with false
* parseTypeStringMissingOk calls parseTypeString_guts with true
* parseTypeString_guts

And this is just silly:

if (raiseError)
parseTypeString(typ_name_or_oid, result, typmod);
else
parseTypeStringMissingOk(typ_name_or_oid, result, typmod);

Just add an argument to parseTypeString and patch all the callers.

 if requested object is not found,
 returns InvalidOid, rather than raises an error.

I thought the consensus was that returning NULL is better than
InvalidOid? From an earlier message:

On Thu, Dec 5, 2013 at 5:25 PM, Robert Haas robertmh...@gmail.com wrote:
 Another advantage of this approach is that, IIUC, type input functions
 can't return a NULL value.  So 'pg_klass'::regclass could return 0,
 but not NULL.  On the other hand, toregclass('pg_klass') *could*
 return NULL, which seems conceptually cleaner.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

2014-01-22 Thread Peter Geoghegan

On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan p...@heroku.com wrote:
 Actually, I think the whole thing is rather badly designed, as warm
 cache or no you're still proposing to do thousands of kernel calls
 while holding a potentially contended LWLock.  I suggest what we
 do is (1) read in the whole file, (2) acquire the lock, (3)
 re-read the whole file in the same buffer, (4) work from the buffer.

 fseeko() and fread() are part of the standard library, not any kernel.
 I don't believe that what I've done in qtext_fetch() implies thousands
 of kernel calls, or thousands of context switches.

I ran an strace on a pg_stat_statements backend with a full ~5,000
entries. It seems that glibc is actually content to always make
lseek() and read() syscalls in the event of an fseek(), even though it
does maintain its own buffer and could in principle have a lot more
gumption about that [1]. I agree that we should copy everything into a
buffer in one go.

I think that making the LWLocking duration as brief as possible isn't
critically important. Automated tools will only be interested in new
query texts (implying possible I/O while share locking) as they
observe new entries. But new entries are the only thing worth
considering that may potentially block on that shared lock (although,
not as they write out their query texts, which almost always just
requires a shared lock). New entries ought to be very rare occurrence
compared to the execution of queries - certainly, on the busy
production systems I use pg_stat_statements on, it may be weeks before
a new query is observed (uh, with the possible exception of *my*
ad-hoc pg_stat_statements queries). On my laptop, explain analyze
select * from pg_stat_statements indicates a total runtime of ~9ms
with 5,000 regression test entries, even with the (still modest)
overhead of doing all those read()/lseek() calls. So we're talking
about a small impact on a new entry, that will only be affected once,
that was always going to have to write out its query text to file
anyway. On top of all this, in general it's very improbable that any
given new entry will be affected at all, because two unlikely things
need to happen at the same time for that to be possible.

The most important aspect of keeping the overhead of
pg_stat_statements low is that there not be too much cache pressure.
Obviously the huge reduction in its shared memory footprint that this
patch allows helps with that.

[1] http://www.pixelbeat.org/programming/stdio_buffering/
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

2014-01-22 Thread Peter Geoghegan

On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan p...@heroku.com wrote:
 BTW shouldn't there be an fflush() in qtext_store?

 I don't think so, no. Whenever qtext_store() callers don't fflush()
 before releasing their exclusive lock, they FreeFile() before doing so
 instead. In the case of pgss_shmem_startup() there is no lock held
 anyway, because it doesn't matter, for the same reason it doesn't
 matter that we're modifying shmem structures in a way that ordinarily
 necessitates an exclusive lock. Why fflush() every qtext_store() call
 if there is expected to be more than one, as is the case for several
 callers?

Having said all that, it occurs to me now that I could have been
smarter about where I fflush(). I'm pretty sure pgss_store() should
have two fflush() calls. The first can occur with just a shared LWLock
held, before the lock strength promotion interim (you probably noticed
that I no longer generate a normalized query text in that interim, for
reasons that are obvious). The second fflush() may occur only in the
very unlikely event of a garbage collection necessitating a new
qtext_store() call with an exclusive lock held, a few lines after the
first fflush(). No need to fflush() with the exclusive lock the vast
majority of the time.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2014-01-22 Thread Tatsuo Ishii

 On Tue, Jan 14, 2014 at 9:28 AM, Yugo Nagata nag...@sraoss.co.jp wrote:
 Here is the patch to implement to_regclass, to_regproc, to_regoper,
 and to_regtype.
 
 + static Datum regclass_guts(char *class_name_or_oid, bool raiseError);
 
 Minor bikeshedding, a lot of code currently uses an argument named
 missing_ok for this purpose (with inverse meaning of course). Any
 reasons why you chose raiseError instead?

Originally the proposal checks errors like syntactical one in addition
to missing objects. So I think raiseError was more appropriate at
that time. Now they only check missing objects. So renaming to
missing_ok could be more appropriate.

 I only had a brief look at the patch, so maybe I'm missing something.
 But I don't think you should create 3 variants of these functions:
 * parseTypeString calls parseTypeString_guts with false
 * parseTypeStringMissingOk calls parseTypeString_guts with true
 * parseTypeString_guts
 
 And this is just silly:
 
 if (raiseError)
 parseTypeString(typ_name_or_oid, result, typmod);
 else
 parseTypeStringMissingOk(typ_name_or_oid, result, typmod);
 
 Just add an argument to parseTypeString and patch all the callers.

Leave the disccusion to Yugo..

 if requested object is not found,
 returns InvalidOid, rather than raises an error.
 
 I thought the consensus was that returning NULL is better than
 InvalidOid? From an earlier message:
 
 On Thu, Dec 5, 2013 at 5:25 PM, Robert Haas robertmh...@gmail.com wrote:
 Another advantage of this approach is that, IIUC, type input functions
 can't return a NULL value.  So 'pg_klass'::regclass could return 0,
 but not NULL.  On the other hand, toregclass('pg_klass') *could*
 return NULL, which seems conceptually cleaner.

Not sure. There's already at least one counter example:

pg_my_temp_schema() oid OID of session's temporary schema, or 0 if none

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire

2014-01-22 Thread Simon Riggs

On 22 January 2014 04:42, Rajeev rastogi rajeev.rast...@huawei.com wrote:

 On 31st December 2013, Christian Kruse Wrote:

 Hi there,

 I created two patches improving the log messages generated by
 log_lock_waits. The first patch shows the process IDs holding a lock we
 try to acquire (show_pids_in_lock_log.patch), sample output
 (log_lock_waits=on required):

 session 1: BEGIN; LOCK TABLE foo IN SHARE MODE; session 2: BEGIN; LOCK
 TABLE foo IN SHARE MODE; session 3: BEGIN; LOCK TABLE foo IN EXCLUSIVE
 MODE;

 Output w/o patch:

 LOG:  process 13777 still waiting for ExclusiveLock on relation 16385
 of database 16384 after 1000.030 ms

 Output with patch:

 LOG:  process 13777 still waiting for ExclusiveLock on relation 16385
 of database 16384 after 1000.030 ms
 CONTEXT:  processes owning lock: 13775, 13776

 I am reviewing this patch. The idea seems to be reasonable.
 Following are my first level observation:

 1. I find a issue in following scenario:

 session 1 with process id X: BEGIN; LOCK TABLE foo IN SHARE MODE;
 session 2 with process id Y: BEGIN; LOCK TABLE foo IN EXCLUSIVE MODE;
 session 3 with process id Z: BEGIN; LOCK TABLE foo IN SHARE MODE;

 On execution of LOCK in session-3, as part of log it will display as:
 processes owning lock: X, Y
 But actually if we see Y has not yet own the lock, it is still 
 waiting with higher priority.
 It may mislead user.
 May be we should change message to give all meaning i.e. which 
 process is owning lock and
 Which process is already in queue.

Perhaps this?

CONTEXT: lock owner  request queue XXX, XXX, XXX, etc

 2. Can we give a better name to new variable 'buf1'?

 3. Do we need to take performance reading to see if any impact?

Don't think so. Diagnosing problems will help performance, not hinder it

 4. Do we require documentation?

Don't think so.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2014-01-22 Thread Yugo Nagata

On Wed, 22 Jan 2014 20:04:12 +0900 (JST)
Tatsuo Ishii is...@postgresql.org wrote:

  On Tue, Jan 14, 2014 at 9:28 AM, Yugo Nagata nag...@sraoss.co.jp wrote:
  Here is the patch to implement to_regclass, to_regproc, to_regoper,
  and to_regtype.
  
  + static Datum regclass_guts(char *class_name_or_oid, bool raiseError);
  
  Minor bikeshedding, a lot of code currently uses an argument named
  missing_ok for this purpose (with inverse meaning of course). Any
  reasons why you chose raiseError instead?
 
 Originally the proposal checks errors like syntactical one in addition
 to missing objects. So I think raiseError was more appropriate at
 that time. Now they only check missing objects. So renaming to
 missing_ok could be more appropriate.
 
  I only had a brief look at the patch, so maybe I'm missing something.
  But I don't think you should create 3 variants of these functions:
  * parseTypeString calls parseTypeString_guts with false
  * parseTypeStringMissingOk calls parseTypeString_guts with true
  * parseTypeString_guts
  
  And this is just silly:
  
  if (raiseError)
  parseTypeString(typ_name_or_oid, result, typmod);
  else
  parseTypeStringMissingOk(typ_name_or_oid, result, typmod);
  
  Just add an argument to parseTypeString and patch all the callers.
 
 Leave the disccusion to Yugo..

parseTypeString() is called by some other functions and I avoided 
influences of modifying the definition on them, since this should
raise errors in most cases. This is same reason for other *MissingOk 
functions in parse_type.c.

Is it better to write definitions of these function and all there callers?

 
  if requested object is not found,
  returns InvalidOid, rather than raises an error.
  
  I thought the consensus was that returning NULL is better than
  InvalidOid? From an earlier message:
  
  On Thu, Dec 5, 2013 at 5:25 PM, Robert Haas robertmh...@gmail.com wrote:
  Another advantage of this approach is that, IIUC, type input functions
  can't return a NULL value.  So 'pg_klass'::regclass could return 0,
  but not NULL.  On the other hand, toregclass('pg_klass') *could*
  return NULL, which seems conceptually cleaner.
 
 Not sure. There's already at least one counter example:
 
 pg_my_temp_schema()   oid OID of session's temporary schema, or 0 if none
 
 Best regards,
 --
 Tatsuo Ishii
 SRA OSS, Inc. Japan
 English: http://www.sraoss.co.jp/index_en.php
 Japanese: http://www.sraoss.co.jp


-- 
Yugo Nagata nag...@sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Simon Riggs

On 22 January 2014 01:23, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
 Another thing to think about is whether we couldn't put a hard limit on
 WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
 design anyway, when you get right down to it.  So for example maybe we
 could split up commit records, with most of the bulky information dumped
 into separate records that appear before the real commit.  This would
 complicate replay --- in particular, if we abort the transaction after
 writing a few such records, how does the replayer realize that it can
 forget about those records?  But that sounds probably surmountable.

 I think removing the list of subtransactions from commit records would
 essentially require not truncating pg_subtrans after a restart
 anymore.

 I'm not suggesting that we stop providing that information!  I'm just
 saying that we perhaps don't need to store it all in one WAL record,
 if instead we put the onus on WAL replay to be able to reconstruct what
 it needs from a series of WAL records.

I think removing excess subxacts from commit and abort records could
be a good idea. Not sure anybody considered it before.

We already emit xid allocation records when we overflow, so we already
know which subxids have been allocated. We also issue subxact abort
records for anything that aborted. So in theory we should be able to
reconstruct an arbitrarily long chain of subxacts.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Simon Riggs

On 22 January 2014 01:30, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 How are we supposed to wait while e.g. ProcArrayLock? Aborting
 transactions doesn't work either, that writes abort records which can
 get signficantly large.

 Yeah, that's an interesting point ;-).  We can't *either* commit or abort
 without emitting some WAL, possibly quite a bit of WAL.

Right, which is why we don't need to lock ProcArrayLock. As soon as we
try to write a commit or abort it goes through the normal XLogInsert
route. As soon as wal_buffers fills WALWriteLock will be held
continuously until we free some space.

Since ProcArrayLock isn't held, read-only users can continue.

As Jeff points out, the blocks being modified would be locked until
space is freed up. Which could make other users wait. The code
required to avoid that wait would be complex and not worth any
overhead.

Note that my proposal would not require aborting any in-flight
transactions; they would continue to completion as soon as space is
cleared.

My proposal again, so we can review how simple it was...

1. Allow a checkpoint to complete by updating the control file, rather
than writing WAL. The control file is already there and is fixed size,
so we can be more confident it will accept the update. We could add a
new checkpoint mode for that, or we could do that always for shutdown
checkpoints (my preferred option). EFFECT: Since a checkpoint can now
be called and complete without writing WAL, we are able to write dirty
buffers and then clean out WAL files to reduce space.

2. If we fill the disk when writing WAL we do not PANIC, we signal the
checkpointer process to perform an immediate checkpoint and then wait
for its completion. EFFECT: Since we are holding WALWriteLock, all
other write users will soon either wait for that lock directly or
indirectly.

Both of those points are relatively straightforward to implement and
this proposal minimises seldom-tested code paths.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

GIN pending list pages not recycled promptly (was Re: [HACKERS] GIN improvements part 1: additional information)

2014-01-22 Thread Heikki Linnakangas


On 01/22/2014 03:39 AM, Tomas Vondra wrote:

What annoys me a bit is the huge size difference between the index
updated incrementally (by a sequence of INSERT commands), and the index
rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
(2288 vs. 2035 MB), but is there a chance to improve this?


Hmm. What seems to be happening is that pending item list pages that the 
fast update mechanism uses are not getting recycled. When enough list 
pages are filled up, they are flushed into the main index and the list 
pages are marked as deleted. But they are not recorded in the FSM, so 
they won't be recycled until the index is vacuumed. Almost all of the 
difference can be attributed to deleted pages left behind like that.


So this isn't actually related to the packed postinglists patch at all. 
It just makes the bloat more obvious, because it makes the actual size 
of the index size, excluding deleted pages, smaller. But it can be 
observed on git master as well:


I created a simple test table and index like this:

create table foo (intarr int[]);
create index i_foo on foo using gin(intarr) with (fastupdate=on);

I filled the table like this:

insert into foo select array[-1] from generate_series(1, 1000) g;

postgres=# \d+i
   List of relations
 Schema | Name | Type  | Owner  |  Size  | Description
+--+---+++-
 public | foo  | table | heikki | 575 MB |
(1 row)

postgres=# \di+
   List of relations
 Schema | Name  | Type  | Owner  | Table |  Size  | Description
+---+---++---++-
 public | i_foo | index | heikki | foo   | 251 MB |
(1 row)

I wrote a little utility that scans all pages in a gin index, and prints 
out the flags indicating what kind of a page it is. The distribution 
looks like this:


 19 DATA
   7420 DATA LEAF
  24701 DELETED
  1 LEAF
  1 META

I think we need to add the deleted pages to the FSM more aggressively.

I tried simply adding calls to RecordFreeIndexPage, after the list pages 
have been marked as deleted, but unfortunately that didn't help. The 
problem is that the FSM is organized into a three-level tree, and 
RecordFreeIndexPage only updates the bottom level. The upper levels are 
not updated until the FSM is vacuumed, so the pages are still not 
visible to GetFreeIndexPage calls until next vacuum. The simplest fix 
would be to add a call to IndexFreeSpaceMapVacuum after flushing the 
pending list, per attached patch. I'm slightly worried about the 
performance impact of the IndexFreeSpaceMapVacuum() call. It scans the 
whole FSM of the index, which isn't exactly free. So perhaps we should 
teach RecordFreeIndexPage to update the upper levels of the FSM in a 
retail-fashion instead.


- Heikki
*** a/src/backend/access/gin/ginfast.c
--- b/src/backend/access/gin/ginfast.c
***
*** 21,26 
--- 21,27 
  #include access/gin_private.h
  #include commands/vacuum.h
  #include miscadmin.h
+ #include storage/indexfsm.h
  #include utils/memutils.h
  #include utils/rel.h
  
***
*** 434,440  ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
--- 435,453 
  	END_CRIT_SECTION();
  
  	if (needCleanup)
+ 	{
  		ginInsertCleanup(ginstate, false, NULL);
+ 
+ 		/*
+ 		 * Vacuum the FSM to make the deleted list pages available for re-use.
+ 		 *
+ 		 * gininsertCleanup marked them as free in the FSM by calling
+ 		 * RecordFreeIndexPage, but that only updates the bottom FSM level.
+ 		 * Since GetFreeIndexPage scans from the top, they won't be visible
+ 		 * for reuse until the upper levels have been updated.
+ 		 */
+ 		IndexFreeSpaceMapVacuum(index);
+ 	}
  }
  
  /*
***
*** 599,608  shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
  			}
  		}
  
  		for (i = 0; i  data.ndeleted; i++)
  			UnlockReleaseBuffer(buffers[i]);
! 
! 		END_CRIT_SECTION();
  	} while (blknoToDelete != newHead);
  
  	return false;
--- 612,625 
  			}
  		}
  
+ 		END_CRIT_SECTION();
+ 
  		for (i = 0; i  data.ndeleted; i++)
+ 		{
+ 			BlockNumber blkno = BufferGetBlockNumber(buffers[i]);
  			UnlockReleaseBuffer(buffers[i]);
! 			RecordFreeIndexPage(index, blkno);
! 		}
  	} while (blknoToDelete != newHead);
  
  	return false;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Heikki Linnakangas


On 01/22/2014 02:10 PM, Simon Riggs wrote:

As Jeff points out, the blocks being modified would be locked until
space is freed up. Which could make other users wait. The code
required to avoid that wait would be complex and not worth any
overhead.


Checkpoint also acquires the content lock of every dirty page in the 
buffer cache...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-22 Thread Alexander Korotkov

On Wed, Jan 22, 2014 at 12:30 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 01/22/2014 09:25 AM, Alexander Korotkov wrote:

 On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com

 wrote:


  On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:

  Oh, I see what's going on. I had assumed that there cannot be duplicate
 insertions into the posting tree, but that's dead wrong. The fast
 insertion mechanism depends on a duplicate insertion to do nothing.


 Ok, this turned out to be a much bigger change than I thought...

 In principle, it's not difficult to eliminate duplicates. However, to
 detect a duplicate, you have to check if the item is present in the
 uncompressed items array, or in the compressed lists. That requires
 decoding the segment where it should be.

 But if we decode the segment, what's the purpose of the uncompressed
 items
 array? Its purpose was to speed up insertions, by buffering them so that
 we
 don't need to decode and re-encode the page/segment on every inserted
 item.
 But if we're paying the price of decoding it anyway, we might as well
 include the new item and re-encode the segment. The uncompressed area
 saves
 some effort in WAL-logging, as the record of inserting an entry into the
 uncompressed area is much smaller than that of re-encoding part of the
 page, but if that really is a concern, we could track more carefully
 which
 parts of the page is modified, and only WAL-log the required parts. And
 hopefully, the fast-update lists buffer inserts so that you insert many
 items at a time to the posting tree, and the price of re-encoding is only
 paid once.

 So, now that I think about this once more, I don't think the uncompressed
 area in every leaf page is a good idea.


 I didn't get why insertion of duplicate TIDs to uncompressed area and
 eliminate them of re-encoding? It would be somewhat more work during
 updates, more work during scan, more WAL records. But is it really
 significant for real-life workloads?


 Hmm, so you would merrily insert duplicate TIDs into the uncompressed
 area, and weed them out when reading or recompressing the page? I had not
 thought of that. Yeah, it might be a good trade-off, duplicates are not
 expected to happen very often.


  I refactored the way the recompression routine works again. It is now more

 clearly a multi-step process. First, the existing page is disassembled
 into an in-memory struct, which is a linked list of all the segments.
 In-memory each segment can be represented as an array of item pointers,
 or
 in the compressed format. In the next phase, all the new items are added
 to
 the segments where they belong. This naturally can lead to overly large
 segments; in the third phase, the items are redistributed among the
 segments, and compressed back to the compressed format. Finally, the
 finished segments are written back to the page, or pages if it had to be
 split.

 The same subroutines to disassemble and recompress are also used in
 vacuum.

 Attached is what I've got now. This is again quite heavily-changed from
 the previous version - sorry for repeatedly rewriting this. I'll continue
 testing and re-reviewing this tomorrow.



 Let's clarify where we are. We had quite debugged and tested version with
 hard-wired varbyte encoding. Now we're reimplementing and debugging
 segmented version of varbyte encoding in a hope that one day we can easily
 replace it with something better that meets our restrictions (but we
 didn't
 find it already). Is it right?


 The segmentation was a sensible change on code-readability grounds alone.
 Yes, it made it easier to experiment with different encodings, and will
 make it easier to replace the encoding in the future, but that wasn't the
 only reason for doing it. It's nice to have the encoding/decoding stuff in
 ginpostinglists.c, so that the rest of the code just passes around opaque
 GinPostingList structs (previously known as PostingListSegment).

 One thing I have been pondering, though: Instead of storing the posting
 lists one after each other on the leaf page, it might be better to store
 them as Items on the page, with normal ItemIds pointing to each. So the
 page layout would be more standard, and you could use 
 PageAddItem/PageIndexMultiDelete
 to add/remove posting lists to page. The immediate advantage of that would
 be that it would make it possible to do a binary search of the segments, to
 quickly locate the right segment where a tuple belongs to. That might not
 be significant in practice - linearly scanning 32 items is not slow either.
 And it would add some overhead, one line pointer per segment, 4 * 32 = 128
 bytes per page with the current segment size of 256 bytes. But then again,
 it might be a good idea just because it would make the pages look more like
 any other page, which is generally a good thing.


We already spent a lot of time with compression. Now we need to figure out
the result we want see. I spent quite

Re: [HACKERS] improve the help message about psql -F

2014-01-22 Thread Jov

OK,I will do it.Thanks.

Jov
blog: http:amutu.com/blog http://amutu.com/blog


2014/1/21 Marti Raudsepp ma...@juffo.org

 On Mon, Jan 20, 2014 at 2:04 PM, Jov am...@amutu.com wrote:
  reasonable，I removed the set,patch attached.

 Hi Jov,

 A new commitfest was just opened, due on 2014-06. Please add your patch
 here:
 https://commitfest.postgresql.org/action/commitfest_view?id=22

 (You'll need a community account if you don't already have one)

 Sometimes simple fixes like yours are merged outside a CommitFest, but
 adding it there makes sure it won't get lost.

 Regards,
 Marti

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2014-01-22 Thread Robert Haas

On Tue, Jan 21, 2014 at 11:20 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, Jan 22, 2014 at 9:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
 On Tue, Jan 21, 2014 at 6:57 PM, MauMau maumau...@gmail.com wrote:
 To follow this, we have the line as:

 #event_source = 'PostgreSQL 9.4'

 But this requires us to change this line for each major release.  That's a
 maintenance headache.

 What I had in mind was to change it during initdb, we are already doing it
 for some other parameter (unix_socket_directories),

 What happens when somebody copies their postgresql.conf from an older
 version?  That's hardly uncommon, even though it might be considered bad
 practice.  I don't think it's a good idea to try to insert a version
 identifier this way.

 But ... what's the point of including the PG version in this string
 anyhow?  If you're trying to make the string unique across different
 installations on the same machine, it's surely insufficient, and if
 that's not the point then what is?

 Well, certainly it cannot handle all different scenario's (particularly
 same version installations), but the original report for this case was
 for different versions of server. I think chances of having different
 versions of server are much more, which will be handled by this
 case.

I wonder if the port number wouldn't be a better choice.  And that
could even be done without adding a parameter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread Robert Haas

On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
 OK, Kondo, please demonstrate benchmarks that show we have 1% impact
 from this change. Otherwise we may need a config parameter to allow
 the calculation.

 OK, testing DBT-2 now. However, error range of benchmark might be 1% higher.
 So I show you detail HTML results.

To see any impact from spinlock contention, I think you're pretty much
going to need a machine with 32 cores, I think, and lots of
concurrency.  pgbench -S is probably a better test than DBT-2, because
it leaves out all the writing, so percentage-wise more time will be
spent doing things like updating the pgss hash table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Simon Riggs

On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 01/22/2014 02:10 PM, Simon Riggs wrote:

 As Jeff points out, the blocks being modified would be locked until
 space is freed up. Which could make other users wait. The code
 required to avoid that wait would be complex and not worth any
 overhead.


 Checkpoint also acquires the content lock of every dirty page in the buffer
 cache...

Good point. We would need to take special action for any dirty blocks
that we cannot obtain content lock for, which should be a smallish
list, to be dealt with right at the end of the checkpoint writes.

We know that anyone waiting for the WAL lock will not be modifying the
block and so we can copy it without obtaining the lock. We can inspect
the lock queue on the WAL locks and then see which buffers we can skip
the lock for.

The alternative of adding a check for WAL space to the normal path is
a non-starter, IMHO.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: GIN pending list pages not recycled promptly (was Re: [HACKERS] GIN improvements part 1: additional information)

2014-01-22 Thread Alvaro Herrera

Heikki Linnakangas wrote:

 I wrote a little utility that scans all pages in a gin index, and
 prints out the flags indicating what kind of a page it is. The
 distribution looks like this:
 
  19 DATA
7420 DATA LEAF
   24701 DELETED
   1 LEAF
   1 META

Hah.

 I think we need to add the deleted pages to the FSM more aggressively.

 I tried simply adding calls to RecordFreeIndexPage, after the list
 pages have been marked as deleted, but unfortunately that didn't
 help. The problem is that the FSM is organized into a three-level
 tree, and RecordFreeIndexPage only updates the bottom level.

Interesting.  I think the idea of having an option for RecordFreeIndexPage
to update upper levels makes sense (no need to force it for other
users.)

Some time ago I proposed an index-only cleanup for vacuum.  That would
help GIN get this kind of treatment (vacuuming its FSM and processing
the pending list) separately from vacuuming the index.  It's probably
too late for 9.4 though.

One other thing worth considering in this area is that making the
pending list size depend on work_mem appears to have been a really bad
idea.  I know one case where the server is really large and seems to run
mostly OLAP type stuff with occasional updates, so they globally set
work_mem=2GB; they have GIN indexes for text search, and the result is
horrible performance 90% of the time, then a vacuum cleans the pending
list and it is blazing fast until the pending list starts getting big
again.  Now you can argue that setting work_mem to that value is a bad
idea, but as it turns out, in this case other than the GIN pending list
it seems to work fine.

Not related to the patch at hand, but I thought I would out it for
consideration, 'cause I'm not gonna start a new thread about it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:

 Well, PANIC is certainly bad, but what I'm suggesting is that we
 just focus on getting that down to ERROR and not worry about
 trying to get out of the disk-shortage situation automatically.
 Nor do I believe that it's such a good idea to have the database
 freeze up until space appears rather than reporting errors.

Dusting off my DBA hat for a moment, I would say that I would be
happy if each process either generated an ERROR or went into a wait
state.  They would not all need to do the same thing; whichever is
easier in each process's context would do.  It would be nice if a
process which went into a long wait state until disk space became
available would issue a WARNING about that, where that is possible.

I feel that anyone running a database that matters to them should
be monitoring disk space and getting an alert from the monitoring
system in advance of actually running out of disk space; but we all
know that poorly managed systems are out there.  To accomodate them
we don't want to add risk or performance hits for those who do
manage their systems well.

The attempt to make more space by deleting WAL files scares me a
bit.  The dynamics of that seem like they could be
counterproductive if the pg_xlog directory is on the same
filesystem as everything else and there is a rapidly growing file. 
Every time you cleaned up, the fast-growing file would eat more of
the space pg_xlog had been using, until perhaps it had no space to
keep even a single segment, no?  And how would that work with a
situation where the archive_command was failing, causing a build-up
WAL files?  It just seems like too much mechanism and incomplete
coverage of the problem space.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER SYSTEM SET typos and fix for temporary file name management

2014-01-22 Thread Robert Haas

On Tue, Jan 21, 2014 at 7:02 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Jan 22, 2014 at 5:29 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 I agree with Michael that having pg_basebackup be aware of the .temp
 suffix is ugly; for instance if we were to fix it to .tmp in ALTER
 SYSTEM but forgot to change pg_basebackup, the check would be
 immediately broken.  But on the other hand I'm not sure why it's such a
 problem for pg_basebackup that it needs to be checking in the first
 place -- how about we rip that out?
 Coupled with the addition of a pgsql_tmp prefix to the temp
 configuration file name would work, yes we could remove this check
 part.

 Perhaps the temp file needs to be in pgsql_tmp?  (Do we need to worry
 about cross-filesystem links if we do?  I mean, do we support mounting
 pgsql_tmp separately?)
 Using pgsql_tmp looks a bit weird as well, as this prefix is used only
 for temporary files doing database-related operations, and ALTER
 SYSTEM is not one. Doing that this would need a mention in the docs at
 least:
 http://www.postgresql.org/docs/devel/static/storage-file-layout.html
 Thoughts?

I think moving the temp file to a different directory than the
permanent file is a non-starter.  As Alvaro says, that could be on a
different file system, and then attempting to rename() the file into
place would fail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dynamic Shared Memory stuff

2014-01-22 Thread Robert Haas

On Tue, Jan 21, 2014 at 2:58 PM, Noah Misch n...@leadboat.com wrote:
 What do people prefer?

 I recommend performing cleanup on the control segment named in PGShmemHeader
 just before shmdt() in PGSharedMemoryCreate().  No new ERROR or WARNING sites
 are necessary.  Have dsm_postmaster_startup() continue to perform a cleanup on
 the control segment named in the state file.

I think I'm on board with the first two sentences of this, but after
Fujii Masao's email yesterday, I can't help thinking that what you
propose the third sentence is a bad idea.  He cloned a master to
create a standby server on the same machine, and the standby startup
ate the master's dynamic shared memory.  We could teach pg_basebackup
not to copy the state file, but that wouldn't help people who take
base backups using the file system copy method, which is a lot of
people.

 5. Give up on this approach.  We could keep what we have now, or make
 the DSM control segment land at a well-known address as we do for the
 main segment.

 How would having the DSM control segment at a well-known address affect the
 problem at hand?  Did you mean a well-known dsm_handle?

Yeah.  So the idea is that we'd always use a dsm_handle of 100 +
(100 * port) or something like that, and then search forward until we
find a dsm_handle that works.  This is basically the same algorithm
we're using today for the main shared memory segment, but with a large
additive constant so that they don't collide with each other.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2014-01-22 Thread Amit Kapila

On Wed, Jan 22, 2014 at 6:54 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 21, 2014 at 11:20 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, Jan 22, 2014 at 9:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
 On Tue, Jan 21, 2014 at 6:57 PM, MauMau maumau...@gmail.com wrote:
 To follow this, we have the line as:

 #event_source = 'PostgreSQL 9.4'

 But this requires us to change this line for each major release.  That's a
 maintenance headache.

 What I had in mind was to change it during initdb, we are already doing it
 for some other parameter (unix_socket_directories),

 What happens when somebody copies their postgresql.conf from an older
 version?  That's hardly uncommon, even though it might be considered bad
 practice.  I don't think it's a good idea to try to insert a version
 identifier this way.

 But ... what's the point of including the PG version in this string
 anyhow?  If you're trying to make the string unique across different
 installations on the same machine, it's surely insufficient, and if
 that's not the point then what is?

 Well, certainly it cannot handle all different scenario's (particularly
 same version installations), but the original report for this case was
 for different versions of server. I think chances of having different
 versions of server are much more, which will be handled by this
 case.

 I wonder if the port number wouldn't be a better choice.  And that
 could even be done without adding a parameter.

We need this for register of event source which might be done before
start of server.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] The problems of PQhost()

2014-01-22 Thread Fujii Masao

Hi,

I reported in other thread that PQhost() has three problems.

http://www.postgresql.org/message-id/cahgqgwe77akyabywde5+8bjldopthp7cnswao_syedjogyv...@mail.gmail.com

(1) PQhost() can return Unix-domain socket directory path even in the
platform that
doesn't support Unix-domain socket.

(2) In the platform that doesn't support Unix-domain socket, when
neither host nor hostaddr
are specified, the default host 'localhost' is used to connect to
the server and
PQhost() must return that, but it doesn't.

(3) PQhost() cannot return the hostaddr.

As the result of these problems, you can see the incorrect result of
\conninfo, for example.

$ psql -d hostaddr=127.0.0.1
=# \conninfo
You are connected to database postgres as user postgres via socket
in /tmp at port 5432.

Obviously /tmp should be 127.0.0.1.


Attached patch fixes these problems.

We can fix the problem (3) by changing PQhost() so that it also
returns the hostaddr. But this change might break the existing
application using PQhost(). So, I added new libpq function PQhostaddr()
which returns the hostaddr, and changed \conninfo so that it reports
correct connection information by using both PQhost() and PQhostaddr().

These problems exist in v9.3 or before. Basically we should backport
the patch into those versions. But obviously it's too late to add new libpq
function PQhostaddr() into those versions. Then, I'm thinking to backport
only the change for (1) and (2) because we can fix them without adding
new libpq function.

BTW, I think that the patch also fixes the problem of \conninfo that
MauMau reported in other thread.
http://www.postgresql.org/message-id/8913B51B3B534F878D54B89E1EB964C6@maumau

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/libpq.sgml
--- b/doc/src/sgml/libpq.sgml
***
*** 1464,1469  char *PQhost(const PGconn *conn);
--- 1464,1487 
   /listitem
  /varlistentry
  
+ varlistentry id=libpq-pqhostaddr
+  term
+   functionPQhostaddr/function
+   indexterm
+primaryPQhostaddr/primary
+   /indexterm
+  /term
+ 
+  listitem
+   para
+Returns the server numeric IP address or host name of the connection.
+ synopsis
+ char *PQhostaddr(const PGconn *conn);
+ /synopsis
+   /para
+  /listitem
+ /varlistentry
+ 
  varlistentry id=libpq-pqport
   term
functionPQport/function
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
***
*** 300,306  exec_command(const char *cmd,
  	else if (strcmp(cmd, conninfo) == 0)
  	{
  		char	   *db = PQdb(pset.db);
! 		char	   *host = PQhost(pset.db);
  
  		if (db == NULL)
  			printf(_(You are currently not connected to a database.\n));
--- 300,306 
  	else if (strcmp(cmd, conninfo) == 0)
  	{
  		char	   *db = PQdb(pset.db);
! 		char	   *host = (PQhostaddr(pset.db) != NULL) ? PQhostaddr(pset.db) : PQhost(pset.db);
  
  		if (db == NULL)
  			printf(_(You are currently not connected to a database.\n));
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
***
*** 165,167  lo_lseek64162
--- 165,168 
  lo_tell64 163
  lo_truncate64 164
  PQconninfo165
+ PQhostaddr166
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
***
*** 5188,5194  PQhost(const PGconn *conn)
  {
  	if (!conn)
  		return NULL;
! 	return conn-pghost ? conn-pghost : conn-pgunixsocket;
  }
  
  char *
--- 5188,5211 
  {
  	if (!conn)
  		return NULL;
! 	if (conn-pghost != NULL  conn-pghost[0] != '\0')
! 		return conn-pghost;
! 	else
! 	{
! #ifdef HAVE_UNIX_SOCKETS
! 		return conn-pgunixsocket;
! #else
! 		return DefaultHost;
! #endif
! 	}
! }
! 
! char *
! PQhostaddr(const PGconn *conn)
! {
! 	if (!conn)
! 		return NULL;
! 	return conn-pghostaddr;
  }
  
  char *
*** a/src/interfaces/libpq/libpq-fe.h
--- b/src/interfaces/libpq/libpq-fe.h
***
*** 301,306  extern char *PQdb(const PGconn *conn);
--- 301,307 
  extern char *PQuser(const PGconn *conn);
  extern char *PQpass(const PGconn *conn);
  extern char *PQhost(const PGconn *conn);
+ extern char *PQhostaddr(const PGconn *conn);
  extern char *PQport(const PGconn *conn);
  extern char *PQtty(const PGconn *conn);
  extern char *PQoptions(const PGconn *conn);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Andres Freund

On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
  I am not sure I understand that point. We can either update the
  in-memory bit before performing the on-disk operations or
  afterwards. Either way, there's a way to be inconsistent if the disk
  operation fails somewhere inbetween (it might fail but still have
  deleted the file/directory!). The normal way to handle that in other
  places is PANICing when we don't know so we recover from the on-disk
  state.
  I really don't see the problem here? Code doesn't get more robust by
  doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
  ERROR, often that's not warranted.
 
 People get cranky when the database PANICs because of a filesystem
 failure.  We should avoid that, especially when it's trivial to do so.
  The update to shared memory should be done second and should be set
 up to be no-fail.

I don't see how that would help. If we fail during unlink/rmdir, we
don't really know at which point we failed. Just keeping the slot in
memory, won't help us in any way - we'll continue to reserve resources
while the slot is half-gone.
I don't think trying to handle errors we don't understand and we don't
routinely expect actually improves robustness. It just leads to harder
to diagnose errors. It's not like the cases here are likely to be caused
by anthything but severe admin failure like removing the write
permissions of the postgres directory while the server is running. Or do
you see more valid causes?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist

2014-01-22 Thread Pavel Stehule

Hello


2014/1/22 Dean Rasheed dean.a.rash...@gmail.com

 On 21 January 2014 22:28, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
  I have been mulling over this patch, and I can't seem to come to terms
  with it.  I first started making it look nicer here and there, thinking
  it was all mostly okay, but eventually arrived at the idea that it seems
  wrong to do what this does: basically, get_object_address() tries to
  obtain an object address, and if that fails, return InvalidOid; then, in
  RemoveObjects, we try rather hard to figure out why that failed, and
  construct an error message.
 
  It seems to me that it would make more sense to have get_object_address
  somehow return a code indicating what failed; then we don't have to go
  all over through the parser code once more.  Perhaps, for example, when
  missing_ok is given as true to get_object_address it also needs to get a
  pointer to ObjectType and a string; if some object does not exist then
  fill the ObjectType with the failing object and the string with the
  failing name.  Then RemoveObjects can construct a string more easily.
  Not sure how workable this exact idea is; maybe there is a better way.
 

 Yeah, when I initially started reviewing this patch I had a very
 similar thought. But when I looked more deeply at the code beneath
 get_object_address, I started to doubt whether it could be done
 without rather extensive changes all over the place. Also
 get_object_address is itself called from a lot of places (not
 necessarily all in our code) and all the other places (in our code, at
 least) pass missing_ok=false. So it seemed rather ugly to change its
 signature and force a matching change in all those other places, which
 actually don't care about missing objects. Perhaps the answer would be
 to have a separate get_object_address_if_exists function, and remove
 the missing_ok flag from get_object_address, but that all felt like a
 much larger patch.

 In the end, I felt that Pavel's approach wasn't adding that much new
 code, and it's all localised in the one place that does actually
 tolerate missing objects.


I though about it too. But I didn't continue - reasons was named by Dean -
and RemoveObjects are not difficult code - lot of code is mechanical - and
it is not on critical path.

Regards

Pavel



 I admit though, that I didn't explore the other approach very deeply,
 so perhaps it might fall out more neatly than I feared.

 Regards,
 Dean

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Robert Haas

On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
  I am not sure I understand that point. We can either update the
  in-memory bit before performing the on-disk operations or
  afterwards. Either way, there's a way to be inconsistent if the disk
  operation fails somewhere inbetween (it might fail but still have
  deleted the file/directory!). The normal way to handle that in other
  places is PANICing when we don't know so we recover from the on-disk
  state.
  I really don't see the problem here? Code doesn't get more robust by
  doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
  ERROR, often that's not warranted.

 People get cranky when the database PANICs because of a filesystem
 failure.  We should avoid that, especially when it's trivial to do so.
  The update to shared memory should be done second and should be set
 up to be no-fail.

 I don't see how that would help. If we fail during unlink/rmdir, we
 don't really know at which point we failed.

This doesn't make sense to me.  unlink/rmdir are atomic operations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dynamic Shared Memory stuff

2014-01-22 Thread Noah Misch

On Wed, Jan 22, 2014 at 09:32:09AM -0500, Robert Haas wrote:
 On Tue, Jan 21, 2014 at 2:58 PM, Noah Misch n...@leadboat.com wrote:
  What do people prefer?
 
  I recommend performing cleanup on the control segment named in PGShmemHeader
  just before shmdt() in PGSharedMemoryCreate().  No new ERROR or WARNING 
  sites
  are necessary.  Have dsm_postmaster_startup() continue to perform a cleanup 
  on
  the control segment named in the state file.
 
 I think I'm on board with the first two sentences of this, but after
 Fujii Masao's email yesterday, I can't help thinking that what you
 propose the third sentence is a bad idea.  He cloned a master to
 create a standby server on the same machine, and the standby startup
 ate the master's dynamic shared memory.  We could teach pg_basebackup
 not to copy the state file, but that wouldn't help people who take
 base backups using the file system copy method, which is a lot of
 people.

I agree we should not rely on folks learning to omit the state file from base
backups.  Abandoning the state file is one way to resolve that, and the
reasons I outlined for preferring to keep it were not overriding concerns.  We
could instead store a postmaster PID in dsm_control_header and only clean if
that PID is dead.  We could make DSM startup aware of whether we're using a
backup label, but that would be awkward thanks to StartupXLOG() happening a
good bit later.  Yeah, abandoning the state file is looking attractive.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-22 Thread Tom Lane

Amit Kapila amit.kapil...@gmail.com writes:
 On Tue, Jan 21, 2014 at 8:25 PM, Robert Haas robertmh...@gmail.com wrote:
 While reviewing, I noted that the
 skipping missing configuration file message in ParseConfigFile()
 uses an elevel of LOG, while the other messages in the same file use
 elevel.  I'm thinking that's a bug.

 It might not be right for all cases, but I think this is saving us in few 
 cases
 when the caller (directly or indirectly) has sent elevel as ERROR or FATAL,
 in such cases we just want to LOG it, if strict is not true. Now it might not
 be appropriate if the caller has sent DEBUG2 and we use LOG, but to
 fix it (if this sounds an issue), some more analysis is required, so let's
 keep it as it is for now.

The problem with this coding is that at SIGHUP, *all* the postgres
processes will bleat about a missing file.

To avoid that kind of log spam, the usual practice is to code the
elog level as something like IsUnderPostmaster ? DEBUG2 : LOG;
see the elevel setting in ProcessConfigFile() for precedent.
In fact, I'm pretty sure that the elevel passed to ParseConfigFile
should always have come from that logic.

In any case, your argument is bogus because we could already have
thrown an error at the given elevel further up in ParseConfigFile,
or later in ParseConfigFp.  None of this code is meant to receive
an ERROR-or-higher elevel.

So yes, this is broken and needs to be changed as Robert says.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: option --if-exists for pg_dump

2014-01-22 Thread Pavel Stehule

Tomorrow I'll send updated version

Regards

Pavel


2014/1/21 Jeevan Chalke jeevan.cha...@enterprisedb.com

 Hi Pavel,

 Consider following test scenario:

 mydb=# \d emp
   Table public.emp
  Column |  Type   | Modifiers
 +-+---
  empno  | integer | not null
  deptno | integer |
  ename  | text|
 Indexes:
 emp_pkey PRIMARY KEY, btree (empno)
 Foreign-key constraints:
 emp_deptno_fkey FOREIGN KEY (deptno) REFERENCES dept(deptno)

 mydb=# \d dept
  Table public.dept
  Column |  Type   | Modifiers
 +-+---
  deptno | integer | not null
  dname  | text|
 Indexes:
 dept_pkey PRIMARY KEY, btree (deptno)
 Referenced by:
 TABLE emp CONSTRAINT emp_deptno_fkey FOREIGN KEY (deptno)
 REFERENCES dept(deptno)

 mydb=# \q
 jeevan@ubuntu:~/pg_master$ ./install/bin/pg_dump -d mydb --if-exists -c 
 mydb_ic.dmp

 I see following lines in dump which looks certainly wrong:
 ===

 DROP FK CONSTRAINT IF EXISTS ublic.emp DROP CONSTRAINT emp_deptno_fkey;
 DROP CONSTRAINT IF EXISTS Y public.emp DROP CONSTRAINT emp_pkey;
 DROP CONSTRAINT IF EXISTS Y public.dept DROP CONSTRAINT dept_pkey;

 When try to restore, as expected it is throwing an error:
 ===

 psql:mydb_ic.dmp:14: ERROR:  syntax error at or near FK
 LINE 1: DROP FK CONSTRAINT IF EXISTS ublic.emp DROP CONSTRAINT emp_d...
  ^
 psql:mydb_ic.dmp:15: ERROR:  syntax error at or near CONSTRAINT
 LINE 1: DROP CONSTRAINT IF EXISTS Y public.emp DROP CONSTRAINT emp_p...
  ^
 psql:mydb_ic.dmp:16: ERROR:  syntax error at or near CONSTRAINT
 LINE 1: DROP CONSTRAINT IF EXISTS Y public.dept DROP CONSTRAINT dept...
  ^

 Note:
 ===
 Commands which are in form of ALTER TABLE ... DROP are failing.
 You need to test each and every object with DROP .. IF EXISTS command.
 Better write small test-case with all objects included.

 Following logic has flaw:
 ===
 diff --git a/src/bin/pg_dump/pg_backup_archiver.c
 b/src/bin/pg_dump/pg_backup_archiver.c
 index 7fc0288..0677712 100644
 --- a/src/bin/pg_dump/pg_backup_archiver.c
 +++ b/src/bin/pg_dump/pg_backup_archiver.c
 @@ -413,8 +413,30 @@ RestoreArchive(Archive *AHX)
  /* Select owner and schema as necessary */
  _becomeOwner(AH, te);
  _selectOutputSchema(AH, te-namespace);
 -/* Drop it */
 -ahprintf(AH, %s, te-dropStmt);
 +
 +if (*te-dropStmt != '\0')
 +{
 +/* Inject IF EXISTS clause when it is required. */
 +if (ropt-if_exists)
 +{
 +char buffer[40];
 +size_t l;
 +
 +/* But do it only, when it is not there yet. */
 +snprintf(buffer, sizeof(buffer), DROP %s IF
 EXISTS,
 + te-desc);
 +l = strlen(buffer);
 +
 +if (strncmp(te-dropStmt, buffer, l) != 0)
 +{

 +ahprintf(AH, DROP %s IF EXISTS %s,
 +te-desc,
 +te-dropStmt + l);
 +}
 +else
 +ahprintf(AH, %s, te-dropStmt);
 +}
 +}
  }
  }


 Also:
 ===

 1.
 This is still out of sync.

 @@ -348,6 +350,8 @@ main(int argc, char *argv[])
  appendPQExpBufferStr(pgdumpopts,  --binary-upgrade);
  if (column_inserts)
  appendPQExpBufferStr(pgdumpopts,  --column-inserts);
 +if (if_exists)
 +appendPQExpBufferStr(pgdumpopts,  --if-exists);
  if (disable_dollar_quoting)
  appendPQExpBufferStr(pgdumpopts,  --disable-dollar-quoting);
  if (disable_triggers)

 2.
 Spell check required:

 +/* skip first n chars, and create a modifieble copy */

 modifieble = modifiable

 +/* DROP IF EXISTS pattern is not appliable on dropStmt */

 appliable = applicable

 3.

 +/*
 + * Object description is based on dropStmt statement. But
 + * a drop statements can be enhanced about IF EXISTS clause.
 + * We have to increase a offset in this case, IF EXISTS
 + * should not be included on object description.
 + */

 Looks like you need to re-phrase these comments line. Something like:

 /*
  * Object description is based on dropStmt statement which may have
  * IF EXISTS clause.  Thus we need to update an offset such that it
  * won't be included in the object description.
  */
 Or as per your choice.


 Need to have careful thought on a bug mentioned above.

 Thanks


 On Fri, Jan 17, 2014 at 6:23 PM, Pavel Stehule pavel.steh...@gmail.comwrote:

 Hello


 2014/1/16 Jeevan Chalke jeevan.cha...@enterprisedb.com

 Hi Pavel,

 I have

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2014-01-22 Thread Tom Lane

Amit Kapila amit.kapil...@gmail.com writes:
 On Wed, Jan 22, 2014 at 6:54 PM, Robert Haas robertmh...@gmail.com wrote:
 I wonder if the port number wouldn't be a better choice.  And that
 could even be done without adding a parameter.

 We need this for register of event source which might be done before
 start of server.

So?  Anything which can know the value of a GUC parameter can certainly
know the selected port number.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Andres Freund

On 2014-01-21 21:42:19 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-01-21 19:45:19 -0500, Tom Lane wrote:
  I don't think that's a comparable case.  Incomplete actions are actions
  to be taken immediately, and which the replayer then has to complete
  somehow if it doesn't find the rest of the action in the WAL sequence.
  The only thing to be done with the records I'm proposing is to remember
  their contents (in some fashion) until it's time to apply them.  If you
  hit end of WAL you don't really have to do anything.
 
  Would that work for the promotion case as well? Afair there's the
  assumption that everything = TransactionXmin can be looked up in
  pg_subtrans or in the procarray - which afaics wouldn't be the case with
  your scheme? And TransactionXmin could very well be below such an
  incomplete commit's xids afaics.
 
 Uh, what?  The behavior I'm talking about is *exactly the same*
 as what happens now.  The only change is that the data sent to the
 WAL file is laid out a bit differently, and the replay logic has
 to work harder to reassemble it before it can apply the commit or
 abort action.  If anything outside replay can detect a difference
 at all, that would be a bug.
 
 Once again: the replayer is not supposed to act immediately on the
 subsidiary records.  It's just supposed to remember their contents
 so it can reattach them to the eventual commit or abort record,
 and then do what it does today to replay the commit or abort.

I (think) I get what you want to do, but splitting the record like that
nonetheless opens up behaviour that previously wasn't there. Imagine we
promote inbetween replaying the list of subxacts (only storing it in
memory) and the main commit record. Either we have something like the
incomplete action stuff doing something with the in-memory data, or we
are in a situation where there can be xids bigger than TransactionXmin
that are not in pg_subtrans and not in the procarray. Which I don't
think exists today since we either read the commit record in it's
entirety or not.
We'd also need to use the MyPgXact-delayChkpt mechanism to prevent
checkpoints from occuring inbetween those records, but we do that
already, so that seems rather uncontroversial.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Funny representation in pg_stat_statements.query.

2014-01-22 Thread Alvaro Herrera

Kyotaro HORIGUCHI wrote:
 Hello,
 
   Since this becomes more than a simple bug fix, I think it is too
   late for 9.4 now. I'll work on this in the longer term.
  
  OK.  I will get around to it someday, but if you do it first, that's fine.
 
 Nevertheless of my words, the drive is getting diminished after
 the issue was settled:-p Anyway I'll keep this in my mind.

This is a common theme with patch submitters: once you accept a good
enough workaround, their motivation to write a real fix diminishes
considerably, and as a result kluges live on.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Andres Freund

On 2014-01-22 10:14:27 -0500, Robert Haas wrote:
 On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
   I am not sure I understand that point. We can either update the
   in-memory bit before performing the on-disk operations or
   afterwards. Either way, there's a way to be inconsistent if the disk
   operation fails somewhere inbetween (it might fail but still have
   deleted the file/directory!). The normal way to handle that in other
   places is PANICing when we don't know so we recover from the on-disk
   state.
   I really don't see the problem here? Code doesn't get more robust by
   doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
   ERROR, often that's not warranted.
 
  People get cranky when the database PANICs because of a filesystem
  failure.  We should avoid that, especially when it's trivial to do so.
   The update to shared memory should be done second and should be set
  up to be no-fail.
 
  I don't see how that would help. If we fail during unlink/rmdir, we
  don't really know at which point we failed.
 
 This doesn't make sense to me.  unlink/rmdir are atomic operations.

Yes, individual operations should be, but you cannot be sure whether a
rename()/unlink() will survive a crash until the directory is
fsync()ed. So, what is one going to do if the unlink suceeded, but the
fsync didn't?

Deletion currently works like:
if (rename(path, tmppath) != 0)
ereport(ERROR,
(errcode_for_file_access(),
 errmsg(could not rename \%s\ to \%s\: %m,
path, tmppath)));

/* make sure no partial state is visible after a crash */
fsync_fname(tmppath, false);
fsync_fname(pg_replslot, true);

if (!rmtree(tmppath, true))
{
ereport(ERROR,
(errcode_for_file_access(),
 errmsg(could not remove directory \%s\: %m,
tmppath)));
}

If we fail between the rename() and the fsync_fname() we don't really
know which state we are in. We'd also have to add code to handle
incomplete slot directories, which currently only exists for startup, to
other places.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Robert Haas

On Tue, Jan 21, 2014 at 3:20 PM, Jan Kara j...@suse.cz wrote:
 But that still doesn't work out very well, because now the guy who
 does the write() has to wait for it to finish before he can do
 anything else.  That's not always what we want, because WAL gets
 written out from our internal buffers for multiple different reasons.
   Well, you can always use AIO (io_submit) to submit direct IO without
 waiting for it to finish. But then you might need to track the outstanding
 IO so that you can watch with io_getevents() when it is finished.

Yeah.  That wouldn't work well for us; the process that did the
io_submit() would want to move on to other things, and how would it,
or any other process, know that the I/O had completed?

   As I wrote in some other email in this thread, using IO priorities for
 data file checkpoint might be actually the right answer. They will work for
 IO submitted by fsync(). The downside is that currently IO priorities / IO
 scheduling classes work only with CFQ IO scheduler.

IMHO, the problem is simpler than that: no single process should be
allowed to completely screw over every other process on the system.
When the checkpointer process starts calling fsync(), the system
begins writing out the data that needs to be fsync()'d so aggressively
that service times for I/O requests from other process go through the
roof.  It's difficult for me to imagine that any application on any
I/O scheduler is ever happy with that behavior.  We shouldn't need to
sprinkle of fsync() calls with special magic juju sauce that says
hey, when you do this, could you try to avoid causing the rest of the
system to COMPLETELY GRIND TO A HALT?.  That should be the *default*
behavior, if not the *only* behavior.

Now, that is not to say that we're unwilling to sprinkle magic juju
sauce if that's what it takes to solve this problem.  If calling
fadvise() or sync_file_range() or some new API that you invent at some
point prior to calling fsync() helps the kernel do the right thing,
we're willing to do that.  Or if you/the Linux community wants to
invent a new API fsync_but_do_not_crush_system() and have us call that
instead of the regular fsync(), we're willing to do that, too.  But I
think there's an excellent case to be made, at least as far as
checkpoint I/O spikes are concerned, that the API is just fine as it
is and Linux's implementation is simply naive.  We'd be perfectly
happy to wait longer for fsync() to complete in exchange for not
starving the rest of the system - and really, who wouldn't?  Linux is
a multi-user system, and apportioning resources among multiple tasks
is a basic function of a multi-user kernel.

/rant

Anyway, if CFQ or any other Linux I/O scheduler gets an option to
lower the priority of the fsyncs, I'm sure somebody here will test it
out and see whether it solves this problem.  AFAICT, experiments to
date have pretty much universally shown CFQ to be worse than not-CFQ
and everything else to be more or less equivalent - but if that
changes, I'm sure many PostgreSQL DBAs will be more than happy to flip
CFQ back on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire

2014-01-22 Thread Christian Kruse

Hi,

On 22/01/14 12:40, Simon Riggs wrote:
  1. I find a issue in following scenario:
 
  session 1 with process id X: BEGIN; LOCK TABLE foo IN SHARE MODE;
  session 2 with process id Y: BEGIN; LOCK TABLE foo IN EXCLUSIVE 
  MODE;
  session 3 with process id Z: BEGIN; LOCK TABLE foo IN SHARE MODE;
 
  On execution of LOCK in session-3, as part of log it will display 
  as:
  processes owning lock: X, Y
  But actually if we see Y has not yet own the lock, it is still 
  waiting with higher priority.
  It may mislead user.
  May be we should change message to give all meaning i.e. which 
  process is owning lock and
  Which process is already in queue.
 
 Perhaps this?
 
 CONTEXT: lock owner  request queue XXX, XXX, XXX, etc

Fixed.

  2. Can we give a better name to new variable 'buf1'?

Fixed.

  3. Do we need to take performance reading to see if any impact?
 
 Don't think so. Diagnosing problems will help performance, not hinder it

I agree. And this code path will only get executed when log_lock_waits
= on, which seems to be a debugging method to me.

Best regards,

-- 
 Christian Kruse   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ee6c24c..588c4ef 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1195,13 +1195,23 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		 */
 		if (log_lock_waits  deadlock_state != DS_NOT_YET_CHECKED)
 		{
-			StringInfoData buf;
+			StringInfoData buf,
+		lock_waiters_sbuf,
+		lock_holders_sbuf;
 			const char *modename;
 			long		secs;
 			int			usecs;
 			long		msecs;
+			SHM_QUEUE  *procLocks;
+			PROCLOCK   *proclock;
+			bool		first_holder = true,
+		first_waiter = true;
+			int			lockHoldersNum = 0;
 
 			initStringInfo(buf);
+			initStringInfo(lock_waiters_sbuf);
+			initStringInfo(lock_holders_sbuf);
+
 			DescribeLockTag(buf, locallock-tag.lock);
 			modename = GetLockmodeName(locallock-tag.lock.locktag_lockmethodid,
 	   lockmode);
@@ -1211,10 +1221,58 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			msecs = secs * 1000 + usecs / 1000;
 			usecs = usecs % 1000;
 
+			LWLockAcquire(partitionLock, LW_SHARED);
+
+			procLocks = (lock-procLocks);
+			proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
+			   offsetof(PROCLOCK, lockLink));
+
+			while (proclock)
+			{
+/*
+ * we are a waiter if myProc-waitProcLock == proclock; we are
+ * a holder if it is NULL or something different
+ */
+if (proclock-tag.myProc-waitProcLock == proclock)
+{
+	if (first_waiter)
+	{
+		appendStringInfo(lock_waiters_sbuf, %d,
+		 proclock-tag.myProc-pid);
+		first_waiter = false;
+	}
+	else
+		appendStringInfo(lock_waiters_sbuf, , %d,
+		 proclock-tag.myProc-pid);
+}
+else
+{
+	if (first_holder)
+	{
+		appendStringInfo(lock_holders_sbuf, %d,
+		 proclock-tag.myProc-pid);
+		first_holder = false;
+	}
+	else
+		appendStringInfo(lock_holders_sbuf, , %d,
+		 proclock-tag.myProc-pid);
+
+	lockHoldersNum++;
+}
+
+proclock = (PROCLOCK *) SHMQueueNext(procLocks, proclock-lockLink,
+			   offsetof(PROCLOCK, lockLink));
+			}
+
+			LWLockRelease(partitionLock);
+
 			if (deadlock_state == DS_SOFT_DEADLOCK)
 ereport(LOG,
 		(errmsg(process %d avoided deadlock for %s on %s by rearranging queue order after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+		 (errcontext(ngettext(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+			  lockHoldersNum), lock_holders_sbuf.data, lock_waiters_sbuf.data;
 			else if (deadlock_state == DS_HARD_DEADLOCK)
 			{
 /*
@@ -1226,13 +1284,19 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  */
 ereport(LOG,
 		(errmsg(process %d detected deadlock while waiting for %s on %s after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+		 (errcontext(ngettext(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+			  lockHoldersNum), lock_holders_sbuf.data, lock_waiters_sbuf.data;
 			}
 
 			if (myWaitStatus == STATUS_WAITING)
 ereport(LOG,
 		(errmsg(process %d still waiting for %s on %s after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+		 (errcontext(ngettext(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+			  lockHoldersNum),

Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire

2014-01-22 Thread Alvaro Herrera

Christian Kruse wrote:

I think this could use some more comments -- for instance at the top of
the while loop, explain what's its purpose.

   if (deadlock_state == DS_SOFT_DEADLOCK)
   ereport(LOG,
   (errmsg(process %d avoided 
 deadlock for %s on %s by rearranging queue order after %ld.%03d ms,
 -   MyProcPid, modename, 
 buf.data, msecs, usecs)));
 + MyProcPid, 
 modename, buf.data, msecs, usecs),
 +  (errcontext(ngettext(process 
 owning lock: %s request queue: %s,
 +processes owning 
 lock: %s request queue: %s,
 + 
   lockHoldersNum), lock_holders_sbuf.data, lock_waiters_sbuf.data;

This ngettext() call is repeated four times in the new code, which is a
bit annoying because it's not trivial.  I think you could assign the
ngettext() to a char * at the bottom of the loop, and then in the
ereport() calls use it:


char *errcxt = NULL;

while ( ... )
{
...
errcxt = ngettext(processes owning lock: ..);
}

ereport(LOG,
(errmsg(blah blah),
 errcxt != NULL ? errcontext(errcxt) : 0));


That would avoid the repetition.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WAL replay should fdatasync() segments?

2014-01-22 Thread Andres Freund

Hi,

Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
write() data before fdatasync()ing them (duh, kinda obvious). But I
think given the current recovery code that leaves a window where we can
get into strange inconsistencies.
Consider what happens if postgres (not the OS!) crashes after writing
WAL data to the OS, but before fdatasync()ing it. Replay will happily
read that record from disk and replay it, which is fine. At the end of
recovery we then will start inserting new records, and those will be
properly fsynced to disk.
But if the *OS* crashes in that moment we might get into the strange
situation where older records might be lost since they weren't
fsync()ed, but newer records and the control file will persist.

I think for a primary that window is relatively small, but I think it's
a good bit bigger for a standby, especially if it's promoted.

I think the correct way to handle this would be to fsync() segments we
read from pg_xlog/ during recovery.

Am I missing something?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Incorrectly reporting config errors

2014-01-22 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:

 A rather common and sensible configuration is to have a common
 configuration file used across servers, which then is overwritten
 by a per-server or per-cluster config file containing values
 specific to a server/cluster.

Agreed.

My preference would be to not generate noise for interim states;
just report net changes.  And don't say that a file contains
errors when we mean those options are ignored on reload; they
will only take effect on restart.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-22 Thread Andres Freund

On 2014-01-10 13:11:32 -0500, Robert Haas wrote:
 OK, I've implemented this: here's what I believe to be a complete
 patch, based on the previous lwlock-pointers.patch but now handling
 LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose
 ends.  I think this should be adequate for allowing lwlocks to be
 stored elsewhere in the main shared memory segment as well as in
 dynamic shared memory segments.
 
 Thoughts?

Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
breaking external code using lwlocks? Requiring code that has worked for
several versions to sprinkle version specific ifdefs seems unneccessary
here.

I see the comments still claim
+ * LWLock is between 16 and 32 bytes on all known platforms, so these two
+ * cases are sufficient.
+ */
+#define LWLOCK_PADDED_SIZE (sizeof(LWLock) = 16 ? 16 : 32)

I don't think that's true anymore with the addition of the tranche
id. The new minimum size seems to be 20 on 32bit platforms. That could
be fixed by making the tranche id a uint8, but I don't think that's
necessary, so just removing the support for  32 seems sufficient.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 21:42:19 -0500, Tom Lane wrote:
 Uh, what?  The behavior I'm talking about is *exactly the same*
 as what happens now.  The only change is that the data sent to the
 WAL file is laid out a bit differently, and the replay logic has
 to work harder to reassemble it before it can apply the commit or
 abort action.  If anything outside replay can detect a difference
 at all, that would be a bug.
 
 Once again: the replayer is not supposed to act immediately on the
 subsidiary records.  It's just supposed to remember their contents
 so it can reattach them to the eventual commit or abort record,
 and then do what it does today to replay the commit or abort.

 I (think) I get what you want to do, but splitting the record like that
 nonetheless opens up behaviour that previously wasn't there.

Obviously we are not on the same page yet.

In my vision, the WAL writer is dumping the same data it would have
dumped, though in a different layout, and it's working from process-local
state same as it does now.  The WAL replayer is taking the same actions at
the same time using the same data as it does now.  There is no behavior
that wasn't there, unless you're claiming that there are *existing* race
conditions in commit/abort WAL processing.

The only thing that seems mildly squishy about this is that it's not clear
how long the WAL replayer ought to hang onto subsidiary records for a
commit or abort it hasn't seen yet.  In the case where we change our minds
and abort a transaction after already having written some subsidiary
records for the commit, it's not really a problem; the replayer can throw
away any saved data related to the commit of xid N as soon as it sees an
abort for xid N.  However, what if the session crashes and never writes
either a final commit or abort record?  I think we can deal with this
fairly easily though, because that case should end with a crash recovery
cycle writing a shutdown checkpoint to the log (we do do that no?).
So the rule can be discard any unmatched subsidiary records if you see a
shutdown checkpoint.  This makes sense on its own terms since there are
surely no active transactions at that point in the log.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL replay should fdatasync() segments?

2014-01-22 Thread Fujii Masao

On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
 write() data before fdatasync()ing them (duh, kinda obvious). But I
 think given the current recovery code that leaves a window where we can
 get into strange inconsistencies.
 Consider what happens if postgres (not the OS!) crashes after writing
 WAL data to the OS, but before fdatasync()ing it. Replay will happily
 read that record from disk and replay it, which is fine. At the end of
 recovery we then will start inserting new records, and those will be
 properly fsynced to disk.
 But if the *OS* crashes in that moment we might get into the strange
 situation where older records might be lost since they weren't
 fsync()ed, but newer records and the control file will persist.

 I think for a primary that window is relatively small, but I think it's
 a good bit bigger for a standby, especially if it's promoted.

In normal streaming replication case, ISTM that window is not bigger for
the standby because basically the standby replays only the WAL data
which walreceiver fsync'd to the disk. But if it replays the WAL file which
was fetched from the archive, that WAL file might not have been flushed
to the disk yet. In this case, that window might become bigger...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire

2014-01-22 Thread Tom Lane

Alvaro Herrera alvhe...@2ndquadrant.com writes:
 This ngettext() call is repeated four times in the new code, which is a
 bit annoying because it's not trivial.  I think you could assign the
 ngettext() to a char * at the bottom of the loop, and then in the
 ereport() calls use it:

Would that not break the compiler's ability to verify the format codes
in the string?  Not to mention make it harder for people to compare
format to arguments, too?

However, the real problem here is that you shouldn't be calling ngettext
manually in an ereport context in the first place.  There is
infrastructure in place for that, and this isn't using it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL replay should fdatasync() segments?

2014-01-22 Thread Andres Freund

On 2014-01-23 02:05:48 +0900, Fujii Masao wrote:
 On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund and...@2ndquadrant.com wrote:
  Hi,
 
  Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
  write() data before fdatasync()ing them (duh, kinda obvious). But I
  think given the current recovery code that leaves a window where we can
  get into strange inconsistencies.
  Consider what happens if postgres (not the OS!) crashes after writing
  WAL data to the OS, but before fdatasync()ing it. Replay will happily
  read that record from disk and replay it, which is fine. At the end of
  recovery we then will start inserting new records, and those will be
  properly fsynced to disk.
  But if the *OS* crashes in that moment we might get into the strange
  situation where older records might be lost since they weren't
  fsync()ed, but newer records and the control file will persist.
 
  I think for a primary that window is relatively small, but I think it's
  a good bit bigger for a standby, especially if it's promoted.
 
 In normal streaming replication case, ISTM that window is not bigger for
 the standby because basically the standby replays only the WAL data
 which walreceiver fsync'd to the disk. But if it replays the WAL file which
 was fetched from the archive, that WAL file might not have been flushed
 to the disk yet. In this case, that window might become bigger...

Yea, but if the walreceiver receives data and crashes/disconnects before
fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll
start inserting new records before establishing a new checkpoint.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-22 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
 breaking external code using lwlocks?

+1, in fact there's probably no reason to touch most *internal* code using
that type name either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Incorrectly reporting config errors

2014-01-22 Thread Andres Freund

On 2014-01-22 12:10:30 -0500, Tom Lane wrote:
 Kevin Grittner kgri...@ymail.com writes:
  My preference would be to not generate noise for interim states;
  just report net changes.
 
 Yeah.  Is it worth explicitly detecting and dropping redundant assignments
 to the same variable?  A naive check for that would be O(N^2) in the
 number of entries in the conf file, but perhaps that's still cheap enough
 in practice.  This would mean for example that
 
shared_buffers = 'oops'
shared_buffers = '128MB'
 
 would not draw an error, which doesn't bother me but might bother
 somebody.

Wouldn't bother me overly much. But it doesn't seem too hard to avoid
it. Simply split the current ProcessConfigFile() into one more phase.
Currently we seem to be parsing the config file, then iterate over the
list returned from that to set options. If we'd instead have one pass
over the items storing the new value in a new config_generic variable,
after checking, and then a second pass over all gucs setting those that
have changed we should end up at the same complexity since it could be
merged with the reset pass.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Incorrectly reporting config errors

2014-01-22 Thread Tom Lane

Kevin Grittner kgri...@ymail.com writes:
 My preference would be to not generate noise for interim states;
 just report net changes.

Yeah.  Is it worth explicitly detecting and dropping redundant assignments
to the same variable?  A naive check for that would be O(N^2) in the
number of entries in the conf file, but perhaps that's still cheap enough
in practice.  This would mean for example that

   shared_buffers = 'oops'
   shared_buffers = '128MB'

would not draw an error, which doesn't bother me but might bother
somebody.

 And don't say that a file contains
 errors when we mean those options are ignored on reload; they
 will only take effect on restart.

I'm not happy about complicating that logic even more.  I think the
reasonable choices here are to reword that message somehow, or just
drop it completely.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-22 Thread Heikki Linnakangas


On 01/22/2014 02:17 PM, Alexander Korotkov wrote:

We already spent a lot of time with compression. Now we need to figure out
the result we want see. I spent quite long time debugging varbyte encoding
without segments. Finally, it passed very many tests without any problems.
Now, it is just piece of junk. I'm afraid that we will have to reimplement
everything from scratch another two or three times because code doesn't
look perfect. For sure, it's incompatible with getting something into 9.4.


That's a bit harsh :-). But yes, I understand what you're saying. It's 
quite common for large patches like this to be rewritten several times 
before being committed; you just don't know what works best until you've 
tried many designs.



Remember we have also subsequent fast-scan which is very needed for hstore
and other application.
I propose to do final decisions now and concentrate forces on making
committable patch with these decisions. And don't change these decisions
even if somebody have idea how to improve code readability in 100 times and
potential extendability in 1000 times.
I propose following decisions:
1) Leave uncompressed area, allow duplicates in it
2) Don't introduce Items on page.


Well, I got the insertions to work now without the uncompressed area, so 
in the spirit of Just Getting This Damn Patch Committed, I'm going to go 
ahead with that. We can add the uncompressed area later if performance 
testing shows it to be necessary. And I agree about the Items on page 
idea - we can come back to that too still in 9.4 timeframe if necessary, 
but probably not.


So, committed. It's the same design as in the most recent patch I 
posted, but with a bunch of bugs fixed, and cleaned up all over. I'm 
going to move to the fast scan patch now, but please do test and review 
the committed version to see if I missed something.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-22 Thread Robert Haas

On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
 breaking external code using lwlocks?

 +1, in fact there's probably no reason to touch most *internal* code using
 that type name either.

I thought about this but figured it was too much of a misnomer to
refer to a pointer as an ID.  But, if we're sure we want to go that
route, I can go revise the patch along those lines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-22 Thread Andres Freund

On 2014-01-22 12:40:34 -0500, Robert Haas wrote:
 On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Andres Freund and...@2ndquadrant.com writes:
  Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
  breaking external code using lwlocks?
 
  +1, in fact there's probably no reason to touch most *internal* code using
  that type name either.
 
 I thought about this but figured it was too much of a misnomer to
 refer to a pointer as an ID.  But, if we're sure we want to go that
 route, I can go revise the patch along those lines.

I personally don't care either way for internal code as long as external
code continues to work. There's the argument of making the commit better
readable by having less noise and less divergence in the branches and
there's your argument of that being less clear.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: hide application_name from other users

2014-01-22 Thread Josh Berkus

On 01/21/2014 05:21 PM, Andres Freund wrote:
 I think the only realistic thing is a monitoring capability, like we
 have replication. GRANT/REVOKE doesn't even come close to being able
 to generically allow to grant permissions of even the moderate
 complexity pg_stat_get_activity() has.

That would work for me, personally.   I don't know how it would work for
anyone else.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] new json funcs

2014-01-22 Thread Andrew Dunstan



On 01/21/2014 06:21 PM, Marko Tiikkaja wrote:

Hi Andrew,

On 1/18/14, 10:05 PM, I wrote:

But I'll continue with my review now that this has been sorted out.


Sorry about the delay.

I think the API for the new functions looks good.  They are all 
welcome additions to the JSON family.


The implementation side looks reasonable to me.  I'm not sure there's 
need to duplicate so much code, though.  E.g. json_to_recordset is 
almost identical to json_populate_recordset, and json_to_record has a 
bit of the same disease.


Finally, (as I'm sure you know already), docs are still missing. 
Marking the patch Waiting on Author for the time being.







New patch attached. Main change is I changed 
json_populate_record/json_to_record to call a common worker function, 
and likewise with json_populate_recordset/json_to_recordset.


We're still finalizing the docs - should be ready in the next day or so.

cheers

andrew

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 481db16..ef5e125 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -68,6 +68,10 @@ static void array_dim_to_json(StringInfo result, int dim, int ndims, int *dims,
   bool use_line_feeds);
 static void array_to_json_internal(Datum array, StringInfo result,
 	   bool use_line_feeds);
+static void datum_to_json(Datum val, bool is_null, StringInfo result,
+			  TYPCATEGORY tcategory, Oid typoutputfunc, bool key_scalar);
+static void add_json(Datum orig_val, bool is_null, StringInfo result,
+		 Oid val_type, bool key_scalar);
 
 /* the null action object used for pure validation */
 static JsonSemAction nullSemAction =
@@ -1219,7 +1223,7 @@ extract_mb_char(char *s)
  */
 static void
 datum_to_json(Datum val, bool is_null, StringInfo result,
-			  TYPCATEGORY tcategory, Oid typoutputfunc)
+			  TYPCATEGORY tcategory, Oid typoutputfunc, bool key_scalar)
 {
 	char	   *outputstr;
 	text	   *jsontext;
@@ -1241,24 +1245,32 @@ datum_to_json(Datum val, bool is_null, StringInfo result,
 			composite_to_json(val, result, false);
 			break;
 		case TYPCATEGORY_BOOLEAN:
-			if (DatumGetBool(val))
-appendStringInfoString(result, true);
+			if (!key_scalar)
+appendStringInfoString(result, DatumGetBool(val) ? true : false);
 			else
-appendStringInfoString(result, false);
+escape_json(result, DatumGetBool(val) ? true : false);
 			break;
 		case TYPCATEGORY_NUMERIC:
 			outputstr = OidOutputFunctionCall(typoutputfunc, val);
-
-			/*
-			 * Don't call escape_json here if it's a valid JSON number.
-			 */
-			dummy_lex.input = *outputstr == '-' ? outputstr + 1 : outputstr;
-			dummy_lex.input_length = strlen(dummy_lex.input);
-			json_lex_number(dummy_lex, dummy_lex.input, numeric_error);
-			if (!numeric_error)
-appendStringInfoString(result, outputstr);
-			else
+			if (key_scalar)
+			{
+/* always quote keys */
 escape_json(result, outputstr);
+			}
+			else
+			{
+/*
+ * Don't call escape_json for a non-key if it's a valid JSON
+ * number.
+ */
+dummy_lex.input = *outputstr == '-' ? outputstr + 1 : outputstr;
+dummy_lex.input_length = strlen(dummy_lex.input);
+json_lex_number(dummy_lex, dummy_lex.input, numeric_error);
+if (!numeric_error)
+	appendStringInfoString(result, outputstr);
+else
+	escape_json(result, outputstr);
+			}
 			pfree(outputstr);
 			break;
 		case TYPCATEGORY_JSON:
@@ -1276,6 +1288,10 @@ datum_to_json(Datum val, bool is_null, StringInfo result,
 			break;
 		default:
 			outputstr = OidOutputFunctionCall(typoutputfunc, val);
+			if (key_scalar  *outputstr == '\0')
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(key value must not be empty)));
 			escape_json(result, outputstr);
 			pfree(outputstr);
 			break;
@@ -1309,7 +1325,7 @@ array_dim_to_json(StringInfo result, int dim, int ndims, int *dims, Datum *vals,
 		if (dim + 1 == ndims)
 		{
 			datum_to_json(vals[*valcount], nulls[*valcount], result, tcategory,
-		  typoutputfunc);
+		  typoutputfunc, false);
 			(*valcount)++;
 		}
 		else
@@ -1490,13 +1506,85 @@ composite_to_json(Datum composite, StringInfo result, bool use_line_feeds)
 		else
 			tcategory = TypeCategory(tupdesc-attrs[i]-atttypid);
 
-		datum_to_json(val, isnull, result, tcategory, typoutput);
+		datum_to_json(val, isnull, result, tcategory, typoutput, false);
 	}
 
 	appendStringInfoChar(result, '}');
 	ReleaseTupleDesc(tupdesc);
 }
 
+static void
+add_json(Datum orig_val, bool is_null, StringInfo result, Oid val_type, bool key_scalar)
+{
+	Datum		val;
+	TYPCATEGORY tcategory;
+	Oid			typoutput;
+	bool		typisvarlena;
+	Oid			castfunc = InvalidOid;
+
+	if (val_type == InvalidOid)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg(could not determine input data type)));
+
+
+	getTypeOutputInfo(val_type, typoutput, typisvarlena);
+
+	if (val_type  FirstNormalObjectId)
+	{
+		HeapTuple	tuple;
+		Form_pg_cast castForm;

Re: [HACKERS] WAL replay should fdatasync() segments?

2014-01-22 Thread Fujii Masao

On Thu, Jan 23, 2014 at 2:08 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-01-23 02:05:48 +0900, Fujii Masao wrote:
 On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Hi,
 
  Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
  write() data before fdatasync()ing them (duh, kinda obvious). But I
  think given the current recovery code that leaves a window where we can
  get into strange inconsistencies.
  Consider what happens if postgres (not the OS!) crashes after writing
  WAL data to the OS, but before fdatasync()ing it. Replay will happily
  read that record from disk and replay it, which is fine. At the end of
  recovery we then will start inserting new records, and those will be
  properly fsynced to disk.
  But if the *OS* crashes in that moment we might get into the strange
  situation where older records might be lost since they weren't
  fsync()ed, but newer records and the control file will persist.
 
  I think for a primary that window is relatively small, but I think it's
  a good bit bigger for a standby, especially if it's promoted.

 In normal streaming replication case, ISTM that window is not bigger for
 the standby because basically the standby replays only the WAL data
 which walreceiver fsync'd to the disk. But if it replays the WAL file which
 was fetched from the archive, that WAL file might not have been flushed
 to the disk yet. In this case, that window might become bigger...

 Yea, but if the walreceiver receives data and crashes/disconnects before
 fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll
 start inserting new records before establishing a new checkpoint.

Yeah, true. Such unflushed WAL file can be read by the subsequent recovery...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Robert Haas

On Wed, Jan 22, 2014 at 10:48 AM, Andres Freund and...@2ndquadrant.com wrote:
 Yes, individual operations should be, but you cannot be sure whether a
 rename()/unlink() will survive a crash until the directory is
 fsync()ed. So, what is one going to do if the unlink suceeded, but the
 fsync didn't?

Well, apparently, one is going to PANIC and reinitialize the system.
I presume that upon reinitialization we'll decide that the slot is
gone, and thus won't recreate it in shared memory.  Of course, if the
entire system suffers a hard power failure after that and before the
directory is succesfully fsync'd, then the slot could reappear on the
next startup.  Which is also exactly what would happen if we removed
the slot from shared memory after doing the unlink, and then the
system suffered a hard power failure before the directory contents
made it to disk.  Except that we also panicked.

In the case of shared buffers, the way we handle fsync failures is by
not allowing the system to checkpoint until all of the fsyncs succeed.
 If there's an OS-level reset before that happens, WAL replay will
perform the same buffer modifications over again and the next
checkpoint will again try to flush them to disk and will not complete
unless it does.  That forms a closed system where we never advance the
redo pointer over the covering WAL record until the changes it covers
are on the disk.  But I don't think this code has any similar
interlock; if it does, I missed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: hide application_name from other users

2014-01-22 Thread Josh Berkus

On 01/21/2014 05:22 PM, Mark Kirkwood wrote:
 If said malicious attacker can log into postgres and issue its own
 queries, and connect to other database then you are in serious trouble
 already.
 
 I also wonder that if such an attacker knows the application name, that
 would suggest that they have access to the application server and are
 able to read its config files...which would probably also contain the
 host and db name too (and possibly the password in some unfortunate cases)!

Common case:

Multitenant shared hosting on a public cloud.

1. attacker writes a tool which exploits VulnerableApplication and takes
it over.

2. they exploit SiteX, running that web app.

3. using SiteX's database credentials, they check pg_stat_activity and
see what other hosts are running VulnerableApplication.

4. They then infect the other hosts running VulnerableApplication.

Alternately:

4. They use VulnerableApplication's predictable password-generating flaw
to log into the other databases, or to try the default password which
ships with the app.

However, thinking about the case above, there are a number of caveats
and workarounds which make the above not that interesting of an exploit
case:

A. it's easy to find VulnerableApplication simply by scanning the web.
Easier, in fact, than the above, if you have an IP block to start with,
and you would.

B. Most applications don't actually set application-specific application
names anyway (psycopg2, libpq).

C. It would be trivially easy for a DBA concerned about security to
obfuscate application names in a way which would not be easy for an
attacker to analyze.

D. Don't use default passwords.  Also, the attacker merely needs to try
each database in turn anyway.

Given the above, I think this specific patch falls into the broad class
of things we would like to have in a multi-tenant toolkit (and is far
from the most useful of those), which would include:

* hiding application_name, user_name, and database_names from users of
other databases

* local superuser who can create per-database users and extensions
from an approved list

* ability to block users from changing some resource limits (work_mem,
for example).

* per-database logging (could be done with log hooks, just needs a contrib).

It seems to me that it's not terribly useful to fix one item on the
above list without having at least a *plan* to address the others. This
really needs to be part of a comprehensive system, not piecework, or
we'll end up with a bunch of little security options which don't work
together.

Probably Heroku has some more specific exploit case to be concerned
about here; if so, might I suggest taking it up with the -security list?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

2014-01-22 Thread Tom Lane

Peter Geoghegan p...@heroku.com writes:
 On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan p...@heroku.com wrote:
 Actually, I think the whole thing is rather badly designed, as warm
 cache or no you're still proposing to do thousands of kernel calls
 while holding a potentially contended LWLock.  I suggest what we
 do is (1) read in the whole file, (2) acquire the lock, (3)
 re-read the whole file in the same buffer, (4) work from the buffer.

 fseeko() and fread() are part of the standard library, not any kernel.
 I don't believe that what I've done in qtext_fetch() implies thousands
 of kernel calls, or thousands of context switches.

 I ran an strace on a pg_stat_statements backend with a full ~5,000
 entries. It seems that glibc is actually content to always make
 lseek() and read() syscalls in the event of an fseek(), even though it
 does maintain its own buffer and could in principle have a lot more
 gumption about that [1]. I agree that we should copy everything into a
 buffer in one go.

Yeah, that was what I thought might happen.  Even if glibc were smarter,
a lot of other libc implementations aren't.  Also, ISTM that traversal
of the hash table will generally lead to more-or-less-random access into
the text file.  So unless you assume that libc has mmap'd the whole
file, it'd have to do a lot of kernel calls anyway to get different pages
of the file into its buffer at different times.

I think that read the whole file is probably going to net out not any
more complex as far as our code goes, and it'll be making a lot fewer
assumptions about how smart the libc level is and what's happening at
the kernel level.  I'll have a go at coding it, anyway.

 I think that making the LWLocking duration as brief as possible isn't
 critically important.

Maybe, but I don't like the idea that calling pg_stat_statements()
frequently could result in random glitches in response time for other
queries.

 Automated tools will only be interested in new
 query texts (implying possible I/O while share locking) as they
 observe new entries.

This argument would be more plausible if there were a way to fetch
the text for a previously-observed entry without invoking
pg_stat_statements().  I'm surprised you've not submitted a patch
to add such a function.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Incorrectly reporting config errors

2014-01-22 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kgri...@ymail.com writes:
 My preference would be to not generate noise for interim states;
 just report net changes.

 Yeah.  Is it worth explicitly detecting and dropping redundant assignments
 to the same variable?  A naive check for that would be O(N^2) in the
 number of entries in the conf file, but perhaps that's still cheap enough
 in practice.  This would mean for example that

   shared_buffers = 'oops'
   shared_buffers = '128MB'

 would not draw an error, which doesn't bother me but might bother
 somebody.

It doesn't bother me any.

 And don't say that a file contains
 errors when we mean those options are ignored on reload; they
 will only take effect on restart.

 I'm not happy about complicating that logic even more.  I think the
 reasonable choices here are to reword that message somehow, or just
 drop it completely.

I agree.  No strong preference which,

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: hide application_name from other users

2014-01-22 Thread Kevin Grittner

Stephen Frost sfr...@snowman.net wrote:
 Harold Giménez (har...@heroku.com) wrote:

 This is just an example of how running monitoring as superuser
 is not necessarily the worst thing, and there are other reasons
 to do it already.

 It's a horrible thing and that isn't a good reason- if my
 database isn't accepting connections, I probably don't care one
 bit how bloated a table is.  Indeed, I care *more* that I'm out
 of connections and would want to know that ASAP.

+1

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire

2014-01-22 Thread Christian Kruse

Hi,

attached you will find a new version of the patch containing more
comments.

On 22/01/14 12:00, Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  This ngettext() call is repeated four times in the new code, which is a
  bit annoying because it's not trivial.  I think you could assign the
  ngettext() to a char * at the bottom of the loop, and then in the
  ereport() calls use it:

 Would that not break the compiler's ability to verify the format codes
 in the string?  Not to mention make it harder for people to compare
 format to arguments, too?

I agree.

 However, the real problem here is that you shouldn't be calling ngettext
 manually in an ereport context in the first place.  There is
 infrastructure in place for that, and this isn't using it.

Fixed in attached patch. I changed it from calling
errorcontext(ngettext()) to calling errdetail_plural().

Best regards,

-- 
 Christian Kruse   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 122afb2..552c5a4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1195,13 +1195,23 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		 */
 		if (log_lock_waits  deadlock_state != DS_NOT_YET_CHECKED)
 		{
-			StringInfoData buf;
+			StringInfoData buf,
+		lock_waiters_sbuf,
+		lock_holders_sbuf;
 			const char *modename;
 			long		secs;
 			int			usecs;
 			long		msecs;
+			SHM_QUEUE  *procLocks;
+			PROCLOCK   *proclock;
+			bool		first_holder = true,
+		first_waiter = true;
+			int			lockHoldersNum = 0;
 
 			initStringInfo(buf);
+			initStringInfo(lock_waiters_sbuf);
+			initStringInfo(lock_holders_sbuf);
+
 			DescribeLockTag(buf, locallock-tag.lock);
 			modename = GetLockmodeName(locallock-tag.lock.locktag_lockmethodid,
 	   lockmode);
@@ -1211,10 +1221,67 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			msecs = secs * 1000 + usecs / 1000;
 			usecs = usecs % 1000;
 
+			LWLockAcquire(partitionLock, LW_SHARED);
+
+			/*
+			 * we loop over the lock's procLocks to gather a list of all
+			 * holders and waiters. Thus we will be able to provide more
+			 * detailed information for lock debugging purposes.
+			 *
+			 * lock-procLocks contains all processes which hold or wait for
+			 * this lock.
+			 */
+
+			procLocks = (lock-procLocks);
+			proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
+			   offsetof(PROCLOCK, lockLink));
+
+			while (proclock)
+			{
+/*
+ * we are a waiter if myProc-waitProcLock == proclock; we are
+ * a holder if it is NULL or something different
+ */
+if (proclock-tag.myProc-waitProcLock == proclock)
+{
+	if (first_waiter)
+	{
+		appendStringInfo(lock_waiters_sbuf, %d,
+		 proclock-tag.myProc-pid);
+		first_waiter = false;
+	}
+	else
+		appendStringInfo(lock_waiters_sbuf, , %d,
+		 proclock-tag.myProc-pid);
+}
+else
+{
+	if (first_holder)
+	{
+		appendStringInfo(lock_holders_sbuf, %d,
+		 proclock-tag.myProc-pid);
+		first_holder = false;
+	}
+	else
+		appendStringInfo(lock_holders_sbuf, , %d,
+		 proclock-tag.myProc-pid);
+
+	lockHoldersNum++;
+}
+
+proclock = (PROCLOCK *) SHMQueueNext(procLocks, proclock-lockLink,
+			   offsetof(PROCLOCK, lockLink));
+			}
+
+			LWLockRelease(partitionLock);
+
 			if (deadlock_state == DS_SOFT_DEADLOCK)
 ereport(LOG,
 		(errmsg(process %d avoided deadlock for %s on %s by rearranging queue order after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+(errdetail_plural(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+  lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data;
 			else if (deadlock_state == DS_HARD_DEADLOCK)
 			{
 /*
@@ -1226,13 +1293,19 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  */
 ereport(LOG,
 		(errmsg(process %d detected deadlock while waiting for %s on %s after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+(errdetail_plural(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+  lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data;
 			}
 
 			if (myWaitStatus == STATUS_WAITING)
 ereport(LOG,
 		(errmsg(process %d still waiting for %s on %s after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+(errdetail_plural(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+  lockHoldersNum,

Re: [HACKERS] plpgsql.consistent_into

2014-01-22 Thread Jim Nasby


On 1/15/14, 12:35 AM, Tom Lane wrote:

Jim Nasby j...@nasby.net writes:

Do we actually support = right now? We already support
v_field := field FROM table ... ;
and I think it's a bad idea to have different meaning for = and :=.


That ship sailed a *very* long time ago.  See other thread about
documenting rather than ignoring this more-or-less-aboriginal
behavior of plpgsql.


Yeah, I had no idea that was even supported...


I have no idea if this is related or not, but I would REALLY like for this to 
work (doesn't in 8.4, AFAIK not in 9.1 either...)


Hm ... too tired to be sure, but I think the issue about inlining a
function of this kind has to do with whether you get the same answers
in corner cases such as subselect fetching no rows.


There was some discussion about this a few years ago and I think that was 
essentially the issue.

What I think would work is essentially a macro that would dump the function 
definition right into the query and then let the planner deal with it. So

SELECT blah, ( SELECT status_code FROM status_code WHERE status_code_id = 
blah_status_code_id ) FROM blah;

can become simply

SELECT blah, status_code__get_text( blah_status_code_id ) FROM blah;

but have it translate to the same raw SQL, same as views.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire

2014-01-22 Thread Tom Lane

Christian Kruse christ...@2ndquadrant.com writes:
 However, the real problem here is that you shouldn't be calling ngettext
 manually in an ereport context in the first place.  There is
 infrastructure in place for that, and this isn't using it.

 Fixed in attached patch. I changed it from calling
 errorcontext(ngettext()) to calling errdetail_plural().

Well, is it context or detail?  Those fields have reasonably well defined
meanings IMO.

If we need errcontext_plural, let's add it, not adopt inferior solutions
just because that isn't there for lack of previous need.

But having said that, I think this is indeed detail not context.
(I kinda wonder whether some of the stuff that's now in the primary
message shouldn't be pushed to errdetail as well.  It looks like some
previous patches in this area have been lazy.)

While I'm griping, this message isn't even trying to follow the project's
message style guidelines.  Detail or context messages are supposed to be
complete sentence(s), with capitalization and punctuation to match.

Lastly, is this information that we want to be shipping to clients?
Perhaps from a security standpoint that's not such a wise idea, and
errdetail_log() is what should be used.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: hide application_name from other users

2014-01-22 Thread Greg Stark

On Tue, Jan 21, 2014 at 7:38 PM, Stephen Frost sfr...@snowman.net wrote:
 * Harold Giménez (har...@heroku.com) wrote:
 Definitely agree with you. This is just an example of how running
 monitoring as superuser is not necessarily the worst thing, and there
 are other reasons to do it already.

 It's a horrible thing and that isn't a good reason- if my database isn't
 accepting connections, I probably don't care one bit how bloated a table
 is.  Indeed, I care *more* that I'm out of connections and would want to
 know that ASAP

This is a silly argument. When things are failing that's precisely
when you need your monitoring system to work. Of course you are
interested in the fact that you're out of connections but you're
trying to figure out *why* you're out of connections. Maybe you have
queries consuming all iops, maybe disks have filled up and that's
cascaded into a postgres problem, maybe you have lots of similar
queries running, etc.

The first step in debugging is always gathering data which is why it's
so frustrating if the first thing to go down is your monitoring
software.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Dave Chinner

On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
 On Fri 17-01-14 08:57:25, Robert Haas wrote:
  On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote:
   So this says to me that the WAL is a place where DIO should really be
   reconsidered. It's mostly sequential writes that need to hit the disk
   ASAP, and you need to know that they have hit the disk before you can
   proceed with other operations.
  
  Ironically enough, we actually *have* an option to use O_DIRECT here.
  But it doesn't work well.  See below.
  
   Also, is the WAL actually ever read under normal (non-recovery)
   conditions or is it write-only under normal operation? If it's seldom
   read, then using DIO for them also avoids some double buffering since
   they wouldn't go through pagecache.
  
  This is the first problem: if replication is in use, then the WAL gets
  read shortly after it gets written.  Using O_DIRECT bypasses the
  kernel cache for the writes, but then the reads stink.
   OK, yes, this is hard to fix with direct IO.

Actually, it's not. Block level caching is the time-honoured answer
to this problem, and it's been used very successfully on a large
scale by many organisations. e.g. facebook with MySQL, O_DIRECT, XFS
and flashcache sitting on an SSD in front of rotating storage.
There's multiple choices for this now - bcache, dm-cache,
flahscache, etc, and they all solve this same problem. And in many
cases do it better than using the page cache because you can
independently scale the size of the block level cache...

And given the size of SSDs these days, being able to put half a TB
of flash cache in front of spinning disks is a pretty inexpensive
way of solving such IO problems

  If we're forcing the WAL out to disk because of transaction commit or
  because we need to write the buffer protected by a certain WAL record
  only after the WAL hits the platter, then it's fine.  But sometimes
  we're writing WAL just because we've run out of internal buffer space,
  and we don't want to block waiting for the write to complete.  Opening
  the file with O_SYNC deprives us of the ability to control the timing
  of the sync relative to the timing of the write.
   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
 transaction commit whenever there's any metadata changed on the filesystem.
 Since mtime  ctime of files will be changed often, the will be a case very
 often.

Therefore: O_DATASYNC.

  Maybe it'll be useful to have hints that say always write this file
  to disk as quick as you can and always postpone writing this file to
  disk for as long as you can for WAL and temp files respectively.  But
  the rule for the data files, which are the really important case, is
  not so simple.  fsync() is actually a fine API except that it tends to
  destroy system throughput.  Maybe what we need is just for fsync() to
  be less aggressive, or a less aggressive version of it.  We wouldn't
  mind waiting an almost arbitrarily long time for fsync to complete if
  other processes could still get their I/O requests serviced in a
  reasonable amount of time in the meanwhile.
   As I wrote in some other email in this thread, using IO priorities for
 data file checkpoint might be actually the right answer. They will work for
 IO submitted by fsync(). The downside is that currently IO priorities / IO
 scheduling classes work only with CFQ IO scheduler.

And I don't see it being implemented anywhere else because it's the
priority aware scheduling infrastructure in CFQ that causes all the
problems with IO concurrency and scalability...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jan Kara

On Wed 22-01-14 09:07:19, Dave Chinner wrote:
 On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
   If we're forcing the WAL out to disk because of transaction commit or
   because we need to write the buffer protected by a certain WAL record
   only after the WAL hits the platter, then it's fine.  But sometimes
   we're writing WAL just because we've run out of internal buffer space,
   and we don't want to block waiting for the write to complete.  Opening
   the file with O_SYNC deprives us of the ability to control the timing
   of the sync relative to the timing of the write.
O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
  transaction commit whenever there's any metadata changed on the filesystem.
  Since mtime  ctime of files will be changed often, the will be a case very
  often.
 
 Therefore: O_DATASYNC.
  O_DSYNC to be exact.

   Maybe it'll be useful to have hints that say always write this file
   to disk as quick as you can and always postpone writing this file to
   disk for as long as you can for WAL and temp files respectively.  But
   the rule for the data files, which are the really important case, is
   not so simple.  fsync() is actually a fine API except that it tends to
   destroy system throughput.  Maybe what we need is just for fsync() to
   be less aggressive, or a less aggressive version of it.  We wouldn't
   mind waiting an almost arbitrarily long time for fsync to complete if
   other processes could still get their I/O requests serviced in a
   reasonable amount of time in the meanwhile.
As I wrote in some other email in this thread, using IO priorities for
  data file checkpoint might be actually the right answer. They will work for
  IO submitted by fsync(). The downside is that currently IO priorities / IO
  scheduling classes work only with CFQ IO scheduler.
 
 And I don't see it being implemented anywhere else because it's the
 priority aware scheduling infrastructure in CFQ that causes all the
 problems with IO concurrency and scalability...
  So CFQ has all sorts of problems but I never had the impression that
priority aware scheduling is the culprit. It is all just complex - sync
idling, seeky writer detection, cooperating threads detection, sometimes
even sync vs async distinction isn't exactly what one would want. And I'm
not speaking about the cgroup stuff... So it doesn't seem to me that some
other IO scheduler couldn't reasonably efficiently implement stuff like IO
scheduling classes.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jan Kara

On Fri 17-01-14 08:57:25, Robert Haas wrote:
 On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote:
  So this says to me that the WAL is a place where DIO should really be
  reconsidered. It's mostly sequential writes that need to hit the disk
  ASAP, and you need to know that they have hit the disk before you can
  proceed with other operations.
 
 Ironically enough, we actually *have* an option to use O_DIRECT here.
 But it doesn't work well.  See below.
 
  Also, is the WAL actually ever read under normal (non-recovery)
  conditions or is it write-only under normal operation? If it's seldom
  read, then using DIO for them also avoids some double buffering since
  they wouldn't go through pagecache.
 
 This is the first problem: if replication is in use, then the WAL gets
 read shortly after it gets written.  Using O_DIRECT bypasses the
 kernel cache for the writes, but then the reads stink.
  OK, yes, this is hard to fix with direct IO.

 However, if you configure wal_sync_method=open_sync and disable
 replication, then you will in fact get O_DIRECT|O_SYNC behavior.
 
 But that still doesn't work out very well, because now the guy who
 does the write() has to wait for it to finish before he can do
 anything else.  That's not always what we want, because WAL gets
 written out from our internal buffers for multiple different reasons.
  Well, you can always use AIO (io_submit) to submit direct IO without
waiting for it to finish. But then you might need to track the outstanding
IO so that you can watch with io_getevents() when it is finished.

 If we're forcing the WAL out to disk because of transaction commit or
 because we need to write the buffer protected by a certain WAL record
 only after the WAL hits the platter, then it's fine.  But sometimes
 we're writing WAL just because we've run out of internal buffer space,
 and we don't want to block waiting for the write to complete.  Opening
 the file with O_SYNC deprives us of the ability to control the timing
 of the sync relative to the timing of the write.
  O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
transaction commit whenever there's any metadata changed on the filesystem.
Since mtime  ctime of files will be changed often, the will be a case very
often.

  Again, I think this discussion would really benefit from an outline of
  the different files used by pgsql, and what sort of data access
  patterns you expect with them.
 
 I think I more or less did that in my previous email, but here it is
 again in briefer form:
 
 - WAL files are written (and sometimes read) sequentially and fsync'd
 very frequently and it's always good to write the data out to disk as
 soon as possible
 - Temp files are written and read sequentially and never fsync'd.
 They should only be written to disk when memory pressure demands it
 (but are a good candidate when that situation comes up)
 - Data files are read and written randomly.  They are fsync'd at
 checkpoint time; between checkpoints, it's best not to write them
 sooner than necessary, but when the checkpoint arrives, they all need
 to get out to the disk without bringing the system to a standstill
 
 We have other kinds of files, but off-hand I'm not thinking of any
 that are really very interesting, apart from those.
 
 Maybe it'll be useful to have hints that say always write this file
 to disk as quick as you can and always postpone writing this file to
 disk for as long as you can for WAL and temp files respectively.  But
 the rule for the data files, which are the really important case, is
 not so simple.  fsync() is actually a fine API except that it tends to
 destroy system throughput.  Maybe what we need is just for fsync() to
 be less aggressive, or a less aggressive version of it.  We wouldn't
 mind waiting an almost arbitrarily long time for fsync to complete if
 other processes could still get their I/O requests serviced in a
 reasonable amount of time in the meanwhile.
  As I wrote in some other email in this thread, using IO priorities for
data file checkpoint might be actually the right answer. They will work for
IO submitted by fsync(). The downside is that currently IO priorities / IO
scheduling classes work only with CFQ IO scheduler.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Jeremy Harris


On 22/01/14 03:16, Jon Nelson wrote:

Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons.  Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.



Dedup-in-sort is also done by my WIP internal merge sort, and
extended (in much the same ways as Jon's) to the external merge.

https://github.com/j47996/pgsql_sorb


I've not done a cost model either, but the dedup capability is
exposed from tuplesort.c to the executor, and downstream uniq
nodes removed.

I've not worked out yet how to eliminate upstream hashagg nodes,
which would be worthwhile from testing results.

--
Cheers,
   Jeremy


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Jeremy Harris


On 22/01/14 03:53, Tom Lane wrote:

Jon Nelson jnelson+pg...@jamponi.net writes:

A rough summary of the patch follows:



- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
   distinct columns is equal to the number of sort columns (and both are
   greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
   creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
   + a new macro, DISCARDTUP and a new structure member, discardtup, are
 both defined and operate similar to COMPARETUP, COPYTUP, etc...
   + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
 throw out the new tuple if it compares as identical to the tuple at
 the top of the heap. Since we're already performing this comparison,
 this is essentially free.
   + in mergeonerun, we may discard a tuple if it compares as identical
 to the *last written tuple*. This is a comparison that did not take
 place before, so it's not free, but it saves a write I/O.
   + We perform the same logic in dumptuples


[ raised eyebrow ... ]  And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?


I don't think Jon was suggesting that the planner drop the unique step.
--
Cheers,
   Jeremy



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

2014-01-22 Thread Tom Lane

Peter Geoghegan p...@heroku.com writes:
 I ran an strace on a pg_stat_statements backend with a full ~5,000
 entries.

BTW, have you got any sort of test scenario you could share for this
purpose?  I'm sure I could build something, but if you've already
done it ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Tom Lane

Jeremy Harris j...@wizmail.org writes:
 On 22/01/14 03:53, Tom Lane wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 - in createplan.c, eliding duplicate tuples is enabled if we are
 creating a unique plan which involves sorting first

 [ raised eyebrow ... ]  And what happens if the planner drops the
 unique step and then the sort doesn't actually go to disk?

 I don't think Jon was suggesting that the planner drop the unique step.

Hm, OK, maybe I misread what he said there.  Still, if we've told
tuplesort to remove duplicates, why shouldn't we expect it to have
done the job?  Passing the data through a useless Unique step is
not especially cheap.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

2014-01-22 Thread Peter Geoghegan

On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, have you got any sort of test scenario you could share for this
 purpose?  I'm sure I could build something, but if you've already
 done it ...

I simply ran the standard regression tests, and then straced a backend
as it executed the pgss-calling query. I'm not sure that I've
understood your question.

As previously noted, my approach to testing this patch involved variations of:

running make installcheck-parallel

Concurrently, running pgbench with multiple clients each calling
pg_stat_statements() and pg_stat_statements_reset(). The latter was
sometimes rigged to do a direct garbage collection to stress test
things. My expectation was that the sanity checking done everywhere
would complain if there were any race conditions or other bugs. This
was pretty effective as a smoke test during development.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

2014-01-22 Thread Tom Lane

Peter Geoghegan p...@heroku.com writes:
 On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, have you got any sort of test scenario you could share for this
 purpose?  I'm sure I could build something, but if you've already
 done it ...

 I simply ran the standard regression tests, and then straced a backend
 as it executed the pgss-calling query. I'm not sure that I've
 understood your question.

Hm, are there 5K distinct queries in the regression tests?  I'd have
thought they weren't enough to fill a large pgss hash.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Jon Nelson

On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeremy Harris j...@wizmail.org writes:
 On 22/01/14 03:53, Tom Lane wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 - in createplan.c, eliding duplicate tuples is enabled if we are
 creating a unique plan which involves sorting first

 [ raised eyebrow ... ]  And what happens if the planner drops the
 unique step and then the sort doesn't actually go to disk?

 I don't think Jon was suggesting that the planner drop the unique step.

 Hm, OK, maybe I misread what he said there.  Still, if we've told
 tuplesort to remove duplicates, why shouldn't we expect it to have
 done the job?  Passing the data through a useless Unique step is
 not especially cheap.

That's correct - I do not propose to drop the unique step. Duplicates
are only dropped if it's convenient to do so. In one case, it's a
zero-cost drop (no extra comparison is made). In most other cases, an
extra comparison is made, typically right before writing a tuple to
tape. If it compares as identical to the previously-written tuple,
it's thrown out instead of being written.

The output of the modified code is still sorted, still *might* (and in
most cases, probably will) contain duplicates, but will (probably)
contain fewer duplicates.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

2014-01-22 Thread Peter Geoghegan

On Wed, Jan 22, 2014 at 1:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Hm, are there 5K distinct queries in the regression tests?  I'd have
 thought they weren't enough to fill a large pgss hash.

Well, yes. They're a really bad case for pg_stat_statements, in that
almost all distinct queries are only executed once or twice and yet
there are enough of them. I think this is generally useful for
testing.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Bruce Momjian

On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
  If we're forcing the WAL out to disk because of transaction commit or
  because we need to write the buffer protected by a certain WAL record
  only after the WAL hits the platter, then it's fine.  But sometimes
  we're writing WAL just because we've run out of internal buffer space,
  and we don't want to block waiting for the write to complete.  Opening
  the file with O_SYNC deprives us of the ability to control the timing
  of the sync relative to the timing of the write.
   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
 transaction commit whenever there's any metadata changed on the filesystem.
 Since mtime  ctime of files will be changed often, the will be a case very
 often.

Also, there is the issue of writes that don't need sycning being synced
because sync is set on the file descriptor.  Here is output from our
pg_test_fsync tool when run on an SSD with a BBU:

$ pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync   n/a
fdatasync  8424.785 ops/sec 119 
usecs/op
fsync  7127.072 ops/sec 140 
usecs/op
fsync_writethrough  n/a
open_sync 10548.469 ops/sec  95 
usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync   n/a
fdatasync  4367.375 ops/sec 229 
usecs/op
fsync  4427.761 ops/sec 226 
usecs/op
fsync_writethrough  n/a
open_sync  4303.564 ops/sec 232 
usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
--  1 * 16kB open_sync write  4938.711 ops/sec 202 
usecs/op
--  2 *  8kB open_sync writes 4233.897 ops/sec 236 
usecs/op
--  4 *  4kB open_sync writes 2904.710 ops/sec 344 
usecs/op
--  8 *  2kB open_sync writes 1736.720 ops/sec 576 
usecs/op
-- 16 *  1kB open_sync writes  935.917 ops/sec1068 
usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close7626.783 ops/sec 131 
usecs/op
write, close, fsync6492.697 ops/sec 154 
usecs/op

Non-Sync'ed 8kB writes:
write351517.178 ops/sec   3 
usecs/op

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Simon Riggs

On 22 January 2014 14:25, Simon Riggs si...@2ndquadrant.com wrote:
 On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 01/22/2014 02:10 PM, Simon Riggs wrote:

 As Jeff points out, the blocks being modified would be locked until
 space is freed up. Which could make other users wait. The code
 required to avoid that wait would be complex and not worth any
 overhead.


 Checkpoint also acquires the content lock of every dirty page in the buffer
 cache...

 Good point. We would need to take special action for any dirty blocks
 that we cannot obtain content lock for, which should be a smallish
 list, to be dealt with right at the end of the checkpoint writes.

 We know that anyone waiting for the WAL lock will not be modifying the
 block and so we can copy it without obtaining the lock. We can inspect
 the lock queue on the WAL locks and then see which buffers we can skip
 the lock for.

This could be handled similarly to the way we handle buffer pin
deadlocks in Hot Standby.

So I don't see any blockers from that angle.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-22 Thread David Rowley

On Tue, Jan 21, 2014 at 3:20 AM, Florian Pflug f...@phlo.org wrote:

 On Jan20, 2014, at 08:42 , David Rowley dgrowle...@gmail.com wrote:
  On Mon, Jan 20, 2014 at 2:45 PM, Florian Pflug f...@phlo.org wrote:
  * I've also renamed INVFUNC to INVSFUNC. That's a pretty invasive
 change, and
it's the last commit, so if you object to that, then you can merge up
 to
eafa72330f23f7c970019156fcc26b18dd55be27 instead of
de3d9148be9732c4870b76af96c309eaf1d613d7.
 
 
  Seems like sfunc really should be tfunc then we could have invtfunc. I'd
 probably
  understand this better if I knew what the 's' was for in sfunc. I've not
 applied
  this just yet. Do you have a reason why you think it's better?

 My issue with just invfunc is mainly that it's too generic - it doesn't
 tell
 you what it's supposed to be the inverse of.

 I've always assumed that 's' in 'sfunc' and 'stype' stands for 'state',
 and that
 the naming is inspired by control theory, where the function which acts on
 the
 state space is often called S.


Ok, that makes more sense now and it seems like a reasonable idea. I'm not
not quite sure yet as when someone said upthread that these negative
transition functions as I was calling them at the time should really be
called inverse transition functions, I then posted that I was going to
call the create aggregate option invfunc which nobody seemed to object
to. I just don't want to go and change that now. It is very possible this
will come up again when the committer is looking at the patch. It would be
a waste if it ended up back at invfunc after we changed it to invsfunc.

Regards

David Rowley

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-22 Thread David Rowley

On Wed, Jan 22, 2014 at 12:46 AM, Florian Pflug f...@phlo.org wrote:

 On Jan21, 2014, at 10:53 , David Rowley dgrowle...@gmail.com wrote:
  It came to me that it might be possible to implement inverse transitions
  for floating point aggregates by just detecting if precision has been
  lost during forward transitions.
 
  I've written the test to do this as:
 
  IF state.value + value = state.value AND value  0 THEN
newstate.precision_lost := true;
newstate.value := state.value;
ELSE
newstate.precision_lost := false;
newstate.value := state.value + value;
END IF;
 
 
  The inverse transition function checks the precision_lost and if it's
 true it
  returns NULL. The core code is now implemented (thanks to Florian) to
  re-aggregate when NULL is returned from the inverse transition function.

 That's not sufficient, I fear. You can lose all significant digits of the
 value
 and still have precision_lost = false afterwards. Try summing over 1e16,
 1.01.
 SELECT 1e16::float8 + 1.01::float8 = 1e16::float8 returns FALSE, yet
 SELECT 1e16::float8 + 1.01::float8 - 1e16::float8 returns 2 where
 1.01
 would have been correct. That's still too much precision loss.

 I'm quite certain that the general idea has merit, but the actual
 invertibility condition is going to be more involved. If you want to play
 with
 this, I think the first step has to be to find a set of guarantees that
 SUM(float) is supposed to meet. Currently, SUM(float) guarantees that if
 the
 summands all have the same sign, the error is bound by C * N, where C is
 some
 (machine-specific?) constant (about 1e-15 or so), and N is the number of
 input
 rows. Or at least so I think from looking at SUMs over floats in descending
 order, which I guess is the worst case. You could, for example, try to see
 if
 you can find a invertibility conditions which guarantees the same, but
 allows
 C to be larger. That would put a bound on the number of digits lost by the
 new
 SUM(float) compared to the old one.


I guess if the case is that normal (non-window) sum(float) results are
undefined unless you add an order by to the aggregate then I guess there is
no possible logic to put in for inverse transitions that will make them
behave the same as an undefined behaviour.


 I don't have high hopes for this getting int 9.4, though.


Yeah I'm going to drop it. I need to focus on documents and performance
testing.


  If it seems sound enough, then I may implement it in C to see how much
  overhead it adds to forward aggregation for floating point types, but
 even
  if it did add too much overhead to forward aggregation it might be worth
  allowing aggregates to have 2 forward transition functions and if the 2nd
  one exists then it could be used in windowing functions where the frame
  does not have unbounded following.

 I don't think adding yet another type of aggregation function is the
 solution here.


Why not?

I thought, if in the cases where we need to burden the forward transition
functions with extra code to make inverse transitions possible, then why
not have an extra transition function so that it does not slow down normal
aggregation?

My testing of sum(numeric) last night does not show any slow down by that
extra dscale tracking code that was added, but if it did then I'd be
thinking seriously about 2 forward transition functions to get around the
problem. I don't understand what would be wrong with that. The core code
could just make use of the 2nd function if it existed and inverse
transitions were thought to be required. i.e in a window context and does
not have UNBOUNDED PRECEDING.

Regards

David Rowley

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Jim Nasby


On 1/21/14, 6:46 PM, Andres Freund wrote:

On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote:

On Tue, Jan 21, 2014 at 3:43 PM, Andres Freundand...@2ndquadrant.com  wrote:

 I personally think this isn't worth complicating the code for.


You're probably right. However, I don't see why the bar has to be very
high when we're considering the trade-off between taking some
emergency precaution against having a PANIC shutdown, and an assured
PANIC shutdown

Well, the problem is that the tradeoff would very likely include making
already complex code even more complex. None of the proposals, even the
one just decreasing the likelihood of a PANIC, like like they'd end up
being simple implementation-wise.
And that additional complexity would hurt robustness and prevent things
I find much more important than this.


If we're not looking for perfection, what's wrong with Peter's idea of a 
ballast file? Presumably the check to see if that file still exists would be 
cheap so we can do that before entering the appropriate critical section.

There's still a small chance that we'd end up panicing, but it's better than 
today. I'd argue that even if it doesn't work for CoW filesystems it'd still be 
a win.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-22 Thread Florian Pflug

On Jan23, 2014, at 01:17 , David Rowley dgrowle...@gmail.com wrote:
 On Wed, Jan 22, 2014 at 12:46 AM, Florian Pflug f...@phlo.org wrote:
 If you want to play with
 this, I think the first step has to be to find a set of guarantees that
 SUM(float) is supposed to meet. Currently, SUM(float) guarantees that if the
 summands all have the same sign, the error is bound by C * N, where C is some
 (machine-specific?) constant (about 1e-15 or so), and N is the number of 
 input
 rows. Or at least so I think from looking at SUMs over floats in descending
 order, which I guess is the worst case. You could, for example, try to see if
 you can find a invertibility conditions which guarantees the same, but allows
 C to be larger. That would put a bound on the number of digits lost by the 
 new
 SUM(float) compared to the old one.
 
 I guess if the case is that normal (non-window) sum(float) results are 
 undefined
 unless you add an order by to the aggregate then I guess there is no possible
 logic to put in for inverse transitions that will make them behave the same as
 an undefined behaviour.

Actually, if sum(float) was really undefined, it'd be *very* easy to provide an
inverse transition function - just make it a no-op. Heck, you could then even
make the forward transition function a no-op, since the very definition of
undefined behaviour is result can be anything, including setting your house
on fire. The point is, it's *not* undefined - it's just imprecise. And the
imprecision can even be quantified, it just depends on the number of
input rows (the equal-sign requirement is mostly there to make imprecision
equivalent to relative error). 

  If it seems sound enough, then I may implement it in C to see how much
  overhead it adds to forward aggregation for floating point types, but even
  if it did add too much overhead to forward aggregation it might be worth
  allowing aggregates to have 2 forward transition functions and if the 2nd
  one exists then it could be used in windowing functions where the frame
  does not have unbounded following.
 
 I don't think adding yet another type of aggregation function is the
 solution here.
 
 Why not?
  
 I thought, if in the cases where we need to burden the forward transition
 functions with extra code to make inverse transitions possible, then why
 not have an extra transition function so that it does not slow down normal
 aggregation?
 
 My testing of sum(numeric) last night does not show any slow down by that
 extra dscale tracking code that was added, but if it did then I'd be thinking
 seriously about 2 forward transition functions to get around the problem.
 I don't understand what would be wrong with that. The core code could just
 make use of the 2nd function if it existed and inverse transitions were
 thought to be required. i.e in a window context and does not
 have UNBOUNDED PRECEDING.

First, every additional function increases the maintenance burden, and at
some point the expected benefit simply isn't worth it. IMHO at least.

Secondly, you'd really need a second aggregate definition - usually, the
non-invertible aggregate will get away with a much smaller state representation.
Aggregates with state type internal could hack their way around that by
simply not using the additional parts of their state structure, but e.g.
plpgsql aggregates would still incur overhead due to repeated copying of
a unnecessary large state value. If it's possible to represent the
forward-only but not the invertible state state with a copy-by-value type,
that overhead could easily be a large part of the total overhead you paid
for having an inverse transition function.

And lastly, we could achieve much of the benefit of a second transition
function by simply telling the forward transition function whether to
expect inverse transition function calls. We already expose things like
the aggregation memory context to C-language transition functions, so the
basic mechanism is already there. In fact, I pondered whether to do this -
but then figured I'd leave it to however needs that facility first. Though
it wouldn't be much code - basically, setting a flag in the WindowAggState,
and exporting a function to be used by aggregates to read it, similar
to what AggCheckCallContext does.

best regards,
Florian Pflug






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jim Nasby


On 1/17/14, 7:57 AM, Robert Haas wrote:

- WAL files are written (and sometimes read) sequentially and fsync'd
very frequently and it's always good to write the data out to disk as
soon as possible
- Temp files are written and read sequentially and never fsync'd.
They should only be written to disk when memory pressure demands it
(but are a good candidate when that situation comes up)
- Data files are read and written randomly.  They are fsync'd at
checkpoint time; between checkpoints, it's best not to write them
sooner than necessary, but when the checkpoint arrives, they all need
to get out to the disk without bringing the system to a standstill


For sake of completeness... there are also data files that are temporary and 
don't need to be written to disk unless the kernel thinks there's better things 
to use that memory for. AFAIK those files are never fsync'd.

In other words, these are the same as the temp files Robert describes except 
they also have random access. Dunno if that matters.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: hide application_name from other users

2014-01-22 Thread Greg Stark

On Wed, Jan 22, 2014 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote:
 Probably Heroku has some more specific exploit case to be concerned
 about here; if so, might I suggest taking it up with the -security list?

I don't think there's a specific vulnerability that needs to be kept
secret here.

Here's an example. I just created a new hobby database which is on a
multi-tenant cluster and ran select * from pg_stat_activity. Here are
two of the more interesting examples:

 463752 | de5nmf0gbii3u5 | 32250 |   463751 | qspfkgrwgqtbcu | unicorn
worker[1] -p 30390 -c ./config/unicorn.rb ||
  | |  |
|   |
 | || insufficient privilege
 463752 | de5nmf0gbii3u5 | 32244 |   463751 | qspfkgrwgqtbcu | unicorn
worker[0] -p 30390 -c ./config/unicorn.rb ||
  | |  |
|   |
 | || insufficient privilege


Note that the contents of the ARGV array are being set by the
unicorn task queuing library. It knows it's making this information
visible to other users with shell access on this machine. But the
decision to stuff the ARGV information into the application_name is
being made by the Pg driver. Neither is under the control of the
application author who may not even be aware this is happening.
Neither component has the complete information to make a competent
decision about whether this information is safe to be in
application_name or not.

Note that the query is showing as insufficient privilege even
though it is listed in the ps output -- the same ps output that is
listing the unicorn ARGV that is being shown in the
application_name

You might say that the Pg gem is at fault for making a blanket policy
decision for applications that the ARGV is safe to show to other
database users but realistically it's so useful to see this
information for your own connections that it's probably the right
decision. Without it it's awfully hard to tell which worker is on
which connection. It would just be nice to be able to treat
application_name the same as query.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup

2014-01-22 Thread Peter Eisentraut

My review:  Clearly, everyone likes this feature.

I'm tempted to think it should be mandatory to specify this option in
plain mode when tablespaces are present.  Otherwise, creating a base
backup is liable to create random files all over the place.  Obviously,
there would be backward compatibility concerns.

I'm not totally happy with the choice of : as the mapping separator,
because that would always require escaping on Windows.  We could make it
analogous to the path handling and make ; the separator on Windows.
Then again, this is not a path, so maybe it should look like one.  We
pick something else altogether, like =.

The option name --tablespace isn't very clear.  It ought to be
something like --tablespace-mapping.

I don't think we should require the new directory to be an absolute
path.  It should be relative to the current directory, just like the -D
option does it.

I'm not so worried about the atooid() and linked-list duplication.  That
can be addressed at some later point.

I would try to write this patch without using MAXPGPATH.  I know
existing code uses it, but we should try to use it less, because it
overallocates memory and requires handling additional error conditions.
For example, you catch overflow in tablespace_list_append() but later
report that as invalid tablespace format.  We now have psprintf() to
make coding with dynamic memory allocation easier.

It looks like when you ignore an escaped : as a separator, you don't
actually unescape it for use as a directory.

OLDDIR and NEWDIR should be capitalized in the help message.

Somehow, I had the subconscious assumption that this feature would do
prefix matching on the directories, not only complete string matching.
So if I had tablespaces in /mnt/data1, /mnt/data2, /mnt/data3, I could
map them all with -T /mnt:mnt and be done.  Not sure if that is useful
to many, but it's worth a thought.

Review style guide for error messages:
http://www.postgresql.org/docs/devel/static/error-style-guide.html

We need to think about how to handle this on platforms without symlinks.
I don't like just printing an error message and moving on.  It should be
either pass or fail or an option to choose between them.

Please put the options in the getopt handling in alphabetical order.
It's not very alphabetical now, but between D and F is probably not the
best place. ;-)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jim Nasby


On 1/17/14, 2:24 PM, Gregory Smith wrote:

I am skeptical that the database will take over very much of this work and perform better 
than the Linux kernel does.  My take is that our most useful role would be providing test 
cases kernel developers can add to a performance regression suite.  Ugly we never 
though that would happen situations seems at the root of many of the kernel 
performance regressions people here get nailed by.


FWIW, there are some scenarios where we could potentially provide additional 
info to the kernel scheduler; stuff that we know that it never will.

For example, if we have a limit clause we can (sometimes) provide a rough 
estimate of how many pages we'll need to read from a relation.

Probably more useful is the case of index scans; if we pre-read more data from 
the index we could hand the kernel a list of base relation blocks that we know 
we'll need.

There's some other things that have been mentioned, such as cases where files 
will only be accessed sequentially.

Outside of that though, the kernel is going to be in a way better position to 
schedule IO than we will ever be. Not only does it understand the underlying 
hardware, it can also see everything else that's going on.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jim Nasby


On 1/19/14, 5:51 PM, Dave Chinner wrote:

Postgres is far from being the only application that wants this; many
people resort to tmpfs because of this:
https://lwn.net/Articles/499410/

Yes, we covered the possibility of using tmpfs much earlier in the
thread, and came to the conclusion that temp files can be larger
than memory so tmpfs isn't the solution here.:)


Although... instead of inventing new APIs and foisting this work onto 
applications, perhaps it would be better to modify tmpfs such that it can 
handle a temp space that's larger than memory... possibly backing it with X 
amount of real disk and allowing it/the kernel to decide when to passively move 
files out of the in-memory tmpfs and onto disk.

Of course that's theoretically what swapping is supposed to do, but if that's 
not up to the job...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread KONDO Mitsumasa


(2014/01/22 22:26), Robert Haas wrote:

On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

OK, Kondo, please demonstrate benchmarks that show we have 1% impact
from this change. Otherwise we may need a config parameter to allow
the calculation.


OK, testing DBT-2 now. However, error range of benchmark might be 1% higher.
So I show you detail HTML results.


To see any impact from spinlock contention, I think you're pretty much
going to need a machine with 32 cores, I think, and lots of
concurrency.  pgbench -S is probably a better test than DBT-2, because
it leaves out all the writing, so percentage-wise more time will be
spent doing things like updating the pgss hash table.
Oh, thanks to inform me. I think essential problem of my patch has bottle neck in 
sqrt() function and other division caluculation. I will replcace sqrt() function 
in math.h to more faster algorithm. And moving unneccessary part of caluculation 
in LWlocks or other locks. It might take time to improvement, so please wait for 
a while.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread Peter Geoghegan

On Wed, Jan 22, 2014 at 5:28 PM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
 Oh, thanks to inform me. I think essential problem of my patch has bottle
 neck in sqrt() function and other division caluculation.

Well, that's a pretty easy theory to test. Just stop calling them (and
do something similar to what we do for current counter fields instead)
and see how much difference it makes.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2014-01-22 Thread Marti Raudsepp

On Wed, Jan 22, 2014 at 1:44 PM, Yugo Nagata nag...@sraoss.co.jp wrote:
 On Wed, 22 Jan 2014 20:04:12 +0900 (JST)
 Tatsuo Ishii is...@postgresql.org wrote:
 parseTypeString() is called by some other functions and I avoided
 influences of modifying the definition on them, since this should
 raise errors in most cases. This is same reason for other *MissingOk
 functions in parse_type.c.

 Is it better to write definitions of these function and all there callers?

Yes, for parseTypeString certainly. There have been many refactorings
like that in the past and all of them use this pattern.

typenameTypeIdAndMod is less clear since the code paths differ so
much, maybe keep 2 versions (merging back to 1 function is OK too, but
in any case you don't need 3).

typenameTypeIdAndModMissingOk(...)
{
Type tup = LookupTypeName(pstate, typeName, typmod_p);
if (tup == NULL || !((Form_pg_type) GETSTRUCT(tup))-typisdefined)
*typeid_p = InvalidOid;
else
*typeid_p = HeapTupleGetOid(tup);

if (tup)
ReleaseSysCache(tup);
}
typenameTypeIdAndMod(...)
{
Type tup = typenameType(pstate, typeName, typmod_p);
*typeid_p = HeapTupleGetOid(tup);
ReleaseSysCache(tup);
}



Also, there's no need for else here:
if (raiseError)
ereport(ERROR, ...);
else
return InvalidOid;

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Claudio Freire

On Wed, Jan 22, 2014 at 10:08 PM, Jim Nasby j...@nasby.net wrote:

 Probably more useful is the case of index scans; if we pre-read more data
 from the index we could hand the kernel a list of base relation blocks that
 we know we'll need.


Actually, I've already tried this. The most important part is fetching
heap pages, not index. Tried that too.

Currently, fadvising those pages works *in detriment* of physically
correlated scans. That's a kernel bug I've reported to LKML, and I
could probably come up with a patch. I've just never had time to set
up the testing machinery to test the patch myself.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Confusing documentation of ordered-set aggregates?

2014-01-22 Thread Florian Pflug

Hi

After reading through the relevant parts of sytnax.sgml, create_aggregate.smgl
and xaggr.sgml, I think I understand how these work - they work exactly like
regular aggregates, except that some arguments are evaluated only once and
passed to the final function instead of the transition function. The whole
ORDER BY thing is just crazy syntax the standard mandates - a saner
alternative would have been

 ordered_set_agg(direct1,...,directN, WITHIN(arg1,...,argM))

or something like that, right?

So whether ORDER BY implies any actual ordering is up to the ordered-set
aggregate's final function. Or at least that's what xaggr.sgml seems to say

 Unlike the case for normal aggregates, the sorting of input rows for an
 ordered-set aggregate is emphasisnot/ done behind the scenes, but is
 the responsibility of the aggregate's support functions.

but that seems to contradict syntax.sgml which says

 The expressions in the replaceableorder_by_clause/replaceable are
 evaluated once per input row just like normal aggregate arguments, sorted
 as per the replaceableorder_by_clause/replaceable's requirements, and
 fed to the aggregate function as input arguments.

Also, xaggr.sgml has the following to explain why the NULLs are passed for all
aggregated arguments to the final function, instead of simply not passing them
at all

 While the null values seem useless at first sight, they are important because
 they make it possible to include the data types of the aggregated input(s) in
 the final function's signature, which may be necessary to resolve the output
 type of a polymorphic aggregate.

Why do ordered-set aggregates required that, when plain aggregates are fine
without it? array_agg(), for example, also has a result type that is
determined by the argument type, yet it's final function doesn't take an
argument of type anyelement, even though it returns anyarray.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-22 Thread Florian Pflug

On Jan23, 2014, at 01:07 , David Rowley dgrowle...@gmail.com wrote:
 On Tue, Jan 21, 2014 at 3:20 AM, Florian Pflug f...@phlo.org wrote:
 On Jan20, 2014, at 08:42 , David Rowley dgrowle...@gmail.com wrote:
  On Mon, Jan 20, 2014 at 2:45 PM, Florian Pflug f...@phlo.org wrote:
  * I've also renamed INVFUNC to INVSFUNC. That's a pretty invasive change, 
  and
it's the last commit, so if you object to that, then you can merge up to
eafa72330f23f7c970019156fcc26b18dd55be27 instead of
de3d9148be9732c4870b76af96c309eaf1d613d7.
 
 
  Seems like sfunc really should be tfunc then we could have invtfunc. I'd 
  probably
  understand this better if I knew what the 's' was for in sfunc. I've not 
  applied
  this just yet. Do you have a reason why you think it's better?
 
 My issue with just invfunc is mainly that it's too generic - it doesn't 
 tell
 you what it's supposed to be the inverse of.
 
 I've always assumed that 's' in 'sfunc' and 'stype' stands for 'state', and 
 that
 the naming is inspired by control theory, where the function which acts on 
 the
 state space is often called S.
 
 Ok, that makes more sense now and it seems like a reasonable idea. I'm not 
 not quite
 sure yet as when someone said upthread that these negative transition 
 functions as
 I was calling them at the time should really be called inverse transition 
 functions,
 I then posted that I was going to call the create aggregate option invfunc 
 which
 nobody seemed to object to. I just don't want to go and change that now. It 
 is very
 possible this will come up again when the committer is looking at the patch. 
 It would
 be a waste if it ended up back at invfunc after we changed it to invsfunc.

Since we already settled on inverse transition function, I kinda doubt that
calling the parameter invsfunc is going to meet a lot of resistance. But we can 
put
that off a little longer still...

I've pushed a few additional things to 
https://github.com/fgp/postgres/tree/invtrans.

* I update the CREATE AGGREGATE documentation, trying to include the 
description of
  the various modes of inverse transition functions into the paragraphs which 
already
  explained about STRICT for transition functions and such.

* I've also updated the list of window functions to include a list of those
  aggregates which potentially need to restart the computation, i.e. MIN/MAX 
and 
  the like.

* I've changed nodeWindowAgg.c to use per-aggregate aggregation contexts for the
  invertible aggregates. Without that, the aggregate context is potentially 
never
  reset, because that previously required *all* the aggregates to restart at the
  same time. That would be OK if we were sure not to leak unbounded amounts of
  stuff stores in that context, but unfortunately we sometimes do. For example,
  whenever a strict, invertible aggregate ends up with only NULL inputs, we
  re-initialize the aggregation, which leaks the old state value. We could
  pfree() that of course, but that state value might reference other stuff that
  we don't know about and thus cannot free. Separating the aggregation contexts
  is the only solution I came up with, so I did that.

* I've also tweaked an if to flag aggregates as invertible only if the frame 
head
  can actually move, i.e. if the frame start clause is something other than
  UNBOUNDED PRECEEDING. Since the choice of whether to use a private aggregation
  context is driven by that flag, that also makes the above apply only to 
aggregates
  were the inverse transition function is actually used.

I hope to find some time tomorrow or so to complete my pass through the 
documentation -
what's still missing as an explanation of the EXPLAIN VERBOSE ANALYZE field and 
maybe
some cleanup of xaggr.sgml.

Do you have any additional things pending?

best regards,
Florian Pflug







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread Andrew Dunstan



On 01/22/2014 08:28 PM, KONDO Mitsumasa wrote:

(2014/01/22 22:26), Robert Haas wrote:

On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

OK, Kondo, please demonstrate benchmarks that show we have 1% impact
from this change. Otherwise we may need a config parameter to allow
the calculation.


OK, testing DBT-2 now. However, error range of benchmark might be 1% 
higher.

So I show you detail HTML results.


To see any impact from spinlock contention, I think you're pretty much
going to need a machine with 32 cores, I think, and lots of
concurrency.  pgbench -S is probably a better test than DBT-2, because
it leaves out all the writing, so percentage-wise more time will be
spent doing things like updating the pgss hash table.
Oh, thanks to inform me. I think essential problem of my patch has 
bottle neck in sqrt() function and other division caluculation. I will 
replcace sqrt() function in math.h to more faster algorithm. And 
moving unneccessary part of caluculation in LWlocks or other locks. It 
might take time to improvement, so please wait for a while.




Umm, I have not read the patch, but are you not using Welford's method? 
Its per-statement overhead should be absolutely tiny (and should not 
compute a square root at all  per statement - the square root should 
only be computed when the standard deviation is actually wanted, e.g. 
when a user examines pg_stat_statements) See for example 
http://www.johndcook.com/standard_deviation.html



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2014-01-22 Thread Amit Kapila

On Wed, Jan 22, 2014 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
 On Wed, Jan 22, 2014 at 6:54 PM, Robert Haas robertmh...@gmail.com wrote:
 I wonder if the port number wouldn't be a better choice.  And that
 could even be done without adding a parameter.

 We need this for register of event source which might be done before
 start of server.

 So?  Anything which can know the value of a GUC parameter can certainly
 know the selected port number.

1. In case of registration of event source, either user has to pass the name
or it uses hard coded default value, so if we use version number along with
'PostgreSQL', it can be consistent.
I don't see any way pgevent.c can know port number to append it to default
value, am I missing something here?
2. In pg_ctl if we use port number, then if user changes it across multiple
server restarts, then it can also create a problem, as the event source
name used will be different than what the name used during registration
of event source.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Confusing documentation of ordered-set aggregates?

2014-01-22 Thread Tom Lane

Florian Pflug f...@phlo.org writes:
 After reading through the relevant parts of sytnax.sgml, create_aggregate.smgl
 and xaggr.sgml, I think I understand how these work - they work exactly like
 regular aggregates, except that some arguments are evaluated only once and
 passed to the final function instead of the transition function.

Yeah, that statement is correct.

 The whole
 ORDER BY thing is just crazy syntax the standard mandates - a saner
 alternative would have been
  ordered_set_agg(direct1,...,directN, WITHIN(arg1,...,argM))
 or something like that, right?

Not sure.  The syntax is certainly something out of far left field (which
is pretty much par for the course with the SQL committee :-().  But the
concept basically is to the extent that your results depend on an assumed
ordering of the input rows, this is what to use.  That seems sane enough,
at least for aggregates where the input ordering does matter.

 So whether ORDER BY implies any actual ordering is up to the ordered-set
 aggregate's final function.

Yes, the committed patch intentionally doesn't force the aggregate to do
any ordering, though all the built-in aggregates do so.

 but that seems to contradict syntax.sgml which says

  The expressions in the replaceableorder_by_clause/replaceable are
  evaluated once per input row just like normal aggregate arguments, sorted
  as per the replaceableorder_by_clause/replaceable's requirements, and
  fed to the aggregate function as input arguments.

Well, syntax.sgml is just trying to explain the users-eye view.  I'm not
sure that it'd be helpful to say here that the implementation might choose
not to do a physical sort.

 Also, xaggr.sgml has the following to explain why the NULLs are passed for all
 aggregated arguments to the final function, instead of simply not passing them
 at all

  While the null values seem useless at first sight, they are important because
  they make it possible to include the data types of the aggregated input(s) in
  the final function's signature, which may be necessary to resolve the output
  type of a polymorphic aggregate.

 Why do ordered-set aggregates required that, when plain aggregates are fine
 without it?

Actually, if polymorphic types had existed when the original aggregate
infrastructure was designed, it might well have been done like that.
I was thinking while working on the ordered-set patch that this would
be a really nifty thing for regular polymorphic aggregates too.  Right
now, the only safe way to make a polymorphic plain aggregate is to use a
polymorphic state type, and that type has to be sufficient to determine
the result type.  If you'd like to define the state type as internal,
you lose --- there's no connection between the input and result types.

So I was wondering if we shouldn't think about how to allow regular
aggregates to use final functions defined in this style.  But it's
not something I've got time to pursue at the moment.

 array_agg(), for example, also has a result type that is
 determined by the argument type, yet it's final function doesn't take an
 argument of type anyelement, even though it returns anyarray.

Yeah.  So it's a complete leap of faith on the type system's part that
this function is an appropriate final function for array_agg().  I'm
not sure offhand if CREATE AGGREGATE would even allow this combination
to be created, or if it only works because we manually jammed those rows
into the catalogs at initdb time.  But it would certainly be safer if
CREATE AGGREGATE *didn't* allow it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Confusing documentation of ordered-set aggregates?

2014-01-22 Thread Tom Lane

I wrote:
 Florian Pflug f...@phlo.org writes:
 array_agg(), for example, also has a result type that is
 determined by the argument type, yet it's final function doesn't take an
 argument of type anyelement, even though it returns anyarray.

 Yeah.  So it's a complete leap of faith on the type system's part that
 this function is an appropriate final function for array_agg().  I'm
 not sure offhand if CREATE AGGREGATE would even allow this combination
 to be created, or if it only works because we manually jammed those rows
 into the catalogs at initdb time.  But it would certainly be safer if
 CREATE AGGREGATE *didn't* allow it.

Actually, after a little bit of experimentation, the irreproducible
manual catalog hack is the very existence of array_agg_finalfn().
If you try to reproduce it via CREATE FUNCTION, the system will object:

regression=# create function foo(internal) returns anyarray as
regression-# 'array_agg_finalfn' language internal;
ERROR:  cannot determine result data type
DETAIL:  A function returning a polymorphic type must have at least one 
polymorphic argument.

So what the ordered-set-aggregate patch has done is introduce a principled
way to define polymorphic aggregates with non-polymorphic state types,
something we didn't have before.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread KONDO Mitsumasa


(2014/01/23 12:00), Andrew Dunstan wrote:


On 01/22/2014 08:28 PM, KONDO Mitsumasa wrote:

(2014/01/22 22:26), Robert Haas wrote:

On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

OK, Kondo, please demonstrate benchmarks that show we have 1% impact
from this change. Otherwise we may need a config parameter to allow
the calculation.


OK, testing DBT-2 now. However, error range of benchmark might be 1% higher.
So I show you detail HTML results.


To see any impact from spinlock contention, I think you're pretty much
going to need a machine with 32 cores, I think, and lots of
concurrency.  pgbench -S is probably a better test than DBT-2, because
it leaves out all the writing, so percentage-wise more time will be
spent doing things like updating the pgss hash table.

Oh, thanks to inform me. I think essential problem of my patch has bottle neck
in sqrt() function and other division caluculation. I will replcace sqrt()
function in math.h to more faster algorithm. And moving unneccessary part of
caluculation in LWlocks or other locks. It might take time to improvement, so
please wait for a while.



Umm, I have not read the patch, but are you not using Welford's method? Its
per-statement overhead should be absolutely tiny (and should not compute a 
square
root at all  per statement - the square root should only be computed when the
standard deviation is actually wanted, e.g. when a user examines
pg_stat_statements) See for example
http://www.johndcook.com/standard_deviation.html
Thanks for your advice. I read your example roughly, however, I think calculating 
variance is not so heavy in my patch. Double based sqrt caluculation is most 
heavily in my mind. And I find fast square root algorithm that is used in 3D games.

http://en.wikipedia.org/wiki/Fast_inverse_square_root

This page shows inverse square root algorithm, but it can caluculate normal 
square root, and it is faster algorithm at the price of precision than general 
algorithm. I think we want to fast algorithm, so it will be suitable.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2014-01-22 Thread Tom Lane

Amit Kapila amit.kapil...@gmail.com writes:
 On Wed, Jan 22, 2014 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 So?  Anything which can know the value of a GUC parameter can certainly
 know the selected port number.

 1. In case of registration of event source, either user has to pass the name
 or it uses hard coded default value, so if we use version number along 
 with
 'PostgreSQL', it can be consistent.
 I don't see any way pgevent.c can know port number to append it to default
 value, am I missing something here?

[ shrug... ]  But the same problem applies just as much to any attempt by
pg_ctl to know *any* postmaster parameter.  In particular, this entire
patch is bogus, because the value it extracts from the postgresql.conf
file does not necessarily have anything to do with the setting the live
postmaster is actually using (which might be determined by a command-line
parameter or environment variable instead).  If the technique could be
relied on, then it could be relied on just as much to extract the
postmaster's port setting.

I don't necessarily object to teaching pg_ctl to make a best-effort
estimate of a postmaster parameter in this way.  But it's complete folly
to suppose that you can get an accurate value of event_source but not
the port number.

I think what we might want to do is redefine the server's behavior
as creating an event named after the concatenation of event_source
and port number, or maybe even get rid of event_source entirely and
just say it's PostgreSQL followed by the port number.  If we did
the latter then the problem would reduce to whether pg_ctl knows
the port number, which I think we're already assuming for other
reasons.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-22 Thread Alexander Korotkov

On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/22/2014 02:17 PM, Alexander Korotkov wrote:

 We already spent a lot of time with compression. Now we need to figure out
 the result we want see. I spent quite long time debugging varbyte encoding
 without segments. Finally, it passed very many tests without any problems.
 Now, it is just piece of junk. I'm afraid that we will have to reimplement
 everything from scratch another two or three times because code doesn't
 look perfect. For sure, it's incompatible with getting something into 9.4.


 That's a bit harsh :-). But yes, I understand what you're saying. It's
 quite common for large patches like this to be rewritten several times
 before being committed; you just don't know what works best until you've
 tried many designs.


  Remember we have also subsequent fast-scan which is very needed for hstore
 and other application.
 I propose to do final decisions now and concentrate forces on making
 committable patch with these decisions. And don't change these decisions
 even if somebody have idea how to improve code readability in 100 times
 and
 potential extendability in 1000 times.
 I propose following decisions:
 1) Leave uncompressed area, allow duplicates in it
 2) Don't introduce Items on page.


 Well, I got the insertions to work now without the uncompressed area, so
 in the spirit of Just Getting This Damn Patch Committed, I'm going to go
 ahead with that. We can add the uncompressed area later if performance
 testing shows it to be necessary. And I agree about the Items on page idea
 - we can come back to that too still in 9.4 timeframe if necessary, but
 probably not.

 So, committed. It's the same design as in the most recent patch I posted,
 but with a bunch of bugs fixed, and cleaned up all over. I'm going to move
 to the fast scan patch now, but please do test and review the committed
 version to see if I missed something.


Great! Thanks a lot!
Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain
GinDataLeafMaxContentSize bytes. Patch is attached.
My test-suite don't run correctly. I'm debugging now.

--
With best regards,
Alexander Korotkov.


gin-assert-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-22 Thread KaiGai Kohei

(2014/01/21 18:18), Dean Rasheed wrote:

On 21 January 2014 01:09, KaiGai Kohei kai...@ak.jp.nec.com wrote:

(2014/01/13 22:53), Craig Ringer wrote:

On 01/09/2014 11:19 PM, Tom Lane wrote:

Dean Rasheed dean.a.rash...@gmail.com writes:

My first thought was that it should just preprocess any security
barrier quals in subquery_planner() in the same way as other quals are
preprocessed. But thinking about it further, those quals are destined
to become the quals of subqueries in the range table, so we don't
actually want to preprocess them at that stage --- that will happen
later when the new subquery is planned by recursion back into
subquery_planner(). So I think the right answer is to make
adjust_appendrel_attrs() handle recursion into sublink subqueries.

TBH, this sounds like doubling down on a wrong design choice. I see
no good reason that updatable security views should require any
fundamental rearrangements of the order of operations in the planner.

In that case, would you mind offerign a quick sanity check on the
following alternative idea:

- Add sourceRelation to Query. This refers to the RTE that supplies
tuple projections to feed into ExecModifyTable, with appropriate resjunk
ctid and (if requ'd) oid cols present.

- When expanding a target view into a subquery, set sourceRelation on
the outer view to the index of the RTE of the newly expanded subquery.

I have sane opinion. Existing assumption, resultRelation is RTE of the
table to be read not only modified, makes the implementation complicated.
If we would have a separate sourceRelation, it allows to have flexible
form including sub-query with security_barrier on the reader side.

Could you tell me the direction you're inclined right now?
I wonder whether I should take the latest patch submitted on 09-Jan for
a deep code reviewing and testing.

Yes, please review the patch from 09-Jan
(http://www.postgresql.org/message-id/CAEZATCUiKxOg=voovja2s6g-sixzzxg18totggp8zobq6qn...@mail.gmail.com).

The idea behind that patch is to avoid a lot of the complication that
leads to and then arises from having a separate sourceRelation in
the query.

If you go down the route of expanding the subquery in the rewriter and
making it the sourceRelation, then you have to make extensive
changes to preprocess_targetlist so that it recursively descends into
the subquery to pull out variables needed by an update.

Also you would probably need additional changes in the rewriter so
that later stages didn't trample on the sourceRelation, and
correctly updated it in the case of views on top of other views.

Also, you would need to make changes to the inheritance planner code,
because you'd now have 2 RTEs referring to the target relation
(resultRelation and sourceRelation wrapping it in a subquery).
Referring to the comment in inheritance_planner():

* Source inheritance is expanded at the bottom of the
* plan tree (see allpaths.c), but target inheritance has to be expanded at
* the top.

except that in the case of the sourceRelation, it is actually
effectively the target, which means it shouldn't be expanded,
otherwise you'd get plans like the ones I complained about upthread
(see the final explain output in
http://www.postgresql.org/message-id/caezatcu3vcdkjpnhgfkrmrkz0axkcuh4ce_kqq3z2hzknhi...@mail.gmail.com).

Perhaps all of that could be made to work, but I think that it would
end up being a far more invasive patch than my 09-Jan patch. My patch
avoids both those issues by not doing the subquery expansion until
after inheritance expansion, and after targetlist preprocessing.

Probably, I could get your point.

Even though the sub-query being replaced from a relation with
securityQuals is eventually planned according to the usual
manner, usual order as a part of regular sub-query planning,
however, adjust_appendrel_attrs() is called by inheritance_planner()
earlier than sub-query's planning.

Maybe, we originally had this problem but not appeared on the
surface, because child relations don't have qualifiers on this
phase. (It shall be distributed later.)
But now, this assumption was broken because of a relation with
securityQuals being replaced by a sub-query with SubLink.
So, I'm inclined to revise the assumption side, rather than
existing assertion checks.

Shall we investigate what assumption should be revised if child-
relation, being expanded at expand_inherited_tables(), would be
a sub-query having Sub-Link?

A minor comment is below:

! /*
!* Make sure that the query is marked correctly if the added qual
!* has sublinks.
!*/
! if (!parsetree-hasSubLinks)
! parsetree-hasSubLinks = checkExprHasSubLink(viewqual);

Is this logic really needed? This flag says the top-level query
contains SubLinks, however, the above condition checks whether
a sub-query to be constructed shall contain SubLinks.
Also, securityQuals is not attached to the parse tree right now.

Thanks,
--
OSS

Re: [HACKERS] [PATCH] Support for pg_stat_archiver view

2014-01-22 Thread Michael Paquier

On Thu, Jan 9, 2014 at 6:36 AM, Gabriele Bartolini
gabriele.bartol...@2ndquadrant.it wrote:
 Il 08/01/14 18:42, Simon Riggs ha scritto:
 Not sure I see why it needs to be an SRF. It only returns one row.
 Attached is version 4, which removes management of SRF stages.

I have been looking at v4 a bit more, and found a couple of small things:
- a warning in pgstatfuncs.c
- some typos and format fixing in the sgml docs
- some corrections in a couple of comments
- Fixed an error message related to pg_stat_reset_shared referring
only to bgwriter and not the new option archiver. Here is how the new
message looks:
=# select pg_stat_reset_shared('popo');
ERROR:  22023: unrecognized reset target: popo
HINT:  Target must be bgwriter or archiver
A new version is attached containing those fixes. I played also with
the patch and pgbench, simulating some archive failures and successes
while running pgbench and the reports given by pg_stat_archiver were
correct, so I am marking this patch as Ready for committer.

Regards,
-- 
Michael
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4ec6981..eb5131f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -270,11 +270,19 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re
  /row
 
  row
+  entrystructnamepg_stat_archiver/indextermprimarypg_stat_archiver/primary/indexterm/entry
+  entryOne row only, showing statistics about the
+   WAL archiver process's activity. See
+   xref linkend=pg-stat-archiver-view for details.
+  /entry
+ /row
+
+ row
   entrystructnamepg_stat_bgwriter/indextermprimarypg_stat_bgwriter/primary/indexterm/entry
   entryOne row only, showing statistics about the
background writer process's activity. See
xref linkend=pg-stat-bgwriter-view for details.
- /entry
+  /entry
  /row
 
  row
@@ -648,6 +656,63 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re
/para
   /note
 
+  table id=pg-stat-archiver-view xreflabel=pg_stat_archiver
+   titlestructnamepg_stat_archiver/structname View/title
+
+   tgroup cols=3
+thead
+ row
+  entryColumn/entry
+  entryType/entry
+  entryDescription/entry
+ /row
+/thead
+
+tbody
+ row
+  entrystructfieldarchived_wals//entry
+  entrytypebigint/type/entry
+  entryNumber of WAL files that have been successfully archived/entry
+ /row
+ row
+  entrystructfieldlast_archived_wal//entry
+  entrytypetext/type/entry
+  entryName of the last WAL file successfully archived/entry
+ /row
+ row
+  entrystructfieldlast_archived_wal_time//entry
+  entrytypetimestamp with time zone/type/entry
+  entryTime of the last successful archive operation/entry
+ /row
+ row
+  entrystructfieldfailed_attempts//entry
+  entrytypebigint/type/entry
+  entryNumber of failed attempts for archiving WAL files/entry
+ /row
+ row
+  entrystructfieldlast_failed_wal//entry
+  entrytypetext/type/entry
+  entryName of the WAL file of the last failed archival operation/entry
+ /row
+ row
+  entrystructfieldlast_failed_wal_time//entry
+  entrytypetimestamp with time zone/type/entry
+  entryTime of the last failed archival operation/entry
+ /row
+ row
+  entrystructfieldstats_reset//entry
+  entrytypetimestamp with time zone/type/entry
+  entryTime at which these statistics were last reset/entry
+ /row
+/tbody
+   /tgroup
+  /table
+
+  para
+   The structnamepg_stat_archiver/structname view will always have a
+   single row, containing data about the archiver process of the cluster.
+  /para
+
   table id=pg-stat-bgwriter-view xreflabel=pg_stat_bgwriter
titlestructnamepg_stat_bgwriter/structname View/title
 
@@ -1613,6 +1678,8 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re
argument (requires superuser privileges).
Calling literalpg_stat_reset_shared('bgwriter')/ will zero all the
counters shown in the structnamepg_stat_bgwriter/ view.
+   Calling literalpg_stat_reset_shared('archiver')/ will zero all the
+   counters shown in the structnamepg_stat_archiver/ view.
   /entry
  /row
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 043d118..c8a89fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -672,6 +672,17 @@ CREATE VIEW pg_stat_xact_user_functions AS
 WHERE P.prolang != 12  -- fast check to eliminate built-in functions
   AND pg_stat_get_xact_function_calls(P.oid) IS NOT NULL;
 
+CREATE VIEW pg_stat_archiver AS
+SELECT
+s.archived_wals,
+s.last_archived_wal,
+s.last_archived_wal_time,
+s.failed_attempts,
+s.last_failed_wal,
+s.last_failed_wal_time,
+s.stats_reset
+FROM

Re: [HACKERS] Patch: Show process IDs of processes holding a lock; show relation and tuple infos of a lock to acquire

2014-01-22 Thread Christian Kruse

Hi,

On 22/01/14 14:45, Tom Lane wrote:
 Well, is it context or detail?  Those fields have reasonably well defined
 meanings IMO.

I find the distinction somewhat blurry and think both would be
appropriate. But since I wasn't sure I changed to detail.

 If we need errcontext_plural, let's add it, not adopt inferior solutions
 just because that isn't there for lack of previous need.

I would've added it if I would've been sure.

 But having said that, I think this is indeed detail not context.
 (I kinda wonder whether some of the stuff that's now in the primary
 message shouldn't be pushed to errdetail as well.  It looks like some
 previous patches in this area have been lazy.)

I agree, the primary message is not very well worded. On the other
hand finding an appropriate alternative seems hard for me.

 While I'm griping, this message isn't even trying to follow the project's
 message style guidelines.  Detail or context messages are supposed to be
 complete sentence(s), with capitalization and punctuation to match.

Hm, I hope I fixed it in this version of the patch.

 Lastly, is this information that we want to be shipping to clients?
 Perhaps from a security standpoint that's not such a wise idea, and
 errdetail_log() is what should be used.

Fixed. I added an errdetail_log_plural() for this, too.

Best regards,

-- 
 Christian Kruse   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 122afb2..552c5a4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1195,13 +1195,23 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		 */
 		if (log_lock_waits  deadlock_state != DS_NOT_YET_CHECKED)
 		{
-			StringInfoData buf;
+			StringInfoData buf,
+		lock_waiters_sbuf,
+		lock_holders_sbuf;
 			const char *modename;
 			long		secs;
 			int			usecs;
 			long		msecs;
+			SHM_QUEUE  *procLocks;
+			PROCLOCK   *proclock;
+			bool		first_holder = true,
+		first_waiter = true;
+			int			lockHoldersNum = 0;
 
 			initStringInfo(buf);
+			initStringInfo(lock_waiters_sbuf);
+			initStringInfo(lock_holders_sbuf);
+
 			DescribeLockTag(buf, locallock-tag.lock);
 			modename = GetLockmodeName(locallock-tag.lock.locktag_lockmethodid,
 	   lockmode);
@@ -1211,10 +1221,67 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			msecs = secs * 1000 + usecs / 1000;
 			usecs = usecs % 1000;
 
+			LWLockAcquire(partitionLock, LW_SHARED);
+
+			/*
+			 * we loop over the lock's procLocks to gather a list of all
+			 * holders and waiters. Thus we will be able to provide more
+			 * detailed information for lock debugging purposes.
+			 *
+			 * lock-procLocks contains all processes which hold or wait for
+			 * this lock.
+			 */
+
+			procLocks = (lock-procLocks);
+			proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
+			   offsetof(PROCLOCK, lockLink));
+
+			while (proclock)
+			{
+/*
+ * we are a waiter if myProc-waitProcLock == proclock; we are
+ * a holder if it is NULL or something different
+ */
+if (proclock-tag.myProc-waitProcLock == proclock)
+{
+	if (first_waiter)
+	{
+		appendStringInfo(lock_waiters_sbuf, %d,
+		 proclock-tag.myProc-pid);
+		first_waiter = false;
+	}
+	else
+		appendStringInfo(lock_waiters_sbuf, , %d,
+		 proclock-tag.myProc-pid);
+}
+else
+{
+	if (first_holder)
+	{
+		appendStringInfo(lock_holders_sbuf, %d,
+		 proclock-tag.myProc-pid);
+		first_holder = false;
+	}
+	else
+		appendStringInfo(lock_holders_sbuf, , %d,
+		 proclock-tag.myProc-pid);
+
+	lockHoldersNum++;
+}
+
+proclock = (PROCLOCK *) SHMQueueNext(procLocks, proclock-lockLink,
+			   offsetof(PROCLOCK, lockLink));
+			}
+
+			LWLockRelease(partitionLock);
+
 			if (deadlock_state == DS_SOFT_DEADLOCK)
 ereport(LOG,
 		(errmsg(process %d avoided deadlock for %s on %s by rearranging queue order after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+(errdetail_plural(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+  lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data;
 			else if (deadlock_state == DS_HARD_DEADLOCK)
 			{
 /*
@@ -1226,13 +1293,19 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  */
 ereport(LOG,
 		(errmsg(process %d detected deadlock while waiting for %s on %s after %ld.%03d ms,
-			  MyProcPid, modename, buf.data, msecs, usecs)));
+MyProcPid, modename, buf.data, msecs, usecs),
+(errdetail_plural(process owning lock: %s request queue: %s,
+			   processes owning lock: %s request queue: %s,
+  lockHoldersNum, lock_holders_sbuf.data,

1 2 >

100 matches

Mail list logo