Re: [HACKERS] small patch to crypt.c

2013-06-09 Thread Stephen Frost
* Joshua D. Drake (j...@commandprompt.com) wrote:
 Well I was more referring to the default is:
 
 check if null, if true return ok
 check if valuntil  today, if true return error
 else return ok
 
 To me we don't need the null check. However, when I tested it,
 without the null check you can't login. So now I am curious about
 what is going on.

Erm, but what is valuntil set to when it's null?  I'd expect it to be
zero because it hasn't been changed since:

TimestampTz vuntil = 0;

Using your pseudo-code, you end up with:

check if 0  today, if true return error
else return ok

Which is why you end up always getting an error when you get rid of the
explicit isnull check.  Looking at it too quickly, I had assumed that
the test was inverted and that your patch worked most of the time but
didn't account for GetCurrentTimestamp() going negative.

Regardless, setting vuntil to some magic value that really means it's
actually NULL, which is what you'd need to do in order to get rid of
that explicit check for null, doesn't strike me as a good idea.  When a
value is null, we shouldn't be looking at the data at all.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Craig Ringer
On 06/09/2013 08:32 AM, MauMau wrote:

 - Failure of a disk containing data directory or tablespace
 If checkpoint can't write buffers to disk because of disk failure,
 checkpoint cannot complete, thus WAL files accumulate in pg_xlog/.
 This means that one disk failure will lead to postgres shutdown.
... which is why tablespaces aren't disposable, and why creating a
tablespace in a RAM disk is such an awful idea.

I'd rather like to be able to recover from this by treating the
tablespace as dead, so any attempt to get a lock on any table within it
fails with an error and already-in-WAL writes to it just get discarded.
It's the sort of thing that'd only be reasonable to do as a recovery
option (like zero_damaged_pages) since if applied by default it'd lead
to potentially severe and unexpected data loss.

I've seen a couple of people bitten by the misunderstanding that
tablespaces are a way to split up your data based on different
reliability requirements, and I really need to write a docs patch for
http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html
http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html
that adds a prominent warning like:

WARNING: Every tablespace must be present before the database can be
started. There is no easy way to recover the database if a tablespace is
lost to disk failure, deletion, use of volatile storage, etc. bDo not
put a tablespace on a RAM disk/b; instead just use UNLOGGED tables.

(Opinions on the above?)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Craig Ringer
On 06/09/2013 03:02 AM, Jeff Janes wrote:

 It would be nice to have the ability to specify multiple log destinations
 with different log_min_messages for each one.  I'm sure syslog already must
 implement some kind of method for doing that, but I've been happy enough
 with the text logs that I've never bothered to look into it much.

Smarter syslog flavours like rsyslog certainly do this.

No alert system triggered by events within Pg will ever be fully
sufficient. Oops, the postmaster crashed with stack corruption, I'll
just exec whatever's in this on_panic_exec GUC (if I can still read it
and it's still valid) to hopefully tell the sysadmin about my death.
Hmm, sounds as reliable and safe as a bicycle powered by a home-made
rocket engine.

External monitoring is IMO always necessary. Something like Icinga with
check_postgres can trivially poke Pg to make sure it's alive. It can
also efficiently check the 'pg_error.log' from rsyslog that contains
only severe errors and raise alerts if it doesn't like what it sees.

If I'm already doing external monitoring (which is necessary as outlined
above) then I see much less point having Pg able to raise alerts for
problems, and am more interested in better built-in functions and views
for exposing Pg's state. Easier monitoring of WAL build-up, ways to slow
the master if async replicas or archiving are getting too far behind, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad error message on valuntil

2013-06-09 Thread Craig Ringer
On 06/08/2013 04:07 AM, Joshua D. Drake wrote:

 FATAL: Authentication failed: Check server log for specifics

 And then we make sure we log proper info?
FATAL: Authentication using method 'password' failed, possible reasons
are no/wrong password sent, account expired; see server log for details ?

We can tell them what they would already know (they tried the 'password'
method) and a little about what might be wrong, as well as where to go
for more info.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed patch: remove hard-coded limit MAX_ALLOCATED_DESCS

2013-06-09 Thread Cédric Villemain
Le samedi 8 juin 2013 23:27:16, Tom Lane a écrit :
 =?iso-8859-1?q?C=E9dric_Villemain?= ced...@2ndquadrant.com writes:
  I'm not sure of expected value of max_safe_fds. Your patch now
  initialize with 5 slots instead of 10, if max_safe_fds is large maybe it
  is better to double the size each time we need instead of jumping
  directly to the largest size ?
 
 I don't see the point particularly.  At the default value of
 max_files_per_process (1000), the patch can allocate at most 500 array
 elements, which on a 64-bit machine would probably be 16 bytes each
 or 8KB total.

The point was only if the original comment was still relevant. It seems it is 
just not accurate anymore.
With patch I can union 492 csv tables instead of 32 with beta1.
-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] pgbench: introduce a new automatic variable 'client_number'

2013-06-09 Thread Heikki Linnakangas

On 06.06.2013 06:53, Gurjeet Singh wrote:

Please find attached a patch for pgbench that introduces a new
auto-variable 'client_number'. Following in the footsteps of 'scale'
auto-variable, this is not declared if the user has specified this variable
using -D switch.

Since 'clientid' is a very common name a user can use for their own
script's variable, I chose to call this auto-variable client_number; just
to avoid conflicts.


Hmm, I'm not sure I care too much about that, to be honest. We have 
'scale' as an auto-variable as well, which is also a common word. Also, 
if there's an existing script out there that does \set client_id ..., 
it will override the automatic value, and work as it used to.


Another reason to name it client_id is that in the pgbench -l log 
format, the documentation calls the first column client_id. Makes 
sense to call the auto-variable the same.


I think you forgot to compile with the patch, because there's a 
semicolon missing ;-). I moved the code around a bit, setting the 
variable next to where :scale is set; that's more readable. In the docs, 
I split the descriptions of :scale and :client_id into a table.


I'll commit the attached as soon as the tree opens for 9.4 development.

- Heikki
commit 85ebed80396313f0b7f943a228421f75c68db2ab
Author: Heikki Linnakangas heikki.linnakan...@iki.fi
Date:   Sun Jun 9 11:33:16 2013 +0300

Add :client_id automatic variable for custom pgbench scripts.

This makes it easier to write custom scripts that have different logic for
each client.

Gurjeet Singh, with some changes by me.

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 8c202bf..1303217 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -2544,6 +2544,20 @@ main(int argc, char **argv)
 		}
 	}
 
+	/*
+	 * Define a :client_id variable that is unique per connection. But don't
+	 * override an explicit -D switch.
+	 */
+	if (getVariable(state[0], client_id) == NULL)
+	{
+		for (i = 0; i  nclients; i++)
+		{
+			snprintf(val, sizeof(val), %d, i);
+			if (!putVariable(state[i], startup, client_id, val))
+exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, starting vacuum...);
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index e9900d3..8775606 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -600,13 +600,39 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
Variables can be set by the command-line option-D/ option,
explained above, or by the meta commands explained below.
In addition to any variables preset by option-D/ command-line options,
-   the variable literalscale/ is preset to the current scale factor.
+   there are a few variables that are preset automatically, listed in
+   xref linkend=pgbench-automatic-variables. A value specified for these
+   variables using option-D/ takes precedence over the automatic presets.
Once set, a variable's
value can be inserted into a SQL command by writing
literal:/replaceablevariablename/.  When running more than
one client session, each session has its own set of variables.
   /para
 
+   table id=pgbench-automatic-variables
+titleAutomatic variables/title
+tgroup cols=2
+ thead
+  row
+   entryVariable/entry
+   entryDescription/entry
+  /row
+ /thead
+
+ tbody
+  row
+   entry literalscale/literal /entry
+   entrycurrent scale factor/entry
+  /row
+
+  row
+   entry literalclient_id/literal /entry
+   entryunique number identifying the client session (starts from zero)/entry
+  /row
+ /tbody
+/tgroup
+   /table
+
   para
Script file meta commands begin with a backslash (literal\/).
Arguments to a meta command are separated by white space.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch API for After Triggers

2013-06-09 Thread Simon Riggs
On 8 June 2013 22:25, Kevin Grittner kgri...@ymail.com wrote:
 Simon Riggs si...@2ndquadrant.com wrote:

 Comments please.

 How much of this problem space do you think could be addressed by
 providing OLD and NEW *relations* to AFTER EACH STATEMENT triggers?

It's a reasonable question because those two things sound a little
like they might be related.

Providing the proposed additional info costs almost nothing since the
work to calculate that info is already performed. I've written this
patch since it was trivial to do so, while inspecting the code to see
if it was possible. As it now turns out, I'll be putting most effort
into the WHEN clause approach for FKs, but there's no reason why
others like Slony or pgmemcache wouldn't benefit here also - hence
posting the patch. The proposed API changes don't conflict in any way
with the feature you propose.

Providing the whole OLD and NEW sets as relations to a trigger would
require significant resources and wouldn't be done for performance
reasons AFAICS. There are also difficulties in semantics, since when
we have OLD and NEW at row level we know we are discussing the same
row. With sets of OLD and NEW we'd need to be able to link the
relations back together somehow, which couldn't be done by PK since
that could change. So we'd need to invent some semantics for a
linking identifier of some description. Which once we've done it
would be used by people to join them back together again, which is
what we already had in the first place. So my take is that it sounds
expressive, but definitely not performant.

Since my objective is performance, not standards, I don't see a reason
to work on that, yet. I might have time to work on it later, lets see.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch API for After Triggers

2013-06-09 Thread Simon Riggs
On 9 June 2013 05:08, Stephen Frost sfr...@snowman.net wrote:
 * Simon Riggs (si...@2ndquadrant.com) wrote:
 While fiddling with FK tuning, Noah suggested batching trigger
 executions together to avoid execution overhead.

 I like the general idea, but I'd prefer a way to avoid having to queue
 up tons of trigger events to be executed in the first place.

There's already a thread on that exact topic, for FKs, which is what
spawned this thread.

 Aggregates do this by providing a way to store up information to be
 processed by an eventual 'final' function.

As I mentioned in my post, I did consider that and then chose not to
do that. However, having a final func is a major modification in the
way that we specify trigger functions. We'd also need to cope with
recursive trigger execution, which would mean the final func would get
called potentially many times, so there's no way of knowing if the
final func is actually the last call needed. That sounded complex and
confusing to me.

The proposed API allows you to do exactly that anyway, more easily, by
just waiting until tg_event_num == tg_tot_num_events.

 Another option, as Kevin
 asked about, would be statement level triggers which are able to see
 both the OLD and the NEW versions of the relation.

 The per-row trigger option with a way to be called immediately

... it already exists and is known as the WHEN clause. This is the
mechanism I expect to use to tune FKs

 and then
 store what it cares about for a later final function strikes me as very
 generalized and able to do things that the statement-level option
 couldn't,

 but I'm not sure if there's a use-case that could solve which
 the OLD/NEW statement trigger capability couldn't.

I think the two things are quite different, as I explained on a
separate post to Kevin.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to find out whether a view is updatable

2013-06-09 Thread Dean Rasheed
On 6 June 2013 08:09, Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 5 June 2013 08:59, Dean Rasheed dean.a.rash...@gmail.com wrote:
 I'm still not happy with pg_view_is_updatable() et al. and the
 information_schema views. I accept that the information_schema views
 have to be the way they are because that's what's defined in the
 standard, but as it stands, the distinction between updatable and
 trigger-updatable makes it impossible in general to answer the simple
 question does foo support UPDATEs?.

 I'm thinking what we really need is a single function with a slightly
 different signature, that can be used to support both the information
 schema views and psql's \d+ (and potentially other client apps).
 Perhaps something like:-

   pg_relation_is_updatable(include_triggers boolean)
   returns int


 OK, here's what it looks like using this approach:


  FUNCTION pg_relation_is_updatable(reloid oid,
include_triggers boolean)
  RETURNS integer


  FUNCTION pg_column_is_updatable(reloid oid,
  attnum integer,
  include_triggers boolean)
  RETURNS boolean


 These replace pg_view_is_updatable() and pg_view_is_insertable(). I
 think I definitely prefer this over the old API, because it gives much
 greater flexibility.

 The information schema views all pass include_triggers = false for
 compatibility with the standard. The return value from
 pg_relation_is_updatable() is now an integer bitmask reflecting
 whether or not the relation is insertable, updatable and/or deletable.

 psql and other clients can more usefully pass include_triggers = true
 to determine whether a relation actually supports INSERT, UPDATE and
 DELETE, including checks for INSTEAD OF triggers on the specified
 relation or any underlying base relations.

 I thought about having pg_relation_is_updatable() return text, like
 the GRANT support functions, but I thought that it would make the
 information schema views harder to write, using a single call to check
 for updatable+deletable, whereas integer bit operations are easy.

 There is a backwards-incompatible change to the information schema,
 reflected in the regression tests: if a view is updatable but not
 deletable, the relevant rows in information_schema.columns now say
 'YES' --- the columns are updatable, even though the relation as a
 whole isn't.

 I've initially defined matching FDW callback functions:


 int
 IsForeignRelUpdatable (Oid foreigntableid,
bool include_triggers);


 bool
 IsForeignColUpdatable (Oid foreigntableid,
int attnum,
bool include_triggers);


 but I'm now having second thoughts about whether we should bother
 passing include_triggers to the FDW. If we regard the foreign table as
 a black box, we only care about whether it is updatable, not *how*
 that update is performed.


Here's a more complete patch along those lines. It defines the
following pair of functions to test for updatability from SQL:

  FUNCTION pg_catalog.pg_relation_is_updatable(reloid oid,
   include_triggers boolean)
  RETURNS integer

  FUNCTION pg_catalog.pg_column_is_updatable(reloid oid,
 attnum smallint,
 include_triggers boolean)
  RETURNS boolean

and the following FDW functions:

  int IsForeignRelUpdatable (Oid foreigntableid);

  bool IsForeignColUpdatable (Oid foreigntableid,
  AttrNumber attnum);

As an initial implementation of this API in the postgres-fdw, I've
added a new option updatable (true by default), which can be
specified as a server option or as a per-table option, to give user
control over whether individual foreign tables are read-only or
updatable.

I called it updatable rather than writable or read-only because it
might perhaps be extended in the future with separate options for
insertable and deletable. It could also be extended to give
column-level control over updatability, or something like
use_remote_updatability could be added, but that all feels like 9.4
material.

Regards,
Dean


pg_relation_is_updatable.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)

2013-06-09 Thread Craig Ringer
On 05/31/2013 03:41 PM, Fabien COELHO wrote:

 However I'm not sure that pg_stat_replication currently has the
 necessary information on either side to measure the lag (in time
 transactions, but how do I know when a transaction was committed? or
 number of transactions?).

 The BDR codebase now has a handy function to report when a transaction
 was committed, pg_get_transaction_committime(xid) .

 This looks handy for monitoring a replication setup.
 It should really be in core...

 Any plans? Or is there other ways to get this kind of information in
 core?

Yes, it's my understanding that the idea is to eventually get all the
BDR functionality merged, piece by piece, including the commit time
tracking feature.

pg_get_transaction_committime isn't trivial to just add to core because
it requires a commit time to be recorded with commit records in the
transaction logs, among other changes.

I don't know if Andres or any of the others involved are planning on
trying to get this particular feature merged in 9.4, but I wouldn't be
too surprised since (AFAIK) it's fairly self-contained and would be
useful for monitoring streaming replication setups as well.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimising Foreign Key checks

2013-06-09 Thread Simon Riggs
On 9 June 2013 02:12, Noah Misch n...@leadboat.com wrote:
 On Sat, Jun 08, 2013 at 08:20:42PM -0400, Robert Haas wrote:
 On Sat, Jun 8, 2013 at 5:41 PM, Noah Misch n...@leadboat.com wrote:
  Likewise; I don't see why we couldn't perform an optimistic check ASAP and
  schedule a final after-statement check when an early check fails.  That
  changes performance characteristics without changing semantics.

 ...this seems like it might have some promise; but what if the action
 we're performing isn't idempotent?  And how do we know?

 The action discussed so far is RI_FKey_check_ins().  It acquires a KEY SHARE
 lock (idempotent by nature) on a row that it finds using B-tree equality
 (presumed IMMUTABLE, thus idempotent).  RI_FKey_check_upd() is nearly the same
 action, so the same argument holds.  Before treating any other operation in
 the same way, one would need to conduct similar analysis.

As long as we are talking about FKs only, then this approach can work.
All we are doing is applying the locks slightly earlier than before.
Once locked they will prevent any later violations, so we are safe
from anybody except *ourselves* from making changes that would
invalidate the earlier check.  Trouble is, there are various ways I
can see that as possible, so making a check early doesn't allow you to
avoid making the check later as well.

AFAICS there are weird cases where changing the way FKs execute will
change the way complex trigger applications will execute. I don't see
a way to avoid that other than do nothing. Currently, we execute the
checks following the normal order of execution rules for triggers.
Every idea we've had so far changes that in some way; variously in
major or minor ways, but changed nonetheless.

Even the approach of deferring checks to allow them to be applied in a
batch mean we might change the way applications execute in detail.
However, since the only possible change there would be to decrease the
number of self-induced failures that seems OK.

So the question is how much change do we want to introduce? I'll guess
not much, rather than lots or none.

Proposal: Have a WHEN clause that accumulates values to be checked in
a hash table up to work_mem in size, allowing us to eliminate the most
common duplicates (just not *all* duplicates). If the value isn't a
duplicate (or at least the first seen tuple with that value), we will
queue up a check for later. That approach gives us *exactly* what we
have now and works with the two common cases: i) few, mostly
duplicated values, ii) many values, but clustered together. Then apply
changes in batches at end of statement.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-06-09 Thread Craig Ringer
On 06/07/2013 04:38 AM, Jeff Janes wrote:
 On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 My database is slow
 -
 This autovacuum thing is using up lots of I/O and CPU, I'll increase
 this delay setting here

 Do you think this was the correct diagnosis but with the wrong action
 taken, or was the diagnosis incorrect in the first place (i.e. it may be
 using some IO and CPU, but that isn't what was  causing the initial
 problem)?  And if the diagnosis was correct, was it causing problems under
 default settings, or only because they already turned off the cost delay?

The problem is that vacuum running too slow tends to result in table and
index bloat. Which results in less efficient cache use, slower scans,
and generally worsening performance.

I've repeatedly seen the user attribute the resulting high I/O to
autovacuum (which is, after all, always working away trying to keep up)
- and solving the problem by further slowing autovacuum.

It is very counter-intuitive that to fix the problem the user needs to
make the background process that's doing the I/O take up *more*
resources, so that other queries take *even less*.

 -
 I'll whack in some manual VACUUM cron jobs during low load maintenance
 hours and hope that keeps the worst of the problem away, that's what
 random forum posts on the Internet say to do.
 - oh my, why did my DB just do an emergency shutdown?

 This one doesn't make much sense to me, unless they mucked around with
 autovacuum_freeze_max_age as well as turning autovacuum itself off
 (common practice?).

Unfortunately, yes, as an extension of the above reasoning people seem
to apply around autovacuum. The now horrifyingly bloated DB is being
kept vaguely functional by regular cron'd vacuum runs, but then
autovacuum kicks back in and starts thrashing the system. It's already
performing really badly because of all the bloat so this is more than it
can take and performance tanks critically. Particularly since it
probably has 1000 or more backends thrashing away if it's anything like
many of the systems I've been seeing in the wild.

The operator's response: Panic and find out how to make it stop. Once
autovacuum quits doing its thing the system returns to staggering along
and they go back to planning a hardware upgrade someday, then suddenly
it's emergency wraparound prevention time.

I suspect vacuum, autovacuum, autovacuum tuning, table and index bloat,
etc is just too complicated for a lot of people running Pg installs to
really understand. I'd really, really love to see some feedback-based
auto-tuning of vacuum.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch API for After Triggers

2013-06-09 Thread Martijn van Oosterhout
On Sun, Jun 09, 2013 at 10:15:09AM +0100, Simon Riggs wrote:
 As I mentioned in my post, I did consider that and then chose not to
 do that. However, having a final func is a major modification in the
 way that we specify trigger functions. We'd also need to cope with
 recursive trigger execution, which would mean the final func would get
 called potentially many times, so there's no way of knowing if the
 final func is actually the last call needed. That sounded complex and
 confusing to me.
 
 The proposed API allows you to do exactly that anyway, more easily, by
 just waiting until tg_event_num == tg_tot_num_events.

Can you signal partial completion? For example, if a trigger know that
blocks of 10,000 are optimal and it sees tg_tot_num_events == 1,000,000
that it could do work every 10,000 entries, as in when:

(tg_event_num % 1) == 0 || tg_event_num == tg_tot_num_events

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Batch API for After Triggers

2013-06-09 Thread Craig Ringer
On 06/09/2013 04:58 PM, Simon Riggs wrote:
 There are also difficulties in semantics, since when
 we have OLD and NEW at row level we know we are discussing the same
 row. With sets of OLD and NEW we'd need to be able to link the
 relations back together somehow, which couldn't be done by PK since
 that could change.

We don't currently have OLD and NEW relations so we're free to define
how this works pretty freely.

Rather than having OLD and NEW as separate relations, we could just have
one OLD_AND_NEW relation. In that relation we exploit Pg's composite
types to nest the old and new tuples in a single outer change record.

OLD_AND_NEW would look to PL/PgSQL as if it were:

CREATE TEMPORARY TABLE OLD_AND_NEW (
OLD tabletype NOT NULL,
NEW tabletype NOT NULL
);

...though presumably without the ability to create indexes on it and the
other things you can do to a real temp table. Though I can see cases
where that'd be awfully handy too.

For DELETE and INSERT we'd either provide different relations named OLD
and NEW respectively, or we'd use OLD_AND_NEW with one field or the
other blank. I'm not sure which would be best.

Alternately, we could break the usual rules for relations and define OLD
and NEW as ordered, so lock-step iteration would always return matching
pairs of rows. That's useless in SQL since there's no way to achieve
lock-step iteration, but if we provide a
for_each_changed_row('some_function'::regproc) that scans them in
lock-step and invokes `some_function` for each one...? (I haven't yet
done enough in the core to have any idea if this approach is completely
and absurdly impossible, or just ugly. Figured I'd throw it out there
anyway.)



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimising Foreign Key checks

2013-06-09 Thread Andres Freund
On 2013-06-01 09:41:13 +0100, Simon Riggs wrote:
 FK checks can be expensive, especially when loading large volumes of
 data into an existing table or partition. A couple of ideas for
 improving performance are discussed here:

Another idea would be to optimize away the row level locks if we have a
table level lock that already provides more protection than the asked
for row level lock. In bulk data loading scenarios that's not uncommon
and can matter quite a bit, especially if loading in parallel since
both, the wal traffic and the dirtying of pages, is considerably
reduced.

Should be implementable relatively easily in heap_lock_tuple without
being visible to the outside.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLE ... ALTER CONSTRAINT

2013-06-09 Thread Andres Freund
On 2013-06-08 21:45:24 +0100, Simon Riggs wrote:
 While fiddling with FK tuning, it was useful to be able to enable and
 disable the DEFERRED mode of constraints.
 
 That is not currently possible in SQL, so I wrote this patch. Without
 this you have to drop and then re-add a constraint, which is
 impractical for large tables.
 
 e.g.
 CREATE TABLE fktable (id integer, fk integer REFERENCES pktable (id));
 
 ALTER TABLE foo
ALTER CONSTRAINT fktable_fk_fkey DEFERRED INITIALLY IMMEDIATE;
 
 Includes docs and tests.
 
 Currently works for FKs only. Potentially other constraints can be
 supported in future.

I haven't looked at the patch in detail, but I am very, very much in
favor of the feature in general… I have wished for this more than once,
and it certainly cost me more time working around it than it would have
cost to implement it.

Thanks,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Andres Freund
On 2013-06-08 13:26:56 -0700, Joshua D. Drake wrote:
 At the points where the XLogInsert()s happens we're in critical sections
 out of which we *cannot* ERROR out because we already may have made
 modifications that cannot be allowed to be performed
 partially/unlogged. That's why we're throwing a PANIC which will force a
 cluster wide restart including *NOT* writing any further buffers from
 s_b out.
 
 
 Does this preclude (sorry I don't know this part of the code very well) my
 suggestion of on log create?

Well, yes. We create log segments some layers below XLogInsert() if
necesary, and as I said above, we're in a critical section at that
point, so just rolling back isn't one of the options.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] on existing update construct

2013-06-09 Thread Craig Ringer
On 05/16/2013 02:44 AM, Dev Kumkar wrote:
 Hello,
 
 Is there an alternative of Sybase on existing update construct in pgsql.

No.

See:

http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MVCC catalog access

2013-06-09 Thread Andres Freund
On 2013-06-06 12:49:14 -0400, Robert Haas wrote:
 On Thu, Jun 6, 2013 at 5:30 AM, Andres Freund and...@2ndquadrant.com wrote:
  + * XXX: Now that we have MVCC catalog access, the reasoning above is no 
  longer
  + * true.  Are there other good reasons to hard-code this, or should we 
  revisit
  + * that decision?
 
  We could just the function by looking in the shared
  relmapper. Everything that can be mapped via it is shared.
 
 I suspect there are several possible sources for this information, but
 it's hard to beat a hard-coded list for efficiency, so I wasn't sure
 if we should tinker with this or not.

I can tell from experience that it makes adding a new shared relation
more of a pain than it already is, but we're hopefully not doing that
all that often.
I just don't think that the mvcc angle has much to do with the decision.

  --- a/src/backend/commands/cluster.c
  +++ b/src/backend/commands/cluster.c
  @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid 
  indexOid, bool recheck, LOCKMOD
* against concurrent SnapshotNow scans of pg_index.  Therefore this is 
  unsafe
* to execute with less than full exclusive lock on the parent table;
* otherwise concurrent executions of RelationGetIndexList could miss 
  indexes.
  + *
  + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of 
  pg_index
  + * shouldn't be common enough to worry about.  The above comment needs
  + * to be updated, and it may be possible to simplify the logic here in 
  other
  + * ways also.
*/
 
  You're right, the comment needs to be changed, but I don't think the
  effect can. A non-inplace upgrade changes the xmin of the row which is
  relevant for indcheckxmin.
 
 OK.
 
  (In fact, isn't this update possibly causing problems like delaying the
  use of such an index already)

 Well, maybe.  In general, the ephemeral snapshot taken for a catalog
 scan can't be any older than the primary snapshot already held.  But
 there could be some corner case where that's not true, if we use this
 technique somewhere that such a snapshot hasn't already been acquired.

I wasn't talking about catalog scans or this patch, I wonder whether the
update we do there won't cause the index not being used for concurrent
(normal) scans since now the xmin is newer while it might be far in the
past before. I.e. we might need to unset indexcheckxmin if xmin is far
enough in the past.

  Hm. Looks like this should also change the description of SecondarySnapshot:
 
  /*
   * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
   * mode, and to the latest one taken in a read-committed transaction.
   * SecondarySnapshot is a snapshot that's always up-to-date as of the 
  current
   * instant, even in transaction-snapshot mode.  It should only be used for
   * special-purpose code (say, RI checking.)
   *
 
 I think that's still more or less true, though we could add catalog
 scans as another example.

I guess my feeling is that once catalog scans use it, it's not so much
special purpose anymore ;). But I admit that the frequency of usage
doesn't say much about its specificity...

  I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None
  of the callers seem to rely on it's behaviour from a quick look and it
  seems rather confusing to have both.
 
 I assume Tom had some reason for making GetLatestSnapshot() behave the
 way it does, so I refrained from doing that.  I might be wrong.

At least I don't see any on a quick look - which doesn't say very
much. I think I just dislike having *instant and *latest functions in
there, seems to be confusing to me.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)

2013-06-09 Thread Andres Freund
On 2013-06-09 17:50:13 +0800, Craig Ringer wrote:
 On 05/31/2013 03:41 PM, Fabien COELHO wrote:
 
  However I'm not sure that pg_stat_replication currently has the
  necessary information on either side to measure the lag (in time
  transactions, but how do I know when a transaction was committed? or
  number of transactions?).
 
  The BDR codebase now has a handy function to report when a transaction
  was committed, pg_get_transaction_committime(xid) .
 
  This looks handy for monitoring a replication setup.
  It should really be in core...
 
  Any plans? Or is there other ways to get this kind of information in
  core?

 pg_get_transaction_committime isn't trivial to just add to core because
 it requires a commit time to be recorded with commit records in the
 transaction logs, among other changes.

The commit records actually already have that information available
(c.f. xl_xact_commit(_compact) in xact.h), the problem is having a
datastructure which collects all that.
That's why the committs (written by Alvaro) added an slru mapping xids
to timestamps. And yes, we want to submit that sometime.

The pg_xlog_wait_remote_apply(), pg_xlog_wait_remote_receive() functions
however don't need any additional infrastructure, so I think those are
easier and less controversial to add.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

2013-06-09 Thread Andrew Dunstan


On 06/09/2013 12:38 AM, Noah Misch wrote:

On Sat, Jun 08, 2013 at 11:50:53PM -0400, Andrew Dunstan wrote:

On 06/08/2013 10:52 PM, Noah Misch wrote:

Let's return to the drawing board on this one.  I would be inclined to keep
the current bad behavior until we implement the i18n-aware case folding
required by SQL.  If I'm alone in thinking that, perhaps switch to downcasing
only ASCII characters regardless of the encoding.  That at least gives
consistent application behavior.

I apologize for not noticing to comment on this week's thread.


The behaviour which this fixes is an unambiguous bug. Calling tolower()
on the individual bytes of a multi-byte character can't possibly produce
any sort of correct result. A database that contains such corrupted
names, probably not valid in any encoding at all, is almost certainly
not restorable, and I'm not sure if it's dumpable either.

I agree with each of those points.  However, since any change here breaks
compatibility, we should fix it right the first time.  A second compatibility
break would be all the more onerous once this intermediate step helps more
users to start using unquoted, non-ASCII object names.


It's already
produced several complaints in recent months, so ISTM that returning to
it for any period of time is unthinkable.

PostgreSQL has lived with this wrong behavior since ... the beginning?  It's a
problem, certainly, but a bandage fix brings its own trouble.



If you have a better fix I am all ears. I can recall at least one 
discussion of this area (concerning Turkish I quite a few years ago) 
where we failed to come up with anything.


I have a fairly hard time believing in your relies on this and somehow 
works scenario.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch API for After Triggers

2013-06-09 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 On 8 June 2013 22:25, Kevin Grittner kgri...@ymail.com wrote:
 Simon Riggs si...@2ndquadrant.com wrote:

 There are also difficulties in semantics, since when
 we have OLD and NEW at row level we know we are discussing the same
 row. With sets of OLD and NEW we'd need to be able to link the
 relations back together somehow, which couldn't be done by PK since
 that could change. So we'd need to invent some semantics for a
 linking identifier of some description. Which once we've done it
 would be used by people to join them back together again, which is
 what we already had in the first place. So my take is that it sounds
 expressive, but definitely not performant.

I have used a feature like this in other database products, and can
say from experience that these relations in a statement trigger can
be very useful without the linkage you propose.  I can see how the
linkage could potentially be useful, but if that is the only
hang-up, we would be adding a powerful feature without it.

 Since my objective is performance, not standards, I don't see a reason
 to work on that, yet. I might have time to work on it later, lets see.

This seems like it has some overlap with the delta relations I will
need to generate for incremental maintenance of materialized views,
so we should coordinate on those efforts if they happen to occur
around the same time.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimising Foreign Key checks

2013-06-09 Thread Greg Stark
On Sun, Jun 9, 2013 at 10:51 AM, Simon Riggs si...@2ndquadrant.com wrote:
 AFAICS there are weird cases where changing the way FKs execute will
 change the way complex trigger applications will execute. I don't see
 a way to avoid that other than do nothing. Currently, we execute the
 checks following the normal order of execution rules for triggers.
 Every idea we've had so far changes that in some way; variously in
 major or minor ways, but changed nonetheless.

The obvious case to handle would be if someone has a trigger to
automatically create any missing references.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-06-09 Thread Kevin Grittner
Craig Ringer cr...@2ndquadrant.com wrote:
 On 06/07/2013 04:38 AM, Jeff Janes wrote:
 Craig Ringer cr...@2ndquadrant.com

 The problem is that vacuum running too slow tends to result in
 table and index bloat. Which results in less efficient cache use,
 slower scans, and generally worsening performance.

 I've repeatedly seen the user attribute the resulting high I/O to
 autovacuum (which is, after all, always working away trying to
 keep up) - and solving the problem by further slowing
 autovacuum.

 It is very counter-intuitive that to fix the problem the user
 needs to make the background process that's doing the I/O take up
 *more* resources, so that other queries take *even less*.

Exactly.  It can be very hard to convince someone to make
autovacuum more aggressive when they associate its default
configuration with slowness.

 - I'll whack in some manual VACUUM cron jobs during low load
 maintenance hours and hope that keeps the worst of the problem
 away, that's what random forum posts on the Internet say to
 do.
 - oh my, why did my DB just do an emergency shutdown?

 This one doesn't make much sense to me, unless they mucked
 around with autovacuum_freeze_max_age as well as turning
 autovacuum itself off (common practice?).

 Unfortunately, yes, as an extension of the above reasoning people
 seem to apply around autovacuum. The now horrifyingly bloated DB
 is being kept vaguely functional by regular cron'd vacuum runs,
 but then autovacuum kicks back in and starts thrashing the
 system. It's already performing really badly because of all the
 bloat so this is more than it can take and performance tanks
 critically. Particularly since it probably has 1000 or more
 backends thrashing away if it's anything like many of the systems
 I've been seeing in the wild.

 The operator's response: Panic and find out how to make it stop.
 Once autovacuum quits doing its thing the system returns to
 staggering along and they go back to planning a hardware upgrade
 someday, then suddenly it's emergency wraparound prevention time.

I have seen exactly this pattern multiple times.  They sometimes
completely ignore all advice about turning on and tuning autovacuum
and instead want to know the exact formula for when the the
wraparound prevention autovacuum will trigger, so they can run a
vacuum just in time to prevent it -- since they believe this will
minimize disk access and thus give them best performance.  They
often take this opportunity to run VACUUM FULL on the table and
don't see the point of following that with any other form of
VACUUM, so they wipe out their visibility map in the process.

 I suspect vacuum, autovacuum, autovacuum tuning, table and index
 bloat, etc is just too complicated for a lot of people running Pg
 installs to really understand.

The ones who suffer most are those who learn just enough to think
they know how to tune better than the defaults, but not enough to
really understand the full impact of the changes they are making.
I have no particular ideas on what to do about that observation,
unfortunately.

 I'd really, really love to see some feedback-based auto-tuning of
 vacuum.

+1

-- 
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLE ... ALTER CONSTRAINT

2013-06-09 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:
 On 2013-06-08 21:45:24 +0100, Simon Riggs wrote:

 ALTER TABLE foo
 ALTER CONSTRAINT fktable_fk_fkey DEFERRED INITIALLY IMMEDIATE;

 I haven't looked at the patch in detail, but I am very, very much in
 favor of the feature in general… I have wished for this more than once

+1

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

2013-06-09 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 06/09/2013 12:38 AM, Noah Misch wrote:
 PostgreSQL has lived with this wrong behavior since ... the beginning?  It's 
 a
 problem, certainly, but a bandage fix brings its own trouble.

I don't see this as particularly bandage-y.  It's a subset of the
spec-required folding behavior, sure, but at least now it's a proper
subset of that behavior.  It preserves all cases in which the previous
coding did the right thing, while removing some cases in which it
didn't.

 If you have a better fix I am all ears. I can recall at least one 
 discussion of this area (concerning Turkish I quite a few years ago) 
 where we failed to come up with anything.

Yeah, Turkish handling of i/I messes up any attempt to use the locale's
case-folding rules straightforwardly.  However, I think we've already
fixed that with the rule that ASCII characters are folded manually.
The resistance to moving this code to use towlower() for non-ASCII
mainly comes from worries about speed, I think; although there was also
something about downcasing conversions that change the string's byte
length being problematic for some callers.

 I have a fairly hard time believing in your relies on this and somehow 
 works scenario.

The key point for me is that if tolower() actually does anything in the
previous state of the code, it's more than likely going to produce
invalidly encoded data.  The consequences of that can't be good.
You can argue that there might be people out there for whom the
transformation accidentally produced a validly-encoded string, but how
likely is that really?  It seems much more likely that the only reason
we've not had more complaints is that on most popular platforms, the
code accidentally fails to fire on any UTF8 characters (or any common
ones, anyway).  On those platforms, there will be no change of behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small patch to crypt.c

2013-06-09 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 Regardless, setting vuntil to some magic value that really means it's
 actually NULL, which is what you'd need to do in order to get rid of
 that explicit check for null, doesn't strike me as a good idea.  When a
 value is null, we shouldn't be looking at the data at all.

Even aside from that, the proposed change seems like a bad idea because
it introduces an unnecessary call of GetCurrentTimestamp() in the common
case where there's no valuntil limit.  On some platforms that call is
pretty slow.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small patch to crypt.c

2013-06-09 Thread Joshua D. Drake


On 06/09/2013 09:28 AM, Tom Lane wrote:


Even aside from that, the proposed change seems like a bad idea because
it introduces an unnecessary call of GetCurrentTimestamp() in the common
case where there's no valuntil limit.  On some platforms that call is
pretty slow.


And that would explain why we don't do something like this:

index f01d904..4935c9f 100644
--- a/src/backend/libpq/crypt.c
+++ b/src/backend/libpq/crypt.c
@@ -145,12 +145,10 @@ md5_crypt_verify(const Port *port, const char 
*role, char *client_pass)

/*
 * Password OK, now check to be sure we are not past 
rolvaliduntil

 */
-   if (isnull)
+   if (isnull || vuntil  GetCurrentTimestamp())
retval = STATUS_OK;
-   else if (vuntil  GetCurrentTimestamp())
-   retval = STATUS_ERROR;
else
-   retval = STATUS_OK;
+   retval = STATUS_ERROR;
}


Right. Ty for the feedback, I know it was just a little bit of code but 
it just looked off and I appreciate the explanation.


JD




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimising Foreign Key checks

2013-06-09 Thread Simon Riggs
On 9 June 2013 14:59, Greg Stark st...@mit.edu wrote:
 On Sun, Jun 9, 2013 at 10:51 AM, Simon Riggs si...@2ndquadrant.com wrote:
 AFAICS there are weird cases where changing the way FKs execute will
 change the way complex trigger applications will execute. I don't see
 a way to avoid that other than do nothing. Currently, we execute the
 checks following the normal order of execution rules for triggers.
 Every idea we've had so far changes that in some way; variously in
 major or minor ways, but changed nonetheless.

 The obvious case to handle would be if someone has a trigger to
 automatically create any missing references.

Exactly my thoughts.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] postgres_fdw regression tests order dependency

2013-06-09 Thread Andrew Dunstan
It looks like the postgres_fdw's regression tests expect data back from 
the following statement in a given order, which presumably isn't guaranteed:


   UPDATE ft2 SET c2 = c2 + 600 WHERE c1 % 10 = 8 RETURNING *;

See 
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=frogmouthdt=2013-06-08%2018%3A30%3A00


Maybe we need to wrap this in a CTE and then order the results for 
consistency?


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Valgrind Memcheck support

2013-06-09 Thread Noah Misch
Valgrind's Memcheck tool[1] is handy for finding bugs, but our use of a custom
allocator limits its ability to detect problems in unmodified PostgreSQL.
During the 9.1 beta cycle, I found some bugs[2] with a rough patch adding
instrumentation to aset.c and mcxt.c such that Memcheck understood our
allocator.  I've passed that patch around to a few people over time, and I've
now removed the roughness such that it's ready for upstream.  In hopes of
making things clearer in the commit history, I've split out a preliminary
refactoring patch from the main patch and attached each separately.

Besides the aset.c/mcxt.c instrumentation, this patch adds explicit checks for
undefined memory to PageAddItem() and printtup(); this has caught C-language
functions that fabricate a Datum without initializing all bits.  The inclusion
of all this is controlled by a pg_config_manual.h setting.  The patch also
adds a suppression file that directs Valgrind to silences nominal errors we
don't plan to fix.

To smoke-test the instrumentation, I used make installcheck runs on x86_64
GNU/Linux and ppc64 GNU/Linux.  This turned up various new and newly-detected
memory bugs, which I will discuss in a separate thread.  With those fixed,
make installcheck has a clean report (in my one particular configuration).
I designed the suppression file to work across platforms; I specifically
anticipated eventual use on x86_64 Darwin and x86_64 FreeBSD.  Valgrind 3.8.1
quickly crashed when running PostgreSQL on Darwin; I did not dig further.

Since aset.c and mcxt.c contain the hottest code paths in the backend, I
verified that a !USE_VALGRIND, non-assert build produces the same code before
and after the patch.  Testing that revealed the need to move up the
AllocSizeIsValid() check in repalloc(), though I don't understand why GCC
reacted that way.

Peter Geoghegan and Korry Douglas provided valuable feedback on earlier
versions of this code.


Possible future directions:

- Test make installcheck-world.  When I last did this in past years, contrib 
did
  trigger some errors.

- Test recovery, such as by running a streaming replica under Memcheck while
  the primary runs make installcheck-world.

- Test newer compilers and higher optimization levels.  I used GCC 4.2 at -O1.
  A brief look at -O2 results showed a new error that I have not studied.  GCC
  4.8 at -O3 might show still more due to increasingly-aggressive assumptions.

- A buildfarm member running its installcheck steps this way.

- Memcheck has support for detecting leaks.  I have not explored that side at
  all, always passing --leak-check=no.  We could add support for freeing
  everything at process exit, thereby making the leak detection meaningful.


Brief notes for folks reproducing my approach: I typically start the
Memcheck-hosted postgres like this:

  valgrind --leak-check=no --gen-suppressions=all \
--suppressions=src/tools/valgrind.supp --time-stamp=yes \
--log-file=$HOME/pg-valgrind/%p.log postgres

If that detected an error on which I desired more detail, I would rerun a
smaller test case with --track-origins=yes --read-var-info=yes.  That slows
things noticeably but gives more-specific messaging.  When even that left the
situation unclear, I would temporarily hack allocChunkLimit so every palloc()
turned into a malloc().

I strongly advise installing the latest-available Valgrind, particularly
because recent releases suffer far less of a performance drop processing the
instrumentation added by this patch.  A make installcheck run takes 273
minutes under Vaglrind 3.6.0 but just 27 minutes under Valgrind 3.8.1.

Thanks,
nm

[1] http://valgrind.org/docs/manual/mc-manual.html
[2] 
http://www.postgresql.org/message-id/20110312133224.ga7...@tornado.gateway.2wire.net

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6a111c7..de64377 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -308,6 +308,37 @@ AllocSetFreeIndex(Size size)
return idx;
 }
 
+#ifdef CLOBBER_FREED_MEMORY
+
+/* Wipe freed memory for debugging purposes */
+static void
+wipe_mem(void *ptr, size_t size)
+{
+   memset(ptr, 0x7F, size);
+}
+#endif
+
+#ifdef MEMORY_CONTEXT_CHECKING
+static void
+set_sentinel(void *base, Size offset)
+{
+   char   *ptr = (char *) base + offset;
+
+   *ptr = 0x7E;
+}
+
+static bool
+sentinel_ok(const void *base, Size offset)
+{
+   const char *ptr = (const char *) base + offset;
+   boolret;
+
+   ret = *ptr == 0x7E;
+
+   return ret;
+}
+#endif
+
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
 
 /*
@@ -492,8 +523,7 @@ AllocSetReset(MemoryContext context)
char   *datastart = ((char *) block) + 
ALLOC_BLOCKHDRSZ;
 
 #ifdef CLOBBER_FREED_MEMORY
-   /* Wipe freed memory for debugging purposes */
-   memset(datastart, 0x7F, 

[HACKERS] 9.3 crop of memory errors

2013-06-09 Thread Noah Misch
My make installcheck runs while completing the just-posted Valgrind Memcheck
patch revealed seven new and newly-detected (due to tighter checking) memory
errors.  Proposed patches attached.


* SP-GiST moveLeafs() and doPickSplit() read past the end of a palloc

These functions construct arrays of OffsetNumber in palloc'd chunks, which
they then place in WAL records.  The rdata entry is done with a MAXALIGN'd
size, but the associated palloc request may not extend to the next MAXALIGN
boundary.  This use of MAXALIGN is wasted space anyway; the arrays only need
SHORTALIGN, which the code maintains without assistance (so long as each xlrec
structure needs at least SHORTALIGN, which seems inevitable).  I've just
removed the explicit alignment bumps.

This behavior dates to the initial SP-GiST commit.  Since every palloc chunk
has MAXALIGN'd size under the hood, the excess read cannot cause a SIGSEGV or
other concrete bad outcome.  Therefore, I propose to change 9.4 only.


* GinFormTuple() leaves pad bytes uninitialized

When it repalloc()s an IndexTuple, this function does not zero the new space.
The increase has up to two regions of alignment padding, one byte after the
null category (if any) and up to six bytes at the end.  This is not, to my
knowledge, a concrete problem.  However, I've observed no other place in the
system where we send uninitialized data to PageAddItem().  This looks like an
oversight, and we should clear affected memory to bring consistency with the
other core bufpage consumers.

This behavior goes back to 8.4 or earlier.  Since it's a mere point of
hygiene, I intend this for 9.4 only.


* Uses of a NULL-terminated string as a Name datum

A Name is expected to have NAMEDATALEN bytes.  Presenting a shorter cstring as
a Name causes reading past the end of the string's allocation.  Switch to the
usual idiom.

New in 9.3 (765cbfdc9263bf7c90b9d1f1044c6950b8b7088c,
3855968f328918b6cd1401dd11d109d471a54d40).


* HaveVirtualXIDsDelayingChkpt() wrongly assumes a terminated array

This function looks for a !VirtualTransactionIdIsValid() terminator, but
GetVirtualXIDsDelayingChkpt() does not add one.  Patch makes the function's
looping logic more like its 9.2 version; I cannot find discussion of or
discern a benefit of that aspect of the change.  It also reverts the addition
of an unused variable, presumably a a debugging relic, by the same commit.

New in 9.3 (f21bb9cfb5646e1793dcc9c0ea697bab99afa523).


* Passing oidvector by value

oidvector ends with a flexible array, so this is almost always wrong.

New in 9.3 (7ac5760fa283bc090c25e4ea495a0d2bb41db7b5).


* hba.c tokenize_file can read backward past the beginning of a stack variable

This arises when a file like pg_hba.conf contains an empty line.

New in 9.3 (7f49a67f954db3e92fd96963169fb8302959576e).


* json parser can read one byte past the datum end

I swapped order of tests like if (has_some_property(*p)  p  end).
Perhaps this was intended as a micro-optimization, putting the more-selective
check first.  If that is important, we might arrange to have a known byte
after the end to make it safe.

New in 9.3 (a570c98d7fa0841f17bbf51d62d02d9e493c7fcc).


Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/src/backend/access/spgist/spgdoinsert.c 
b/src/backend/access/spgist/spgdoinsert.c
index 5f6bcdd..1199916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -535,9 +535,9 @@ moveLeafs(Relation index, SpGistState *state,
{
XLogRecPtr  recptr;
 
-   ACCEPT_RDATA_DATA(xlrec, MAXALIGN(sizeof(xlrec)), 0);
-   ACCEPT_RDATA_DATA(toDelete, MAXALIGN(sizeof(OffsetNumber) * 
nDelete), 1);
-   ACCEPT_RDATA_DATA(toInsert, MAXALIGN(sizeof(OffsetNumber) * 
nInsert), 2);
+   ACCEPT_RDATA_DATA(xlrec, sizeof(xlrec), 0);
+   ACCEPT_RDATA_DATA(toDelete, sizeof(OffsetNumber) * nDelete, 1);
+   ACCEPT_RDATA_DATA(toInsert, sizeof(OffsetNumber) * nInsert, 2);
ACCEPT_RDATA_DATA(leafdata, leafptr - leafdata, 3);
ACCEPT_RDATA_BUFFER(current-buffer, 4);
ACCEPT_RDATA_BUFFER(nbuf, 5);
@@ -1118,7 +1118,7 @@ doPickSplit(Relation index, SpGistState *state,
 
leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
-   ACCEPT_RDATA_DATA(xlrec, MAXALIGN(sizeof(xlrec)), 0);
+   ACCEPT_RDATA_DATA(xlrec, sizeof(xlrec), 0);
ACCEPT_RDATA_DATA(innerTuple, innerTuple-size, 1);
nRdata = 2;
 
@@ -1154,7 +1154,7 @@ doPickSplit(Relation index, SpGistState *state,
{
xlrec.nDelete = nToDelete;
ACCEPT_RDATA_DATA(toDelete,
- 
MAXALIGN(sizeof(OffsetNumber) * nToDelete),
+ sizeof(OffsetNumber) 
* nToDelete,
 

Re: [HACKERS] Valgrind Memcheck support

2013-06-09 Thread Andres Freund
On 2013-06-09 17:25:59 -0400, Noah Misch wrote:
 Valgrind's Memcheck tool[1] is handy for finding bugs, but our use of a custom
 allocator limits its ability to detect problems in unmodified PostgreSQL.
 During the 9.1 beta cycle, I found some bugs[2] with a rough patch adding
 instrumentation to aset.c and mcxt.c such that Memcheck understood our
 allocator.  I've passed that patch around to a few people over time, and I've
 now removed the roughness such that it's ready for upstream.  In hopes of
 making things clearer in the commit history, I've split out a preliminary
 refactoring patch from the main patch and attached each separately.

 Besides the aset.c/mcxt.c instrumentation, this patch adds explicit checks for
 undefined memory to PageAddItem() and printtup(); this has caught C-language
 functions that fabricate a Datum without initializing all bits.  The inclusion
 of all this is controlled by a pg_config_manual.h setting.  The patch also
 adds a suppression file that directs Valgrind to silences nominal errors we
 don't plan to fix.

Very nice work. I've started to do this quite some time back to smoke
out some bugs in code of mine, but never got remotely to a point where
it was submittable. But I already found some bugs with it. So I'd very
happy if this could get committed.

Will take a look.

 I strongly advise installing the latest-available Valgrind, particularly
 because recent releases suffer far less of a performance drop processing the
 instrumentation added by this patch.  A make installcheck run takes 273
 minutes under Vaglrind 3.6.0 but just 27 minutes under Valgrind 3.8.1.

At least on linux amd64 I'd strongly suggest using something newer than
(afair) 3.8.1, i.e. the svn version. Up to that version it corrupts the
stack alignment inside signal handlers which doesn't get fixed up even
after a fork(). This leads to mysterious segfaults, e.g. during elog()s
due to the usage of sse registers which have stronger alignment
requirements.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread MauMau

From: Craig Ringer cr...@2ndquadrant.com

On 06/09/2013 08:32 AM, MauMau wrote:


- Failure of a disk containing data directory or tablespace
If checkpoint can't write buffers to disk because of disk failure,
checkpoint cannot complete, thus WAL files accumulate in pg_xlog/.
This means that one disk failure will lead to postgres shutdown.


I've seen a couple of people bitten by the misunderstanding that
tablespaces are a way to split up your data based on different
reliability requirements, and I really need to write a docs patch for
http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html
http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html
that adds a prominent warning like:

WARNING: Every tablespace must be present before the database can be
started. There is no easy way to recover the database if a tablespace is
lost to disk failure, deletion, use of volatile storage, etc. bDo not
put a tablespace on a RAM disk/b; instead just use UNLOGGED tables.

(Opinions on the above?)


Yes, I'm sure this is useful for DBAs to know how postgres behaves and take 
some preparations.  However, this does not apply to my case, because I'm 
using tablespaces for I/O distribution across multiple disks and simply for 
database capacity.


The problem is that the reliability of the database system decreases with 
more disks, because failure of any one of those disks would result in a 
database PANIC shutdown




I'd rather like to be able to recover from this by treating the
tablespace as dead, so any attempt to get a lock on any table within it
fails with an error and already-in-WAL writes to it just get discarded.
It's the sort of thing that'd only be reasonable to do as a recovery
option (like zero_damaged_pages) since if applied by default it'd lead
to potentially severe and unexpected data loss.


I'm in favor of taking a tablespace offline when I/O failure is encountered, 
and continue running the database server.  But WAL must not be discarded 
because committed transactions must be preserved for durability of ACID.


Postgres needs to take these steps when it encounters an I/O error:

1. Take the tablespace offline, so that subsequent read/write against it 
returns an error without actually issuing read/write against data files.


2. Discard shared buffers containing data in the tablespace.

WAL is not affected by the offlining of tablespaces.  WAL records already 
written on the WAL buffer will be written to pg_xlog/ and archived as usual. 
Those WAL records will be used to recover committed transactions during 
archive recovery.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON and unicode surrogate pairs

2013-06-09 Thread Andrew Dunstan


On 06/06/2013 12:53 PM, Robert Haas wrote:

On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan and...@dunslane.net wrote:

In 9.2, the JSON parser didn't check the validity of the use of unicode
escapes other than that it required 4 hex digits to follow '\u'. In 9.3,
that is still the case. However, the JSON accessor functions and operators
also try to turn JSON strings into text in the server encoding, and this
includes de-escaping \u sequences. This works fine except when there is a
pair of sequences representing a UTF-16 type surrogate pair, something that
is explicitly permitted in the JSON spec.

The attached patch is an attempt to remedy that, and a surrogate pair is
turned into the correct code point before converting it to whatever the
server encoding is.

Note that this would mean we can still put JSON with incorrect use of
surrogates into the database, as now (9.2 and later), and they will cause
almost all the accessor functions to raise an error, as now (9.3). All this
does is allow JSON that uses surrogates correctly not to fail when applying
the accessor functions and operators. That's a possible violation of POLA,
and at least worth of a note in the docs, but I'm not sure what else we can
do now - adding this check to the input lexer would possibly cause restores
to fail, which users might not thank us for.

I think the approach you've proposed here is a good one.





I did that, but it's evident from the buildfarm that there's more work 
to do. The problem is that we do the de-escaping as we lex the json to 
construct the look ahead token, and at that stage we don't know whether 
or not it's really going to be needed. That means we can cause errors to 
be raised in far too many places. It's failing on this line:


   converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);

even though the operator in use (-) doesn't even use the de-escaped 
value.


The real solution is going to be to delay the de-escaping of the string 
until it is known to be wanted. That's unfortunately going to be a bit 
invasive, but I can't see a better solution. I'll work on it ASAP. 
Getting it to work well without a small API change might be pretty hard, 
though.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw regression tests order dependency

2013-06-09 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 It looks like the postgres_fdw's regression tests expect data back from 
 the following statement in a given order, which presumably isn't guaranteed:

 UPDATE ft2 SET c2 = c2 + 600 WHERE c1 % 10 = 8 RETURNING *;

 See 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=frogmouthdt=2013-06-08%2018%3A30%3A00

I think what happened here is that autovacuum ran while the test was in
process, and freed up some space near the start of the table that was
then used by the two INSERTs just above this.  I was able to duplicate
that diff by adding a VACUUM S 1.T 1 command just ahead of the
INSERTs.

 Maybe we need to wrap this in a CTE and then order the results for 
 consistency?

In principle, we'd have to do that to every update in the test, which
doesn't seem either painless or conducive to thorough testing.  What I'm
a bit inclined to do instead is to modify the test case so that the rows
inserted by the two INSERTs aren't touched by this UPDATE.  It's
probably easier to guarantee that no rows are updated twice in this test
sequence than to make the query results totally order-independent.

I'm going to go experiment with adding more VACUUMs to the test script
just to see if any other results change ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Craig Ringer
On 06/10/2013 06:39 AM, MauMau wrote:

 The problem is that the reliability of the database system decreases
 with more disks, because failure of any one of those disks would result
 in a database PANIC shutdown

More specifically, with more independent sets of disks / file systems.

 I'd rather like to be able to recover from this by treating the
 tablespace as dead, so any attempt to get a lock on any table within it
 fails with an error and already-in-WAL writes to it just get discarded.
 It's the sort of thing that'd only be reasonable to do as a recovery
 option (like zero_damaged_pages) since if applied by default it'd lead
 to potentially severe and unexpected data loss.
 
 I'm in favor of taking a tablespace offline when I/O failure is
 encountered, and continue running the database server.  But WAL must not
 be discarded because committed transactions must be preserved for
 durability of ACID.
[snip]
 WAL is not affected by the offlining of tablespaces.  WAL records
 already written on the WAL buffer will be written to pg_xlog/ and
 archived as usual. Those WAL records will be used to recover committed
 transactions during archive recovery.

(I'm still learning the details of Pg's WAL, WAL replay and recovery, so
the below's just my understanding):

The problem is that WAL for all tablespaces is mixed together in the
archives. If you lose your tablespace then you have to keep *all* WAL
around and replay *all* of it again when the tablespace comes back
online. This would be very inefficient, would require a lot of tricks to
cope with applying WAL to a database that has an on-disk state in the
future as far as the archives are concerned. It's not as simple as just
replaying all WAL all over again - as I understand it, things like
CLUSTER or TRUNCATE will result in relfilenodes not being where they're
expected to be as far as old WAL archives are concerned. Selective
replay would be required, and that leaves the door open to all sorts of
new and exciting bugs in areas that'd hardly ever get tested.

To solve the massive disk space explosion problem I imagine we'd have to
have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
and loss of the rather nice property that WAL writes are nice sequential
writes. It'd be complicated and probably cause nightmares during
recovery, for archive-based replication, etc.

The only other thing I can think of is: When a tablespace is offline,
write WAL records to a separate tablespace recovery log as they're
encountered. Replay this log when the tablespace comes is restored,
before applying any other new WAL to the tablespace. This wouldn't
affect archive-based recovery since it'd already have the records from
the original WAL.

None of these options seem exactly simple or pretty, especially given
the additional complexities that'd be involved in allowing WAL records
to be applied out-of-order, something that AFAIK _never_h happens at the
moment.

The key problem, of course, is that this all sounds like a lot of
complicated work for a case that's not really supposed to happen. Right
now, the answer is your database is unrecoverable, switch to your
streaming warm standby and re-seed it from the standby. Not pretty, but
at least there's the option of using a sync standby and avoiding data loss.

How would you approach this?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Revisit items marked 'NO' in sql_features.txt

2013-06-09 Thread Robins Tharakan
Hi,

While reviewing sql_features.txt, found a few items marked NO ('Not
supported') whereas, at the outset, they seemed to be supported. Apologies,
if this is already considered and / or still marked 'NO' for a reason, but
a list of such items mentioned below:

F202TRUNCATE TABLE: identity column restart option  NO

F263Comma-separated predicates in simple CASE expressionNO

T041Basic LOB data type support 01  BLOB data type  NO

T051Row types   NO

T174Identity columnsNO

T176Sequence generator support  NO

T177Sequence generator support: simple restart option   NO

T522Default values for IN parameters of SQL-invoked procedures
 NO

T571Array-returning external SQL-invoked functions  NO

--
Robins Tharakan


Re: [HACKERS] JSON and unicode surrogate pairs

2013-06-09 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 I did that, but it's evident from the buildfarm that there's more work 
 to do. The problem is that we do the de-escaping as we lex the json to 
 construct the look ahead token, and at that stage we don't know whether 
 or not it's really going to be needed. That means we can cause errors to 
 be raised in far too many places. It's failing on this line:
 converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);
 even though the operator in use (-) doesn't even use the de-escaped 
 value.

 The real solution is going to be to delay the de-escaping of the string 
 until it is known to be wanted. That's unfortunately going to be a bit 
 invasive, but I can't see a better solution. I'll work on it ASAP. 

Not sure that this idea isn't a dead end.  IIUC, you're proposing to
jump through hoops in order to avoid complaining about illegal JSON
data, essentially just for backwards compatibility with 9.2's failure to
complain about it.  If we switch over to a pre-parsed (binary) storage
format for JSON values, won't we be forced to throw these errors anyway?
If so, maybe we should just take the compatibility hit now while there's
still a relatively small amount of stored JSON data in the wild.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Server side lo-funcs name

2013-06-09 Thread Tatsuo Ishii
Recently we got a complain about server side large object function
names described in the doc:
http://www.postgresql.org/message-id/51b2413f.8010...@gmail.com

In the doc:
http://www.postgresql.org/docs/9.3/static/lo-funcs.html

There are server-side functions callable from SQL that correspond to
each of the client-side functions described above; indeed, for the
most part the client-side functions are simply interfaces to the
equivalent server-side functions

From the description it is hard for users to find out server side
functions loread and lowrite becuase they are looking for
lo_read and lo_write. So I think his complain is fair. Included
patches attempt to fix the problem.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
diff --git a/doc/src/sgml/lobj.sgml b/doc/src/sgml/lobj.sgml
index db5bc10..6dbc84c 100644
--- a/doc/src/sgml/lobj.sgml
+++ b/doc/src/sgml/lobj.sgml
@@ -527,10 +527,14 @@ int lo_unlink(PGconn *conn, Oid lobjId);
 
   para
There are server-side functions callable from SQL that correspond to
-   each of the client-side functions described above; indeed, for the
-   most part the client-side functions are simply interfaces to the
-   equivalent server-side functions.  The ones that are actually useful
-   to call via SQL commands are
+   each of the client-side functions described above(please note
+   that the server side function
+   for functionlo_read/function is functionloread/function and
+   the server side function for functionlo_write/function
+   is functionlowrite/function); indeed, for the most part the
+   client-side functions are simply interfaces to the equivalent
+   server-side functions.  The ones that are actually useful to call
+   via SQL commands are
functionlo_creat/functionindextermprimarylo_creat//,
functionlo_create/functionindextermprimarylo_create//,
functionlo_unlink/functionindextermprimarylo_unlink//,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump with postgis extension dumps rules separately

2013-06-09 Thread Joe Conway
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/03/2013 07:57 PM, Tom Lane wrote:
 I'd have put the getRules call where getEventTriggers is now, or
 at least adjacent to getTriggers in one direction or the other.
 I'm not sure there is anything functionally wrong with what you
 have here; but IMO rules and table-level triggers are pretty much
 the same kind of critters so far as pg_dump is concerned, to wit,
 they're table properties not free-standing objects (which is
 exactly the point of this change).  So it seems to me to make sense
 to process them together.

OK, done this way and committed.

 BTW, don't forget that the getRules move needs to be back-patched.

This too.

Thanks,

Joe


- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting,  24x7 Support
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRtR8OAAoJEDfy90M199hlCvsP/0esIWG7S7yPh91tGHQmoUiP
OlbbIwQRVAVvKAdStT3RtvI/NjmflZrqMmCXueGoy3dOrVZ+NcfMW09gLOSLxpjV
0PEnWBvU9JiyhQ4dyRjfqygZzD4AJZwMhtgPJqIIZUTsoctIPGwW5PZocTeuNJvu
RzM4I+nfQSMkqTFuYawyD8l8uH802DfkiTrKCFAPvEExdzQOPac4Mc042tWdJiLU
BlbqAF3Tw2V4MqHSnvumvRoIitFkWi3IpVkfIrsSZJ3a+meZP8Sqp2dMZNP2f8i9
u6drVw8hrYibHQvEG+xYhUBtPEfqrkIh7hqREtfyuMHyaXT+DcpIKcMjhHVeA/X0
1+lxGcW7I/IQZF5d8ql59xGdMPG+xjDxo04e2tu9E+yyfF82uFIBa6dNmFEA4Scr
fegoYLHr/VJv3FHXTv5nFPxcuXA/H+2C+0WaJaweLSrh0Sg33myKV46m2YXb1KRA
mtW/ahV2g2yLq7uYK7rwvEZltUVKmko6uMuJzAOf75rT8hXkZEy6poMTIqH+wc4B
qs+ZzgDIu5e/YgPMdgLNidfLJHuUchfcOYvleUGynzydZoLQTdO9DI3wGxrhEL4m
kFu/ypmahMrYmb/2dG6cMfyLuF7DKwyysGzUBA5HISoywvSy55CEqaZKs5muE3eb
XFB8fC0rphS4dLOrRU0a
=0LVS
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Placing hints in line pointers

2013-06-09 Thread Jeff Davis
On Sat, 2013-06-01 at 15:45 +0100, Simon Riggs wrote:
 Doing this would have two purposes:
 
 * We wouldn't need to follow the pointer if the row is marked aborted.
 This would save a random memory access for that tuple

That's quite similar to LP_DEAD, right? You could potentially set this
new hint sooner, which may be an advantage.

 We wouldn't need to do a FPW when a hint changes, we would only need
 to take a copy of the ItemId array, which is much smaller. And it
 could be protected by its own checksum.

I like the direction of this idea.

I don't see exactly how you want to make this safe, though. During
replay, we can't just replace the ItemId array, because the other
contents on the page might be newer (e.g. some tuples may have been
cleaned out, leaving the item pointers pointing to the wrong place).

 (In addition, if we wanted, this could be used to extend block size to
 64kB if we used 8-byte alignment for tuples)

I think that supporting 64KB blocks has merit.

Interesting observation about the extra bits available. That would be an
on-disk format change, so perhaps this is something to add to the list
of waiting for a format change?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-09 Thread Jeff Davis
Attached is a first draft of an update to pg_filedump for 9.3. I know
pg_filedump is a pgfoundry project, but that seems like it's just there
to host the download; so please excuse the slightly off-topic post here
on -hackers.

I made a few changes to support 9.3, which were mostly fixes related two
things:

 * new htup_details.h and changes related to FK concurrency improvements
 * XLogRecPtr is now a uint64

And, of course, I added support for checksums. They are always displayed
and calculated, but it only throws an error if you pass -k. Only the
user knows whether checksums are enabled, because we removed page-level
bits indicating the presence of a checksum.

The patch is a bit ugly: I had to copy some code, and copy the entire
checksum.c file (minus some Asserts, which don't work in an external
program). Suggestions welcome.

Regards,
Jeff Davis

diff -Nc pg_filedump-9.2.0/checksum.c pg_filedump-9.3.0j/checksum.c
*** pg_filedump-9.2.0/checksum.c	1969-12-31 16:00:00.0 -0800
--- pg_filedump-9.3.0j/checksum.c	2013-06-09 21:20:34.036176831 -0700
***
*** 0 
--- 1,157 
+ /*-
+  *
+  * checksum.c
+  *	  Checksum implementation for data pages.
+  *
+  * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/storage/page/checksum.c
+  *
+  *-
+  *
+  * Checksum algorithm
+  *
+  * The algorithm used to checksum pages is chosen for very fast calculation.
+  * Workloads where the database working set fits into OS file cache but not
+  * into shared buffers can read in pages at a very fast pace and the checksum
+  * algorithm itself can become the largest bottleneck.
+  *
+  * The checksum algorithm itself is based on the FNV-1a hash (FNV is shorthand
+  * for Fowler/Noll/Vo) The primitive of a plain FNV-1a hash folds in data 1
+  * byte at a time according to the formula:
+  *
+  * hash = (hash ^ value) * FNV_PRIME
+  *
+  * FNV-1a algorithm is described at http://www.isthe.com/chongo/tech/comp/fnv/
+  *
+  * PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of
+  * high bits - high order bits in input data only affect high order bits in
+  * output data. To resolve this we xor in the value prior to multiplication
+  * shifted right by 17 bits. The number 17 was chosen because it doesn't
+  * have common denominator with set bit positions in FNV_PRIME and empirically
+  * provides the fastest mixing for high order bits of final iterations quickly
+  * avalanche into lower positions. For performance reasons we choose to combine
+  * 4 bytes at a time. The actual hash formula used as the basis is:
+  *
+  * hash = (hash ^ value) * FNV_PRIME ^ ((hash ^ value)  17)
+  *
+  * The main bottleneck in this calculation is the multiplication latency. To
+  * hide the latency and to make use of SIMD parallelism multiple hash values
+  * are calculated in parallel. The page is treated as a 32 column two
+  * dimensional array of 32 bit values. Each column is aggregated separately
+  * into a partial checksum. Each partial checksum uses a different initial
+  * value (offset basis in FNV terminology). The initial values actually used
+  * were chosen randomly, as the values themselves don't matter as much as that
+  * they are different and don't match anything in real data. After initializing
+  * partial checksums each value in the column is aggregated according to the
+  * above formula. Finally two more iterations of the formula are performed with
+  * value 0 to mix the bits of the last value added.
+  *
+  * The partial checksums are then folded together using xor to form a single
+  * 32-bit checksum. The caller can safely reduce the value to 16 bits
+  * using modulo 2^16-1. That will cause a very slight bias towards lower
+  * values but this is not significant for the performance of the
+  * checksum.
+  *
+  * The algorithm choice was based on what instructions are available in SIMD
+  * instruction sets. This meant that a fast and good algorithm needed to use
+  * multiplication as the main mixing operator. The simplest multiplication
+  * based checksum primitive is the one used by FNV. The prime used is chosen
+  * for good dispersion of values. It has no known simple patterns that result
+  * in collisions. Test of 5-bit differentials of the primitive over 64bit keys
+  * reveals no differentials with 3 or more values out of 10 random keys
+  * colliding. Avalanche test shows that only high order bits of the last word
+  * have a bias. Tests of 1-4 uncorrelated bit errors, stray 0 and 0xFF bytes,
+  * overwriting page from random position to end with 0 bytes, and overwriting
+  * random segments of page with 0x00, 0xFF and random data all show optimal
+  * 2e-16 

Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-09 Thread Alvaro Herrera
Jeff Davis wrote:

 --- 1000,1015 
   strcat (flagString, HASEXTERNAL|);
 if (infoMask  HEAP_HASOID)
   strcat (flagString, HASOID|);
 +   if (infoMask  HEAP_XMAX_KEYSHR_LOCK)
 + strcat (flagString, XMAX_KEYSHR_LOCK|);
 if (infoMask  HEAP_COMBOCID)
   strcat (flagString, COMBOCID|);
 if (infoMask  HEAP_XMAX_EXCL_LOCK)
   strcat (flagString, XMAX_EXCL_LOCK|);
 !   if (infoMask  HEAP_XMAX_SHR_LOCK)
 ! strcat (flagString, XMAX_SHR_LOCK|);
 !   if (infoMask  HEAP_XMAX_LOCK_ONLY)
 ! strcat (flagString, XMAX_LOCK_ONLY|);
 if (infoMask  HEAP_XMIN_COMMITTED)
   strcat (flagString, XMIN_COMMITTED|);
 if (infoMask  HEAP_XMIN_INVALID)

Hm, note that XMAX_SHR_LOCK is two bits, so when that flag is present
you will get the three lock modes displayed with the above code, which is
probably going to be misleading.  htup_details.h does this:

/*
 * Use these to test whether a particular lock is applied to a tuple
 */
#define HEAP_XMAX_IS_SHR_LOCKED(infomask) \
(((infomask)  HEAP_LOCK_MASK) == HEAP_XMAX_SHR_LOCK)
#define HEAP_XMAX_IS_EXCL_LOCKED(infomask) \
(((infomask)  HEAP_LOCK_MASK) == HEAP_XMAX_EXCL_LOCK)
#define HEAP_XMAX_IS_KEYSHR_LOCKED(infomask) \
(((infomask)  HEAP_LOCK_MASK) == HEAP_XMAX_KEYSHR_LOCK)

Presumably it'd be better to do something similar.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers