Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-27 Thread Amit Langote
On 27-01-2015 AM 05:46, Jim Nasby wrote:
> On 1/25/15 7:42 PM, Amit Langote wrote:
>> On 21-01-2015 PM 07:26, Amit Langote wrote:
>>> Ok, I will limit myself to focusing on following things at the moment:
>>>
>>> * Provide syntax in CREATE TABLE to declare partition key
>>
>> While working on this, I stumbled upon the question of how we deal with
>> any index definitions following from constraints defined in a CREATE
>> statement. I think we do not want to have a physical index created for a
>> table that is partitioned (in other words, has no heap of itself). As
>> the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
>> EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
>> two decisions to make here:
>>
>> 1) how do we deal with any index definitions (either explicit or
>> implicit following from constraints defined on it) - do we allow them by
>> marking them specially, say, in pg_index, as being mere
>> placeholders/templates or invent some other mechanism?
>>
>> 2) As a short-term solution, do we simply reject creating any indexes
>> (/any constraints that require them) on a table whose definition also
>> includes PARTITION ON clause? Instead define them on its partitions (or
>> any relations in hierarchy that are not further partitioned).
>>
>> Or maybe I'm missing something...
> 
> Wasn't the idea that the parent table in a partitioned table wouldn't
> actually have a heap of it's own? If there's no heap there can't be an
> index.
>

Yes, that's right. Perhaps, we should look at heap-less partitioned
relation thingy not so soon as you say below.

> That said, I think this is premature optimization that could be done later.

It seems so.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-27 Thread Michael Paquier
On Sat, Jan 17, 2015 at 11:06 PM, Robert Haas  wrote:
> On Fri, Jan 16, 2015 at 10:56 AM, Alvaro Herrera
>  wrote:
>> So how about something like
>>
>> #define ALLOCFLAG_HUGE  0x01
>> #define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
>> void *
>> MemoryContextAllocFlags(MemoryContext context, Size size, int flags);
>
> That sounds good, although personally I'd rather have the name be
> something like MemoryContextAllocExtended; we have precedent for using
> "Extended" for this sort of thing elsewhere.  Also, I'd suggest trying
> to keep the flag name short, e.g. ALLOC_HUGE and ALLOC_NO_OOM (or
> ALLOC_SOFT_FAIL?).
Yes, I think that this name makes more sense (LockAcquire[Extended],
RangeVarGetRelid[Extended]), as well as minimizing shorter name for
the flags.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-27 Thread Andres Freund
On 2015-01-27 08:21:57 +0100, Andreas Karlsson wrote:
> On 01/23/2015 02:58 AM, Petr Jelinek wrote:
> >On 23/01/15 00:40, Andreas Karlsson wrote:
> >>- Renamed some things from int12 to int128, there are still some places
> >>with int16 which I am not sure what to do with.
> >
> >I'd vote for renaming them to int128 too, there is enough C functions
> >that user int16 for 16bit integer that this is going to be confusing
> >otherwise.
> 
> Do you also think the SQL functions should be named numeric_int128_sum,
> numeric_int128_avg, etc?

I'm pretty sure we already decided upthread that the SQL interface is
going to keep usint int4/8 and by extension int16.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up

2015-01-27 Thread Andres Freund
On 2015-01-27 16:23:53 +0900, Michael Paquier wrote:
> On Tue, Jan 27, 2015 at 6:24 AM, Andres Freund  wrote:
> > Unfortunately that Assert()s when there's a lock conflict because
> > e.g. another backend is currently connecting. That's because ProcSleep()
> > does a enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout) - and
> > there's no deadlock timeout (or lock timeout) handler registered.

> Yes, that could logically happen if there is a lock conflicting as
> RowExclusiveLock or lower lock can be taken in recovery.

I don't this specific lock (it's a object, not relation lock) can easily
be taken directly by a user except during authentication.

> > [...]
> > afaics, that should work? Not pretty, but probably easier than starting
> > to reason about the deadlock detector in the startup process.

> Wouldn't it be cleaner to simply register a dedicated handler in
> StartupXlog instead, obviously something else than DEADLOCK_TIMEOUT as
> it is reserved for backend operations? For back-branches, we may even
> consider using DEADLOCK_TIMEOUT..

What would that timeout handler actually do? Two problems:
a) We really don't want the startup process be killed/error out as that
   likely means a postmaster restart. So the default deadlock detector
   strategy is something that's really useless for us.
b) Pretty much all other (if not actually all other) heavyweight lock
   acquisitions in the startup process acquire locks using dontWait =
   true - which means that the deadlock detector isn't actually
   run. That's essentially fine because we simply kill everything in our
   way. C.f. StandbyAcquireAccessExclusiveLock() et al.

There's a dedicated 'deadlock detector' like infrastructure around
ResolveRecoveryConflictWithBufferPin(), but it deals with a class of
deadlocks that's not handled in the deadlock detector anyway.


I think this isn't particularly pretty, but it seems to be working well
enough, and changing it would be pretty invasive. So keeping in line
with all that code seems to be easier.

> > We probably should also add a Assert(!InRecovery || sessionLock) to
> > LockAcquireExtended() - these kind of problems are otherwise hard to
> > find in a developer setting.

> So this means that locks other than session ones cannot be taken while
> a node is in recovery, but RowExclusiveLock can be taken while in
> recovery. Don't we have a problem with this assertion then?

Note that InRecovery doesn't mean what you probably think it means:
/*
 * Are we doing recovery from XLOG?
 *
 * This is only ever true in the startup process; it should be read as meaning
 * "this process is replaying WAL records", rather than "the system is in
 * recovery mode".  It should be examined primarily by functions that need
 * to act differently when called from a WAL redo function (e.g., to skip WAL
 * logging).  To check whether the system is in recovery regardless of which
 * process you're running in, use RecoveryInProgress() but only after shared
 * memory startup and lock initialization.
 */
boolInRecovery = false;

The assertion actually should be even stricter:
Assert(InRecovery || (sessionLock && dontWait));
i.e. we never acquire a heavyweight lock in the startup process unless
it's a session lock (as we don't have resource managers/a xact to track
locks) and we don't wait (as we don't have the deadlock detector
infrastructure set up).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-27 Thread Andreas Karlsson

On 01/27/2015 09:05 AM, Andres Freund wrote:

On 2015-01-27 08:21:57 +0100, Andreas Karlsson wrote:

On 01/23/2015 02:58 AM, Petr Jelinek wrote:

On 23/01/15 00:40, Andreas Karlsson wrote:

- Renamed some things from int12 to int128, there are still some places
with int16 which I am not sure what to do with.


I'd vote for renaming them to int128 too, there is enough C functions
that user int16 for 16bit integer that this is going to be confusing
otherwise.


Do you also think the SQL functions should be named numeric_int128_sum,
numeric_int128_avg, etc?


I'm pretty sure we already decided upthread that the SQL interface is
going to keep usint int4/8 and by extension int16.


Excellent, then I will leave my latest patch as-is and let the reviewer 
say what he thinks.


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-27 Thread Michael Paquier
Alvaro Herrera wrote:
>> So how about something like
>>
>> #define ALLOCFLAG_HUGE  0x01
>> #define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
>> void *
>> MemoryContextAllocFlags(MemoryContext context, Size size, int flags);
The flag for huge allocations may be useful, but I don't actually see
much value in the flag ALLOC_NO_OOM if the stuff in aset.c returns
unconditionally NULL in case of an OOM and we let palloc complain
about an OOM when allocation returns NULL. Something I am missing
perhaps?

>> I definitely do not want to push the nofail stuff via the
>> MemoryContextData-> API into aset.c. Imo aset.c should always return
>> NULL and then mcxt.c should throw the error if in the normal palloc()
>> function.
>
> Sure, that seems reasonable ...
Yes, this would simplify the footprint of this patch to aset.c to a
minimum by changing the ereport to NULL in a couple of places.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-27 Thread Andres Freund
On 2015-01-27 17:27:53 +0900, Michael Paquier wrote:
> Alvaro Herrera wrote:
> >> So how about something like
> >>
> >> #define ALLOCFLAG_HUGE  0x01
> >> #define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
> >> void *
> >> MemoryContextAllocFlags(MemoryContext context, Size size, int flags);
> The flag for huge allocations may be useful, but I don't actually see
> much value in the flag ALLOC_NO_OOM if the stuff in aset.c returns
> unconditionally NULL in case of an OOM and we let palloc complain
> about an OOM when allocation returns NULL. Something I am missing
> perhaps?

I guess the idea is to have *user facing* MemoryContextAllocExtended()
that can do both huge and no-oom allocations. Otherwise we need palloc
like wrappers for all combinations.
We're certainly not just going to ignore memory allocation failures
generally in in MemoryContextAllocExtended()

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-26 17:45:52 -0500, robertmh...@gmail.com wrote:
>
> > Based on the recent emails, it appears there's been a shift of
> > preference to having it be in-core […]
> 
> Well, I'm not sure that anyone else here agreed with me on that

Sure, an in-core AUDIT command would be great. Stephen has always said
that would be the most preferable solution; and if we had the code to
implement it, I doubt anyone would prefer the contrib module instead.
But we don't have that code now, and we won't have it in time for 9.5.

We had an opportunity to work on pgaudit in its current form, and I like
to think that the result is useful. To me, the question has always been
whether some variant of that code would be acceptable for 9.5's contrib.
If so, I had some time to work on that. If not… well, hard luck. But the
option to implement AUDIT was not available to me, which is why I have
not commented much on it recently.

> The basic dynamic here seems to be you asking for changes and Abhijit
> making them but without any real confidence, and I don't feel good
> about that. 

I understand how I might have given you that impression, but I didn't
mean to, and I don't really feel that way.

I appreciate Stephen's suggestions and, although it took me some time to
understand them fully, I think the use of GRANT to provide finer-grained
auditing configuration has improved pgaudit. I am slightly concerned by
the resulting complexity, but I think that can be addressed by examples
and so on. I wouldn't be unhappy if this code were to go into contrib.

(I should point out that it is also not the case that I do not hold any
opinions and would be happy with anything pgaudit-shaped being included.
For example, I strongly prefer GRANT to the 'alice:*:*' approach.)

Anyway, I think it's reasonably clear now that pgaudit is unlikely to
make it into 9.5 in any form, so I'll find something else to do.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter user/role CURRENT_USER

2015-01-27 Thread Kyotaro HORIGUCHI
Hello, thank you for the comment.

> Looking at this a bit, I'm not sure completely replacing RoleId with a
> node is the best idea; some of the users of that production in the
> grammar are okay with accepting only normal strings as names, and don't
> need all the CURRENT_USER etc stuff.

You're right. the productions don't need RoleSpec already uses
char* for the role member in its *Stmt structs. I might have done
a bit too much.

Adding new nonterminal RoleId and using it makes a bit duplicate
of check code for "public"/"none" and others but removes other
redundant check for "!= CSTRING" from some production, CREATE
ROLE/USER/GROUP, ALTER ROLE/USER/GROUP RENAME.

RoleId in the patch still has rule components for CURRENT_USER,
SESSION_USER, and CURRENT_ROLE. Without them, the parser prints
an error ununderstandable to users.

| =# alter role current_user rename to "PuBlic";
| ERROR:  syntax error at or near "rename"
| LINE 1: alter role current_user rename to "PuBlic";
| ^

With the components, the error message becomes like this.

| =# alter role current_role rename to "PuBlic";
| ERROR:  reserved word cannot be used
| LINE 1: alter role current_role rename to "PuBlic";
|^


>   Maybe we need a new production,
> say RoleSpec, and we modify the few productions that need the extra
> flexibility?  For instance we could have ALTER USER RoleSpec instead of
> ALTER USER RoleId.  But we would keep CREATE USER RoleId, because it
> doesn't make any sense to accept CREATE USER CURRENT_USER.
> I think that would lead to a less invasive patch also.
> Am I making sense?

I think I did what you expected. It was good as the code got
simpler but the invasive-ness dosn't seem to be reduced..

What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From d90b7e09968f32a5b543242469fb65e304df3318 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 14 Nov 2014 17:37:22 +0900
Subject: [PATCH] ALTER USER CURRENT_USER v4

Added RoleId for the use in where CURRENT_USER-like sutff is not used.
CREATE ROLE/USER/GROUP, ALTER ROLE/USER/GROUP RENAME are changed to
use RoleId.
---
 src/backend/catalog/aclchk.c   |  30 +++--
 src/backend/commands/alter.c   |   2 +-
 src/backend/commands/extension.c   |   2 +-
 src/backend/commands/foreigncmds.c |  57 -
 src/backend/commands/schemacmds.c  |  30 -
 src/backend/commands/tablecmds.c   |   4 +-
 src/backend/commands/tablespace.c  |   2 +-
 src/backend/commands/user.c|  86 +++---
 src/backend/nodes/copyfuncs.c  |  37 --
 src/backend/nodes/equalfuncs.c |  35 --
 src/backend/parser/gram.y  | 229 +
 src/backend/parser/parse_utilcmd.c |   4 +-
 src/backend/utils/adt/acl.c|  34 ++
 src/include/commands/user.h|   2 +-
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/parsenodes.h |  48 ++--
 src/include/utils/acl.h|   2 +-
 17 files changed, 400 insertions(+), 205 deletions(-)

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 1e3888e..1c90626 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -430,13 +430,16 @@ ExecuteGrantStmt(GrantStmt *stmt)
 	foreach(cell, stmt->grantees)
 	{
 		PrivGrantee *grantee = (PrivGrantee *) lfirst(cell);
+		Oid grantee_uid = ACL_ID_PUBLIC;
 
-		if (grantee->rolname == NULL)
-			istmt.grantees = lappend_oid(istmt.grantees, ACL_ID_PUBLIC);
-		else
-			istmt.grantees =
-lappend_oid(istmt.grantees,
-			get_role_oid(grantee->rolname, false));
+		/* "public" is mapped to ACL_ID_PUBLIC */
+		if (grantee->role->roltype != ROLESPEC_PUBLIC)
+		{
+			grantee_uid = get_rolespec_oid(grantee->role, false);
+			if (!OidIsValid(grantee_uid))
+grantee_uid = ACL_ID_PUBLIC;
+		}
+		istmt.grantees = lappend_oid(istmt.grantees, grantee_uid);
 	}
 
 	/*
@@ -913,13 +916,16 @@ ExecAlterDefaultPrivilegesStmt(AlterDefaultPrivilegesStmt *stmt)
 	foreach(cell, action->grantees)
 	{
 		PrivGrantee *grantee = (PrivGrantee *) lfirst(cell);
+		Oid grantee_uid = ACL_ID_PUBLIC;
 
-		if (grantee->rolname == NULL)
-			iacls.grantees = lappend_oid(iacls.grantees, ACL_ID_PUBLIC);
-		else
-			iacls.grantees =
-lappend_oid(iacls.grantees,
-			get_role_oid(grantee->rolname, false));
+		/* "public" is mapped to ACL_ID_PUBLIC */
+		if (grantee->role->roltype != ROLESPEC_PUBLIC)
+		{
+			grantee_uid = get_rolespec_oid(grantee->role, false);
+			if (!OidIsValid(grantee_uid))
+grantee_uid = ACL_ID_PUBLIC;
+		}
+		iacls.grantees = lappend_oid(iacls.grantees, grantee_uid);
 	}
 
 	/*
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 78b54b4..1d8799b 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -679,7 +679,7 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid)
 Oid
 ExecAlt

Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC

2015-01-27 Thread Giuseppe Broccolo
Hi Marco,

On 16/01/15 16:55, Marco Nenciarini wrote:
> On 14/01/15 17:22, Gabriele Bartolini wrote:
> >
> > My opinion, Marco, is that for version 5 of this patch, you:
> >
> > 1) update the information on the wiki (it is outdated - I know you have
> > been busy with LSN map optimisation)
>
> Done.
>
> > 2) modify pg_basebackup in order to accept a directory (or tar file) and
> > automatically detect the LSN from the backup profile
>
> New version of patch attached. The -I parameter now requires a backup
> profile from a previous backup. I've added a sanity check that forbid
> incremental file level backups if the base timeline is different from
> the current one.
>
> > 3) add the documentation regarding the backup profile and pg_basebackup
> >
>
> Next on my TODO list.
>
> > Once we have all of this, we can continue trying the patch. Some
> > unexplored paths are:
> >
> > * tablespace usage
>
> I've improved my pg_restorebackup python PoC. It now supports tablespaces.

About tablespaces, I noticed that any pointing to tablespace locations is
lost during the recovery of an incremental backup changing the tablespace
mapping (-T option). Here the steps I followed:

   - creating and filling a test database obtained through pgbench

   psql -c "CREATE DATABASE pgbench"
   pgbench -U postgres -i -s 5 -F 80 pgbench

   - a first base backup with pg_basebackup:

   mkdir -p backups/$(date '+%d%m%y%H%M')/data && pg_basebackup -v -F
p -D backups/$(date '+%d%m%y%H%M')/data -x

   - creation of a new tablespace, alter the table "pgbench_accounts" to
   set the new tablespace:

   mkdir -p /home/gbroccolo/pgsql/tbls
   psql -c "CREATE TABLESPACE tbls LOCATION '/home/gbroccolo/pgsql/tbls'"
   psql -c "ALTER TABLE pgbench_accounts SET TABLESPACE tbls" pgbench

   - Doing some work on the database:

   pgbench -U postgres -T 120 pgbench

   - a second incremental backup with pg_basebackup specifying the new
   location for the tablespace through the tablespace mapping:

   mkdir -p backups/$(date '+%d%m%y%H%M')/data backups/$(date
'+%d%m%y%H%M')/tbls && pg_basebackup -v -F p -D backups/$(date
'+%d%m%y%H%M')/data -x -I backups/2601151641/data/backup_profile -T
/home/gbroccolo/pgsql/tbls=/home/gbroccolo/pgsql/backups/$(date
'+%d%m%y%H%M')/tbls

   - a recovery based on the tool pg_restorebackup.py attached in
   http://www.postgresql.org/message-id/54b9428e.9020...@2ndquadrant.it

   ./pg_restorebackup.py backups/2601151641/data
backups/2601151707/data /tmp/data -T
/home/gbroccolo/pgsql/backups/2601151707/tbls=/tmp/tbls


In the last step, I obtained the following stack trace:

Traceback (most recent call last):
  File "./pg_restorebackup.py", line 74, in 
shutil.copy2(base_file, dest_file)
  File "/home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py",
line 130, in copy2
copyfile(src, dst)
  File "/home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py",
line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory:
'backups/2601151641/data/base/16384/16406_fsm'


Any idea on what's going wrong?

Thanks,
Giuseppe.
-- 
Giuseppe Broccolo - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
giuseppe.brocc...@2ndquadrant.it | www.2ndQuadrant.it


Re: [HACKERS] [POC] FETCH limited by bytes.

2015-01-27 Thread Kyotaro HORIGUCHI
Thank you for the comment.

The automatic way to determin the fetch_size looks become too
much for the purpose. An example of non-automatic way is a new
foreign table option like 'fetch_size' but this exposes the
inside too much... Which do you think is preferable?

Thu, 22 Jan 2015 11:17:52 -0500, Tom Lane  wrote in 
<24503.1421943...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > Hello, as the discuttion on async fetching on postgres_fdw, FETCH
> > with data-size limitation would be useful to get memory usage
> > stability of postgres_fdw.
> 
> > Is such a feature and syntax could be allowed to be added?
> 
> This seems like a lot of work, and frankly an incredibly ugly API,
> for a benefit that is entirely hypothetical.  Have you got numbers
> showing any actual performance win for postgres_fdw?

The API is a rush work to make the path for the new parameter
(but, yes, I did too much for the purpose that use from
postgres_fdw..)  and it can be any saner syntax but it's not the
time to do so yet.

The data-size limitation, any size to limit, would give
significant gain especially for small sized rows.

This patch began from the fact that it runs about twice faster
when fetch size = 1 than 100.

http://www.postgresql.org/message-id/20150116.171849.109146500.horiguchi.kyot...@lab.ntt.co.jp

I took exec times to get 1M rows from localhost via postgres_fdw
and it showed the following numbers.

=# SELECT a from ft1;
fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
(local)0.75s
10060  6.2s   6000 (0.006)
1  60  2.7s 60 (0.6  )
3  60  2.2s180 (2.0  )
6  60  2.4s360 (4.0  )

=# SELECT a, b, c from ft1;
fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
(local)0.8s
100   204 12  s  20400 (0.02 )
1000  204 10  s 204000 (0.2  )
1 204  5.8s204 (2)
2 204  5.9s408 (4)

=# SELECT a, b, d from ft1;
fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
(local)0.8s
100  1356 17  s 135600 (0.136)
1000 1356 15  s1356000 (1.356)
1475 1356 13  s2000100 (2.0  )
2950 1356 13  s4000200 (4.0  )

The definitions of the environment are the following.

CREATE SERVER sv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', 
dbname 'postgres');
CREATE USER MAPPING FOR PUBLIC SERVER sv1;
CREATE TABLE lt1 (a int, b timestamp, c text, d text);
CREATE FOREIGN TABLE ft1 (a int, b timestamp, c text, d text) SERVER sv1 
OPTIONS (table_name 'lt1');
INSERT INTO lt1 (SELECT a, now(), repeat('x', 128), repeat('x', 1280) FROM 
generate_series(0, 99) a);

The "avg row size" is alloced_mem/fetch_size and the alloced_mem
is the sum of HeapTuple[fetch_size] and (HEAPTUPLESIZE +
tup->t_len) for all stored tuples in the receiver side,
fetch_more_data() in postgres_fdw.

They are about 50% gain for the smaller tuple size and 25% for
the larger. They looks to be optimal at where alloced_mem is
around 2MB by the reason unknown to me. Anyway the difference
seems to be significant.

> Even if we wanted to do something like this, I strongly object to
> measuring size by heap_compute_data_size.  That's not a number that users
> would normally have any direct knowledge of; nor does it have anything
> at all to do with the claimed use-case, where what you'd really need to
> measure is bytes transmitted down the wire.  (The difference is not small:
> for instance, toasted values would likely still be toasted at the point
> where you're measuring.)

Sure. Finally, the attached patch #1 which does the following
things.

 - Sender limits the number of tuples using the sum of the net
   length of the column values to be sent, not including protocol
   overhead. It is calculated in the added function
   slot_compute_attr_size(), using raw length for compressed
   values.

 - postgres_fdw calculates fetch limit bytes by the following
   formula,

   MAX_FETCH_MEM - MAX_FETCH_SIZE * (estimated overhead per tuple);

The result of the patch is as follows. MAX_FETCH_MEM = 2MiB and
MAX_FETCH_SIZE = 3.

fetch_size,   avg row size(*1),   time,   max alloced_mem/fetch(Mbytes)
(auto) 60  2.4s   108 ( 1.08)
(auto)204  7.3s536400 ( 0.54)
(auto)   1356 15  s430236 ( 0.43)

This is meaningfully fast but the patch looks too big and the
meaning of the new parameter is hard to understand..:(


On the other hand the cause of the displacements of alloced_mem
shown above is per-tuple overhead, the sum of which is unknown
before execution.  The second patch makes FE

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-26 18:47:29 -0600, jim.na...@bluetreble.com wrote:
>
> Anything happen with this?

Nothing so far.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: searching in array function - array_position

2015-01-27 Thread Pavel Stehule
2015-01-26 23:29 GMT+01:00 Jim Nasby :

> On 1/26/15 4:17 PM, Pavel Stehule wrote:
>
>> Any way to reduce the code duplication between the array and
>> non-array versions? Maybe factor out the operator caching code?
>>
>>
>> I though about it - but there is different checks, different result
>> processing, different result type.
>>
>> I didn't find any readable and reduced form :(
>>
>
> Yeah, that's why I was thinking specifically of the operator caching
> code... isn't that identical? That would at least remove a dozen lines...


It is only partially identical - I would to use cache for array_offset, but
it is not necessary for array_offsets .. depends how we would to modify
current API to support externally cached data.

Regards

Pavel



>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
2015-01-26 21:44 GMT+01:00 Jim Nasby :

> On 1/25/15 4:23 AM, Pavel Stehule wrote:
>
>>
>> I tested a concept iteration over array in format [key1, value1, key2,
>> value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
>> ...] too
>>
>> It is only a few lines more to current code, and this change doesn't
>> break a compatibility.
>>
>> Do you think, so this patch is acceptable?
>>
>> Ideas, comments?
>>
>
> Aside from fixing the comments... I think this needs more tests on corner
> cases. For example, what happens when you do
>
> foreach a, b, c in array(array(1,2),array(3,4)) ?
>

it is relative simple behave -- empty values are NULL

array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
ARRAY[1,2,3,4]


>
> Or the opposite case of
>
> foreach a,b in array(array(1,2,3))
>
> Also, what about:
>
> foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?



 postgres=# select array(select
unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
   array
---
 {1,2,3,4,5,6,7,8}
(1 row)

so it generate pairs {1,2}{3,4},{5,6},{7,8}

Regards

Pavel Stehule


> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Robert Haas
On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund  wrote:
>> I basically have two ideas to fix this.
>>
>> 1) Make do_pg_start_backup() acquire a SHARE lock on
>>pg_database. That'll prevent it from starting while a movedb() is
>>still in progress. Then additionally add pg_backup_in_progress()
>>function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
>>XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
>>movedb() to error out if a backup is in progress.
>
> Attached is a patch trying to this. Doesn't look too bad and lead me to
> discover missing recovery conflicts during a AD ST.
>
> But: It doesn't actually work on standbys, because lock.c prevents any
> stronger lock than RowExclusive from being acquired. And we need need a
> lock that can conflict with WAL replay of DBASE_CREATE, to handle base
> backups that are executed on the primary. Those obviously can't detect
> whether any standby is currently doing a base backup...
>
> I currently don't have a good idea how to mangle lock.c to allow
> this. I've played with doing it like in the second patch, but that
> doesn't actually work because of some asserts around ProcSleep - leading
> to locks on database objects not working in the startup process (despite
> already being used).
>
> The easiest thing would be to just use a lwlock instead of a heavyweight
> lock - but those aren't canceleable...

How about just wrapping an lwlock around a flag variable?  movedb()
increments the variable when starting and decrements it when done
(must use PG_ENSURE_ERROR_CLEANUP).  Starting a backup errors out (or
waits in 1-second increments) if it's non-zero.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund  wrote:
> On 2015-01-22 20:54:47 -0500, Stephen Frost wrote:
>> * Bruce Momjian (br...@momjian.us) wrote:
>> > On Fri, Jan 23, 2015 at 01:19:33AM +0100, Andres Freund wrote:
>> > > Or do you - as the text edited in your patch, but not the quote above -
>> > > mean to run pg_upgrade just on the primary and then rsync?
>> >
>> > No, I was going to run it on both, then rsync.
>>
>> I'm pretty sure this is all a lot easier than you believe it to be.  If
>> you want to recreate what pg_upgrade does to a cluster then the simplest
>> thing to do is rsync before removing any of the hard links.  rsync will
>> simply recreate the same hard link tree that pg_upgrade created when it
>> ran, and update files which were actually changed (the catalog tables).
>
> I don't understand why that'd be better than simply fixing (yes, that's
> imo the correct term) pg_upgrade to retain relfilenodes across the
> upgrade. Afaics there's no conflict risk and it'd make the clusters much
> more similar, which would be good; independent of rsyncing standbys.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Sat, Jan 24, 2015 at 10:04 PM, Bruce Momjian  wrote:
> On Fri, Jan 23, 2015 at 02:34:36PM -0500, Stephen Frost wrote:
>> > > You'd have to replace the existing data directory on the master to do
>> > > that, which pg_upgrade was designed specifically to not do, in case
>> > > things went poorly.
>> >
>> > Why? Just rsync the new data directory onto the old directory on the
>> > standbys. That's fine and simple.
>>
>> That still doesn't address the need to use --size-only, it would just
>> mean that you don't need to use -H.  If anything the -H part is the
>> aspect which worries me the least about this approach.
>
> I can now confirm that it works, just as Stephen said.  I was able to
> upgrade a standby cluster that contained the regression database, and
> the pg_dump output was perfect.
>
> I am attaching doc instruction that I will add to all branches as soon
> as someone else confirms my results.  You will need to use rsync
> --itemize-changes to see the hard links being created, e.g.:
>
>hf+ pgsql/data/base/16415/28188 => pgsql.old/data/base/16384/28188

My rsync manual page (on two different systems) mentions nothing about
remote_dir, so I'd be quite unable to follow your proposed directions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Windows buildfarm animals are still not happy with abbreviated keys patch

2015-01-27 Thread Robert Haas
On Mon, Jan 26, 2015 at 9:52 PM, Michael Paquier
 wrote:
> On Tue, Jan 27, 2015 at 4:21 AM, Robert Haas  wrote:
>> That's what I hope to find out.  :-)
> Buildfarm seems happy now. I just gave a try to that on one of my
> small Windows VMs and compared the performance with 9.4 for this
> simple test case when building with MSVC 2010:
> create table aa as select random()::text as a, 'filler filler filler'
> as b a from generate_series(1,100);
> create index aai in aa(a):
> On 9.4, the index creation took 26.5s, while on master it took 18s.
> That's nice, particularly for things like a restore from a dump.

Cool.  That's a little bit smaller win than I would have expected
given my Linux results, but it's certainly respectable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GetLockConflicts() and thus recovery conflicts seem pretty broken

2015-01-27 Thread Andres Freund
Hi,

While investigating other bugs I noticed that
ResolveRecoveryConflictWithLock() wasn't really working. Turns out
GetLockConflicts() violates it's API contract which says:

 * The result array is palloc'd and is terminated with an invalid VXID.

Problem 1:
We don't actually put the terminator there. It happens to more or less
accidentally work on a master because the array is palloc0()ed there and
while a 0 is valid backend id it happens to not be a valid local
transaction id.  In HS we don't actually allocate the array every time,
but it's instead statically allocated. Without zeroing.

I have no idea why this doesn't crash
ResolveRecoveryConflictWithInterrupt() which does:
while (!lock_acquired)
{
if (++num_attempts < 3)
backends = GetLockConflicts(locktag, 
AccessExclusiveLock);
...
ResolveRecoveryConflictWithVirtualXIDs(backends,

 PROCSIG_RECOVERY_CONFLICT_LOCK);
and ResolveRecoveryConflictWithVirtualXIDs does:
...
while (VirtualTransactionIdIsValid(*waitlist))
{
/* kill kill kill */

/* The virtual transaction is gone now, wait for the next one */
waitlist++;
}

I guess we just accidentally happen to come across appropriately
set memory at some point.

I'm really baffled that this hasn't caused significant problems so far.


Problem 2:
Since bcd8528f001 and 29eedd312274 the "the result array is palloc'd" is
wrong because we're now doing:

static VirtualTransactionId *vxids;
/*
 * Allocate memory to store results, and fill with InvalidVXID.  We only
 * need enough space for MaxBackends + a terminator, since prepared 
xacts
 * don't count. InHotStandby allocate once in TopMemoryContext.
 */
if (InHotStandby)
{
if (vxids == NULL)
vxids = (VirtualTransactionId *)
MemoryContextAlloc(TopMemoryContext,
   sizeof(VirtualTransactionId) 
* (MaxBackends + 1));
}
else
vxids = (VirtualTransactionId *)
palloc0(sizeof(VirtualTransactionId) * (MaxBackends + 
1));

Obviously that violates the API contract. I'm inclined to rip the HS
special case out and add a pfree() to the single HS caller. The commit
message introducing it says:

Use malloc() in GetLockConflicts() when called InHotStandby to avoid
repeated palloc calls. Current code assumed this was already true, so
this is a bug fix.
But I can't really believe this is relevant.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
> On Sat, Jan 24, 2015 at 10:04 PM, Bruce Momjian  wrote:
> > On Fri, Jan 23, 2015 at 02:34:36PM -0500, Stephen Frost wrote:
> >> > > You'd have to replace the existing data directory on the master to do
> >> > > that, which pg_upgrade was designed specifically to not do, in case
> >> > > things went poorly.
> >> >
> >> > Why? Just rsync the new data directory onto the old directory on the
> >> > standbys. That's fine and simple.
> >>
> >> That still doesn't address the need to use --size-only, it would just
> >> mean that you don't need to use -H.  If anything the -H part is the
> >> aspect which worries me the least about this approach.
> >
> > I can now confirm that it works, just as Stephen said.  I was able to
> > upgrade a standby cluster that contained the regression database, and
> > the pg_dump output was perfect.
> >
> > I am attaching doc instruction that I will add to all branches as soon
> > as someone else confirms my results.  You will need to use rsync
> > --itemize-changes to see the hard links being created, e.g.:
> >
> >hf+ pgsql/data/base/16415/28188 => 
> > pgsql.old/data/base/16384/28188
> 
> My rsync manual page (on two different systems) mentions nothing about
> remote_dir, so I'd be quite unable to follow your proposed directions.

The example listed works, but only when it's a local rsync:

rsync --archive --hard-links --size-only old_dir new_dir remote_dir

Perhaps a better example (or additional one) would be with a remote
rsync, including clarification of old and new dir, like so:

(run in /var/lib/postgresql)
rsync --archive --hard-links --size-only \
  9.3/main \
  9.4/main \
  server:/var/lib/postgresql/

Note that 9.3/main and 9.4/main are two source directories for rsync to
copy over, while server:/var/lib/postgresql/ is a remote destination
directory.  The above directories match a default Debian/Ubuntu install.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Tom Lane
Robert Haas  writes:
> On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund  wrote:
>> I don't understand why that'd be better than simply fixing (yes, that's
>> imo the correct term) pg_upgrade to retain relfilenodes across the
>> upgrade. Afaics there's no conflict risk and it'd make the clusters much
>> more similar, which would be good; independent of rsyncing standbys.

> +1.

That's certainly impossible for the system catalogs, which means you
have to be able to deal with relfilenode discrepancies for them, which
means that maintaining the same relfilenodes for user tables is of
dubious value.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread David Fetter
On Tue, Jan 27, 2015 at 08:02:37AM +0100, Daniel Bausch wrote:
> Hi PG devs!
> 
> Tom Lane  writes:
> 
> >> Wait for first IO, issue second IO request
> >> Compute
> >> Already have second IO request, issue third
> >> ...
> >
> >> We'd be a lot less sensitive to IO latency.
> >
> > It would take about five minutes of coding to prove or disprove this:
> > stick a PrefetchBuffer call into heapgetpage() to launch a request for the
> > next page as soon as we've read the current one, and then see if that
> > makes any obvious performance difference.  I'm not convinced that it will,
> > but if it did then we could think about how to make it work for real.
> 
> Sorry for dropping in so late...
> 
> I have done all this two years ago.  For TPC-H Q8, Q9, Q17, Q20, and Q21
> I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
> Look-Ahead (the outer loop!).
> (On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)

Would you be so kind as to pass along any patches (ideally applicable
to git master), tests, and specific measurements you made?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund  
>> wrote:
>>> I don't understand why that'd be better than simply fixing (yes, that's
>>> imo the correct term) pg_upgrade to retain relfilenodes across the
>>> upgrade. Afaics there's no conflict risk and it'd make the clusters much
>>> more similar, which would be good; independent of rsyncing standbys.
>
>> +1.
>
> That's certainly impossible for the system catalogs, which means you
> have to be able to deal with relfilenode discrepancies for them, which
> means that maintaining the same relfilenodes for user tables is of
> dubious value.

Why is that impossible for the system catalogs?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 9:36 AM, Stephen Frost  wrote:
> The example listed works, but only when it's a local rsync:
>
> rsync --archive --hard-links --size-only old_dir new_dir remote_dir
>
> Perhaps a better example (or additional one) would be with a remote
> rsync, including clarification of old and new dir, like so:
>
> (run in /var/lib/postgresql)
> rsync --archive --hard-links --size-only \
>   9.3/main \
>   9.4/main \
>   server:/var/lib/postgresql/
>
> Note that 9.3/main and 9.4/main are two source directories for rsync to
> copy over, while server:/var/lib/postgresql/ is a remote destination
> directory.  The above directories match a default Debian/Ubuntu install.

My point is that Bruce's patch suggests looking for "remote_dir" in
the rsync documentation, but no such term appears there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane  wrote:
>> That's certainly impossible for the system catalogs, which means you
>> have to be able to deal with relfilenode discrepancies for them, which
>> means that maintaining the same relfilenodes for user tables is of
>> dubious value.

> Why is that impossible for the system catalogs?

New versions aren't guaranteed to have the same system catalogs, let alone
the same relfilenodes for them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Jan 27, 2015 at 9:36 AM, Stephen Frost  wrote:
> > The example listed works, but only when it's a local rsync:
> >
> > rsync --archive --hard-links --size-only old_dir new_dir remote_dir
> >
> > Perhaps a better example (or additional one) would be with a remote
> > rsync, including clarification of old and new dir, like so:
> >
> > (run in /var/lib/postgresql)
> > rsync --archive --hard-links --size-only \
> >   9.3/main \
> >   9.4/main \
> >   server:/var/lib/postgresql/
> >
> > Note that 9.3/main and 9.4/main are two source directories for rsync to
> > copy over, while server:/var/lib/postgresql/ is a remote destination
> > directory.  The above directories match a default Debian/Ubuntu install.
> 
> My point is that Bruce's patch suggests looking for "remote_dir" in
> the rsync documentation, but no such term appears there.

Ah, well, perhaps we could simply add a bit of clarification to this:

for details on specifying remote_dir

like so:

for details on specifying the destination remote_dir

?

On my system, the rsync man page has '[DEST]' in the synopsis, but it
doesn't actually go on to specifically define what 'DEST' is, rather
referring to it later as 'destination' or 'remote directory'.

I'm sure other suggestions would be welcome if they'd help clarify.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Andres Freund
On 2015-01-27 10:20:48 -0500, Robert Haas wrote:
> On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund  
> >> wrote:
> >>> I don't understand why that'd be better than simply fixing (yes, that's
> >>> imo the correct term) pg_upgrade to retain relfilenodes across the
> >>> upgrade. Afaics there's no conflict risk and it'd make the clusters much
> >>> more similar, which would be good; independent of rsyncing standbys.
> >
> >> +1.
> >
> > That's certainly impossible for the system catalogs, which means you
> > have to be able to deal with relfilenode discrepancies for them, which
> > means that maintaining the same relfilenodes for user tables is of
> > dubious value.
> 
> Why is that impossible for the system catalogs?

Maybe it's not impossible for existing catalogs, but it's certainly
complicated. But I don't think it's all that desirable anyway - they're
not the same relation after the pg_upgrade anyway (initdb/pg_dump
filled them). That's different for the user defined relations.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-01-27 Thread Robert Haas
On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas  wrote:
> This developed a slight merge conflict.  I've rebased the attached
> version, and I also took the step of getting rid of buf_table.c, as I
> think I proposed somewhere upthread.  This avoids the overhead of
> constructing a BufferTag only to copy it into a BufferLookupEnt, plus
> some function calls and so forth.  A quick-and-dirty test suggests
> this might not have cut down on the 1-client overhead much, but I
> think it's worth doing anyway: it's certainly saving a few cycles, and
> I don't think it's complicating anything measurably.

So here are median-of-three results for 5-minute read-only pgbench
runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
various client counts.

clients - master@4b2a25 - master+chash-buftable-v2
1 8319.254199 8105.479632
2 15485.561948 15138.227533
3 23690.186576 23264.702784
4 32157.346740 31536.870881
5 40879.532747 40455.794841
6 49063.279970 49625.681573
7 57518.374517 57492.275197
8 65351.415323 65622.211763
16 126166.416498 126668.793798
24 155727.916112 155587.414299
32 180329.012019 179543.424754
40 201384.186317 200109.614362
48 222352.265595 225688.574611
56 240400.659338 254398.144976
64 253699.031075 266624.224699
72 261198.336625 270652.578322
80 264569.862257 270332.738084

That's extremely unimpressive.  You (Andres) showed a much bigger
benefit on a different machine with a much higher scale factor (5000)
so I don't know whether the modest benefit here is because of the
different machine, the different scale factor, or what.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-01-27 Thread Andres Freund
On 2015-01-27 10:36:35 -0500, Robert Haas wrote:
> On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas  wrote:
> > This developed a slight merge conflict.  I've rebased the attached
> > version, and I also took the step of getting rid of buf_table.c, as I
> > think I proposed somewhere upthread.  This avoids the overhead of
> > constructing a BufferTag only to copy it into a BufferLookupEnt, plus
> > some function calls and so forth.  A quick-and-dirty test suggests
> > this might not have cut down on the 1-client overhead much, but I
> > think it's worth doing anyway: it's certainly saving a few cycles, and
> > I don't think it's complicating anything measurably.
> 
> So here are median-of-three results for 5-minute read-only pgbench
> runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
> various client counts.

> That's extremely unimpressive.  You (Andres) showed a much bigger
> benefit on a different machine with a much higher scale factor (5000)
> so I don't know whether the modest benefit here is because of the
> different machine, the different scale factor, or what.

Based on my test on hydra some time back the bottleneck is nearly
entirely in GetSnapshotData() at higher client counts. So I'm not that
surprised you don't see the big benefit here.  IIRC on hydra the results
for using a large enough shared_buffers setting for a fully cached run
at that scale is pretty close to your master results - so there's really
not much performance increase to be expected by making the buf table
lockless.

I guess you would see a slightly bigger difference at a different
shared_buffer/scale combination. IIRC a scale 1000 is about 15GB large?
So about half of the data set fit in shared buffers. In the scale 5k
results I posted it was about 1/5.

I had also tested on a four socket x86 machine where GetSnapshotData()
was a, but not the sole big, bottleneck.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane  wrote:
> >> That's certainly impossible for the system catalogs, which means you
> >> have to be able to deal with relfilenode discrepancies for them, which
> >> means that maintaining the same relfilenodes for user tables is of
> >> dubious value.
> 
> > Why is that impossible for the system catalogs?
> 
> New versions aren't guaranteed to have the same system catalogs, let alone
> the same relfilenodes for them.

Indeed, new versions almost certainly have wholly new system catalogs.

While there might be a reason to keep the relfilenodes the same, it
doesn't actually help with the pg_upgrade use-case we're currently
discussing (at least, not without additional help).  The problem is that
we certainly must transfer all the new catalogs, but how would rsync
know that those catalog files have to be transferred but not the user
relations?  Using --size-only would mean that system catalogs whose
sizes happen to match after the upgrade wouldn't be transferred and that
would certainly lead to a corrupt situation.

Andres proposed a helper script which would go through the entire tree
on the remote side and set all the timestamps on the remote side to
match those on the local side (prior to the pg_upgrade).  If all the
relfilenodes remained the same and the timestamps on the catalog tables
all changed then it might work to do (without using --size-only):

stop-cluster
set-timestamp-script
pg_upgrade
rsync new_data_dir -> remote:existing_cluster

This would mean that any other files which happened to be changed by
pg_upgrade beyond the catalog tables would also get copied across.  The
issue that I see with that is that if the pg_upgrade process does touch
anything outside of the system catalogs, then its documented revert
mechanism (rename the control file and start the old cluster back up,
prior to having started the new cluster) wouldn't be valid.  Requiring
an extra script which runs around changing timestamps on files is a bit
awkward too, though I suppose possible, and then we'd also have to
document that this process only works with $version of pg_upgrade that
does the preservation of the relfilenodes.

I suppose there's also technically a race condition to consider, if the
whole thing is scripted and pg_upgrade manages to change an existing
file in the same second that the old cluster did then that file wouldn't
be recognized by the rsync as having been updated.  That's not too hard
to address though- just wait a second somewhere in there.  Still, I'm
not really sure that this approach really gains us much over the
approach that Bruce is proposing.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 12:24 AM, Noah Misch wrote:

For the moment, maybe I could commit the fix for the non U+ case for
escape_json, and we could think some more about detecting and warning about
the problem strings.

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.




OK, so something like this patch? I'm mildly concerned that you and I 
are the only ones commenting on this.


cheers

andrew


diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..8d2b38f 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -94,6 +94,8 @@ static void datum_to_json(Datum val, bool is_null, StringInfo result,
 static void add_json(Datum val, bool is_null, StringInfo result,
 		 Oid val_type, bool key_scalar);
 static text *catenate_stringinfo_string(StringInfo buffer, const char *addon);
+static void  escape_json_work(StringInfo buf, const char *str,
+			  bool jsonb_unicode);
 
 /* the null action object used for pure validation */
 static JsonSemAction nullSemAction =
@@ -2356,6 +2358,18 @@ json_object_two_arg(PG_FUNCTION_ARGS)
 void
 escape_json(StringInfo buf, const char *str)
 {
+	escape_json_work( buf, str, false);
+}
+
+void
+escape_jsonb(StringInfo buf, const char *str)
+{
+	escape_json_work( buf, str, true);
+}
+
+static void
+escape_json_work(StringInfo buf, const char *str, bool jsonb_unicode)
+{
 	const char *p;
 
 	appendStringInfoCharMacro(buf, '\"');
@@ -2398,7 +2412,9 @@ escape_json(StringInfo buf, const char *str)
  * unicode escape that should be present is \u, all the
  * other unicode escapes will have been resolved.
  */
-if (p[1] == 'u' &&
+if (jsonb_unicode && strncmp(p+1, "u", 5) == 0)
+	appendStringInfoCharMacro(buf, *p);
+else if (!jsonb_unicode && p[1] == 'u' &&
 	isxdigit((unsigned char) p[2]) &&
 	isxdigit((unsigned char) p[3]) &&
 	isxdigit((unsigned char) p[4]) &&
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 644ea6d..22e1263 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -305,7 +305,7 @@ jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal)
 			appendBinaryStringInfo(out, "null", 4);
 			break;
 		case jbvString:
-			escape_json(out, pnstrdup(scalarVal->val.string.val, scalarVal->val.string.len));
+			escape_jsonb(out, pnstrdup(scalarVal->val.string.val, scalarVal->val.string.len));
 			break;
 		case jbvNumeric:
 			appendStringInfoString(out,
diff --git a/src/include/utils/json.h b/src/include/utils/json.h
index 6f69403..661d7bd 100644
--- a/src/include/utils/json.h
+++ b/src/include/utils/json.h
@@ -42,7 +42,13 @@ extern Datum json_build_array_noargs(PG_FUNCTION_ARGS);
 extern Datum json_object(PG_FUNCTION_ARGS);
 extern Datum json_object_two_arg(PG_FUNCTION_ARGS);
 
+/*
+ * escape_jsonb is more strict about unicode escapes, and will only not
+ * escape \u, as that is the only unicode escape that should still be
+ * present.
+ */
 extern void escape_json(StringInfo buf, const char *str);
+extern void escape_jsonb(StringInfo buf, const char *str);
 
 extern Datum json_typeof(PG_FUNCTION_ARGS);
 
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index aa5686f..b5457c4 100644
--- a/src/test/regress/expected/jsonb.out
+++ b/src/test/regress/expected/jsonb.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '"\u"', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--
+ "\u" | "\u"
+(1 row)
+
+-- any other unicode-looking escape - backslash is escaped
+-- all unicode characters should have been resolved
 select to_jsonb(text '\uabcd');
- to_jsonb 
---
- "\uabcd"
+ to_jsonb  
+---
+ "\\uabcd"
 (1 row)
 
 -- any other backslash is escaped
@@ -338,6 +347,14 @@ select to_jsonb(text '\abcd');
  "\\abcd"
 (1 row)
 
+-- doubled backslash should be reproduced
+-- this isn't a unicode escape, it's an escaped backslash followed by 'u'
+select jsonb '"\\uabcd"';
+   jsonb   
+---
+ "\\uabcd"
+(1 row)
+
 --jsonb_agg
 CREATE TEMP TABLE rows AS
 SELECT x, 'txt' || x as y
diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out
index 687ae63..10aaac8 100644
--- a/src/test/regress/expected/jsonb_1.out
+++ b/src/test/regress/expected/jsonb_1.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '"\u"', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--

Re: [HACKERS] numeric access out of bounds

2015-01-27 Thread Tom Lane
I wrote:
> Andrew Gierth  writes:
>> I can see two possible fixes: one to correct the assumptions in the
>> macros, the other to check for NaN before calling init_var_from_num in
>> numeric_send (all the other functions seem to do this check explicitly).
>> Which would be preferable?

> I'm inclined to think special-casing NaN in numeric_send is the thing to
> do, since it is that way everywhere else.  If we could push down all the
> special casing into init_var_from_num then that would probably be better,
> but in most cases that looks unattractive.

After taking a closer look I realized that you were right with your first
alternative: the field access macros are just plain wrong, and we should
fix 'em.  Done now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan  writes:
> On 01/27/2015 12:24 AM, Noah Misch wrote:
>> +1 for splitting development that way.  Fixing the use of escape_json() is
>> objective, but the choices around the warning are more subtle.

> OK, so something like this patch? I'm mildly concerned that you and I 
> are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Abbreviated keys for Numeric

2015-01-27 Thread Andrew Gierth
> "Peter" == Peter Geoghegan  writes:

 Peter> What I find particularly interesting about this patch is that it
 Peter> makes sorting numerics significantly faster than even sorting
 Peter> float8 values,

Played some more with this. Testing on some different gcc versions
showed that the results were not consistent between versions; the latest
I tried (4.9) showed float8 as somewhat faster, while 4.7 showed float8
as slightly slower; the difference was all in the time of the float8
case, the time for numeric was virtually the same.

For one specific test query, taking the best time of multiple runs,

float8:   gcc4.7 = 980ms, gcc4.9 = 833ms
numeric:  gcc4.7 = 940ms, gcc4.9 = 920ms

(vs. 650ms for bigint on either version)

So honestly I think abbreviation for float8 is a complete red herring.

Also, I couldn't get any detectable benefit from inlining
DatumGetFloat8, though I may have to play more with that to be certain
(above tests did not have any float8-related modifications at all, just
the datum and numeric abbrevs patches).

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] File based incremental backup v6

2015-01-27 Thread Marco Nenciarini
Hi,

here it is another version of the file based incremental backup patch.

Changelog from the previous one:

* pg_basebackup --incremental option take the directory containing the
  base backup instead of the backup profile file
* rename the backup_profile file at the same time of backup_label file
  when starting the first time from a backup.
* handle "pg_basebackup -D -" appending the backup profile to the
  resulting tar stream
* added documentation for -I/--incremental option to pg_basebackup doc
* updated replication protocol documentation


The reationale of moving the backup_profile out of the way during
recovery is to avoid using a data directory which has been already
started as a base of a backup.

I've also lightly improved the pg_restorebackup PoC implementing the
syntax advised by Gabriele:

./pg_restorebackup.py DESTINATION BACKUP_1 BACKUP_2 [BACKUP_3, ...]

It also supports relocation of tablespace with -T option.
The -T option is mandatory if there was any tablespace defined in the
PostgreSQL instance when the incremental_backup was taken.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From 56fed6e250280f8e5d5c17252db631f33a3c9d8f Mon Sep 17 00:00:00 2001
From: Marco Nenciarini 
Date: Tue, 14 Oct 2014 14:31:28 +0100
Subject: [PATCH] File-based incremental backup v6

Add backup profile to pg_basebackup
INCREMENTAL option implementaion
---
 doc/src/sgml/protocol.sgml |  86 -
 doc/src/sgml/ref/pg_basebackup.sgml|  31 ++-
 src/backend/access/transam/xlog.c  |  18 +-
 src/backend/access/transam/xlogfuncs.c |   2 +-
 src/backend/replication/basebackup.c   | 335 +++--
 src/backend/replication/repl_gram.y|   6 +
 src/backend/replication/repl_scanner.l |   1 +
 src/bin/pg_basebackup/pg_basebackup.c  | 191 +--
 src/include/access/xlog.h  |   3 +-
 src/include/replication/basebackup.h   |   5 +
 10 files changed, 639 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index efe75ea..fc24648 100644
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
*** The commands accepted in walsender mode 
*** 1882,1888 

  

! BASE_BACKUP [LABEL 
'label'] [PROGRESS] 
[FAST] [WAL] [NOWAIT] 
[MAX_RATE rate]
   BASE_BACKUP
  
  
--- 1882,1888 

  

! BASE_BACKUP [LABEL 
'label'] [INCREMENTAL 
'start_lsn'] [PROGRESS] 
[FAST] [WAL] [NOWAIT] 
[MAX_RATE rate]
   BASE_BACKUP
  
  
*** The commands accepted in walsender mode 
*** 1905,1910 
--- 1905,1928 
 
  
 
+ INCREMENTAL 
'start_lsn'
+ 
+  
+   Requests a file-level incremental backup of all files changed after
+   start_lsn. When operating with
+   INCREMENTAL, the content of every block-organised
+   file will be analyzed and the file will be sent if at least one
+   block has a LSN higher than or equal to the provided
+   start_lsn.
+  
+  
+   The backup_profile will contain information on
+   every file that has been analyzed, even those that have not been 
sent.
+  
+ 
+
+ 
+
  PROGRESS
  
   
*** The commands accepted in walsender mode 
*** 2022,2028 
ustar interchange format specified in the POSIX 1003.1-2008
standard) dump of the tablespace contents, except that the two trailing
blocks of zeroes specified in the standard are omitted.
!   After the tar data is complete, a final ordinary result set will be 
sent,
containing the WAL end position of the backup, in the same format as
the start position.
   
--- 2040,2046 
ustar interchange format specified in the POSIX 1003.1-2008
standard) dump of the tablespace contents, except that the two trailing
blocks of zeroes specified in the standard are omitted.
!   After the tar data is complete, an ordinary result set will be sent,
containing the WAL end position of the backup, in the same format as
the start position.
   
*** The commands accepted in walsender mode 
*** 2073,2082 
the server supports it.
   
   
!   Once all tablespaces have been sent, a final regular result set will
be sent. This result set contains the end position of the
backup, given in XLogRecPtr format as a single column in a single row.
   
  

  
--- 2091,2162 
the server supports it.
   
   
!   Once all tablespaces have been sent, another regular result set will
be sent. This result set contains the end position of the
backup, given in XLogRecPtr format as a single column in a single row.
   
+  
+  

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 12:23 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 01/27/2015 12:24 AM, Noah Misch wrote:

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.

OK, so something like this patch? I'm mildly concerned that you and I
are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.





The input \\uabcd will be stored correctly as \uabcd, but this will in 
turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
That's what the patch fixes.


There are two problems here and this addresses one of them. The other 
problem is the ambiguity regarding \\u and \u.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] File based Incremental backup v7

2015-01-27 Thread Marco Nenciarini
Il 27/01/15 10:25, Giuseppe Broccolo ha scritto:> Hi Marco,
>
> On 16/01/15 16:55, Marco Nenciarini wrote:
>> On 14/01/15 17:22, Gabriele Bartolini wrote:
>> >
>> > My opinion, Marco, is that for version 5 of this patch, you:
>> >
>> > 1) update the information on the wiki (it is outdated - I know you have
>> > been busy with LSN map optimisation)
>>
>> Done.
>>
>> > 2) modify pg_basebackup in order to accept a directory (or tar
file) and
>> > automatically detect the LSN from the backup profile
>>
>> New version of patch attached. The -I parameter now requires a backup
>> profile from a previous backup. I've added a sanity check that forbid
>> incremental file level backups if the base timeline is different from
>> the current one.
>>
>> > 3) add the documentation regarding the backup profile and pg_basebackup
>> >
>>
>> Next on my TODO list.
>>
>> > Once we have all of this, we can continue trying the patch. Some
>> > unexplored paths are:
>> >
>> > * tablespace usage
>>
>> I've improved my pg_restorebackup python PoC. It now supports
tablespaces.
>
> About tablespaces, I noticed that any pointing to tablespace locations
> is lost during the recovery of an incremental backup changing the
> tablespace mapping (-T option). Here the steps I followed:
>
>   * creating and filling a test database obtained through pgbench
>
> psql -c "CREATE DATABASE pgbench"
> pgbench -U postgres -i -s 5 -F 80 pgbench
>
>   * a first base backup with pg_basebackup:
>
> mkdir -p backups/$(date '+%d%m%y%H%M')/data && pg_basebackup -v -F
p -D backups/$(date '+%d%m%y%H%M')/data -x
>
>   * creation of a new tablespace, alter the table "pgbench_accounts" to
> set the new tablespace:
>
> mkdir -p /home/gbroccolo/pgsql/tbls
> psql -c "CREATE TABLESPACE tbls LOCATION
'/home/gbroccolo/pgsql/tbls'"
> psql -c "ALTER TABLE pgbench_accounts SET TABLESPACE tbls" pgbench
>
>   * Doing some work on the database:
>
> pgbench -U postgres -T 120 pgbench
>
>   * a second incremental backup with pg_basebackup specifying the new
> location for the tablespace through the tablespace mapping:
>
> mkdir -p backups/$(date '+%d%m%y%H%M')/data backups/$(date
'+%d%m%y%H%M')/tbls && pg_basebackup -v -F p -D backups/$(date
'+%d%m%y%H%M')/data -x -I backups/2601151641/data/backup_profile -T
/home/gbroccolo/pgsql/tbls=/home/gbroccolo/pgsql/backups/$(date
'+%d%m%y%H%M')/tbls
>
>   * a recovery based on the tool pg_restorebackup.py attached in
> http://www.postgresql.org/message-id/54b9428e.9020...@2ndquadrant.it
>
> ./pg_restorebackup.py backups/2601151641/data
backups/2601151707/data /tmp/data -T
/home/gbroccolo/pgsql/backups/2601151707/tbls=/tmp/tbls
>
> In the last step, I obtained the following stack trace:
>
> Traceback (most recent call last):
>   File "./pg_restorebackup.py", line 74, in 
> shutil.copy2(base_file, dest_file)
>   File
"/home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py", line
130, in copy2
> copyfile(src, dst)
>   File
"/home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py", line
82, in copyfile
> with open(src, 'rb') as fsrc:
> IOError: [Errno 2] No such file or directory:
'backups/2601151641/data/base/16384/16406_fsm'
>
>
> Any idea on what's going wrong?
>

I've done some test and it looks like that FSM nodes always have
InvalidXLogRecPtr as LSN.

Ive updated the patch to always include files if all their pages have
LSN == InvalidXLogRecPtr

Updated patch v7 attached.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From bffcdf0d5c3258c8848215011eb8e8b3377d9f18 Mon Sep 17 00:00:00 2001
From: Marco Nenciarini 
Date: Tue, 14 Oct 2014 14:31:28 +0100
Subject: [PATCH] File-based incremental backup v7

Add backup profiles and --incremental to pg_basebackup
---
 doc/src/sgml/protocol.sgml |  86 -
 doc/src/sgml/ref/pg_basebackup.sgml|  31 ++-
 src/backend/access/transam/xlog.c  |  18 +-
 src/backend/access/transam/xlogfuncs.c |   2 +-
 src/backend/replication/basebackup.c   | 344 +++--
 src/backend/replication/repl_gram.y|   6 +
 src/backend/replication/repl_scanner.l |   1 +
 src/bin/pg_basebackup/pg_basebackup.c  | 191 --
 src/include/access/xlog.h  |   3 +-
 src/include/replication/basebackup.h   |   5 +
 10 files changed, 648 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index efe75ea..fc24648 100644
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
*** The commands accepted in walsender mode 
*** 1882,1888 

  

! BASE_BACKUP [LABEL 
'label'] [PROGRESS] 
[FAST] [WAL] [NOWAIT] 
[MAX_RATE rate]
   BASE_BACKUP
  
  
--- 1882,1888 

  

! BASE_BACKUP [LABEL 
'label'] [INCREMENTAL 
'start_lsn'] [PROGRESS] 
[FAST] [WAL] [NOWAIT] 
[MAX_RATE rate]
   B

[HACKERS] Release notes

2015-01-27 Thread Joshua D. Drake


Where can one find the running copy of release notes for pending releases?
--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans."


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release notes

2015-01-27 Thread Andres Freund
On 2015-01-27 10:09:12 -0800, Joshua D. Drake wrote:
> Where can one find the running copy of release notes for pending releases?

In the source tree on the master once it exists. It doesn't yet for the
next set of releases.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dereferenced pointers checked as NULL in btree_utils_var.c

2015-01-27 Thread Tom Lane
Heikki Linnakangas  writes:
> I think you are confusing NULL pointers with an SQL NULL. 
> gistgetadjusted checks that it's not an SQL NULL (!oldisnull[i]), but it 
> does not check if it's a NULL pointer 
> (DatumGetPointer(oldentries[i].key) != NULL). The case we're worried 
> about is that the value is not an SQL NULL, i.e. isnull flag is false, 
> but the Datum is a NULL pointer.

Actually both of these deserve to be worried about; though it's fairly
clear from looking at the core GIST code that it avoids calling
gistKeyIsEQ on SQL NULLs.

> Nevertheless, looking at the code, I don't that a NULL pointer can ever 
> be passed to the same-method, for any of the built-in or contrib 
> opclasses, but it's very confusing because some functions check for 
> that. My proof goes like this:

> 1. The key value passed as argument must've been originally formed by 
> the compress, union, or picksplit methods.

> 1.1. Some compress methods do nothing, and just return the passed-in 
> key, which comes from the table and cannot be a NULL pointer (the 
> compress method is never called for SQL NULLs). Other compress methods 
> don't check for a NULL pointer, and would crash if there was one. 
> gist_poly_compress() and gist_circle_compress() do check for a NULL, but 
> they only return a NULL key if the input key is NULL, which cannot 
> happen because the input comes from a table. So I believe the 
> NULL-checks in those functions are dead code, and none of the compress 
> methods ever return a NULL key.

> 1.2. None of the union methods return a NULL pointer (nor expect one as 
> input).

> 1.3. None of the picksplit methods return a NULL pointer. They all 
> return one of the original values, or one formed with the union method. 
> The picksplit method can return a NULL pointer in the spl_ldatum or 
> spl_rdatum field, in the degenerate case that it puts all keys on the 
> same halve. However, the caller, gistUserPickSplit checks for that and 
> immediately overwrites spl_ldatum and spl_rdatum with sane values in 
> that case.

Sounds good to me.

> I don't understand what this sentence in the documentation on the 
> compress method means:

>> Depending on your needs, you could also need to care about
>> compressing NULL values in there, storing for example (Datum) 0 like
>> gist_circle_compress does.

I believe you're right that I added this because there were checks for
null pointers in some but not all of the opclass support functions.
It looked to me like some opclasses might be intending to pass around null
pointers as valid (not-SQL-NULL) values.  I think your analysis above
eliminates that idea though.  It's a sufficiently weird concept that
I don't feel a need to document or support it.

So I'm fine with taking out both this documentation text and the dead
null-pointer checks; but let's do that all in one patch not piecemeal.
In any case, this is just cosmetic cleanup; there's no actual hazard
here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release notes

2015-01-27 Thread Tom Lane
Andres Freund  writes:
> On 2015-01-27 10:09:12 -0800, Joshua D. Drake wrote:
>> Where can one find the running copy of release notes for pending releases?

> In the source tree on the master once it exists. It doesn't yet for the
> next set of releases.

The closest thing to a "running copy" is the git commit log.  Somebody
(usually Bruce or I) writes release notes from that shortly before a
release is made.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release notes

2015-01-27 Thread Joshua D. Drake


On 01/27/2015 10:17 AM, Tom Lane wrote:

Andres Freund  writes:

On 2015-01-27 10:09:12 -0800, Joshua D. Drake wrote:

Where can one find the running copy of release notes for pending releases?



In the source tree on the master once it exists. It doesn't yet for the
next set of releases.


The closest thing to a "running copy" is the git commit log.  Somebody
(usually Bruce or I) writes release notes from that shortly before a
release is made.


Thanks for the info.

JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans."


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan  writes:
> On 01/27/2015 12:23 PM, Tom Lane wrote:
>> I think coding anything is premature until we decide how we're going to
>> deal with the fundamental ambiguity.

> The input \\uabcd will be stored correctly as \uabcd, but this will in 
> turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
> That's what the patch fixes.
> There are two problems here and this addresses one of them. The other 
> problem is the ambiguity regarding \\u and \u.

It's the same problem really, and until we have an answer about
what to do with \u, I think any patch is half-baked and possibly
counterproductive.

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+ as a NUL byte (\000).  The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text.  I don't
particularly have a problem with that, except possibly for the time cost
of checking for \000 before allowing a conversion to occur.  While a
memchr() check might be cheap enough, we could also consider inventing a
new JEntry type code for string-containing-null, so that there's a
distinction in the type system between strings that are coercible to text
and those that are not.

If we went down a path like that, the currently proposed patch would be
quite useless.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Abbreviated keys for Numeric

2015-01-27 Thread Gavin Flower

On 28/01/15 06:29, Andrew Gierth wrote:

"Peter" == Peter Geoghegan  writes:

  Peter> What I find particularly interesting about this patch is that it
  Peter> makes sorting numerics significantly faster than even sorting
  Peter> float8 values,

Played some more with this. Testing on some different gcc versions
showed that the results were not consistent between versions; the latest
I tried (4.9) showed float8 as somewhat faster, while 4.7 showed float8
as slightly slower; the difference was all in the time of the float8
case, the time for numeric was virtually the same.

For one specific test query, taking the best time of multiple runs,

float8:   gcc4.7 = 980ms, gcc4.9 = 833ms
numeric:  gcc4.7 = 940ms, gcc4.9 = 920ms

(vs. 650ms for bigint on either version)

So honestly I think abbreviation for float8 is a complete red herring.

Also, I couldn't get any detectable benefit from inlining
DatumGetFloat8, though I may have to play more with that to be certain
(above tests did not have any float8-related modifications at all, just
the datum and numeric abbrevs patches).

Since gcc5.0 is due to be released in less than 3 months, it might be 
worth testing with that.



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Hi

2015-01-27 11:41 GMT+01:00 Pavel Stehule :

>
>
> 2015-01-26 21:44 GMT+01:00 Jim Nasby :
>
>> On 1/25/15 4:23 AM, Pavel Stehule wrote:
>>
>>>
>>> I tested a concept iteration over array in format [key1, value1, key2,
>>> value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
>>> ...] too
>>>
>>> It is only a few lines more to current code, and this change doesn't
>>> break a compatibility.
>>>
>>> Do you think, so this patch is acceptable?
>>>
>>> Ideas, comments?
>>>
>>
>> Aside from fixing the comments... I think this needs more tests on corner
>> cases. For example, what happens when you do
>>
>> foreach a, b, c in array(array(1,2),array(3,4)) ?
>>
>
> it is relative simple behave -- empty values are NULL
>
> array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
> ARRAY[1,2,3,4]
>
>
>>
>> Or the opposite case of
>>
>> foreach a,b in array(array(1,2,3))
>>
>> Also, what about:
>>
>> foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?
>
>
>
>  postgres=# select array(select
> unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
>array
> ---
>  {1,2,3,4,5,6,7,8}
> (1 row)
>
> so it generate pairs {1,2}{3,4},{5,6},{7,8}
>

I fixed situation when array has not enough elements.

More tests, simple doc

Regards

Pavel


>
> Regards
>
> Pavel Stehule
>
>
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>>
>
>
diff --git a/contrib/hstore/expected/hstore.out b/contrib/hstore/expected/hstore.out
new file mode 100644
index 9749e45..e44532e
*** a/contrib/hstore/expected/hstore.out
--- b/contrib/hstore/expected/hstore.out
*** select %% 'aa=>1, cq=>l, b=>g, fg=>NULL'
*** 1148,1153 
--- 1148,1169 
   {b,g,aa,1,cq,l,fg,NULL}
  (1 row)
  
+ -- fast iteration over keys
+ do $$
+ declare
+   key text;
+   value text;
+ begin
+   foreach key, value in array hstore_to_array('aa=>1, cq=>l, b=>g, fg=>NULL'::hstore)
+   loop
+ raise notice 'key: %, value: %', key, value;
+   end loop;
+ end;
+ $$;
+ NOTICE:  key: b, value: g
+ NOTICE:  key: aa, value: 1
+ NOTICE:  key: cq, value: l
+ NOTICE:  key: fg, value: 
  select hstore_to_matrix('aa=>1, cq=>l, b=>g, fg=>NULL'::hstore);
  hstore_to_matrix 
  -
diff --git a/contrib/hstore/sql/hstore.sql b/contrib/hstore/sql/hstore.sql
new file mode 100644
index 5a9e9ee..7b9eb09
*** a/contrib/hstore/sql/hstore.sql
--- b/contrib/hstore/sql/hstore.sql
*** select avals('');
*** 257,262 
--- 257,275 
  select hstore_to_array('aa=>1, cq=>l, b=>g, fg=>NULL'::hstore);
  select %% 'aa=>1, cq=>l, b=>g, fg=>NULL';
  
+ -- fast iteration over keys
+ do $$
+ declare
+   key text;
+   value text;
+ begin
+   foreach key, value in array hstore_to_array('aa=>1, cq=>l, b=>g, fg=>NULL'::hstore)
+   loop
+ raise notice 'key: %, value: %', key, value;
+   end loop;
+ end;
+ $$;
+ 
  select hstore_to_matrix('aa=>1, cq=>l, b=>g, fg=>NULL'::hstore);
  select %# 'aa=>1, cq=>l, b=>g, fg=>NULL';
  
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 69a0885..4ef0299
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
*** NOTICE:  row = {7,8,9}
*** 2490,2495 
--- 2490,2518 
  NOTICE:  row = {10,11,12}
  
  
+ 
+ 
+  FOREACH cycle can be used for iteration over record. You
+  need a  extension. For this case a clause
+  SLICE should not be used. FOREACH
+  statements supports list of target variables. When source array is
+  a array of composites, then composite array element is saved to target
+  variables. When the array is a array of scalar values, then target 
+  variables are filled item by item.
+ 
+ CREATE FUNCTION trig_function() RETURNS TRIGGER AS $$
+ DECLARE
+   key text; value text;
+ BEGIN
+   FOREACH key, value IN ARRAY hstore_to_array(hstore(NEW))
+   LOOP
+ RAISE NOTICE 'key = %, value = %', key, value;
+   END LOOP;
+   RETURN NEW;
+ END;
+ $$ LANGUAGE plpgsql;
+ 
+ 
 
  
 
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
new file mode 100644
index ae5421f..4ab3d90
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** exec_stmt_foreach_a(PLpgSQL_execstate *e
*** 2242,2247 
--- 2242,2250 
  	Datum		value;
  	bool		isnull;
  
+ 
+ 	bool		multiassign = false;
+ 
  	/* get the value of the array expression */
  	value = exec_eval_expr(estate, stmt->expr, &isnull, &arrtype);
  	if (isnull)
*** exec_stmt_foreach_a(PLpgSQL_execstate *e
*** 2303,2308 
--- 2306,2328 
  (errcode(ERRCODE_DATATYPE_MISMATCH),
  			  errmsg("FOREACH loop variable must not be of an array type")));
  
+ 	/*
+ 	 * Proof concept -- multiassign in FOREACH cycle
+ 	 *
+ 	 * Motivation: FOREACH key, value IN ARRAY hstore_to_array(hstore(NEW)) ...
+ 	 */
+ 	if (loop_var->dtype == PLPGSQL_DTYPE_ROW
+ 		 &

Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Andres Freund
On 2015-01-27 07:16:27 -0500, Robert Haas wrote:
> On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund  wrote:
> >> I basically have two ideas to fix this.
> >>
> >> 1) Make do_pg_start_backup() acquire a SHARE lock on
> >>pg_database. That'll prevent it from starting while a movedb() is
> >>still in progress. Then additionally add pg_backup_in_progress()
> >>function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
> >>XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
> >>movedb() to error out if a backup is in progress.
> >
> > Attached is a patch trying to this. Doesn't look too bad and lead me to
> > discover missing recovery conflicts during a AD ST.
> >
> > But: It doesn't actually work on standbys, because lock.c prevents any
> > stronger lock than RowExclusive from being acquired. And we need need a
> > lock that can conflict with WAL replay of DBASE_CREATE, to handle base
> > backups that are executed on the primary. Those obviously can't detect
> > whether any standby is currently doing a base backup...
> >
> > I currently don't have a good idea how to mangle lock.c to allow
> > this. I've played with doing it like in the second patch, but that
> > doesn't actually work because of some asserts around ProcSleep - leading
> > to locks on database objects not working in the startup process (despite
> > already being used).
> >
> > The easiest thing would be to just use a lwlock instead of a heavyweight
> > lock - but those aren't canceleable...
> 
> How about just wrapping an lwlock around a flag variable?  movedb()
> increments the variable when starting and decrements it when done
> (must use PG_ENSURE_ERROR_CLEANUP).  Starting a backup errors out (or
> waits in 1-second increments) if it's non-zero.

That'd end up essentially being a re-emulation of locks. Don't find that
all that pretty. But maybe we have to go there.


Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

It basically consists out of:
1) Make GetLockConflicts() actually work.
2) Allow the startup process to actually acquire locks other than
   AccessExclusiveLocks. There already is code acquiring other locks,
   but it's currently broken because they're acquired in blocking mode
   which isn't actually supported in the startup mode. Using this
   infrastructure we can actually fix a couple small races around
   database creation/drop.
3) Allow session locks during recovery to be heavier than
   RowExclusiveLock - since relation/object session locks aren't ever
   taken out by the user (without a plain lock first) that's safe.
4) Perform streaming base backups from within a transaction command, to
   provide error handling.
5) Make walsender actually react to recovery conflict interrupts
6) Prevent access to the template database of a CREATE DATABASE during
   WAL replay.
7) Add an interlock to prevent base backups and movedb() to run
   concurrently. What we actually came here for.

Comments?

At the very least it's missing some documentation updates about the
locking changes in ALTER DATABASE, CREATE DATABASE and the base backup
sections.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 6e196d17e3dc3ae923321c1b49eb46ccd5ac75b0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 27 Jan 2015 19:52:11 +0100
Subject: [PATCH] Fix various issues around WAL replay and ALTER DATABASE SET
 TABLESPACE.


Discussion: 20150120152819.gc24...@alap3.anarazel.de

Fix GetLockConflicts() to properly terminate the list of conflicting
backends. It's unclear why this hasn't caused more problems.

Discussion: 20150127142713.gd29...@awork2.anarazel.de

Don't acquire blocking locks on the database in dbase_redo(), not
enough state setup.
Discussion: 20150126212458.ga29...@awork2.anarazel.de

Don't allow access to the template database during the replay of a
CREATE DATABASE.
---
 src/backend/access/transam/xlog.c|  29 +
 src/backend/commands/dbcommands.c| 104 ++-
 src/backend/replication/basebackup.c |  15 +
 src/backend/replication/walsender.c  |  14 +
 src/backend/storage/ipc/standby.c| 117 ++-
 src/backend/storage/lmgr/lmgr.c  |  31 ++
 src/backend/storage/lmgr/lock.c  |  13 ++--
 src/backend/utils/init/postinit.c|   2 +-
 src/include/access/xlog.h|   1 +
 src/include/storage/lmgr.h   |   3 +
 src/include/storage/standby.h|   2 +-
 11 files changed, 240 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..38e7dff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -53,6 +53,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 01:40 PM, Tom Lane wrote:


In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.




Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
the table what I suggested is unnecessary.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan  writes:
> On 01/27/2015 01:40 PM, Tom Lane wrote:
>> In particular, I would like to suggest that the current representation of
>> \u is fundamentally broken and that we have to change it, not try to
>> band-aid around it.  This will mean an on-disk incompatibility for jsonb
>> data containing U+, but hopefully there is very little of that out
>> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
>> consider such solutions.

> Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
> the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2015-01-27 Thread Pavel Stehule
2015-01-26 22:34 GMT+01:00 Jim Nasby :

> On 1/22/15 2:01 PM, Pavel Stehule wrote:
>
>>
>> * I would to simplify a behave of evaluating of message
>> expression - probably I disallow NULL there.
>>
>>
>> Well, the only thing I could see you doing there is throwing a
>> different error if the hint is null. I don't see that as an improvement.
>> I'd just leave it as-is.
>>
>>
>> I enabled a NULL - but enforced a WARNING before.
>>
>
> I don't see the separate warning as being helpful. I'd just do something
> like
>
> +(err_hint != NULL) ? errhint("%s",
> err_hint) : errhint("Message attached to failed assertion is null") ));
>

done


>
> There should also be a test case for a NULL message.
>

is there, if I understand well

Regards

Pavel


>
>  * GUC enable_asserts will be supported
>>
>>
>> That would be good. Would that allow for enabling/disabling on a
>> per-function basis too?
>>
>>
>> sure - there is only question if we develop a #option
>> enable|disable_asserts. I have no string idea.
>>
>
> The option would be nice, but I don't think it's strictly necessary. The
> big thing is being able to control this on a per-function basis (which I
> think you can do with ALTER FUNCTION SET?)
>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 6bcb106..5663983
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** dynamic_library_path = 'C:\tools\postgre
*** 6912,6917 
--- 6912,6931 
  
  
  
+  
+   enable_user_asserts (boolean)
+   
+enable_user_asserts configuration parameter
+   
+   
+   
+
+ If true, any user assertions are evaluated.  By default, this 
+ is set to true.
+
+   
+  
+ 
   
exit_on_error (boolean)

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 69a0885..26f7eba
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
*** END LOOP  label
 Errors and Messages
  
+   
+ RAISE statement
+ 
 
  RAISE
 
*** RAISE unique_violation USING MESSAGE = '
*** 3564,3570 
--- 3567,3599 
   the whole category.
  
 
+   
  
+   
+ ASSERT statement
+ 
+
+ ASSERT
+
+ 
+
+ assertions
+ in PL/pgSQL
+
+ 
+
+ Use the ASSERT statement to ensure so expected
+ predicate is allways true. If the predicate is false or is null,
+ then a assertion exception is raised.
+ 
+ 
+ ASSERT expression , message expression ;
+ 
+ 
+ The user assertions can be enabled or disabled by the 
+ .
+
+   
   
  
   
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
new file mode 100644
index 28c8c40..da12428
*** a/src/backend/utils/errcodes.txt
--- b/src/backend/utils/errcodes.txt
*** PEERRCODE_PLPGSQL_ERROR
*** 454,459 
--- 454,460 
  P0001EERRCODE_RAISE_EXCEPTIONraise_exception
  P0002EERRCODE_NO_DATA_FOUND  no_data_found
  P0003EERRCODE_TOO_MANY_ROWS  too_many_rows
+ P0004EERRCODE_ASSERT_EXCEPTION   assert_exception
  
  Section: Class XX - Internal Error
  
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
new file mode 100644
index c35867b..cd55e94
*** a/src/backend/utils/init/globals.c
--- b/src/backend/utils/init/globals.c
*** bool		IsBinaryUpgrade = false;
*** 99,104 
--- 99,105 
  bool		IsBackgroundWorker = false;
  
  bool		ExitOnAnyError = false;
+ bool		enable_user_asserts = true;
  
  int			DateStyle = USE_ISO_DATES;
  int			DateOrder = DATEORDER_MDY;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index f6df077..b3a2660
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*** static struct config_bool ConfigureNames
*** 980,985 
--- 980,994 
  	},
  
  	{
+ 		{"enable_user_asserts", PGC_USERSET, ERROR_HANDLING_OPTIONS,
+ 			gettext_noop("Enable user asserts checks."),
+ 			NULL
+ 		},
+ 		&enable_user_asserts,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"exit_on_error", PGC_USERSET, ERROR_HANDLING_OPTIONS,
  			gettext_noop("Terminate session on any error."),
  			NULL
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
new file mode 100644
index 6e33a17..fab3e8a
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*** extern bool IsBackgroundWorker;
*** 137,142 
--- 137,143 
  extern PGDLLIMPORT bool IsBinaryUpgrade;
  
  extern bool ExitOnAnyError;
+ extern bool enable_user_asserts;
  
  extern PG

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 02:28 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 01/27/2015 01:40 PM, Tom Lane wrote:

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.




I don't think we can be in the business of rejecting valid JSON.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Hello

here is a initial version of row_to_array function - transform any row to
array in format proposed by Andrew.

Regards

Pavel

2015-01-27 19:58 GMT+01:00 Pavel Stehule :

> Hi
>
> 2015-01-27 11:41 GMT+01:00 Pavel Stehule :
>
>>
>>
>> 2015-01-26 21:44 GMT+01:00 Jim Nasby :
>>
>>> On 1/25/15 4:23 AM, Pavel Stehule wrote:
>>>

 I tested a concept iteration over array in format [key1, value1, key2,
 value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
 ...] too

 It is only a few lines more to current code, and this change doesn't
 break a compatibility.

 Do you think, so this patch is acceptable?

 Ideas, comments?

>>>
>>> Aside from fixing the comments... I think this needs more tests on
>>> corner cases. For example, what happens when you do
>>>
>>> foreach a, b, c in array(array(1,2),array(3,4)) ?
>>>
>>
>> it is relative simple behave -- empty values are NULL
>>
>> array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
>> ARRAY[1,2,3,4]
>>
>>
>>>
>>> Or the opposite case of
>>>
>>> foreach a,b in array(array(1,2,3))
>>>
>>> Also, what about:
>>>
>>> foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?
>>
>>
>>
>>  postgres=# select array(select
>> unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
>>array
>> ---
>>  {1,2,3,4,5,6,7,8}
>> (1 row)
>>
>> so it generate pairs {1,2}{3,4},{5,6},{7,8}
>>
>
> I fixed situation when array has not enough elements.
>
> More tests, simple doc
>
> Regards
>
> Pavel
>
>
>>
>> Regards
>>
>> Pavel Stehule
>>
>>
>>> --
>>> Jim Nasby, Data Architect, Blue Treble Consulting
>>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>>>
>>
>>
>
diff --git a/src/backend/utils/adt/rowtypes.c b/src/backend/utils/adt/rowtypes.c
new file mode 100644
index 3dc9a84..d758d2d
*** a/src/backend/utils/adt/rowtypes.c
--- b/src/backend/utils/adt/rowtypes.c
***
*** 21,26 
--- 21,27 
  #include "catalog/pg_type.h"
  #include "funcapi.h"
  #include "libpq/pqformat.h"
+ #include "utils/array.h"
  #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  #include "utils/typcache.h"
*** btrecordimagecmp(PG_FUNCTION_ARGS)
*** 1810,1812 
--- 1811,1898 
  {
  	PG_RETURN_INT32(record_image_cmp(fcinfo));
  }
+ 
+ /*
+  * transform any record to array in format [key1, value1, key2, value2 [, ...]]
+  */
+ Datum
+ row_to_array(PG_FUNCTION_ARGS)
+ {
+ 	HeapTupleHeader		rec = PG_GETARG_HEAPTUPLEHEADER(0);
+ 	TupleDesc		rectupdesc;
+ 	Oid			rectuptyp;
+ 	int32			rectuptypmod;
+ 	HeapTupleData		rectuple;
+ 	int	ncolumns;
+ 	Datum 		*recvalues;
+ 	bool  		*recnulls;
+ 	ArrayBuildState		*builder;
+ 	int	i;
+ 
+ 	/* Extract type info from the tuple itself */
+ 	rectuptyp = HeapTupleHeaderGetTypeId(rec);
+ 	rectuptypmod = HeapTupleHeaderGetTypMod(rec);
+ 	rectupdesc = lookup_rowtype_tupdesc(rectuptyp, rectuptypmod);
+ 	ncolumns = rectupdesc->natts;
+ 
+ 	/* Build a temporary HeapTuple control structure */
+ 	rectuple.t_len = HeapTupleHeaderGetDatumLength(rec);
+ 	ItemPointerSetInvalid(&(rectuple.t_self));
+ 	rectuple.t_tableOid = InvalidOid;
+ 	rectuple.t_data = rec;
+ 
+ 	recvalues = (Datum *) palloc(ncolumns * sizeof(Datum));
+ 	recnulls = (bool *) palloc(ncolumns * sizeof(bool));
+ 
+ 	/* Break down the tuple into fields */
+ 	heap_deform_tuple(&rectuple, rectupdesc, recvalues, recnulls);
+ 
+ 	/* Prepare target array */
+ 	builder = initArrayResult(TEXTOID, CurrentMemoryContext);
+ 
+ 	for (i = 0; i < ncolumns; i++)
+ 	{
+ 		Oid	columntyp = rectupdesc->attrs[i]->atttypid;
+ 		Datum		value;
+ 		bool		isnull;
+ 
+ 		/* Ignore dropped columns */
+ 		if (rectupdesc->attrs[i]->attisdropped)
+ 			continue;
+ 
+ 		builder = accumArrayResult(builder,
+ 			CStringGetTextDatum(NameStr(rectupdesc->attrs[i]->attname)),
+ 			false,
+ 			TEXTOID,
+ 			CurrentMemoryContext);
+ 
+ 		if (!recnulls[i])
+ 		{
+ 			char *outstr;
+ 			bool		typIsVarlena;
+ 			Oid		typoutput;
+ 			FmgrInfo		proc;
+ 
+ 			getTypeOutputInfo(columntyp, &typoutput, &typIsVarlena);
+ 			fmgr_info_cxt(typoutput, &proc, CurrentMemoryContext);
+ 			outstr = OutputFunctionCall(&proc, recvalues[i]);
+ 
+ 			value = CStringGetTextDatum(outstr);
+ 			isnull = false;
+ 		}
+ 		else
+ 		{
+ 			value = (Datum) 0;
+ 			isnull = true;
+ 		}
+ 
+ 		builder = accumArrayResult(builder,
+ 		value, isnull,
+ 		TEXTOID,
+ 		CurrentMemoryContext);
+ 	}
+ 
+ 	ReleaseTupleDesc(rectupdesc);
+ 
+ 	PG_RETURN_DATUM(makeArrayResult(builder, CurrentMemoryContext));
+ }
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
new file mode 100644
index 9edfdb8..a27cf4a
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*** DATA(insert OID = 376 (  string_to_array
*** 891,896 
--- 891,898 
  DESCR("split delimited text into text[], with null string");
  DATA(insert OID = 384 (  array_to_string   PGNSP PGUID 12 1 0 0 0 

Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Example:

postgres=# do $$
declare r record;
declare k text; v text;
begin
  for r in select * from foo loop
foreach k,v in array row_to_array(r) loop
  raise notice 'k: %, v: %', k, v;
end loop;
  end loop;
end;
$$;
NOTICE:  k: a, v: 2
NOTICE:  k: b, v: NAZDAR
NOTICE:  k: c, v: 2015-01-27
NOTICE:  k: a, v: 2
NOTICE:  k: b, v: AHOJ
NOTICE:  k: c, v: 2015-01-27
DO

Regards

Pavel

2015-01-27 21:26 GMT+01:00 Pavel Stehule :

> Hello
>
> here is a initial version of row_to_array function - transform any row to
> array in format proposed by Andrew.
>
> Regards
>
> Pavel
>
> 2015-01-27 19:58 GMT+01:00 Pavel Stehule :
>
>> Hi
>>
>> 2015-01-27 11:41 GMT+01:00 Pavel Stehule :
>>
>>>
>>>
>>> 2015-01-26 21:44 GMT+01:00 Jim Nasby :
>>>
 On 1/25/15 4:23 AM, Pavel Stehule wrote:

>
> I tested a concept iteration over array in format [key1, value1, key2,
> value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
> ...] too
>
> It is only a few lines more to current code, and this change doesn't
> break a compatibility.
>
> Do you think, so this patch is acceptable?
>
> Ideas, comments?
>

 Aside from fixing the comments... I think this needs more tests on
 corner cases. For example, what happens when you do

 foreach a, b, c in array(array(1,2),array(3,4)) ?

>>>
>>> it is relative simple behave -- empty values are NULL
>>>
>>> array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
>>> ARRAY[1,2,3,4]
>>>
>>>

 Or the opposite case of

 foreach a,b in array(array(1,2,3))

 Also, what about:

 foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?
>>>
>>>
>>>
>>>  postgres=# select array(select
>>> unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
>>>array
>>> ---
>>>  {1,2,3,4,5,6,7,8}
>>> (1 row)
>>>
>>> so it generate pairs {1,2}{3,4},{5,6},{7,8}
>>>
>>
>> I fixed situation when array has not enough elements.
>>
>> More tests, simple doc
>>
>> Regards
>>
>> Pavel
>>
>>
>>>
>>> Regards
>>>
>>> Pavel Stehule
>>>
>>>
 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com

>>>
>>>
>>
>


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan  writes:
> On 01/27/2015 02:28 PM, Tom Lane wrote:
>> Well, we can either fix it now or suffer with a broken representation
>> forever.  I'm not wedded to the exact solution I described, but I think
>> we'll regret it if we don't change the representation.
>> 
>> The only other plausible answer seems to be to flat out reject \u.
>> But I assume nobody likes that.

> I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x.  At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls).  I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice.  Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

Also, while the idea of throwing error only when a \0 needs to be
converted to text seems logically clean, it looks like that might pretty
much cripple the usability of such values anyhow, because we convert to
text at the drop of a hat.  So some investigation and probably additional
work would be needed to ensure you could do at least basic things with
such values.  (A function for direct conversion to bytea might be useful
too.)

I think the "it would mean rejecting valid JSON" argument is utter
hogwash.  We already reject, eg, "\u00A0" if you're not using a UTF8
encoding.  And we reject "1e1", not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type.  I don't see any moral superiority of that over rejecting "\u"
because of an implementation restriction of our underlying text type.

So at this point I propose that we reject \u when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila  wrote:
> Script used to test is attached (parallel_count.sh)

Why does this use EXPLAIN ANALYZE instead of \timing ?

> IBM POWER-7 16 cores, 64 hardware threads
> RAM = 64GB
>
> Table Size - 120GB
>
> Used below statements to create table -
> create table tbl_perf(c1 int, c2 char(1000));
> insert into tbl_perf values(generate_series(1,1000),'a');
> insert into tbl_perf values(generate_series(1001,3000),'a');
> insert into tbl_perf values(generate_series(3001,11000),'a');

I generated this table using this same method and experimented with
copying the whole file to the bit bucket using dd.  I did this on
hydra, which I think is the same machine you used.

time for i in `seq 0 119`; do if [ $i -eq 0 ]; then f=16388; else
f=16388.$i; fi; dd if=$f of=/dev/null bs=8k; done

There is a considerable amount of variation in the amount of time this
takes to run based on how much of the relation is cached.  Clearly,
there's no way for the system to cache it all, but it can cache a
significant portion, and that affects the results to no small degree.
dd on hydra prints information on the data transfer rate; on uncached
1GB segments, it runs at right around 400 MB/s, but that can soar to
upwards of 3GB/s when the relation is fully cached.  I tried flushing
the OS cache via echo 1 > /proc/sys/vm/drop_caches, and found that
immediately after doing that, the above command took 5m21s to run -
i.e. ~321000 ms.  Most of your test times are faster than that, which
means they reflect some degree of caching.  When I immediately reran
the command a second time, it finished in 4m18s the second time, or
~258000 ms.  The rate was the same as the first test - about 400 MB/s
- for most of the files, but 27 of the last 28 files went much faster,
between 1.3 GB/s and 3.7 GB/s.

This tells us that the OS cache on this machine has anti-spoliation
logic in it, probably not dissimilar to what we have in PG.  If the
data were cycled through the system cache in strict LRU fashion, any
data that was leftover from the first run would have been flushed out
by the early part of the second run, so that all the results from the
second set of runs would have hit the disk.  But in fact, that's not
what happened: the last pages from the first run remained cached even
after reading an amount of new data that exceeds the size of RAM on
that machine.  What I think this demonstrates is that we're going to
have to be very careful to control for caching effects, or we may find
that we get misleading results.  To make this simpler, I've installed
a setuid binary /usr/bin/drop_caches that you (or anyone who has an
account on that machine) can use you drop the caches; run 'drop_caches
1'.

> Block-By-Block
>
> No. of workers/Time (ms) 0 2
> Run-1 267798 295051
> Run-2 276646 296665
> Run-3 281364 314952
> Run-4 290231 326243
> Run-5 288890 295684

The next thing I did was run test with the block-by-block method after
having dropped the caches.  I did this with 0 workers and with 8
workers.  I dropped the caches and restarted postgres before each
test, but then ran each test a second time to see the effect of
caching by both the OS and by PostgreSQL.  I got these results:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached.  But that's not what happened.

I've also got a draft of a prefetching implementation here that I'd
like to test out, but I've just discovered that it's buggy, so I'm
going to send these results for now and work on fixing that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Stephen Frost
Robert, all,

* Robert Haas (robertmh...@gmail.com) wrote:
> There is a considerable amount of variation in the amount of time this
> takes to run based on how much of the relation is cached.  Clearly,
> there's no way for the system to cache it all, but it can cache a
> significant portion, and that affects the results to no small degree.
> dd on hydra prints information on the data transfer rate; on uncached
> 1GB segments, it runs at right around 400 MB/s, but that can soar to
> upwards of 3GB/s when the relation is fully cached.  I tried flushing
> the OS cache via echo 1 > /proc/sys/vm/drop_caches, and found that
> immediately after doing that, the above command took 5m21s to run -
> i.e. ~321000 ms.  Most of your test times are faster than that, which
> means they reflect some degree of caching.  When I immediately reran
> the command a second time, it finished in 4m18s the second time, or
> ~258000 ms.  The rate was the same as the first test - about 400 MB/s
> - for most of the files, but 27 of the last 28 files went much faster,
> between 1.3 GB/s and 3.7 GB/s.

[...]

> With 0 workers, first run took 883465.352 ms, and second run took 295050.106 
> ms.
> With 8 workers, first run took 340302.250 ms, and second run took 307767.758 
> ms.
> 
> This is a confusing result, because you expect parallelism to help
> more when the relation is partly cached, and make little or no
> difference when it isn't cached.  But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run.  If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs.  What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long?  It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

One possibility might be round-trip latency.  The multi-threaded case is
able to keep the CPUs and the i/o system going, and the cached results
don't have as much latency since things are cached, but the
single-threaded uncached case going i/o -> cpu -> i/o -> cpu, ends up
with a lot of wait time as it switches between being on CPU and waiting
on the i/o.

Just some thoughts.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers

2015-01-27 Thread Jim Nasby

On 1/26/15 6:11 PM, Greg Stark wrote:


On Tue, Jan 27, 2015 at 12:03 AM, Jim Nasby mailto:jim.na...@bluetreble.com>> wrote:

But one backend can effectively "pin" a buffer more than once, no? If so, 
then ISTM there's some risk that code path A pins and forgets to unpin, but path B 
accidentally unpins for A.


The danger is that there's a codepath that refers to memory it doesn't have a 
pin on but that there is no actual test in our regression suite that doesn't 
actually have a second pin on the same buffer. If there is in fact no reachable 
code path at all without the second pin then there's no active bug, only a bad 
coding practice. But if there are code paths that we just aren't testing then 
that's a real bug.

IIRC CLOBBER_FREED_MEMORY only affects palloc'd blocks. Do we not have a mode 
that automatically removes any buffer as soon as it's not pinned? That seems 
like it would be a valuable addition.


By setting BufferDesc.tag to 0?

On a related note... I'm confused by this part of UnpinBuffer. How is refcount 
ending up > 0??

Assert(ref->refcount > 0);
ref->refcount--;
if (ref->refcount == 0)
{
/* I'd better not still hold any locks on the buffer */
Assert(!LWLockHeldByMe(buf->content_lock));
Assert(!LWLockHeldByMe(buf->io_in_progress_lock));

LockBufHdr(buf);

/* Decrement the shared reference count */
Assert(buf->refcount > 0);
buf->refcount--;


BTW, I certainly see no evidence of CLOBBER_FREED_MEMORY coming into play here.


Fwiw I think our experience is that bugs where buffers are unpinned get exposed 
pretty quickly in production. I suppose the same might not be true for rarely 
called codepaths or in cases where the buffers are usually already pinned.


Yeah, that's what I was thinking. If there's some easy way to correctly 
associate pins with specific code paths (owners?) then maybe it's worth doing 
so; but I don't think it's worth much effort.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Providing catalog view to pg_hba.conf file - Patch submission

2015-01-27 Thread Jim Nasby

On 1/27/15 1:04 AM, Haribabu Kommi wrote:

On Mon, Jun 30, 2014 at 5:06 PM, Abhijit Menon-Sen  wrote:

I think having two columns would work. The columns could be called
"database" and "database_list" and "user" and "user_list" respectively.

The database column may contain one of "all", "sameuser", "samegroup",
"replication", but if it's empty, database_list will contain an array of
database names. Then ("all", {}) and ("", {all}) are easily separated.
Likewise for user and user_list.


Thanks for the review.

I corrected all the review comments except the one to add two columns
as (database, database_list and user, user_list). I feel this may cause
some confusion to the users.

Here I attached the latest version of the patch.
I will add this patch to the next commitfest.


Apologies if this was covered, but why isn't the IP address an inet instead of 
text?

Also, what happens if someone reloads the config in the middle of running the 
SRF? ISTM it'd be better to do something like process all of parsed_hba_lines 
into a tuplestore. Obviously there's still a race condition there, but at least 
it's a lot smaller, and AFAIK no worse than the pg_stats views.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Jim Nasby

On 1/26/15 4:45 PM, Robert Haas wrote:

On Mon, Jan 26, 2015 at 5:21 PM, Stephen Frost  wrote:

I don't disagree with you about any of that.  I don't think you disagree
with my comment either.  What I'm not entirely clear on is how consensus
could be reached.  Leaving it dormant for the better part of a year
certainly doesn't appear to have helped that situation.  We've discussed
having it be part of the main server and having it be a contrib module
and until about a week ago, I had understood that having it in contrib
would be preferrable.  Based on the recent emails, it appears there's
been a shift of preference to having it be in-core, but clearly there's
no time left to do that in this release cycle.


Well, I'm not sure that anyone else here agreed with me on that, and
one person does not a consensus make no matter who it is.  The basic
problem here is that we don't seem to have even two people here who
agree on how this ought to be done.  The basic dynamic here seems to
be you asking for changes and Abhijit making them but without any real
confidence, and I don't feel good about that.  I'm willing to defer to
an emerging consensus here when there is one, but what Abhijit likes
best is not a consensus, and neither is what you like, and neither is
what I like.  What we need is some people agreeing with each other.


BTW, I know that at least earlier versions of EnterpriseDB's version of 
Postgres (circa 2007) had an auditing feature. I never dealt with any customers 
who were using it when I was there, but perhaps some other folks could shed 
some light on what customers wanted to see an an auditing feature... (I'm 
looking at you, Jimbo!)
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Jim Nasby

On 1/27/15 3:56 AM, Abhijit Menon-Sen wrote:

At 2015-01-26 18:47:29 -0600, jim.na...@bluetreble.com wrote:


Anything happen with this?


Nothing so far.


It would be best to get this into a commit fest so it's not lost.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Fri, Jan 23, 2015 at 6:42 AM, Amit Kapila  wrote:
> Fixed-Chunks
>
> No. of workers/Time (ms) 0 2 4 8 16 24 32
> Run-1 250536 266279 251263 234347 87930 50474 35474
> Run-2 249587 230628 225648 193340 83036 35140 9100
> Run-3 234963 220671 230002 256183 105382 62493 27903
> Run-4 239111 245448 224057 189196 123780 63794 24746
> Run-5 239937 222820 219025 220478 114007 77965 39766

I cannot reproduce these results.  I applied your fixed-chunk size
patch and ran SELECT parallel_count('tbl_perf', 32) a few times.  The
first thing I notice is that, as I predicted, there's an issue with
different workers finishing at different times.  For example, from my
first run:

2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34700) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34698) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34701) exited with exit code 0
2015-01-27 22:13:10 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34699) exited with exit code 0
2015-01-27 22:15:00 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34683) exited with exit code 0
2015-01-27 22:15:29 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34673) exited with exit code 0
2015-01-27 22:15:58 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34679) exited with exit code 0
2015-01-27 22:16:38 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34689) exited with exit code 0
2015-01-27 22:16:39 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34671) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34677) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34672) exited with exit code 0
2015-01-27 22:16:48 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34680) exited with exit code 0
2015-01-27 22:16:50 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34686) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34670) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34690) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34674) exited with exit code 0
2015-01-27 22:16:52 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34684) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34675) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34682) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34691) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34676) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34685) exited with exit code 0
2015-01-27 22:16:55 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34692) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34687) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34678) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34681) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34688) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34694) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34693) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34695) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34697) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34696) exited with exit code 0

That run started at 22:13:01.  Within 4 seconds, 4 workers exited.  So
clearly we are not getting the promised 32-way parallelism for the
whole test.  Granted, in this instance, *most* of the workers run
until the end, but I think we'll find that there are
uncomfortably-frequent cases where we get significantly less
parallelism than we planned on because the work isn't divided evenly.

But leaving that aside, I've run this test 6 times in a row now, with
a warm cache, and the bes

Re: [HACKERS] proposal: searching in array function - array_position

2015-01-27 Thread Jim Nasby

On 1/27/15 4:36 AM, Pavel Stehule wrote:

2015-01-26 23:29 GMT+01:00 Jim Nasby mailto:jim.na...@bluetreble.com>>:

On 1/26/15 4:17 PM, Pavel Stehule wrote:

 Any way to reduce the code duplication between the array and 
non-array versions? Maybe factor out the operator caching code?


I though about it - but there is different checks, different result 
processing, different result type.

I didn't find any readable and reduced form :(


Yeah, that's why I was thinking specifically of the operator caching 
code... isn't that identical? That would at least remove a dozen lines...


It is only partially identical - I would to use cache for array_offset, but it 
is not necessary for array_offsets .. depends how we would to modify current 
API to support externally cached data.


Externally cached data?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 5:55 PM, Jim Nasby  wrote:
> BTW, I know that at least earlier versions of EnterpriseDB's version of
> Postgres (circa 2007) had an auditing feature. I never dealt with any
> customers who were using it when I was there, but perhaps some other folks
> could shed some light on what customers wanted to see an an auditing
> feature... (I'm looking at you, Jimbo!)

It's still there, but it's nothing like pgaudit.  What I hear is that
our customers are looking for something that has the capabilities of
DBMS_FGA.  I haven't researched either that or pgaudit enough to know
how similar they are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Jim Nasby

On 1/27/15 9:29 AM, Stephen Frost wrote:

My point is that Bruce's patch suggests looking for "remote_dir" in
>the rsync documentation, but no such term appears there.

Ah, well, perhaps we could simply add a bit of clarification to this:

for details on specifying remote_dir


The whole remote_dir discussion made me think of something... would --link-dest 
be any help here?

   --link-dest=DIR
  This option behaves like --copy-dest, but unchanged files are 
hard linked from DIR to the des-
  tination  directory.   The  files  must be identical in all 
preserved attributes (e.g. permis-
  sions, possibly ownership) in order for the files to be linked 
together.  An example:

rsync -av --link-dest=$PWD/prior_dir host:src_dir/ new_dir/

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2015-01-27 Thread Jim Nasby

On 1/27/15 1:30 PM, Pavel Stehule wrote:

I don't see the separate warning as being helpful. I'd just do something 
like

+(err_hint != NULL) ? errhint("%s", err_hint) : 
errhint("Message attached to failed assertion is null") ));


done


There should also be a test case for a NULL message.


is there, if I understand well


I see it now. Looks good.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Jim Nasby

On 1/27/15 2:26 PM, Pavel Stehule wrote:

here is a initial version of row_to_array function - transform any row to array 
in format proposed by Andrew.


Please start a new thread for this... does it depend on the key-value patch?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers

2015-01-27 Thread Tom Lane
Jim Nasby  writes:
> On 1/26/15 6:11 PM, Greg Stark wrote:
>> Fwiw I think our experience is that bugs where buffers are unpinned get 
>> exposed pretty quickly in production. I suppose the same might not be true 
>> for rarely called codepaths or in cases where the buffers are usually 
>> already pinned.

> Yeah, that's what I was thinking. If there's some easy way to correctly 
> associate pins with specific code paths (owners?) then maybe it's worth doing 
> so; but I don't think it's worth much effort.

If you have a working set larger than shared_buffers, then yeah it's
likely that reference-after-unpin bugs would manifest pretty quickly.
This does not necessarily translate into something reproducible or
debuggable, however; and besides that you'd really rather that such
bugs not get into the field in the first place.

The point of my Valgrind proposal was to provide a mechanism that would
have a chance of catching such bugs in a *development* context, and
provide some hint of where in the codebase the fault is, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Merlin Moncure
On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane  wrote:
> Andrew Dunstan  writes:
>> On 01/27/2015 12:23 PM, Tom Lane wrote:
>>> I think coding anything is premature until we decide how we're going to
>>> deal with the fundamental ambiguity.
>
>> The input \\uabcd will be stored correctly as \uabcd, but this will in
>> turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
>> That's what the patch fixes.
>> There are two problems here and this addresses one of them. The other
>> problem is the ambiguity regarding \\u and \u.
>
> It's the same problem really, and until we have an answer about
> what to do with \u, I think any patch is half-baked and possibly
> counterproductive.
>
> In particular, I would like to suggest that the current representation of
> \u is fundamentally broken and that we have to change it, not try to
> band-aid around it.  This will mean an on-disk incompatibility for jsonb
> data containing U+, but hopefully there is very little of that out
> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
> consider such solutions.
>
> The most obvious way to store such data unambiguously is to just go ahead
> and store U+ as a NUL byte (\000).  The only problem with that is that
> then such a string cannot be considered to be a valid value of type TEXT,
> which would mean that we'd need to throw an error if we were asked to
> convert a JSON field containing such a character to text.

Hm, does this include text out operations for display purposes?   If
so, then any query selecting jsonb objects with null bytes would fail.
How come we have to error out?  How about a warning indicating the
string was truncated?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Jim Nasby

On 1/27/15 12:58 PM, Pavel Stehule wrote:

  postgres=# select array(select 
unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
array
---
  {1,2,3,4,5,6,7,8}
(1 row)

so it generate pairs {1,2}{3,4},{5,6},{7,8}


I fixed situation when array has not enough elements.

More tests, simple doc


Hrm, this wasn't what I was expecting:

+ select foreach_test_ab(array[1,2,3,4]);
+ NOTICE:  a: 1, b: 2
+ NOTICE:  a: 3, b: 4

I was expecting that foreach a,b array would be expecting something in the 
array to have a dimension of 2. :(

I think this is bad, because this:

foreach_test_ab('{{1,2,3},{4,5,6}}'::int[]);

will give you 1,2; 3,4; 5,6. I don't see any way that that makes sense. Even if 
it did make sense, I'm more concerned that adding this will seriously paint us 
into a corner when it comes to the (to me) more rational case of returning 
{1,2,3},{4,5,6}.

I think we need to think some more about this, at least to make sure we're not 
painting ourselves into a corner for more appropriate array iteration.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Merlin Moncure  writes:
> On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane  wrote:
>> In particular, I would like to suggest that the current representation of
>> \u is fundamentally broken and that we have to change it, not try to
>> band-aid around it.  This will mean an on-disk incompatibility for jsonb
>> data containing U+, but hopefully there is very little of that out
>> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
>> consider such solutions.
>> 
>> The most obvious way to store such data unambiguously is to just go ahead
>> and store U+ as a NUL byte (\000).  The only problem with that is that
>> then such a string cannot be considered to be a valid value of type TEXT,
>> which would mean that we'd need to throw an error if we were asked to
>> convert a JSON field containing such a character to text.

> Hm, does this include text out operations for display purposes?   If
> so, then any query selecting jsonb objects with null bytes would fail.
> How come we have to error out?  How about a warning indicating the
> string was truncated?

No, that's not a problem, because jsonb_out would produce an escaped
textual representation, so any null would come out as \u again.
The trouble comes up when you do something that's supposed to produce
a *non escaped* text equivalent of a JSON string value, such as
the ->> operator.

Arguably, ->> is broken already with the current coding, in that
these results are entirely inconsistent:

regression=# select '{"a":"foo\u0040bar"}'::jsonb ->> 'a';
 ?column? 
--
 foo@bar
(1 row)

regression=# select '{"a":"foo\ubar"}'::jsonb ->> 'a';
   ?column?   
--
 foo\ubar
(1 row)

regression=# select '{"a":"foo\\ubar"}'::jsonb ->> 'a';
   ?column?   
--
 foo\ubar
(1 row)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Jim Nasby

On 1/26/15 11:11 PM, Amit Kapila wrote:

On Tue, Jan 27, 2015 at 3:18 AM, Jim Nasby mailto:jim.na...@bluetreble.com>> wrote:
 >
 > On 1/23/15 10:16 PM, Amit Kapila wrote:
 >>
 >> Further, if we want to just get the benefit of parallel I/O, then
 >> I think we can get that by parallelising partition scan where different
 >> table partitions reside on different disk partitions, however that is
 >> a matter of separate patch.
 >
 >
 > I don't think we even have to go that far.
 >
 >
 > We'd be a lot less sensitive to IO latency.
 >
 > I wonder what kind of gains we would see if every SeqScan in a query spawned 
a worker just to read tuples and shove them in a queue (or shove a pointer to a 
buffer in the queue).
 >

Here IIUC, you want to say that just get the read done by one parallel
worker and then all expression calculation (evaluation of qualification
and target list) in the main backend, it seems to me that by doing it
that way, the benefit of parallelisation will be lost due to tuple
communication overhead (may be the overhead is less if we just
pass a pointer to buffer but that will have another kind of problems
like holding buffer pins for a longer period of time).

I could see the advantage of testing on lines as suggested by Tom Lane,
but that seems to be not directly related to what we want to achieve by
this patch (parallel seq scan) or if you think otherwise then let me know?


There's some low-hanging fruit when it comes to improving our IO performance 
(or more specifically, decreasing our sensitivity to IO latency). Perhaps the 
way to do that is with the parallel infrastructure, perhaps not. But I think 
it's premature to look at parallelism for increasing IO performance, or 
worrying about things like how many IO threads we should have before we at 
least look at simpler things we could do. We shouldn't assume there's nothing 
to be gained short of a full parallelization implementation.

That's not to say there's nothing else we could use parallelism for. Sort, 
merge and hash operations come to mind.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Jim Nasby

On 1/27/15 3:46 PM, Stephen Frost wrote:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
>With 8 workers, first run took 340302.250 ms, and second run took 307767.758 
ms.
>
>This is a confusing result, because you expect parallelism to help
>more when the relation is partly cached, and make little or no
>difference when it isn't cached.  But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run.  If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs.  What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long?  It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

One possibility might be round-trip latency.  The multi-threaded case is
able to keep the CPUs and the i/o system going, and the cached results
don't have as much latency since things are cached, but the
single-threaded uncached case going i/o -> cpu -> i/o -> cpu, ends up
with a lot of wait time as it switches between being on CPU and waiting
on the i/o.


This exactly mirrors what I've seen on production systems. On a single SeqScan 
I can't get anywhere close to the IO performance I could get with dd. Once I 
got up to 4-8 SeqScans of different tables running together, I saw iostat 
numbers that were similar to what a single dd bs=8k would do. I've tested this 
with iSCSI SAN volumes on both 1Gbit and 10Gbit ethernet.

This is why I think that when it comes to IO performance, before we start 
worrying about real parallelization we should investigate ways to do some kind 
of async IO.

I only have my SSD laptop and a really old server to test on, but I'll try 
Tom's suggestion of adding a PrefetchBuffer call into heapgetpage() unless 
someone beats me to it. I should be able to do it tomorrow.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Jim Nasby

On 1/27/15 5:06 PM, Robert Haas wrote:

On Tue, Jan 27, 2015 at 5:55 PM, Jim Nasby  wrote:

BTW, I know that at least earlier versions of EnterpriseDB's version of
Postgres (circa 2007) had an auditing feature. I never dealt with any
customers who were using it when I was there, but perhaps some other folks
could shed some light on what customers wanted to see an an auditing
feature... (I'm looking at you, Jimbo!)


It's still there, but it's nothing like pgaudit.  What I hear is that
our customers are looking for something that has the capabilities of
DBMS_FGA.  I haven't researched either that or pgaudit enough to know
how similar they are.


Do they really need the full capabilities of DBMS_FGA? At first glance, it 
looks even more sophisticated than anything that's been discussed so far. To 
wit (from 
http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_fga.htm#CDEIECAG):

DBMS_FGA.ADD_POLICY(
   object_schema  VARCHAR2,
   object_nameVARCHAR2,
   policy_nameVARCHAR2,
   audit_conditionVARCHAR2,
   audit_column   VARCHAR2,
   handler_schema VARCHAR2,
   handler_module VARCHAR2,
   enable BOOLEAN,
   statement_typesVARCHAR2,
   audit_trailBINARY_INTEGER IN DEFAULT,
   audit_column_opts  BINARY_INTEGER IN DEFAULT);
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers

2015-01-27 Thread Jim Nasby

On 1/27/15 5:16 PM, Tom Lane wrote:

Jim Nasby  writes:

On 1/26/15 6:11 PM, Greg Stark wrote:

Fwiw I think our experience is that bugs where buffers are unpinned get exposed 
pretty quickly in production. I suppose the same might not be true for rarely 
called codepaths or in cases where the buffers are usually already pinned.



Yeah, that's what I was thinking. If there's some easy way to correctly 
associate pins with specific code paths (owners?) then maybe it's worth doing 
so; but I don't think it's worth much effort.


If you have a working set larger than shared_buffers, then yeah it's
likely that reference-after-unpin bugs would manifest pretty quickly.
This does not necessarily translate into something reproducible or
debuggable, however; and besides that you'd really rather that such
bugs not get into the field in the first place.

The point of my Valgrind proposal was to provide a mechanism that would
have a chance of catching such bugs in a *development* context, and
provide some hint of where in the codebase the fault is, too.


That's what I was looking for two; I was just wondering if there was an easy 
way to also cover the case of one path forgetting to Unpin and a second path 
accidentally Unpinning (with neither dropping the refcount to 0). It sounds 
like it's just not worth worrying about that though.

Do you think there's merit to having bufmgr.c do something special when 
refcount hits 0 in a CLOBBER_FREED_MEMORY build? It seems like it's a lot 
easier to enable that than to setup valgrind (though I've never tried the 
latter).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 6:54 PM, Jim Nasby  wrote:
> On 1/27/15 5:06 PM, Robert Haas wrote:
>>
>> On Tue, Jan 27, 2015 at 5:55 PM, Jim Nasby 
>> wrote:
>>>
>>> BTW, I know that at least earlier versions of EnterpriseDB's version of
>>> Postgres (circa 2007) had an auditing feature. I never dealt with any
>>> customers who were using it when I was there, but perhaps some other
>>> folks
>>> could shed some light on what customers wanted to see an an auditing
>>> feature... (I'm looking at you, Jimbo!)
>>
>>
>> It's still there, but it's nothing like pgaudit.  What I hear is that
>> our customers are looking for something that has the capabilities of
>> DBMS_FGA.  I haven't researched either that or pgaudit enough to know
>> how similar they are.
>
>
> Do they really need the full capabilities of DBMS_FGA? At first glance, it
> looks even more sophisticated than anything that's been discussed so far. To
> wit (from
> http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_fga.htm#CDEIECAG):
>
> DBMS_FGA.ADD_POLICY(
>object_schema  VARCHAR2,
>object_nameVARCHAR2,
>policy_nameVARCHAR2,
>audit_conditionVARCHAR2,
>audit_column   VARCHAR2,
>handler_schema VARCHAR2,
>handler_module VARCHAR2,
>enable BOOLEAN,
>statement_typesVARCHAR2,
>audit_trailBINARY_INTEGER IN DEFAULT,
>audit_column_opts  BINARY_INTEGER IN DEFAULT);

I don't know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund  wrote:
> That'd end up essentially being a re-emulation of locks. Don't find that
> all that pretty. But maybe we have to go there.

The advantage of it is that it is simple.  The only thing we're really
giving up is the deadlock detector, which I think isn't needed in this
case.

> Here's an alternative approach. I think it generally is superior and
> going in the right direction, but I'm not sure it's backpatchable.

I tend think this is changing too many things to back-patch.  It might
all be fine, but it's pretty wide-reaching, so the chances of
collateral damage seem non-trivial.  Even in master, I'm not sure I
see the point in rocking the apple cart to this degree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dereferenced pointers checked as NULL in btree_utils_var.c

2015-01-27 Thread Michael Paquier
On Wed, Jan 28, 2015 at 3:15 AM, Tom Lane  wrote:
> So I'm fine with taking out both this documentation text and the dead
> null-pointer checks; but let's do that all in one patch not piecemeal.
> In any case, this is just cosmetic cleanup; there's no actual hazard
> here.
Attached is a patch with all those things done. I added as well an
assertion in gistKeyIsEQ checking if the input datums are NULL. I
believe that this is still useful for developers, feel free to rip it
out from the patch if you think otherwise.
Regards,
-- 
Michael
From 95b051aac56de68412a7b0635484e46eb4e24ad0 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 28 Jan 2015 09:36:11 +0900
Subject: [PATCH] Remove dead NULL-pointer checks in GiST code

The key value passed to the same method is generated by either the
compress, union or picksplit methods, but it happens that none of them
actually pass a NULL pointer. Some compress methods do not check for a
NULL pointer, what should have resulted in a crash. Note that
gist_poly_compress() and gist_circle_compress() do check for a NULL,
but they only return a NULL key if the input key is NULL, which cannot
happen because the input comes from a table.

This commit also removes a documentation note added in commit a0a3883
that has been introduced based on the fact that some routines of the
same method were doing NULL-pointer checks, and adds an assertion
in gistKeyIsEQ to ensure that no NULL keys are passed when calling the
same method.
---
 contrib/btree_gist/btree_utils_num.c |  8 ++
 contrib/btree_gist/btree_utils_var.c | 10 ++-
 doc/src/sgml/gist.sgml   |  5 
 src/backend/access/gist/gistproc.c   | 56 
 src/backend/access/gist/gistutil.c   |  3 ++
 5 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/contrib/btree_gist/btree_utils_num.c b/contrib/btree_gist/btree_utils_num.c
index 505633c..7c5b911 100644
--- a/contrib/btree_gist/btree_utils_num.c
+++ b/contrib/btree_gist/btree_utils_num.c
@@ -147,13 +147,9 @@ gbt_num_same(const GBT_NUMKEY *a, const GBT_NUMKEY *b, const gbtree_ninfo *tinfo
 	b2.lower = &(((GBT_NUMKEY *) b)[0]);
 	b2.upper = &(((GBT_NUMKEY *) b)[tinfo->size]);
 
-	if (
-		(*tinfo->f_eq) (b1.lower, b2.lower) &&
-		(*tinfo->f_eq) (b1.upper, b2.upper)
-		)
-		return TRUE;
-	return FALSE;
 
+	return ((*tinfo->f_eq) (b1.lower, b2.lower) &&
+			(*tinfo->f_eq) (b1.upper, b2.upper));
 }
 
 
diff --git a/contrib/btree_gist/btree_utils_var.c b/contrib/btree_gist/btree_utils_var.c
index b7dd060..6ad3347 100644
--- a/contrib/btree_gist/btree_utils_var.c
+++ b/contrib/btree_gist/btree_utils_var.c
@@ -337,7 +337,6 @@ bool
 gbt_var_same(Datum d1, Datum d2, Oid collation,
 			 const gbtree_vinfo *tinfo)
 {
-	bool		result;
 	GBT_VARKEY *t1 = (GBT_VARKEY *) DatumGetPointer(d1);
 	GBT_VARKEY *t2 = (GBT_VARKEY *) DatumGetPointer(d2);
 	GBT_VARKEY_R r1,
@@ -346,13 +345,8 @@ gbt_var_same(Datum d1, Datum d2, Oid collation,
 	r1 = gbt_var_key_readable(t1);
 	r2 = gbt_var_key_readable(t2);
 
-	if (t1 && t2)
-		result = ((*tinfo->f_cmp) (r1.lower, r2.lower, collation) == 0 &&
-  (*tinfo->f_cmp) (r1.upper, r2.upper, collation) == 0);
-	else
-		result = (t1 == NULL && t2 == NULL);
-
-	return result;
+	return ((*tinfo->f_cmp) (r1.lower, r2.lower, collation) == 0 &&
+			(*tinfo->f_cmp) (r1.upper, r2.upper, collation) == 0);
 }
 
 
diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 5de282b..f9644b0 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -498,11 +498,6 @@ my_compress(PG_FUNCTION_ARGS)
course.
   
 
-  
-Depending on your needs, you could also need to care about
-compressing NULL values in there, storing for example
-(Datum) 0 like gist_circle_compress does.
-  
  
 
 
diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index 4decaa6..1df6357 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -1039,25 +1039,15 @@ gist_poly_compress(PG_FUNCTION_ARGS)
 
 	if (entry->leafkey)
 	{
-		retval = palloc(sizeof(GISTENTRY));
-		if (DatumGetPointer(entry->key) != NULL)
-		{
-			POLYGON*in = DatumGetPolygonP(entry->key);
-			BOX		   *r;
+		POLYGON*in = DatumGetPolygonP(entry->key);
+		BOX		   *r;
 
-			r = (BOX *) palloc(sizeof(BOX));
-			memcpy((void *) r, (void *) &(in->boundbox), sizeof(BOX));
-			gistentryinit(*retval, PointerGetDatum(r),
-		  entry->rel, entry->page,
-		  entry->offset, FALSE);
-
-		}
-		else
-		{
-			gistentryinit(*retval, (Datum) 0,
-		  entry->rel, entry->page,
-		  entry->offset, FALSE);
-		}
+		retval = palloc(sizeof(GISTENTRY));
+		r = (BOX *) palloc(sizeof(BOX));
+		memcpy((void *) r, (void *) &(in->boundbox), sizeof(BOX));
+		gistentryinit(*retval, PointerGetDatum(r),
+	  entry->rel, entry->page,
+	  entry->offset, FALSE);
 	}
 	else
 		retval = entry;
@@ -1113,28 +1103,18 @@ gist_circle_compress(PG_FUNCTION

Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Andres Freund
On 2015-01-27 19:24:24 -0500, Robert Haas wrote:
> On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund  wrote:
> > That'd end up essentially being a re-emulation of locks. Don't find that
> > all that pretty. But maybe we have to go there.
> 
> The advantage of it is that it is simple.  The only thing we're really
> giving up is the deadlock detector, which I think isn't needed in this
> case.

I think it's more than just the deadlock detector. Consider
pg_locks/pg_stat_activity.waiting, cancelling a acquisition, error
cleanup and recursive acquisitions. Acquiring session locks + a
surrounding transaction command deals with with cleanups without
introducing PG_ENSURE blocks in a couple places. And we need recursive
acquisition so a streaming base backup can acquire the lock over the
whole runtime, while a manual pg_start_backup() does only for its own
time.

Especially the visibility in the system views is something I'd not like
to give up in additional blocks we introduce in the backbranches.

> > Here's an alternative approach. I think it generally is superior and
> > going in the right direction, but I'm not sure it's backpatchable.
> 
> I tend think this is changing too many things to back-patch.  It might
> all be fine, but it's pretty wide-reaching, so the chances of
> collateral damage seem non-trivial.  Even in master, I'm not sure I
> see the point in rocking the apple cart to this degree.

It's big, true. But a fair amount of it stuff I think we have to do
anyway. The current code acquiring session locks in dbase_redo() is
borked - we either assert or segfault if there's a connection in the
wrong moment beause there's no deadlock handler. Plus it has races that
aren't that hard to hit :(. To fix the crashes (but not the race) we can
have a separate ResolveRecoveryConflictWithObjectLock that doesn't
record the existance of the lock, but doesn't ever do a ProcSleep(). Not
pretty either :(

Seems like a situation with no nice solutions so far :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread David Steele
On 1/27/15 6:09 PM, Jim Nasby wrote:
> The whole remote_dir discussion made me think of something... would
> --link-dest be any help here?

I'm pretty sure --link-dest would not be effective in this case.  The
problem exists on the source side and --link-dest only operates on the
destination.

-- 
- David Steele
da...@pgmasters.net




signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Additional role attributes && superuser review

2015-01-27 Thread Adam Brightwell
All,

I have attached and updated patch for review.

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
new file mode 100644
index 62305d2..fd4b9ab
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***
*** 1445,1450 
--- 1445,1486 
   
  
   
+   rolbypassrls
+   bool
+   Role can bypass row level security
+  
+ 
+  
+   rolexclbackup
+   bool
+   Role can perform on-line exclusive backup operations
+  
+ 
+  
+   rolxlogreplay
+   bool
+   Role can control xlog recovery replay operations
+  
+ 
+  
+   rollogfile
+   bool
+   Role can rotate log files
+  
+ 
+  
+   rolmonitor
+   bool
+   Role can view pg_stat_* details
+  
+ 
+  
+   rolsignal
+   bool
+   Role can signal backend processes
+  
+ 
+  
rolconnlimit
int4

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index d57243a..096c770
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** SELECT set_config('log_statement_stats',
*** 16425,16438 
  pg_start_backup(label text , fast boolean )
  
 pg_lsn
!Prepare for performing on-line backup (restricted to superusers or replication roles)


 
  pg_stop_backup()
  
 pg_lsn
!Finish performing on-line backup (restricted to superusers or replication roles)


 
--- 16425,16438 
  pg_start_backup(label text , fast boolean )
  
 pg_lsn
!Prepare for performing on-line exclusive backup (restricted to superusers or replication roles)


 
  pg_stop_backup()
  
 pg_lsn
!Finish performing on-line exclusive backup (restricted to superusers or replication roles)


 
diff --git a/doc/src/sgml/ref/create_role.sgml b/doc/src/sgml/ref/create_role.sgml
new file mode 100644
index ea26027..0fc3b42
*** a/doc/src/sgml/ref/create_role.sgml
--- b/doc/src/sgml/ref/create_role.sgml
*** CREATE ROLE connlimit
  | [ ENCRYPTED | UNENCRYPTED ] PASSWORD 'password'
  | VALID UNTIL 'timestamp'
*** CREATE ROLE connlimit
diff --git a/doc/src/sgml/ref/create_user.sgml b/doc/src/sgml/ref/create_user.sgml
new file mode 100644
index 065999c..f7f10c7
*** a/doc/src/sgml/ref/create_user.sgml
--- b/doc/src/sgml/ref/create_user.sgml
*** CREATE USER connlimit
  | [ ENCRYPTED | UNENCRYPTED ] PASSWORD 'password'
  | VALID UNTIL 'timestamp'
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
new file mode 100644
index 2179bf7..12b8a17
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***
*** 27,32 
--- 27,33 
  #include "miscadmin.h"
  #include "replication/walreceiver.h"
  #include "storage/smgr.h"
+ #include "utils/acl.h"
  #include "utils/builtins.h"
  #include "utils/numeric.h"
  #include "utils/guc.h"
*** pg_start_backup(PG_FUNCTION_ARGS)
*** 54,63 
  
  	backupidstr = text_to_cstring(backupid);
  
! 	if (!superuser() && !has_rolreplication(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 		   errmsg("must be superuser or replication role to run a backup")));
  
  	startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
--- 55,65 
  
  	backupidstr = text_to_cstring(backupid);
  
! 	if (!has_replication_privilege(GetUserId())
! 		&& !has_exclbackup_privilege(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
!  errmsg("must be superuser, replication role or exclusive backup role to run a backup")));
  
  	startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
*** pg_stop_backup(PG_FUNCTION_ARGS)
*** 82,91 
  {
  	XLogRecPtr	stoppoint;
  
! 	if (!superuser() && !has_rolreplication(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 		 (errmsg("must be superuser or replication role to run a backup";
  
  	stoppoint = do_pg_stop_backup(NULL, true, NULL);
  
--- 84,94 
  {
  	XLogRecPtr	stoppoint;
  
! 	if (!has_replication_privilege(GetUserId())
! 		&& !has_exclbackup_privilege(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
!  errmsg("must be superuser, replication role or exclusive backup role to run a backup")));
  
  	stoppoint = do_pg_stop_backup(NULL, true, NULL);
  
*** pg_switch_xlog(PG_FUNCTION_ARGS)
*** 100,109 
  {
  	XLogRecPtr	switchpoint;
  
! 	if (!superuser())
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 			 (errmsg("must be superuser to sw

Re: [HACKERS] [POC] FETCH limited by bytes.

2015-01-27 Thread Corey Huinker
Last year I was working on a patch to postgres_fdw where the fetch_size
could be set at the table level and the server level.

I was able to get the settings parsed and they would show up in
pg_foreign_table
and pg_foreign_servers. Unfortunately, I'm not very familiar with how
foreign data wrappers work, so I wasn't able to figure out how to get these
custom values passed from the PgFdwRelationInfo struct into the
query's PgFdwScanState
struct.

I bring this up only because it might be a simpler solution, in that the
table designer could set the fetch size very high for narrow tables, and
lower or default for wider tables. It's also a very clean syntax, just
another option on the table and/or server creation.

My incomplete patch is attached.



On Tue, Jan 27, 2015 at 4:24 AM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Thank you for the comment.
>
> The automatic way to determin the fetch_size looks become too
> much for the purpose. An example of non-automatic way is a new
> foreign table option like 'fetch_size' but this exposes the
> inside too much... Which do you think is preferable?
>
> Thu, 22 Jan 2015 11:17:52 -0500, Tom Lane  wrote in <
> 24503.1421943...@sss.pgh.pa.us>
> > Kyotaro HORIGUCHI  writes:
> > > Hello, as the discuttion on async fetching on postgres_fdw, FETCH
> > > with data-size limitation would be useful to get memory usage
> > > stability of postgres_fdw.
> >
> > > Is such a feature and syntax could be allowed to be added?
> >
> > This seems like a lot of work, and frankly an incredibly ugly API,
> > for a benefit that is entirely hypothetical.  Have you got numbers
> > showing any actual performance win for postgres_fdw?
>
> The API is a rush work to make the path for the new parameter
> (but, yes, I did too much for the purpose that use from
> postgres_fdw..)  and it can be any saner syntax but it's not the
> time to do so yet.
>
> The data-size limitation, any size to limit, would give
> significant gain especially for small sized rows.
>
> This patch began from the fact that it runs about twice faster
> when fetch size = 1 than 100.
>
>
> http://www.postgresql.org/message-id/20150116.171849.109146500.horiguchi.kyot...@lab.ntt.co.jp
>
> I took exec times to get 1M rows from localhost via postgres_fdw
> and it showed the following numbers.
>
> =# SELECT a from ft1;
> fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
> (local)0.75s
> 10060  6.2s   6000 (0.006)
> 1  60  2.7s 60 (0.6  )
> 3  60  2.2s180 (2.0  )
> 6  60  2.4s360 (4.0  )
>
> =# SELECT a, b, c from ft1;
> fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
> (local)0.8s
> 100   204 12  s  20400 (0.02 )
> 1000  204 10  s 204000 (0.2  )
> 1 204  5.8s204 (2)
> 2 204  5.9s408 (4)
>
> =# SELECT a, b, d from ft1;
> fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
> (local)0.8s
> 100  1356 17  s 135600 (0.136)
> 1000 1356 15  s1356000 (1.356)
> 1475 1356 13  s2000100 (2.0  )
> 2950 1356 13  s4000200 (4.0  )
>
> The definitions of the environment are the following.
>
> CREATE SERVER sv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host
> 'localhost', dbname 'postgres');
> CREATE USER MAPPING FOR PUBLIC SERVER sv1;
> CREATE TABLE lt1 (a int, b timestamp, c text, d text);
> CREATE FOREIGN TABLE ft1 (a int, b timestamp, c text, d text) SERVER sv1
> OPTIONS (table_name 'lt1');
> INSERT INTO lt1 (SELECT a, now(), repeat('x', 128), repeat('x', 1280) FROM
> generate_series(0, 99) a);
>
> The "avg row size" is alloced_mem/fetch_size and the alloced_mem
> is the sum of HeapTuple[fetch_size] and (HEAPTUPLESIZE +
> tup->t_len) for all stored tuples in the receiver side,
> fetch_more_data() in postgres_fdw.
>
> They are about 50% gain for the smaller tuple size and 25% for
> the larger. They looks to be optimal at where alloced_mem is
> around 2MB by the reason unknown to me. Anyway the difference
> seems to be significant.
>
> > Even if we wanted to do something like this, I strongly object to
> > measuring size by heap_compute_data_size.  That's not a number that users
> > would normally have any direct knowledge of; nor does it have anything
> > at all to do with the claimed use-case, where what you'd really need to
> > measure is bytes transmitted down the wire.  (The difference is not
> small:
> > for instance, toasted values would likely still be toasted at the point
> > where you're measuring.)
>
> Sure. Finally, the attached patch #1 which does the following
> things.
>
>  - Sen

Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 4:46 PM, Stephen Frost  wrote:
>> With 0 workers, first run took 883465.352 ms, and second run took 295050.106 
>> ms.
>> With 8 workers, first run took 340302.250 ms, and second run took 307767.758 
>> ms.
>>
>> This is a confusing result, because you expect parallelism to help
>> more when the relation is partly cached, and make little or no
>> difference when it isn't cached.  But that's not what happened.
>
> These numbers seem to indicate that the oddball is the single-threaded
> uncached run.  If I followed correctly, the uncached 'dd' took 321s,
> which is relatively close to the uncached-lots-of-workers and the two
> cached runs.  What in the world is the uncached single-thread case doing
> that it takes an extra 543s, or over twice as long?  It's clearly not
> disk i/o which is causing the slowdown, based on your dd tests.

Yeah, I'm wondering if the disk just froze up on that run for a long
while, which has been known to occasionally happen on this machine,
because I can't reproduce that crappy number.  I did the 0-worker test
a few more times, with the block-by-block method, dropping the caches
and restarting PostgreSQL each time, and got:

32.968 ms
322873.325 ms
322967.722 ms
321759.273 ms

After that last run, I ran it a few more times without restarting
PostgreSQL or dropping the caches, and got:

257629.348 ms
289668.976 ms
290342.970 ms
258035.226 ms
284237.729 ms

Then I redid the 8-client test.  Cold cache, I got 337312.554 ms.  On
the rerun, 323423.813 ms.  Third run, 324940.785.

There is more variability than I would like here.  Clearly, it goes a
bit faster when the cache is warm, but that's about all I can say with
any confidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas  wrote:
> Now, when you did what I understand to be the same test on the same
> machine, you got times ranging from 9.1 seconds to 35.4 seconds.
> Clearly, there is some difference between our test setups.  Moreover,
> I'm kind of suspicious about whether your results are actually
> physically possible.  Even in the best case where you somehow had the
> maximum possible amount of data - 64 GB on a 64 GB machine - cached,
> leaving no space for cache duplication between PG and the OS and no
> space for the operating system or postgres itself - the table is 120
> GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
> from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
> there could be some speedup from issuing I/O requests in parallel
> instead of serially, but that is a 15x speedup over dd, so I am a
> little suspicious that there is some problem with the test setup,
> especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1].  So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them.  Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6.  So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment.  When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments.  That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

All that aside, I still can't account for the numbers you are seeing.
When I run with your patch and what I think is your test case, I get
different (slower) numbers.  And even if we've got 6 drives cranking
along at 400MB/s each, that's still only 2.4 GB/s, not >6 GB/s.  So
I'm still perplexed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Not my idea.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Bruce Momjian
On Mon, Jan 26, 2015 at 05:41:59PM -0500, Stephen Frost wrote:
> I've thought about it a fair bit actually and I agree that there is some
> risk to using rsync for *incremental* base backups.  That is, you have
> a setup where you loop with:
> 
> pg_start_backup
> rsync -> dest
> pg_stop_backup
> 
> without using -I, changing what 'dest' is, or making sure it's empty
> every time.  The problem is the 1s-level granularity used on the
> timestamp.  A possible set of operations, all within 1s, is:
> 
> file changed
> rsync starts copying the file
> file changed again (somewhere prior to where rsync is at)
> rsync finishes the file copy
> 
> Now, this isn't actually a problem for the first time that file is
> backed up- the issue is if that file isn't changed again.  rsync won't
> re-copy it, but that change that rsync missed won't be in the WAL
> history for the *second* backup that's done (only the first), leading to
> a case where that file would end up corrupted.

Interesting problem, but doesn't rsync use sub-second accuracy?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Bruce Momjian
On Tue, Jan 27, 2015 at 09:36:58AM -0500, Stephen Frost wrote:
> The example listed works, but only when it's a local rsync:
> 
> rsync --archive --hard-links --size-only old_dir new_dir remote_dir
> 
> Perhaps a better example (or additional one) would be with a remote
> rsync, including clarification of old and new dir, like so:
> 
> (run in /var/lib/postgresql)
> rsync --archive --hard-links --size-only \
>   9.3/main \
>   9.4/main \
>   server:/var/lib/postgresql/
> 
> Note that 9.3/main and 9.4/main are two source directories for rsync to
> copy over, while server:/var/lib/postgresql/ is a remote destination
> directory.  The above directories match a default Debian/Ubuntu install.

OK, sorry everyone was confused by 'remote_dir'.  Does this new patch
help?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/doc/src/sgml/pgupgrade.sgml b/doc/src/sgml/pgupgrade.sgml
new file mode 100644
index e1cd260..4e4fe64
*** a/doc/src/sgml/pgupgrade.sgml
--- b/doc/src/sgml/pgupgrade.sgml
*** pg_upgrade.exe
*** 409,414 
--- 409,486 
 
  
 
+ Upgrade any Log-Shipping Standby Servers
+ 
+ 
+  If you have Log-Shipping Standby Servers (), follow these steps to upgrade them (before
+  starting any servers):
+ 
+ 
+ 
+ 
+  
+   Install the new PostgreSQL binaries on standby servers
+ 
+   
+Make sure the new binaries and support files are installed
+on all the standby servers.  Do not run
+initdb.  If initdb was run, delete
+the standby server data directories.  Also, install any custom
+shared object files on the new standbys that you installed in the
+new master cluster.
+   
+  
+ 
+  
+   Run rsync
+ 
+   
+From a directory that is above the old and new database cluster
+directories, run this for each slave:
+ 
+ 
+rsync --archive --hard-links --size-only old_dir new_dir remote_dir
+ 
+ 
+where old_dir and new_dir are relative
+to the current directory, and remote_dir is
+above the old and new cluster directories on
+the standby server.  The old and new relative cluster paths
+must match on the master and standby server.  Consult the
+rsync manual page for details on specifying the
+remote directory, e.g. slavehost:/var/lib/postgresql/.
+rsync will be fast when --link mode is
+used because it will create hard links on the remote server rather
+than transfering user data.
+   
+  
+ 
+  
+   Configure log-shipping to standby servers
+ 
+   
+Configure the servers for log shipping.  (You do not need to run
+pg_start_backup() and pg_stop_backup()
+or take a file system backup as the slaves are still sychronized
+with the master.)
+   
+  
+ 
+ 
+ 
+
+ 
+
+ Start the new server
+ 
+ 
+  The new server and any rsync'ed standby servers can
+  now be safely started.
+ 
+
+ 
+
  Post-Upgrade processing
  
  
*** psql --username postgres --file script.s
*** 548,562 

  

-A Log-Shipping Standby Server () cannot
-be upgraded because the server must allow writes.  The simplest way
-is to upgrade the primary and use rsync to rebuild the
-standbys.  You can run rsync while the primary is down,
-or as part of a base backup ()
-which overwrites the old standby cluster.
-   
- 
-   
 If you want to use link mode and you do not want your old cluster
 to be modified when the new cluster is started, make a copy of the
 old cluster and upgrade that in link mode. To make a valid copy
--- 620,625 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread David Steele
On 1/27/15 9:32 PM, Bruce Momjian wrote
> Now, this isn't actually a problem for the first time that file is
> backed up- the issue is if that file isn't changed again.  rsync won't
> re-copy it, but that change that rsync missed won't be in the WAL
> history for the *second* backup that's done (only the first), leading to
> a case where that file would end up corrupted.
> Interesting problem, but doesn't rsync use sub-second accuracy?
>
According to my empirical testing on Linux and OSX the answer is no:
rsync does not use sub-second accuracy.  This seems to be true even on
file systems like ext4 that support millisecond mod times, at least it
was true on Ubuntu 12.04 running ext4.

Even on my laptop there is a full half-second of vulnerability for
rsync.  Faster systems may have a larger window.

-- 
- David Steele
da...@pgmasters.net




signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Bruce Momjian
On Tue, Jan 27, 2015 at 09:44:51PM -0500, David Steele wrote:
> On 1/27/15 9:32 PM, Bruce Momjian wrote
> > Now, this isn't actually a problem for the first time that file is
> > backed up- the issue is if that file isn't changed again.  rsync won't
> > re-copy it, but that change that rsync missed won't be in the WAL
> > history for the *second* backup that's done (only the first), leading to
> > a case where that file would end up corrupted.
> > Interesting problem, but doesn't rsync use sub-second accuracy?
> >
> According to my empirical testing on Linux and OSX the answer is no:
> rsync does not use sub-second accuracy.  This seems to be true even on
> file systems like ext4 that support millisecond mod times, at least it
> was true on Ubuntu 12.04 running ext4.
> 
> Even on my laptop there is a full half-second of vulnerability for
> rsync.  Faster systems may have a larger window.

OK, bummer.  Well, I don't think we ever recommend to run rsync without
checksums, but the big problem is that rsync doesn't do checksums by
default.  :-(

pg_upgrade recommends using two rsyncs:

   To make a valid copy of the old cluster, use rsync to create
   a dirty copy of the old cluster while the server is running, then shut
   down the old server and run rsync again to update the copy
   with any changes to make it consistent.  You might want to exclude some

I am afraid that will not work as it could miss changes, right?  When
would the default mod-time checking every be safe?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread David Steele
On 1/27/15 9:51 PM, Bruce Momjian wrote:
>> According to my empirical testing on Linux and OSX the answer is no:
>> rsync does not use sub-second accuracy.  This seems to be true even on
>> file systems like ext4 that support millisecond mod times, at least it
>> was true on Ubuntu 12.04 running ext4.
>>
>> Even on my laptop there is a full half-second of vulnerability for
>> rsync.  Faster systems may have a larger window.
> OK, bummer.  Well, I don't think we ever recommend to run rsync without
> checksums, but the big problem is that rsync doesn't do checksums by
> default.  :-(
>
> pg_upgrade recommends using two rsyncs:
>
>To make a valid copy of the old cluster, use rsync to create
>a dirty copy of the old cluster while the server is running, then shut
>down the old server and run rsync again to update the copy
>with any changes to make it consistent.  You might want to exclude some
>
> I am afraid that will not work as it could miss changes, right?  When
> would the default mod-time checking every be safe?
>
According to my testing the default mod-time checking is never
completely safe in rsync.  I've worked around this in PgBackRest by
building the manifest and then waiting until the start of the next
second before starting to copy.  It was the only way I could make the
incremental backups reliable without requiring checksums (which are
optional as in rsync for performance).  Of course, you still have to
trust the clock for this to work.

This is definitely an edge case.  Not only does the file have to be
modified in the same second *after* rsync has done the copy, but the
file also has to not be modified in *any other subsequent second* before
the next incremental backup.  If the file is busy enough to have a
collision with rsync in that second, then it is very likely to be
modified before the next incremental backup which is generally a day or
so later.  And, of course, the backup where the issue occurs is fine -
it's the next backup that is invalid.

However, the hot/cold backup scheme as documented does make the race
condition more likely since the two backups are done in close proximity
temporally.  Ultimately, the most reliable method is to use checksums.

For me the biggest issue is that there is no way to discover if a db in
consistent no matter how much time/resources you are willing to spend. 
I could live with the idea of the occasional bad backup (since I keep as
many as possible), but having no way to know whether it is good or not
is very frustrating.  I know data checksums are a step in that
direction, but they are a long way from providing the optimal solution. 
I've implemented rigorous checksums in PgBackRest but something closer
to the source would be even better.

-- 
- David Steele
da...@pgmasters.net




signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-27 17:00:27 -0600, jim.na...@bluetreble.com wrote:
>
> It would be best to get this into a commit fest so it's not lost.

It's there already.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Dne 28.1.2015 0:25 "Jim Nasby"  napsal(a):
>
> On 1/27/15 12:58 PM, Pavel Stehule wrote:
>>
>>   postgres=# select array(select
unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
>> array
>> ---
>>   {1,2,3,4,5,6,7,8}
>> (1 row)
>>
>> so it generate pairs {1,2}{3,4},{5,6},{7,8}
>>
>>
>> I fixed situation when array has not enough elements.
>>
>> More tests, simple doc
>
>
> Hrm, this wasn't what I was expecting:
>
> + select foreach_test_ab(array[1,2,3,4]);
> + NOTICE:  a: 1, b: 2
> + NOTICE:  a: 3, b: 4
>
> I was expecting that foreach a,b array would be expecting something in
the array to have a dimension of 2. :(

It is inconsist (your expectation) with current implementation of FOREACH.
It doesnt produce a array when SLICING is missing. And it doesnt calculate
with dimensions.

I would not to change this rule. It is not ambigonuous and it allows to
work with
1d, 2d, 3d dimensions array. You can process Andrew format well and my
proposed format (2d array) well too.

There can be differen behave when SLICING is used. There we can iterate
exactly with dimensions. We can design a behave in this case?

>
> I think this is bad, because this:
>
> foreach_test_ab('{{1,2,3},{4,5,6}}'::int[]);
>
> will give you 1,2; 3,4; 5,6. I don't see any way that that makes sense.
Even if it did make sense, I'm more concerned that adding this will
seriously paint us into a corner when it comes to the (to me) more rational
case of returning {1,2,3},{4,5,6}.
>
> I think we need to think some more about this, at least to make sure
we're not painting ourselves into a corner for more appropriate array
iteration.
>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Noah Misch
On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 01/27/2015 02:28 PM, Tom Lane wrote:
> >> Well, we can either fix it now or suffer with a broken representation
> >> forever.  I'm not wedded to the exact solution I described, but I think
> >> we'll regret it if we don't change the representation.

> So at this point I propose that we reject \u when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change.  It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+ in jsonb.  jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+.  Therefore, we have to assume many
users will notice the change.  This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.

> Anybody who's seriously unhappy with that can propose a patch to fix it
> properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats.  (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+ as 0x00 is the best end state.

> We probably need to rethink the re-escaping behavior as well; I'm not
> sure if your latest patch is the right answer for that.

Yes, we do.  No change to the representation of U+ is going to fix the
following bug, but that patch does fix it:

[local] test=# select
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb,
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb::text::jsonb;
 ?column? | ?column? 
--+--
 t| f
(1 row)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Providing catalog view to pg_hba.conf file - Patch submission

2015-01-27 Thread Haribabu Kommi
On Wed, Jan 28, 2015 at 9:47 AM, Jim Nasby  wrote:
> On 1/27/15 1:04 AM, Haribabu Kommi wrote:
>>
>> Here I attached the latest version of the patch.
>> I will add this patch to the next commitfest.
>
>
> Apologies if this was covered, but why isn't the IP address an inet instead
> of text?

Corrected the address type as inet instead of text. updated patch is attached.

> Also, what happens if someone reloads the config in the middle of running
> the SRF?

hba entries are reloaded only in postmaster process, not in every backend.
So there shouldn't be any problem with config file reload. Am i
missing something?

Regards,
Hari Babu
Fujitsu Australia


Catalog_view_to_HBA_settings_patch_V4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Heikki Linnakangas

On 01/28/2015 04:16 AM, Robert Haas wrote:

On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas  wrote:

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups.  Moreover,
I'm kind of suspicious about whether your results are actually
physically possible.  Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.


So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1].  So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them.  Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6.  So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment.  When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments.  That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.


OTOH, spreading the I/O across multiple files is not a good thing, if 
you don't have a RAID setup like that. With a single spindle, you'll 
just induce more seeks.


Perhaps the OS is smart enough to read in large-enough chunks that the 
occasional seek doesn't hurt much. But then again, why isn't the OS 
smart enough to read in large-enough chunks to take advantage of the 
RAID even when you read just a single file?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: searching in array function - array_position

2015-01-27 Thread Pavel Stehule
2015-01-28 0:01 GMT+01:00 Jim Nasby :

> On 1/27/15 4:36 AM, Pavel Stehule wrote:
>
>> 2015-01-26 23:29 GMT+01:00 Jim Nasby > jim.na...@bluetreble.com>>:
>>
>> On 1/26/15 4:17 PM, Pavel Stehule wrote:
>>
>>  Any way to reduce the code duplication between the array and
>> non-array versions? Maybe factor out the operator caching code?
>>
>>
>> I though about it - but there is different checks, different
>> result processing, different result type.
>>
>> I didn't find any readable and reduced form :(
>>
>>
>> Yeah, that's why I was thinking specifically of the operator caching
>> code... isn't that identical? That would at least remove a dozen lines...
>>
>>
>> It is only partially identical - I would to use cache for array_offset,
>> but it is not necessary for array_offsets .. depends how we would to modify
>> current API to support externally cached data.
>>
>
> Externally cached data?


Some from these functions has own caches for minimize access to typcache
(array_map, array_cmp is example). And in first case, I am trying to push
these information from fn_extra, in second case I don't do it, because I
don't expect a repeated call (and I am expecting so type cache will be
enough).

I plan to do some benchmark to calculate if we have to do, or we can use
type cache only.

Pavel


>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
2015-01-28 0:16 GMT+01:00 Jim Nasby :

> On 1/27/15 2:26 PM, Pavel Stehule wrote:
>
>> here is a initial version of row_to_array function - transform any row to
>> array in format proposed by Andrew.
>>
>
> Please start a new thread for this... does it depend on the key-value
> patch?


partially - a selected format should be well supported by FOREACH statement

Regards

Pavel


>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] TABLESAMPLE patch

2015-01-27 Thread Kyotaro HORIGUCHI
Hi, I took a look on this and found nice.

By the way, the parameter for REPEATABLE seems allowing to be a
expression in ParseTableSample but the grammer rejects it.

The following change seems enough.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 4578b5e..8cf09d5 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10590,7 +10590,7 @@ tablesample_clause:
;
 
 opt_repeatable_clause:
-   REPEATABLE '(' AexprConst ')'   { $$ = (Node *) $3; }
+   REPEATABLE '(' a_expr ')'   { $$ = (Node *) $3; }
| /*EMPTY*/ 
{ $$ = NULL; }
;


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
2015-01-28 6:49 GMT+01:00 Pavel Stehule :

>
> Dne 28.1.2015 0:25 "Jim Nasby"  napsal(a):
> >
> > On 1/27/15 12:58 PM, Pavel Stehule wrote:
> >>
> >>   postgres=# select array(select
> unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
> >> array
> >> ---
> >>   {1,2,3,4,5,6,7,8}
> >> (1 row)
> >>
> >> so it generate pairs {1,2}{3,4},{5,6},{7,8}
> >>
> >>
> >> I fixed situation when array has not enough elements.
> >>
> >> More tests, simple doc
> >
> >
> > Hrm, this wasn't what I was expecting:
> >
> > + select foreach_test_ab(array[1,2,3,4]);
> > + NOTICE:  a: 1, b: 2
> > + NOTICE:  a: 3, b: 4
> >
> > I was expecting that foreach a,b array would be expecting something in
> the array to have a dimension of 2. :(
>
> It is inconsist (your expectation) with current implementation of FOREACH.
> It doesnt produce a array when SLICING is missing. And it doesnt calculate
> with dimensions.
>
> I would not to change this rule. It is not ambigonuous and it allows to
> work with
> 1d, 2d, 3d dimensions array. You can process Andrew format well and my
> proposed format (2d array) well too.
>
one small example

CREATE OR REPLACE FUNCTION iterate_over_pairs(text[])
RETURNS void AS $$
DECLARE v1 text; v2 text; e text; i int := 0;
BEGIN
  FOREACH e IN ARRAY $1 LOOP
IF i % 2 = 0 THEN v1 := e;
ELSE v2 := e; RAISE NOTICE 'v1: %, v2: %', v1, v2; END IF;
i := i + 1;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

postgres=# SELECT iterate_over_pairs(ARRAY[1,2,3,4]::text[]);
NOTICE:  v1: 1, v2: 2
NOTICE:  v1: 3, v2: 4
 iterate_over_pairs


(1 row)

postgres=# SELECT iterate_over_pairs(ARRAY[[1,2],[3,4]]::text[]);
NOTICE:  v1: 1, v2: 2
NOTICE:  v1: 3, v2: 4
 iterate_over_pairs


(1 row)

I can use iterate_over_pairs for 1D or 2D arrays well -- a FOREACH was
designed in this direction - without SLICE a dimensions data are
unimportant.

Discussed enhancing of FOREACH is faster and shorter (readable)
iterate_over_pairs use case.

FOREACH v1, v2 IN ARRAY $1 LOOP
  ..
END LOOP;

It is consistent with current design

You can look to patch - in this moment a SLICE > 0 is disallowed for
situation, when target variable is ROW and source is not ROW.

Regards

Pavel

There can be differen behave when SLICING is used. There we can iterate
> exactly with dimensions. We can design a behave in this case?
>
> >
> > I think this is bad, because this:
> >
> > foreach_test_ab('{{1,2,3},{4,5,6}}'::int[]);
> >
> > will give you 1,2; 3,4; 5,6. I don't see any way that that makes sense.
> Even if it did make sense, I'm more concerned that adding this will
> seriously paint us into a corner when it comes to the (to me) more rational
> case of returning {1,2,3},{4,5,6}.
> >
> > I think we need to think some more about this, at least to make sure
> we're not painting ourselves into a corner for more appropriate array
> iteration.
> >
> > --
> > Jim Nasby, Data Architect, Blue Treble Consulting
> > Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] proposal: plpgsql - Assert statement

2015-01-27 Thread Pavel Stehule
2015-01-28 0:13 GMT+01:00 Jim Nasby :

> On 1/27/15 1:30 PM, Pavel Stehule wrote:
>
>> I don't see the separate warning as being helpful. I'd just do
>> something like
>>
>> +(err_hint != NULL) ? errhint("%s",
>> err_hint) : errhint("Message attached to failed assertion is null") ));
>>
>>
>> done
>>
>>
>> There should also be a test case for a NULL message.
>>
>>
>> is there, if I understand well
>>
>
> I see it now. Looks good.


please, can you enhance a documentation part - I am sorry, my English
skills are not enough

Thank you

Pavel

>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Amit Kapila
On Wed, Jan 28, 2015 at 12:38 PM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:
>
> On 01/28/2015 04:16 AM, Robert Haas wrote:
>>
>> On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas 
wrote:
>>>
>>> Now, when you did what I understand to be the same test on the same
>>> machine, you got times ranging from 9.1 seconds to 35.4 seconds.
>>> Clearly, there is some difference between our test setups.  Moreover,
>>> I'm kind of suspicious about whether your results are actually
>>> physically possible.  Even in the best case where you somehow had the
>>> maximum possible amount of data - 64 GB on a 64 GB machine - cached,
>>> leaving no space for cache duplication between PG and the OS and no
>>> space for the operating system or postgres itself - the table is 120
>>> GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
>>> from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
>>> there could be some speedup from issuing I/O requests in parallel
>>> instead of serially, but that is a 15x speedup over dd, so I am a
>>> little suspicious that there is some problem with the test setup,
>>> especially because I cannot reproduce the results.
>>
>>
>> So I thought about this a little more, and I realized after some
>> poking around that hydra's disk subsystem is actually six disks
>> configured in a software RAID5[1].  So one advantage of the
>> chunk-by-chunk approach you are proposing is that you might be able to
>> get all of the disks chugging away at once, because the data is
>> presumably striped across all of them.  Reading one block at a time,
>> you'll never have more than 1 or 2 disks going, but if you do
>> sequential reads from a bunch of different places in the relation, you
>> might manage to get all 6.  So that's something to think about.
>>
>> One could imagine an algorithm like this: as long as there are more
>> 1GB segments remaining than there are workers, each worker tries to
>> chug through a separate 1GB segment.  When there are not enough 1GB
>> segments remaining for that to work, then they start ganging up on the
>> same segments.  That way, you get the benefit of spreading out the I/O
>> across multiple files (and thus hopefully multiple members of the RAID
>> group) when the data is coming from disk, but you can still keep
>> everyone busy until the end, which will be important when the data is
>> all in-memory and you're just limited by CPU bandwidth.
>
>
> OTOH, spreading the I/O across multiple files is not a good thing, if you
don't have a RAID setup like that. With a single spindle, you'll just
induce more seeks.
>

Yeah, if such a thing happens then there is less chance that user
will get any major benefit via parallel sequential scan unless
the qualification expressions or other expressions used in
statement are costly.  So here one way could be that either user
should configure the parallel sequence scan parameters in such
a way that only when it can be beneficial it should perform parallel
scan (something like increase parallel_tuple_comm_cost or we can
have some another parameter) or just not use parallel sequential scan
(parallel_seqscan_degree=0).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up

2015-01-27 Thread Michael Paquier
Andres Freund wrote:
> I think this isn't particularly pretty, but it seems to be working well
> enough, and changing it would be pretty invasive. So keeping in line
> with all that code seems to be easier.
OK, I'm convinced with this part to remove the call of
LockSharedObjectForSession that uses dontWait and replace it by a loop
in ResolveRecoveryConflictWithDatabase.

> Note that InRecovery doesn't mean what you probably think it means:
> [stuff]
> boolInRecovery = false;
Yes, right. I misunderstood with RecoveryInProgress().

> The assertion actually should be even stricter:
> Assert(!InRecovery || (sessionLock && dontWait));
> i.e. we never acquire a heavyweight lock in the startup process unless
> it's a session lock (as we don't have resource managers/a xact to track
> locks) and we don't wait (as we don't have the deadlock detector
> infrastructure set up).
No problems with this assertion here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers