date:20141208

On 8 December 2014 at 11:46, Michael Paquier michael.paqu...@gmail.com wrote:

 * ideally we'd like to be able to differentiate the types of usage.
 which then allows the user to control the level of compression
 depending upon the type of action. My first cut at what those settings
 should be are ALL  LOGICAL  PHYSICAL  VACUUM.
 VACUUM - only compress while running vacuum commands
 PHYSICAL - only compress while running physical DDL commands (ALTER
 TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
 typically be used for logical decoding
 LOGICAL - compress FPIs for record types that change tables
 ALL - all user commands
 (each level includes all prior levels)

 Well, that's clearly an optimization so I don't think this should be
 done for a first shot but those are interesting fresh ideas.

It is important that we offer an option that retains user performance.
I don't see that as an optimisation, but as an essential item.

The current feature will reduce WAL volume, at the expense of
foreground user performance. Worse, that will all happen around time
of new checkpoint,  so I expect this will have a large impact.
Presumably testing has been done to show the impact on user response
times? If not, we need that.

The most important distinction is between foreground and background tasks.

If you think the above is too complex, then we should make the
parameter into a USET, but set it to on in VACUUM, CLUSTER and
autovacuum.

 Technically speaking, note that we would need to support such things
 with a new API to switch a new context flag in registered_buffers of
 xloginsert.c for each block, and decide if the block is compressed
 based on this context flag, and the compression level wanted.

 * name should not be wal_compression - we're not compressing all wal
 records, just fpis. There is no evidence that we even want to compress
 other record types, nor that our compression mechanism is effective at
 doing so. Simple = keep name as compress_full_page_writes
 Though perhaps we should have it called wal_compression_level

 I don't really like those new names, but I'd prefer
 wal_compression_level if we go down that road with 'none' as default
 value. We may still decide in the future to support compression at the
 record level instead of context level, particularly if we have an API
 able to do palloc_return_null_at_oom, so the idea of WAL compression
 is not related only to FPIs IMHO.

We may yet decide, but the pglz implementation is not effective on
smaller record lengths. Nor has any testing been done to show that is
even desirable.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advance local xmin more aggressively

2014-12-08 Thread Heikki Linnakangas


On 12/05/2014 05:05 PM, Robert Haas wrote:

[ reviving a very old thread ]

On Tue, Feb 10, 2009 at 4:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Alvaro Herrera alvhe...@commandprompt.com writes:

For example, maybe we could keep track of counts of snapshots removed
since the last xmin calculation, and only run this routine if the number
is different from zero (or some small positive integer).


I think most of the callers of SnapshotResetXmin already know they
removed something.

It might be interesting for FreeSnapshot or something nearby to note
whether the snapshot being killed has xmin = proc's xmin, and only do
the update calculation if so.

I still dislike the assumption that all resource owners are children of
a known owner.  I suspect in fact that it's demonstrably wrong right
now, let alone in future (cf comments in PortalRun).  If we're going to
do this then snapmgr.c needs to track the snapshots for itself.  Of
course that's going to make the is it worth it question even more
pressing.


I've run into essentially the same problem Jeff originally complained
about with a large customer who has long-running transactions that
make extensive use of cursors.  Cursors are opened and closed over
time but it is rare for the number open to reach exactly zero, so what
ends up happening is that the backend xmin does not advance.  As you
can imagine, that doesn't work out well.

The approach I came up with initially was similar to Jeff's: start at
the topmost resource owner and traverse them all, visiting every
snapshot along the way.  But then I found this thread and saw the
comment that this might be demonstrably wrong and referring to the
comments in PortalRun.  Having reviewed those comments, which don't
seem to have changed much in the last five years, I can't understand
how they related to this issue.  It's true that the TopTransaction
resource owner could get swapped out under us during an internal
commit, but why should SnapshotResetXmin() have to care? It just
traverses the one that is in effect at the time it gets called.  The
only real danger I see here is that there could be *more than one*
toplevel resource owner.  I wonder if we could solve that problem by
adding a registry of active toplevel resource owners, so that if we
have a forest rather than a tree we can still find everything.


I don't immediately see the problem either, but I have to say that 
grovelling through all the resource owners seems ugly anyway. Resource 
owners are not meant to be traversed like that. And there could be a lot 
of them, and you have to visit every one of them. That could get 
expensive if there are a lot of resource owners.


BTW, you could easily detect that you haven't seen all the registered 
snapshots, after traversing the resource owner, as we keep the counter 
of them. So you could just fall back to not advancing the xmin if it 
happens.



The problem I see with having snapmgr.c track the snapshots for itself
is that it is mostly duplicating of bookkeeping which is already being
done.  Since this problem doesn't affect the majority of users, it's
not desirable to add a lot of extra bookkeeping to cater to it - but
even if it did affect a lot of users, we still want it to be as cheap
as possible, and reusing the tracking that resource owners are already
doing seems like the way to get there.


I would prefer doing separate bookkeeping in snapmgr.c. It seems like it 
wouldn't be difficult to do. It has to be cheap, but I don't see any 
reason to believe that it'd be more expensive than traversing through 
all resource owners. A binary heap or some other priority queue 
implementation should work pretty well for this.



I think there are a couple of things we can do to make this cheaper.
Suppose we keep track of the oldest xmin of any registered snapshot
and the number of registered snapshots that have that particular xmin.
Every time we deregister a snapshot, we check whether it is one of the
ones with the minimum xmin; if it is, we decrement the count.  When
the count reaches 0, we know that a traversal of all registered
snapshots is guaranteed to find a newer value to advertise in
MyProc-xmin; that way, we can arrange to do the work only when it's
likely to pay off.  In most cases this won't happen until the last
snapshot is unregistered, because our snapshots normally form a stack,
with newer snapshots having been taken later.  But if somebody
unregisters the oldest snapshot we'll immediately notice and
recalculate.


Yeah, that's a reasonable optimization. It's a reasonable optimization 
even if you do the bookkeeping in snapmgr.c.


And that optimization won't save you in the cases where it doesn't 
apply. For example, what if snapshots happen to form a queue, rather 
than a stack:


DECLARE c1 CURSOR FOR ...;
DECLARE c2 CURSOR FOR ...;
DECLARE c3 CURSOR FOR ...;
...
DECLARE c1000 CURSOR FOR ...;

CLOSE c1;
CLOSE c2;
CLOSE c3;
...
CLOSE c1000;

It's not hard to imagine an application doing that.

[HACKERS] Casting issues with domains

2014-12-08 Thread Thomas Reiss

Hello all,


We experienced some casting issues with domains. We experienced the
problem while querying the information_schema btw, but here is a simpler
test case :

postgres=# create table test1 (a text);
CREATE TABLE
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# create index idx1 on test1(a);
CREATE INDEX
postgres=# analyze test1 ;
ANALYZE;
postgres=# explain select * from test1 where a = 'toto';
  QUERY PLAN
---
 Index Only Scan using idx1 on test1  (cost=0.29..8.31 rows=1 width=5)
   Index Cond: (a = 'toto'::text)
(2 lignes)

Now we create a tstdom domain and cast the a column to tstdom in the
view definition :
postgres=# create domain tstdom as text;
CREATE DOMAIN
postgres=# create view test2 as select a::tstdom from test1 ;
CREATE VIEW
postgres=# explain select * from test2 where a='toto';
QUERY PLAN
--
 Seq Scan on test1  (cost=0.00..1693.00 rows=500 width=5)
   Filter: (((a)::tstdom)::text = 'toto'::text)
(2 lignes)

As you can see, a is casted to tstdom then again to text. This casts
prevents the optimizer to choose an index scan to retrieve the data. The
casts are however strictly equivalent and should be not prevent the
optimizer to use indexes.

Also, the same problem appears in the information_schema views, as every
object names are casted to information_schema.sql_identifier. Even if
this domain is declared as name, no index will be used because of this cast.

Shouldn't the planner simplify the casts when it's possible ?


Regards,
Thomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-08 Thread Anssi Kääriäinen

On Fri, 2014-12-05 at 10:00 -0800, Josh Berkus wrote:
 I thought the point of INSERT ... ON CONFLICT update was so that you
 didn't have to care if it was a new row or not?
 
 If you do care, it seems like it makes more sense to do your own INSERTs
 and UPDATEs, as Django currently does.

Django tries to update the object if it already exists in the database.
If it doesn't, then Django does an insert. This is suboptimal from
concurrency standpoint, and does two round trips to the database instead
of just one.

For Django, both insert and update are OK when saving an object to the
database, but Django needs to know which one was done.

I too agree that this doesn't need to be handled in the first version of
the patch.

 - Anssi



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API

2014-12-08 Thread Craig Ringer

On 12/05/2014 08:03 PM, David Rowley wrote:
 On 2 December 2014 at 15:36, Craig Ringer cr...@2ndquadrant.com
 mailto:cr...@2ndquadrant.com wrote:
 
 On 12/01/2014 09:51 PM, Marco Nenciarini wrote:
  I think this is a leftover, as you don't use elog afterwards.
 
 Good catch, fixed.
 
 I've looked over this again and tested it on a windows 8.1 machine. I
 cannot find any problems
 
 The only comments about the code I have would maybe be to use some
 constants like:
 
 #define FILETIME_PER_SEC1000L
 #define FILETIME_PER_USEC10

[snip]

 I'll leave it up to the committer to decide if it's better with or
 without the attached patch.

I think it's more readable, and that's pretty much always a good thing.

Patches with your changes attached.

I used FILETIME_UNITS_PER_SEC though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 9c4e7f5539f0d518b0fe94d12bc562d95967f6a6 Mon Sep 17 00:00:00 2001
From: Craig Ringer cr...@2ndquadrant.com
Date: Fri, 12 Sep 2014 12:41:35 +0800
Subject: [PATCH 1/2] Use GetSystemTimeAsFileTime directly in win32
 gettimeofday
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PostgreSQL was calling GetSystemTime followed by SystemTimeToFileTime in the
win32 port gettimeofday function. This is not necessary and limits the reported
precision to the 1ms granularity that the SYSTEMTIME struct can represent. By
using GetSystemTimeAsFileTime we avoid unnecessary conversions and capture
timestamps at 100ns granularity, which is then rounded to 1Âµs granularity for
storage in a PostgreSQL timestamp.

On most Windows systems this change will actually have no significant effect on
timestamp resolution as the system timer tick is typically between 1ms and 15ms
depending on what timer resolution currently running applications have
requested. You can check this with clockres.exe from sysinternals. Despite the
platform limiation this change still permits capture of finer timestamps where
the system is capable of producing them and it gets rid of an unnecessary
syscall.

The higher resolution GetSystemTimePreciseAsFileTime call available on Windows
8 and Windows Server 2012 has the same interface as GetSystemTimeAsFileTime, so
switching to GetSystemTimeAsFileTime makes it easier to use the Precise variant
later.
---
 src/port/gettimeofday.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 75a9199..ecc0b4d 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -31,10 +31,17 @@
 #include sys/time.h
 
 
-/* FILETIME of Jan 1 1970 00:00:00. */
+/* FILETIME of Jan 1 1970 00:00:00, the PostgreSQL epoch */
 static const unsigned __int64 epoch = UINT64CONST(1164447360);
 
 /*
+ * FILETIME represents the number of 100-nanosecond intervals since
+ * January 1, 1601 (UTC).
+ */
+#define FILETIME_UNITS_PER_SEC	1000L
+#define FILETIME_UNITS_PER_USEC	10
+
+/*
  * timezone information is stored outside the kernel so tzp isn't used anymore.
  *
  * Note: this function is not for Win32 high precision timing purpose. See
@@ -44,16 +51,15 @@ int
 gettimeofday(struct timeval * tp, struct timezone * tzp)
 {
 	FILETIME	file_time;
-	SYSTEMTIME	system_time;
 	ULARGE_INTEGER ularge;
 
-	GetSystemTime(system_time);
-	SystemTimeToFileTime(system_time, file_time);
+	GetSystemTimeAsFileTime(file_time);
 	ularge.LowPart = file_time.dwLowDateTime;
 	ularge.HighPart = file_time.dwHighDateTime;
 
-	tp-tv_sec = (long) ((ularge.QuadPart - epoch) / 1000L);
-	tp-tv_usec = (long) (system_time.wMilliseconds * 1000);
+	tp-tv_sec = (long) ((ularge.QuadPart - epoch) / FILETIME_UNITS_PER_SEC);
+	tp-tv_usec = (long) (((ularge.QuadPart - epoch) % FILETIME_UNITS_PER_SEC)
+		/ FILETIME_UNITS_PER_USEC);
 
 	return 0;
 }
-- 
1.9.3

From 30cfd48f05bee68602c3088f49c1e30c2251a84f Mon Sep 17 00:00:00 2001
From: Craig Ringer cr...@2ndquadrant.com
Date: Thu, 18 Sep 2014 23:02:14 +0800
Subject: [PATCH 2/2] On Windows, use GetSystemTimePreciseAsFileTime when
 available

PostgreSQL on Windows 8 or Windows Server 2012 will now obtain
high-resolution timestamps by dynamically loading the the
GetSystemTimePreciseAsFileTime function. It'll fall back to
GetSystemTimeAsFileTime if the higher precision variant isn't found,
so the same binaries without problems on older Windows releases.

No attempt is made to detect the Windows version. Only the presence or
absence of the desired function is considered.
---
 src/backend/main/main.c |  6 ++
 src/include/port.h  |  2 ++
 src/port/gettimeofday.c | 53 +++--
 3 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index c51b391..73c30c5 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -260,6 +260,12 @@ startup_hacks(const char

Re: [HACKERS] inherit support for foreign tables

2014-12-08 Thread Etsuro Fujita


(2014/12/08 15:17), Ashutosh Bapat wrote:

On Sat, Dec 6, 2014 at 9:21 PM, Noah Misch n...@leadboat.com
mailto:n...@leadboat.com wrote:
Does this inheritance patch add any
atomicity
problem that goes away when one breaks up the inheritance hierarchy and
UPDATEs each table separately?  If not, this limitation is okay.



If the UPDATES crafted after breaking up the inheritance hierarchy are
needed to be run within the same transaction (as the UPDATE on
inheritance hierarchy would do), yes, there is atomicity problem.


ISTM that your concern would basically a known problem.  Consider the 
following transaction.


BEGIN TRANSACTION;
UPDATE foo SET a = 100;  -- updates on table foo in remote server1
UPDATE bar SET a = 100;  -- updates on table bar in remote server2
COMMIT TRANSACTION;

This transaction would cause the atomicity problem if 
pgfdw_xact_callback() for XACT_EVENT_PRE_COMMIT for foo succeeded and 
then that for bar failed during CommitTransaction().


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advance local xmin more aggressively

On Mon, Dec 8, 2014 at 4:56 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I don't immediately see the problem either, but I have to say that
 grovelling through all the resource owners seems ugly anyway. Resource
 owners are not meant to be traversed like that. And there could be a lot of
 them, and you have to visit every one of them. That could get expensive if
 there are a lot of resource owners.

1. I don't really see why resource owners shouldn't be traversed like
that.  They are clearly intended to form a hierarchy, and there's
existing code that recurses through the hierarchy from a given level
downward.  What's ugly about that?

2. If you have a lot of resource owners, you probably have a lot of
snapshots, so walking a list will be expensive, too.  It will be
disproportionately expensive to walk the resource owner tree only if
there are lots of resource owners but very few of them have any
snapshots.  But I don't think that can really happen.  If you've got
lots of resource owners and each one has a snapshot, you'll traverse
~3 pointers per snapshot: ~1 to find the next ResourceOwner, 1 to find
the snapshot array, and 1 to reach the snapshot itself.  A non-inlined
list would traverse only 2 pointers per snapshot, but that doesn't
seem like enough of a difference to get excited about.

 BTW, you could easily detect that you haven't seen all the registered
 snapshots, after traversing the resource owner, as we keep the counter of
 them. So you could just fall back to not advancing the xmin if it happens.

Not a bad idea.  Or we could elog(FATAL) or fail an assertion if we
don't see them all, and then if it happens we call it a bug and fix
it.

 I would prefer doing separate bookkeeping in snapmgr.c. It seems like it
 wouldn't be difficult to do. It has to be cheap, but I don't see any reason
 to believe that it'd be more expensive than traversing through all resource
 owners. A binary heap or some other priority queue implementation should
 work pretty well for this.

That's optimizing for making the xmin recomputation cheap, but we
don't expect xmin recomputation to happen very often, so I'm not sure
that's the right trade-off.

 And that optimization won't save you in the cases where it doesn't apply.
 For example, what if snapshots happen to form a queue, rather than a stack:

 DECLARE c1 CURSOR FOR ...;
 DECLARE c2 CURSOR FOR ...;
 DECLARE c3 CURSOR FOR ...;
 ...
 DECLARE c1000 CURSOR FOR ...;

 CLOSE c1;
 CLOSE c2;
 CLOSE c3;
 ...
 CLOSE c1000;

 It's not hard to imagine an application doing that. Sure, you could rewrite
 it to close the cursors in different order, but on the face of it that's not
 an unreasonable thing for an application to do. I think we should avoid the
 O(n^2) behaviour in that case.

You won't actually get O(n^2) behavior here unless those DECLARE
CURSOR statements all get snapshots with different xmins; if many of
them have snapshots that share an xmin, then the optimization of
recomputing only when there are no snaps with a given xmin will save
you.  That's a bit pathological, but maybe we should try to cater to
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_recvlogical description

2014-12-08 Thread Euler Taveira

Hi,

The pg_recvlogical docs was rewritten but someone forgot to tweak the
help description. It is a bit late in the 9.4 cycle but let be consistent.


Regards,


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
From 4c6c146e100575fbf02c6160bd919fbd92505708 Mon Sep 17 00:00:00 2001
From: Euler Taveira eu...@timbira.com
Date: Mon, 8 Dec 2014 11:25:29 -0300
Subject: [PATCH] Let be consistent with the docs. The new sentence describes
 better ('control' instead of 'receive') what pg_recvlogical is for.

---
 src/bin/pg_basebackup/pg_recvlogical.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 0d97638..9966cd7 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -62,7 +62,7 @@ static void disconnect_and_exit(int code);
 static void
 usage(void)
 {
-	printf(_(%s receives PostgreSQL logical change streams.\n\n),
+	printf(_(%s controls PostgreSQL logical decoding streams.\n\n),
 		   progname);
 	printf(_(Usage:\n));
 	printf(_(  %s [OPTION]...\n), progname);
-- 
2.1.3


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-08 Thread Adam Brightwell

Michael,


  This work will certainly continue to be pursued.  If a simple move is
  possible/acceptable, then I think that would be the best option.  I can
  handle that as it would appear that I am capable of moving it, if that is
  acceptable.
 Yes please. Actually I could have done it, just found the option to do
 so. Let's see what shows up with your work.


I have moved it to commitfest 2014-12 and marked as Waiting on Author if
that is acceptable.

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com

Re: [HACKERS] Casting issues with domains

2014-12-08 Thread Tom Lane

Thomas Reiss thomas.re...@dalibo.com writes:
 postgres=# explain select * from test2 where a='toto';
 QUERY PLAN
 --
  Seq Scan on test1  (cost=0.00..1693.00 rows=500 width=5)
Filter: (((a)::tstdom)::text = 'toto'::text)
 (2 lignes)

 As you can see, a is casted to tstdom then again to text. This casts
 prevents the optimizer to choose an index scan to retrieve the data. The
 casts are however strictly equivalent and should be not prevent the
 optimizer to use indexes.

No, they are not equivalent.  The optimizer can't simply drop the
cast-to-domain, because that cast might result in a runtime error due
to a domain CHECK constraint violation.  (This is true even if no such
constraint exists at planning time, unfortunately.  If we had a
mechanism to force replanning at ALTER DOMAIN ADD CONSTRAINT, maybe the
no-constraints case could be handled better, but we don't; and adding
one would also imply adding more locks around domain usage, so it's not
all that attractive to do it.)

The short answer is that SQL domains are not zero-cost type aliases.
Perhaps there would be value in having a feature that *is* a a zero-cost
alias, but it wouldn't be a domain.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Dumping database creation options and ACLs

2014-12-08 Thread Ronan Dunklau

Hello.

As of now, the only way to restore database options and ACLs is to use 
pg_dumpall without the globals options. The often recommended pg_dumpall -g + 
individual dumps of the target databases doesn't restore those.

Since pg_dump/pg_restore offer the ability to create the database, it should do 
so with the correct owner, options and database ACLs. 

There was some discussion about those issues a while ago (see 
http://www.postgresql.org/message-id/11646.1272814...@sss.pgh.pa.us for 
example). As I understand it, the best way to handle that would be to push 
these modifications in pg_dump, but it is unclear how it should be done with 
regards to restoring to a different database.

In the meantime, it would be great to add an option to pg_dumpall allowing to 
dump this information. We could add the db creation in the output of 
pg_dumpall -g,  and add a specific --createdb-only option (similar to --roles-
only and --tablespaces-only).

Would such a patch be welcome ?

PS: this email was originally sent to the pgsql-bugs mailing list


-- 
Ronan Dunklau
http://dalibo.com - http://dalibo.org

signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Dec 5, 2014 at 3:45 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, if an alias is used, and you refer to an attribute using a
 non-alias name (i.e. the original table name), then you'll already get
 an error suggesting that the alias be used instead -- of course,
 that's nothing new. It doesn't matter to the existing hinting
 mechanism if the attribute name is otherwise wrong. Once you fix the
 code to use the alias suggested, you'll then get this new
 Levenshtein-based hint.

 In that case, I think I favor giving no hint at all when the RTE name
 is specified but doesn't match exactly.

 I don't follow. The existing mechanism only concerns what to do when
 the original table name was used when an alias should have been used
 instead. What does that have to do with this patch?

Just that that's the case in which it seems useful to give a hint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alter user set local_preload_libraries.

On Sun, Dec 7, 2014 at 9:54 AM, Peter Eisentraut pete...@gmx.net wrote:
 My radical proposal therefore would have been to embrace this
 inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether,
 relying on users interpreting the parameter names to indicate that
 changing them later may or may not have an effect.  Unfortunately, the
 concerns about ignore_system_indexes prevent that.

What exactly are those concerns?  Do you have a link to previous discussion?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Testing DDL deparsing support

On Sat, Dec 6, 2014 at 10:43 PM, Bruce Momjian br...@momjian.us wrote:
 This causes creation DDL is checked if it is used in the regression
 database, but what about ALTER and DROP?  pg_dump doesn't issue those,
 except in special cases like inheritance.

The proposed testing mechanism should cover any ALTER commands that
are in the regression tests provided that those objects are not
subsequently dropped -- because if the ALTER commands aren't replayed
properly, then the later pg_dump won't produce the same output.

There probably are some gaps in our current regression tests in this
area, but that's probably a good thing to fix regardless of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-08 Thread Peter Geoghegan

On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas robertmh...@gmail.com wrote:
 Just that that's the case in which it seems useful to give a hint.

I think it's very possible that the wrong alias may be provided by the
user, and that we should consider that when providing a hint. Besides,
considering every visible RTE (while penalizing non-exact alias names
iff the user provided an alias name) is actually going to make bad
hints less likely, by increasing the number of equidistant low quality
matches in a way that swamps the mechanism into providing no actual
match at all. That's an important additional protection against low
quality matches.

What do other people think here?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

On Sat, Dec 6, 2014 at 12:13 AM, David Rowley dgrowle...@gmail.com wrote:
 It's bare-bones core support for allowing aggregate states to be merged
 together with another aggregate state. I would imagine that if a query such
 as:

 SELECT MAX(value) FROM bigtable;

 was run, then a series of parallel workers could go off and each find the
 max value from their portion of the table and then perhaps some other node
 type would then take all the intermediate results from the workers, once
 they're finished, and join all of the aggregate states into one and return
 that. Naturally, you'd need to check that all aggregates used in the
 targetlist had a merge function first.

I think this is great infrastructure and could also be useful for
pushing down aggregates in cases involving foreign data wrappers.  But
I suggest we discuss it on a separate thread because it's not related
to parallel seq scan per se.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I think we have access to this information in planner (RelOptInfo - pages),
 if we want, we can use that to eliminate the small relations from
 parallelism, but question is how big relations do we want to consider
 for parallelism, one way is to check via tests which I am planning to
 follow, do you think we have any heuristic which we can use to decide
 how big relations should be consider for parallelism?

Surely the Path machinery needs to decide this in particular cases
based on cost.  We should assign some cost to starting a parallel
worker via some new GUC, like parallel_startup_cost = 100,000.  And
then we should also assign a cost to the act of relaying a tuple from
the parallel worker to the master, maybe cpu_tuple_cost (or some new
GUC).  For a small relation, or a query with a LIMIT clause, the
parallel startup cost will make starting a lot of workers look
unattractive, but for bigger relations it will make sense from a cost
perspective, which is exactly what we want.

There are probably other important considerations based on goals for
overall resource utilization, and also because at a certain point
adding more workers won't help because the disk will be saturated.  I
don't know exactly what we should do about those issues yet, but the
steps described in the previous paragraph seem like a good place to
start anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-08 Thread Peter Geoghegan

On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote:
 I think it's very possible that the wrong alias may be provided by the
 user, and that we should consider that when providing a hint.

Note that the existing mechanism (the mechanism that I'm trying to
improve) only ever shows this error message:

There is a column named \%s\ in table \%s\, but it cannot be
referenced from this part of the query.

I think it's pretty clear that this general class of user error is common.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost sfr...@snowman.net wrote:
 For my 2c, I'd like to see it support exactly what the SeqScan node
 supports and then also what Foreign Scan supports.  That would mean we'd
 then be able to push filtering down to the workers which would be great.
 Even better would be figuring out how to parallelize an Append node
 (perhaps only possible when the nodes underneath are all SeqScan or
 ForeignScan nodes) since that would allow us to then parallelize the
 work across multiple tables and remote servers.

I don't see how we can support the stuff ForeignScan does; presumably
any parallelism there is up to the FDW to implement, using whatever
in-core tools we provide.  I do agree that parallelizing Append nodes
is useful; but let's get one thing done first before we start trying
to do thing #2.

 I'm not entirely following this.  How can the worker be responsible for
 its own plan when the information passed to it (per the above
 paragraph..) is pretty minimal?  In general, I don't think we need to
 have specifics like this worker is going to do exactly X because we
 will eventually need some communication to happen between the worker and
 the master process where the worker can ask for more work because it's
 finished what it was tasked with and the master will need to give it
 another chunk of work to do.  I don't think we want exactly what each
 worker process will do to be fully formed at the outset because, even
 with the best information available, given concurrent load on the
 system, it's not going to be perfect and we'll end up starving workers.
 The plan, as formed by the master, should be more along the lines of
 this is what I'm gonna have my workers do along w/ how many workers,
 etc, and then it goes and does it.  Perhaps for an 'explain analyze' we
 return information about what workers actually *did* what, but that's a
 whole different discussion.

I agree with this.  For a first version, I think it's OK to start a
worker up for a particular sequential scan and have it help with that
sequential scan until the scan is completed, and then exit.  It should
not, as the present version of the patch does, assign a fixed block
range to each worker; instead, workers should allocate a block or
chunk of blocks to work on until no blocks remain.  That way, even if
every worker but one gets stuck, the rest of the scan can still
finish.

Eventually, we will want to be smarter about sharing works between
multiple parts of the plan, but I think it is just fine to leave that
as a future enhancement for now.

 - Master backend is just responsible for coordination among workers
 It shares the required information to workers and then fetch the
 data processed by each worker, by using some more logic, we might
 be able to make master backend also fetch data from heap rather than
 doing just co-ordination among workers.

 I don't think this is really necessary...

I think it would be an awfully good idea to make this work.  The
master thread may be significantly faster than any of the others
because it has no IPC costs.  We don't want to leave our best resource
sitting on the bench.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb generator functions

2014-12-08 Thread Andrew Dunstan



On 12/08/2014 04:21 AM, Alvaro Herrera wrote:

Andrew Dunstan wrote:


OK, here is a new patch version that

  * uses find_coercion_path() to find the cast function if any, as
discussed elsewhere
  * removes calls to getTypeOutputInfo() except where required
  * honors a cast to json only for rendering both json and jsonb
  * adds processing for the date type that was previously missing in
datum_to_jsonb

Did this go anywhere?



Not, yet. I hope to get to it this week.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

All,

Pardon me for jumping into this late.  In general, I like Alvaro's
approach.  However, I wanted to list the major shortcomings of the
existing replication system (based on complaints by PGX's users and on
IRC) and compare them to Alvaro's proposed implementation to make sure
that enough of them are addressed, and that the ones which aren't
addressed are not being addressed as a clear decision.  We can't address
*all* of the limitations of the current system, but let's make sure that
we're addressing enough of them to make implementing a 2nd partitioning
system worthwhile.

Where I have ? is because I'm not clear from Alvaro's proposal whether
they're addressed or not.

1.The Trigger Problem
the need to write triggers for INSERT/UPDATE/DELETE.
Addressed.

2. The Clutter Problem
cluttering up system views and dumps with hundreds of partitioned tables
Addressed.

3. Creation Problem
the need two write triggers and/or cron jobs to create new partitions
Addressed.

4. Creation Locking Problem
high probability of lock pile-ups whenever a new partition is created on
demand due to multiple backends trying to create the partition at the
same time.
Not Addressed?

5. Constant Problem
Since current partitioned query planning happens before the rewrite
phase, SELECTs do not use partition logic to evaluate even simple
expressions, let alone IMMUTABLE or STABLE functions.
Addressed??

6. Unique Index Problem
Cannot create a unique index across multiple partitions, which prevents
the partitioned table from being FK'd.
Not Addressed
(but could be addressed in the future)

7. JOIN Problem
Two partitioned tables being JOINed need to append and materialize
before the join, causing a very slow join under some circumstances, even
if both tables are partitioned on the same ranges.
Not Addressed?
(but could be addressed in the future)

8. COPY Problem
Cannot bulk-load into the Master, just into individual partitions.
Addressed.

9. Hibernate Problem
When using the trigger method, inserts into the master partition return
0, which Hibernate and some other ORMs regard as an insert failure.
Addressed.

10. Scaling Problem
Inheritance partitioning becomes prohibitively slow for the planner at
somewhere between 100 and 500 partitions depending on various factors.
No idea?

11. Hash Partitioning
Some users would prefer to partition into a fixed number of
hash-allocated partitions.
Not Addressed.

12. Extra Constraint Evaluation
Inheritance partitioning evaluates *all* constraints on the partitions,
whether they are part of the partitioning scheme or not.  This is way
expensive if those are, say, polygon comparisons.
Addressed.


Additionally, I believe that Alvaro's proposal will make the following
activities which are supported by partition-by-inheritance more
difficult or impossible.  Again, these are probably acceptable because
inheritance partitioning isn't going away.  However, we should
consciously decide that:

A. COPY/ETL then attach
In inheritance partitioning, you can easily build a partition outside
the master and then attach it, allowing for minimal disturbance of
concurrent users.  Could be addressed in the future.

B. Catchall Partition
Many partitioning schemes currently contain a catchall partition which
accepts rows outside of the range of the partitioning scheme, due to bad
input data.  Probably not handled on purpose; Alvaro is proposing that
we reject these instead, or create the partitions on demand, which is a
legitimate approach.

C. Asymmetric Partitioning / NULLs in partition column
This is the classic Active/Inactive By Month setup for partitions.
Could be addressed via special handling for NULL/infinity in the
partitioned column.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I guess you could list or hash partition on multiple columns, too.

 How would you distinguish values in list partition for multiple
 columns? I mean for range partition, we are sure there will
 be either one value for each column, but for list it could
 be multiple and not fixed for each partition, so I think it will not
 be easy to support the multicolumn partition key for list
 partitions.

I don't understand.  If you want to range partition on columns (a, b),
you say that, say, tuples with (a, b) values less than (100, 200) go
here and the rest go elsewhere.  For list partitioning, you say that,
say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
rest go elsewhere.  I'm not sure how useful that is but it's not
illogical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Mon, Dec 8, 2014 at 12:13 AM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 So just to clarify, first and last destinations are considered defined if 
 you have something like:

 ...
 PARTITION p1 VALUES LESS THAN 10
 PARTITION p2 VALUES BETWEEN 10 AND 20
 PARTITION p3 VALUES GREATER THAN 20
 ...

 And not defined if:

 ...
 PARTITION p1 VALUES BETWEEN 10 AND 20
 ...

Yes.

 For pg_dump --binary-upgrade, you need a statement like SELECT
 binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for
 each pg_class entry.  So you can't easily have a single SQL statement
 creating multiple such entries.

 Hmm, do you mean pg_dump cannot emit such a SQL or there shouldn't be one 
 in the first place?

I mean that the binary upgrade script needs to set the OID for every
pg_class object being restored, and it does that by stashing away up
to one (1) pg_class OID before each CREATE statement.  If a single
CREATE statement generates multiple pg_class entries, this method
doesn't work.

 Makes sense. This would double as a way to create subpartitions too? And that 
 would have to play well with any choice we end up making about how we treat 
 subpartitioning key (one of the points discussed above)

Yes, I think so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Sure, I don't feel we should not provide anyway to take dump
 for individual partition but not at level of independent table.
 May be something like --table table_name
 --partition partition_name.

 In general, I think we should try to avoid exposing that partitions are
 individual tables as that might hinder any future enhancement in that
 area (example if we someone finds a different and better way to
 arrange the partition data, then due to the currently exposed syntax,
 we might feel blocked).

I guess I'm in disagreement with you - and, perhaps - the majority on
this point.  I think that ship has already sailed: partitions ARE
tables.  We can try to make it less necessary for users to ever look
at those tables as separate objects, and I think that's a good idea.
But trying to go from a system where partitions are tables, which is
what we have today, to a system where they are not seems like a bad
idea to me.  If we make a major break from how things work today,
we're going to end up having to reimplement stuff that already works.

Besides, I haven't really seen anyone propose something that sounds
like a credible alternative.  If we could make partition objects
things that the storage layer needs to know about but the query
planner doesn't need to understand, that'd be maybe worth considering.
But I don't see any way that that's remotely feasible.  There are lots
of places that we assume that a heap consists of blocks number 0 up
through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
and pieces of the way index vacuuming is handled, which in turn bleeds
into Hot Standby.  You can't just decide that now block numbers are
going to be replaced by some more complex structure, or even that
they're now going to be nonlinear, without breaking a huge amount of
stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs si...@2ndquadrant.com wrote:
 * parameter should be SUSET - it doesn't *need* to be set only at
 server start since all records are independent of each other

Why not USERSET?  There's no point in trying to prohibit users from
doing things that will cause bad performance because they can do that
anyway.

 * ideally we'd like to be able to differentiate the types of usage.
 which then allows the user to control the level of compression
 depending upon the type of action. My first cut at what those settings
 should be are ALL  LOGICAL  PHYSICAL  VACUUM.

 VACUUM - only compress while running vacuum commands
 PHYSICAL - only compress while running physical DDL commands (ALTER
 TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
 typically be used for logical decoding
 LOGICAL - compress FPIs for record types that change tables
 ALL - all user commands
 (each level includes all prior levels)

Interesting idea, but what evidence do we have that a simple on/off
switch isn't good enough?

 * name should not be wal_compression - we're not compressing all wal
 records, just fpis. There is no evidence that we even want to compress
 other record types, nor that our compression mechanism is effective at
 doing so. Simple = keep name as compress_full_page_writes

Quite right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

On 2014-12-08 14:09:19 -0500, Robert Haas wrote:
  records, just fpis. There is no evidence that we even want to compress
  other record types, nor that our compression mechanism is effective at
  doing so. Simple = keep name as compress_full_page_writes
 
 Quite right.

I don't really agree with this. There's lots of records which can be
quite big where compression could help a fair bit. Most prominently
HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest
chunk of WAL. And these are big and repetitive enough that compression
is very likely to be beneficial.

I still think that just compressing the whole record if it's above a
certain size is going to be better than compressing individual
parts. Michael argued thta that'd be complicated because of the varying
size of the required 'scratch space'. I don't buy that argument
though. It's easy enough to simply compress all the data in some fixed
chunk size. I.e. always compress 64kb in one go. If there's more
compress that independently.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On 12/08/2014 11:05 AM, Robert Haas wrote:
 I guess I'm in disagreement with you - and, perhaps - the majority on
 this point.  I think that ship has already sailed: partitions ARE
 tables.  We can try to make it less necessary for users to ever look
 at those tables as separate objects, and I think that's a good idea.
 But trying to go from a system where partitions are tables, which is
 what we have today, to a system where they are not seems like a bad
 idea to me.  If we make a major break from how things work today,
 we're going to end up having to reimplement stuff that already works.

I don't thing its feasible to drop inheritance partitioning at this
point; too many user exploit a lot of peculiarities of that system which
wouldn't be supported by any other system.  So any new partitioning
system we're talking about would be *in addition* to the existing
system.  Hence my prior email trying to make sure that a new proposed
system is sufficiently different from the existing one to be worthwhile.

 Besides, I haven't really seen anyone propose something that sounds
 like a credible alternative.  If we could make partition objects
 things that the storage layer needs to know about but the query
 planner doesn't need to understand, that'd be maybe worth considering.
 But I don't see any way that that's remotely feasible. 

On the other hand, as long as partitions exist exclusively at the
planner layer, we can't fix the existing major shortcomings of
inheritance partitioning, such as its inability to handle expressions.
Again, see previous.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

On Mon, Dec 8, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-12-08 14:09:19 -0500, Robert Haas wrote:
  records, just fpis. There is no evidence that we even want to compress
  other record types, nor that our compression mechanism is effective at
  doing so. Simple = keep name as compress_full_page_writes

 Quite right.

 I don't really agree with this. There's lots of records which can be
 quite big where compression could help a fair bit. Most prominently
 HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest
 chunk of WAL. And these are big and repetitive enough that compression
 is very likely to be beneficial.

 I still think that just compressing the whole record if it's above a
 certain size is going to be better than compressing individual
 parts. Michael argued thta that'd be complicated because of the varying
 size of the required 'scratch space'. I don't buy that argument
 though. It's easy enough to simply compress all the data in some fixed
 chunk size. I.e. always compress 64kb in one go. If there's more
 compress that independently.

I agree that idea is worth considering.  But I think we should decide
which way is better and then do just one or the other.  I can't see
the point in adding wal_compress=full_pages now and then offering an
alternative wal_compress=big_records in 9.5.

I think it's also quite likely that there may be cases where
context-aware compression strategies can be employed.  For example,
the prefix/suffix compression of updates that Amit did last cycle
exploit the likely commonality between the old and new tuple.  We
might have cases like that where there are meaningful trade-offs to be
made between CPU and I/O, or other reasons to have user-exposed knobs.
I think we'll be much happier if those are completely separate GUCs,
so we can say things like compress_gin_wal=true and
compress_brin_effort=3.14 rather than trying to have a single
wal_compress GUC and assuming that we can shoehorn all future needs
into it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On 2014-12-08 14:05:52 -0500, Robert Haas wrote:
 On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila amit.kapil...@gmail.com wrote:
  Sure, I don't feel we should not provide anyway to take dump
  for individual partition but not at level of independent table.
  May be something like --table table_name
  --partition partition_name.
 
  In general, I think we should try to avoid exposing that partitions are
  individual tables as that might hinder any future enhancement in that
  area (example if we someone finds a different and better way to
  arrange the partition data, then due to the currently exposed syntax,
  we might feel blocked).
 
 I guess I'm in disagreement with you - and, perhaps - the majority on
 this point.  I think that ship has already sailed: partitions ARE
 tables.  We can try to make it less necessary for users to ever look
 at those tables as separate objects, and I think that's a good idea.
 But trying to go from a system where partitions are tables, which is
 what we have today, to a system where they are not seems like a bad
 idea to me.  If we make a major break from how things work today,
 we're going to end up having to reimplement stuff that already works.

I don't think this makes much sense. That'd severely restrict our
ability to do stuff for a long time. Unless we can absolutely rely on
the fact that partitions have the same schema and such we'll rob
ourselves of significant optimization opportunities.

 Besides, I haven't really seen anyone propose something that sounds
 like a credible alternative.  If we could make partition objects
 things that the storage layer needs to know about but the query
 planner doesn't need to understand, that'd be maybe worth considering.
 But I don't see any way that that's remotely feasible.  There are lots
 of places that we assume that a heap consists of blocks number 0 up
 through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
 and pieces of the way index vacuuming is handled, which in turn bleeds
 into Hot Standby.  You can't just decide that now block numbers are
 going to be replaced by some more complex structure, or even that
 they're now going to be nonlinear, without breaking a huge amount of
 stuff.

I think you're making a wrong fundamental assumption here. Just because
we define partitions to not be full relations doesn't mean we have to
treat them entirely separate. I don't see why a pg_class.relkind = 'p'
entry would be something actually problematic. That'd easily allow to
treat them differently in all the relevant places (all of ALTER TABLE,
DML et al) and still allow all of the current planner/executor
infrastructure. We can even allow direct SELECTs from individual
partitions if we want to - that's trivial to achieve.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Mon, Dec 8, 2014 at 2:30 PM, Josh Berkus j...@agliodbs.com wrote:
 On 12/08/2014 11:05 AM, Robert Haas wrote:
 I guess I'm in disagreement with you - and, perhaps - the majority on
 this point.  I think that ship has already sailed: partitions ARE
 tables.  We can try to make it less necessary for users to ever look
 at those tables as separate objects, and I think that's a good idea.
 But trying to go from a system where partitions are tables, which is
 what we have today, to a system where they are not seems like a bad
 idea to me.  If we make a major break from how things work today,
 we're going to end up having to reimplement stuff that already works.

 I don't thing its feasible to drop inheritance partitioning at this
 point; too many user exploit a lot of peculiarities of that system which
 wouldn't be supported by any other system.  So any new partitioning
 system we're talking about would be *in addition* to the existing
 system.  Hence my prior email trying to make sure that a new proposed
 system is sufficiently different from the existing one to be worthwhile.

I think any new partitioning system should keep the good things about
the existing system, of which there are some, and not try to reinvent
the wheel.  The yard stick for a new system shouldn't be is this
different enough? but does this solve the problems without creating
new ones?.

 Besides, I haven't really seen anyone propose something that sounds
 like a credible alternative.  If we could make partition objects
 things that the storage layer needs to know about but the query
 planner doesn't need to understand, that'd be maybe worth considering.
 But I don't see any way that that's remotely feasible.

 On the other hand, as long as partitions exist exclusively at the
 planner layer, we can't fix the existing major shortcomings of
 inheritance partitioning, such as its inability to handle expressions.
 Again, see previous.

Huh?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] compiler warnings under MinGW for 9.4

2014-12-08 Thread Jeff Janes

In the past, building under MinGW produced so many warnings that I never
bothered to read them.

Now most of them have been removed, so the ones that are left might be
worth reporting.

Using gcc.exe (GCC) 4.6.2 on REL9_4_STABLE
eadd80c08ddfc485db84b9af7cca54a0d50ebe6d I get:

mingwcompat.c:60:1: warning: 'RegisterWaitForSingleObject' redeclared
without dllimport attribute: previous dllimport ignored [-Wattributes]
input.c:382:1: warning: 'saveHistory' defined but not used
[-Wunused-function]

Does anyone have opinions on how to address these?


Cheers,

Jeff

Re: [HACKERS] On partitioning

On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund and...@2ndquadrant.com wrote:
 I guess I'm in disagreement with you - and, perhaps - the majority on
 this point.  I think that ship has already sailed: partitions ARE
 tables.  We can try to make it less necessary for users to ever look
 at those tables as separate objects, and I think that's a good idea.
 But trying to go from a system where partitions are tables, which is
 what we have today, to a system where they are not seems like a bad
 idea to me.  If we make a major break from how things work today,
 we're going to end up having to reimplement stuff that already works.

 I don't think this makes much sense. That'd severely restrict our
 ability to do stuff for a long time. Unless we can absolutely rely on
 the fact that partitions have the same schema and such we'll rob
 ourselves of significant optimization opportunities.

I don't think that's mutually exclusive with the idea of
partitions-as-tables.  I mean, you can add code to the ALTER TABLE
path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
wherever you want.

 Besides, I haven't really seen anyone propose something that sounds
 like a credible alternative.  If we could make partition objects
 things that the storage layer needs to know about but the query
 planner doesn't need to understand, that'd be maybe worth considering.
 But I don't see any way that that's remotely feasible.  There are lots
 of places that we assume that a heap consists of blocks number 0 up
 through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
 and pieces of the way index vacuuming is handled, which in turn bleeds
 into Hot Standby.  You can't just decide that now block numbers are
 going to be replaced by some more complex structure, or even that
 they're now going to be nonlinear, without breaking a huge amount of
 stuff.

 I think you're making a wrong fundamental assumption here. Just because
 we define partitions to not be full relations doesn't mean we have to
 treat them entirely separate. I don't see why a pg_class.relkind = 'p'
 entry would be something actually problematic. That'd easily allow to
 treat them differently in all the relevant places (all of ALTER TABLE,
 DML et al) and still allow all of the current planner/executor
 infrastructure. We can even allow direct SELECTs from individual
 partitions if we want to - that's trivial to achieve.

We may just be using different words to talk about more-or-less the
same thing, then.  What I'm saying is that I want these things to keep
working:

- Indexes.
- Merge append and any other inheritance-aware query planning techniques.
- Direct access to individual partitions to bypass
tuple-routing/query-planning overhead.

I am not necessarily saying that I have a problem with putting other
restrictions on partitions, like requiring them to have the same tuple
descriptor or the same ACLs as their parents.  Those kinds of details
bear discussion, but I'm not intrinsically opposed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lockless StrategyGetBuffer() clock sweep

On 2014-10-30 07:55:08 -0400, Robert Haas wrote:
 On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund and...@2ndquadrant.com wrote:
  But if it is, then how about
  adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
  and just checking the flag instead of checking bgwriterLatch itself?
 
  Yea, that'd be nicer. I didn't do it because it made the patch slightly
  more invasive... The complexity really is only needed because we're not
  guaranteed that 64bit reads are atomic. Although we actually can be
  sure, because there's no platform with nonatomic intptr_t reads - so
  64bit platforms actually *do* have atomic 64bit reads/writes.
 
  So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
  don't even need to take a lock to set it. Afaics the worst that can
  happen is that several processes set the latch...
 
 OK, that design is fine with me.

There's a slight problem with this, namely restarts of bgwriter. If it
crashes the reference to the relevant latch will currently be reset via
StrategyNotifyBgWriter(). In reality that's not a problem because
sizeof(void*) writes are always atomic, but currently we don't assume
that. We'd sometimes write to a old latch, but that's harmless because
they're never deallocated.

So I can see several solutions right now:
1) Redefine our requirements so that aligned sizeof(void*) writes are
   always atomic. That doesn't affect any currently supported platform
   and won't ever affect any future platform either. E.g. linux has had
   that requirement for a long time.
2) Use a explicitly defined latch for the bgworker instead of using the
   PGPROC-procLatch. That way it never has to be reset and all
   SetLatch()s will eventually go to the right process.
3) Continue requiring the spinlock when setting the latch.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On 2014-12-08 14:48:50 -0500, Robert Haas wrote:
 On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund and...@2ndquadrant.com wrote:
  I guess I'm in disagreement with you - and, perhaps - the majority on
  this point.  I think that ship has already sailed: partitions ARE
  tables.  We can try to make it less necessary for users to ever look
  at those tables as separate objects, and I think that's a good idea.
  But trying to go from a system where partitions are tables, which is
  what we have today, to a system where they are not seems like a bad
  idea to me.  If we make a major break from how things work today,
  we're going to end up having to reimplement stuff that already works.
 
  I don't think this makes much sense. That'd severely restrict our
  ability to do stuff for a long time. Unless we can absolutely rely on
  the fact that partitions have the same schema and such we'll rob
  ourselves of significant optimization opportunities.
 
 I don't think that's mutually exclusive with the idea of
 partitions-as-tables.  I mean, you can add code to the ALTER TABLE
 path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
 wherever you want.

That'll be a lot of places you'll need to touch. More fundamentally: Why
should we name something a table that's not one?

  Besides, I haven't really seen anyone propose something that sounds
  like a credible alternative.  If we could make partition objects
  things that the storage layer needs to know about but the query
  planner doesn't need to understand, that'd be maybe worth considering.
  But I don't see any way that that's remotely feasible.  There are lots
  of places that we assume that a heap consists of blocks number 0 up
  through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
  and pieces of the way index vacuuming is handled, which in turn bleeds
  into Hot Standby.  You can't just decide that now block numbers are
  going to be replaced by some more complex structure, or even that
  they're now going to be nonlinear, without breaking a huge amount of
  stuff.
 
  I think you're making a wrong fundamental assumption here. Just because
  we define partitions to not be full relations doesn't mean we have to
  treat them entirely separate. I don't see why a pg_class.relkind = 'p'
  entry would be something actually problematic. That'd easily allow to
  treat them differently in all the relevant places (all of ALTER TABLE,
  DML et al) and still allow all of the current planner/executor
  infrastructure. We can even allow direct SELECTs from individual
  partitions if we want to - that's trivial to achieve.
 
 We may just be using different words to talk about more-or-less the
 same thing, then.

That might be

 What I'm saying is that I want these things to keep working:

 - Indexes.

Nobody argued against that I think.

 - Merge append and any other inheritance-aware query planning
 techniques.

Same here.

 - Direct access to individual partitions to bypass
 tuple-routing/query-planning overhead.

I think that might be ok in some cases, but in general I'd be very wary
to allow that. I think it might be ok to allow direct read access, but
everything else I'd be opposed. I'd much rather go the route of allowing
to few things and then gradually opening up if required than the other
way round (as that pretty much will never happen because it'll break
deployed systems).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On 12/08/2014 11:40 AM, Robert Haas wrote:
 I don't thing its feasible to drop inheritance partitioning at this
 point; too many user exploit a lot of peculiarities of that system which
 wouldn't be supported by any other system.  So any new partitioning
 system we're talking about would be *in addition* to the existing
 system.  Hence my prior email trying to make sure that a new proposed
 system is sufficiently different from the existing one to be worthwhile.
 
 I think any new partitioning system should keep the good things about
 the existing system, of which there are some, and not try to reinvent
 the wheel.  The yard stick for a new system shouldn't be is this
 different enough? but does this solve the problems without creating
 new ones?.

It's unrealistic to assume that a new system would support all of the
features of the existing inheritance partitioning without restriction.
 In fact, I'd say that such a requirement amounts to saying don't
bother trying.

For example, inheritance allows us to have different indexes,
constraints, and even columns on partitions.  We can have overlapping
partitions, and heterogenous multilevel partitioning (partition this
customer by month but partition that customer by week).  We can even add
triggers on individual partitions to reroute data away from a specific
partition.   A requirement to support all of these peculiar uses of
inheritance partitioning would doom any new partitioning project.

 Besides, I haven't really seen anyone propose something that sounds
 like a credible alternative.  If we could make partition objects
 things that the storage layer needs to know about but the query
 planner doesn't need to understand, that'd be maybe worth considering.
 But I don't see any way that that's remotely feasible.

 On the other hand, as long as partitions exist exclusively at the
 planner layer, we can't fix the existing major shortcomings of
 inheritance partitioning, such as its inability to handle expressions.
 Again, see previous.
 
 Huh?

Explained in the other email I posted on this thread.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't think that's mutually exclusive with the idea of
 partitions-as-tables.  I mean, you can add code to the ALTER TABLE
 path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
 wherever you want.

 That'll be a lot of places you'll need to touch. More fundamentally: Why
 should we name something a table that's not one?

Well, I'm not convinced that it isn't one.  And adding a new relkind
will involve a bunch of code churn, too.  But I don't much care to
pre-litigate this: when someone has got a patch, we can either agree
that the approach is OK or argue that it is problematic because X.  I
think we need to hammer down the design in broad strokes first, and
I'm not sure we're totally there yet.

 - Direct access to individual partitions to bypass
 tuple-routing/query-planning overhead.

 I think that might be ok in some cases, but in general I'd be very wary
 to allow that. I think it might be ok to allow direct read access, but
 everything else I'd be opposed. I'd much rather go the route of allowing
 to few things and then gradually opening up if required than the other
 way round (as that pretty much will never happen because it'll break
 deployed systems).

Why?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Mon, Dec 8, 2014 at 2:58 PM, Josh Berkus j...@agliodbs.com wrote:
 I think any new partitioning system should keep the good things about
 the existing system, of which there are some, and not try to reinvent
 the wheel.  The yard stick for a new system shouldn't be is this
 different enough? but does this solve the problems without creating
 new ones?.

 It's unrealistic to assume that a new system would support all of the
 features of the existing inheritance partitioning without restriction.
  In fact, I'd say that such a requirement amounts to saying don't
 bother trying.

 For example, inheritance allows us to have different indexes,
 constraints, and even columns on partitions.  We can have overlapping
 partitions, and heterogenous multilevel partitioning (partition this
 customer by month but partition that customer by week).  We can even add
 triggers on individual partitions to reroute data away from a specific
 partition.   A requirement to support all of these peculiar uses of
 inheritance partitioning would doom any new partitioning project.

I don't think it has to be possible to support every use case that we
can support today; clearly, a part of the goal here is to be LESS
general so that we can be more performant.  But I think the urge to
change too many things at once had better be tempered by a clear-eyed
vision of what can reasonably be accomplished in one patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2014-12-08 Thread Heikki Linnakangas


On 12/08/2014 09:21 PM, Andres Freund wrote:

I still think that just compressing the whole record if it's above a
certain size is going to be better than compressing individual
parts. Michael argued thta that'd be complicated because of the varying
size of the required 'scratch space'. I don't buy that argument
though. It's easy enough to simply compress all the data in some fixed
chunk size. I.e. always compress 64kb in one go. If there's more
compress that independently.


Doing it in fixed-size chunks doesn't help - you have to hold onto the 
compressed data until it's written to the WAL buffers.


But you could just allocate a large enough scratch buffer, and give up 
if it doesn't fit. If the compressed data doesn't fit in e.g. 3 * 8kb, 
it didn't compress very well, so there's probably no point in 
compressing it anyway. Now, an exception to that might be a record that 
contains something else than page data, like a commit record with 
millions of subxids, but I think we could live with not compressing 
those, even though it would be beneficial to do so.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API


On 12/6/14, 5:21 PM, Kouhei Kaigai wrote:

 Yes please. We have other contrib modules that exist as tests, so this
 seems reasonable to me.


I can't improve the docs without the example code. Is that available now?


Please wait for a few days. The ctidscan module is not adjusted for the
latest interface yet.


I've made some minor edits, with an emphasis on not changing original intent. 
Each section was saved as a separate edit, so if anyone objects to something 
just revert the relevant change. Once the code is available more editing can be 
done.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Log inability to lock pages during vacuum


On 12/7/14, 6:16 PM, Simon Riggs wrote:

On 20 October 2014 at 10:57, Jim Nasby jim.na...@bluetreble.com wrote:


Currently, a non-freeze vacuum will punt on any page it can't get a cleanup
lock on, with no retry. Presumably this should be a rare occurrence, but I
think it's bad that we just assume that and won't warn the user if something
bad is going on.


(I'm having email problems, so I can't see later mails on this thread,
so replying here.)

Logging patch looks fine, but I would rather not add a line of text
for each VACUUM, just in case this is non-zero. I think we should add
that log line only if the blocks skipped  0.


I thought about doing that, but I'm loath to duplicate a rather large ereport 
call. Happy to make the change if that's the consensus though.


What I'm more interested in is what you plan to do with the
information once we get it?

The assumption that skipping blocks is something bad is strange. I
added it because VACUUM could and did regularly hang on busy tables,
which resulted in bloat because other blocks that needed cleaning
didn't get any attention.

Which is better, spend time obsessively trying to vacuum particular
blocks, or to spend the time on other blocks that are in need of
cleaning and are available to be cleaned?

Which is better, have autovacuum or system wide vacuum progress on to
other tables that need cleaning, or spend lots of effort retrying?

How do we know what is the best next action?

I'd really want to see some analysis of those things before we spend
even more cycles on this.


That's the entire point of logging this information. There is an underlying 
assumption that we won't actually skip many pages, but there's no data to back 
that up, nor is there currently any way to get that data.

My hope is that the logging shows that there isn't anything more that needs to 
be done here. If this is something that causes problems, at least now DBAs will 
be aware of it and hopefully we'll be able to identify specific problem 
scenarios and find a solution.



BTW, my initial proposal[1] was strictly logging. The only difference was 
raising it to a warning if a significant portion of the table was skipped. I 
only investigated retrying locks at the suggestion of others. I never intended 
this to become a big time sink.

[1]:
Currently, a non-freeze vacuum will punt on any page it can't get a cleanup 
lock on, with no retry. Presumably this should be a rare occurrence, but I think 
it's bad that we just assume that and won't warn the user if something bad is going 
on.

My thought is that if we skip any pages elog(LOG) how many we skipped. If we skip 
more than 1% of the pages we visited (not relpages) then elog(WARNING) instead.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Casting issues with domains


On 12/8/14, 9:18 AM, Tom Lane wrote:

The short answer is that SQL domains are not zero-cost type aliases.
Perhaps there would be value in having a feature that*is*  a a zero-cost
alias, but it wouldn't be a domain.


Note that you can actually re-use the support functions of one data type to create a new 
one. So if you wanted a special type called document that actually behaved 
the same as text you could do that fairly easily (though not as easily as creating a 
domain).

If we were going to expend energy here, I suspect it would be more useful to look at ways 
of creating new types without requiring C. C isn't an option on many (even most) 
environments in today's cloud world, aside from the intimidation factor. 
There are comments in the code that hypothesize about making cstring a full type; that 
might be all that's needed.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

On Tue, Dec 9, 2014 at 5:33 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 12/08/2014 09:21 PM, Andres Freund wrote:

 I still think that just compressing the whole record if it's above a
 certain size is going to be better than compressing individual
 parts. Michael argued thta that'd be complicated because of the varying
 size of the required 'scratch space'. I don't buy that argument
 though. It's easy enough to simply compress all the data in some fixed
 chunk size. I.e. always compress 64kb in one go. If there's more
 compress that independently.


 Doing it in fixed-size chunks doesn't help - you have to hold onto the
 compressed data until it's written to the WAL buffers.

 But you could just allocate a large enough scratch buffer, and give up if
 it doesn't fit. If the compressed data doesn't fit in e.g. 3 * 8kb, it
 didn't compress very well, so there's probably no point in compressing it
 anyway. Now, an exception to that might be a record that contains something
 else than page data, like a commit record with millions of subxids, but I
 think we could live with not compressing those, even though it would be
 beneficial to do so.
Another thing to consider is the possibility to control at GUC level
what is the maximum size of a record we allow to compress.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Role Attribute Bitmask Catalog Representation

On Tue, Dec 9, 2014 at 12:10 AM, Adam Brightwell
adam.brightw...@crunchydatasolutions.com wrote:
 Michael,


  This work will certainly continue to be pursued.  If a simple move is
  possible/acceptable, then I think that would be the best option.  I can
  handle that as it would appear that I am capable of moving it, if that
  is
  acceptable.
 Yes please. Actually I could have done it, just found the option to do
 so. Let's see what shows up with your work.


 I have moved it to commitfest 2014-12 and marked as Waiting on Author if
 that is acceptable.
Thanks! I guess that's fine.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning


On 12/8/14, 1:05 PM, Robert Haas wrote:

Besides, I haven't really seen anyone propose something that sounds
like a credible alternative.  If we could make partition objects
things that the storage layer needs to know about but the query
planner doesn't need to understand, that'd be maybe worth considering.
But I don't see any way that that's remotely feasible.  There are lots
of places that we assume that a heap consists of blocks number 0 up
through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
and pieces of the way index vacuuming is handled, which in turn bleeds
into Hot Standby.  You can't just decide that now block numbers are
going to be replaced by some more complex structure, or even that
they're now going to be nonlinear, without breaking a huge amount of
stuff.


Agreed, but it's possible to keep a block/CTID interface while doing something 
different on the disk.

If you think about it, partitioning is really a hack anyway. It clutters up 
your logical set implementation with a bunch of physical details. What most 
people really want when they implement partitioning is simply data locality.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On 12/8/14, 12:26 PM, Josh Berkus wrote:

4. Creation Locking Problem
high probability of lock pile-ups whenever a new partition is created on
demand due to multiple backends trying to create the partition at the
same time.
Not Addressed?

Do users actually try and create new partitions during DML? That sounds doomed
to failure in pretty much any system...

6. Unique Index Problem
Cannot create a unique index across multiple partitions, which prevents
the partitioned table from being FK'd.
Not Addressed
(but could be addressed in the future)

And would be extremely useful even with simple inheritance, let alone
partitioning...

9. Hibernate Problem
When using the trigger method, inserts into the master partition return
0, which Hibernate and some other ORMs regard as an insert failure.
Addressed.

It would be really nice to address this with regular inheritance too...

11. Hash Partitioning
Some users would prefer to partition into a fixed number of
hash-allocated partitions.
Not Addressed.

Though, you should be able to do that in either system if you bother to define
your own hash in a BEFORE trigger...

A. COPY/ETL then attach
In inheritance partitioning, you can easily build a partition outside
the master and then attach it, allowing for minimal disturbance of
concurrent users. Could be addressed in the future.

How much of the desire for this is because our current row routing solutions
are very slow? I suspect that's the biggest reason, and hopefully Alvaro's proposal
mostly eliminates it.

B. Catchall Partition
Many partitioning schemes currently contain a catchall partition which
accepts rows outside of the range of the partitioning scheme, due to bad
input data. Probably not handled on purpose; Alvaro is proposing that
we reject these instead, or create the partitions on demand, which is a
legitimate approach.

C. Asymmetric Partitioning / NULLs in partition column
This is the classic Active/Inactive By Month setup for partitions.
Could be addressed via special handling for NULL/infinity in the
partitioned column.

If we allowed for a catchall partition and supported normal inheritance/triggers on
that partition then users could continue to do whatever they needed with data that didn't fit the
normal partitioning pattern.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

On 9 December 2014 at 04:09, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs si...@2ndquadrant.com wrote:
 * parameter should be SUSET - it doesn't *need* to be set only at
 server start since all records are independent of each other

 Why not USERSET?  There's no point in trying to prohibit users from
 doing things that will cause bad performance because they can do that
 anyway.

Yes, I think USERSET would work fine for this.

 * ideally we'd like to be able to differentiate the types of usage.
 which then allows the user to control the level of compression
 depending upon the type of action. My first cut at what those settings
 should be are ALL  LOGICAL  PHYSICAL  VACUUM.

 VACUUM - only compress while running vacuum commands
 PHYSICAL - only compress while running physical DDL commands (ALTER
 TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
 typically be used for logical decoding
 LOGICAL - compress FPIs for record types that change tables
 ALL - all user commands
 (each level includes all prior levels)

 Interesting idea, but what evidence do we have that a simple on/off
 switch isn't good enough?

Yes, I think that was overcooked. What I'm thinking is that in the
long run we might have groups of parameters attached to different
types of action, so we wouldn't need, for example, two parameters for
work_mem and maintenance_work_mem. We'd just have work_mem and then a
scheme that has different values of work_mem for different action
types.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

On 9 December 2014 at 04:21, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-12-08 14:09:19 -0500, Robert Haas wrote:
  records, just fpis. There is no evidence that we even want to compress
  other record types, nor that our compression mechanism is effective at
  doing so. Simple = keep name as compress_full_page_writes

 Quite right.

 I don't really agree with this. There's lots of records which can be
 quite big where compression could help a fair bit. Most prominently
 HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest
 chunk of WAL. And these are big and repetitive enough that compression
 is very likely to be beneficial.

Yes, you're right there. I was forgetting those aren't FPIs. However
they are close enough that it wouldn't necessarily effect the naming
of a parameter that controls such compression.

 I still think that just compressing the whole record if it's above a
 certain size is going to be better than compressing individual
 parts.

I think its OK to think it, but we should measure it.

For now then, I remove my objection to a commit of this patch based
upon parameter naming/rethinking. We have a fine tradition of changing
the names after the release is mostly wrapped, so lets pick a name in
a few months time when the dust has settled on what's in.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

On 9 December 2014 at 06:28, Jim Nasby jim.na...@bluetreble.com wrote:
 On 12/7/14, 6:16 PM, Simon Riggs wrote:

 What I'm more interested in is what you plan to do with the
 information once we get it?

 The assumption that skipping blocks is something bad is strange. I
 added it because VACUUM could and did regularly hang on busy tables,
 which resulted in bloat because other blocks that needed cleaning
 didn't get any attention.

 Which is better, spend time obsessively trying to vacuum particular
 blocks, or to spend the time on other blocks that are in need of
 cleaning and are available to be cleaned?

 Which is better, have autovacuum or system wide vacuum progress on to
 other tables that need cleaning, or spend lots of effort retrying?

 How do we know what is the best next action?

 I'd really want to see some analysis of those things before we spend
 even more cycles on this.


 That's the entire point of logging this information. There is an underlying
 assumption that we won't actually skip many pages, but there's no data to
 back that up, nor is there currently any way to get that data.

There is no such underlying assumption. You assumed there was one, but
there isn't one.

All I can say for certain is that waiting on a lock for long periods
was literally a waste of time. Now it no longer wastes time, it gets
on with vacuuming the pages it can.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On 12/08/2014 02:12 PM, Jim Nasby wrote:
 On 12/8/14, 12:26 PM, Josh Berkus wrote:
 4. Creation Locking Problem
 high probability of lock pile-ups whenever a new partition is created on
 demand due to multiple backends trying to create the partition at the
 same time.
 Not Addressed?
 
 Do users actually try and create new partitions during DML? That sounds
 doomed to failure in pretty much any system...

There is no question that it would be easier for users to create
partitions on demand automatically.  Particularly if you're partitioning
by something other than time.  For a particular case, consider users on
RDS, which has no cron jobs for creating new partitons; it's on demand
or manually.

It's quite possible that there is no good way to work out the locking
for on-demand partitions though, but *if* we're going to have a 2nd
partition system, I think it's important to at least discuss the
problems with on-demand creation.

 11. Hash Partitioning
 Some users would prefer to partition into a fixed number of
 hash-allocated partitions.
 Not Addressed.
 
 Though, you should be able to do that in either system if you bother to
 define your own hash in a BEFORE trigger...

That doesn't do you any good with the SELECT query, unless you change
your middleware to add a hash(column) to every query.  Which would be
really hard to do for joins.

 A. COPY/ETL then attach
 In inheritance partitioning, you can easily build a partition outside
 the master and then attach it, allowing for minimal disturbance of
 concurrent users.  Could be addressed in the future.
 
 How much of the desire for this is because our current row routing
 solutions are very slow? I suspect that's the biggest reason, and
 hopefully Alvaro's proposal mostly eliminates it.

That doesn't always work, though.  In some cases the partition is being
built using some fairly complex logic (think of partitions which are
based on matviews) and there's no fast way to create the new data.
Again, this is an acceptable casualty of an improved design, but if it
will be so, we should consciously decide that.

 B. Catchall Partition
 Many partitioning schemes currently contain a catchall partition which
 accepts rows outside of the range of the partitioning scheme, due to bad
 input data.  Probably not handled on purpose; Alvaro is proposing that
 we reject these instead, or create the partitions on demand, which is a
 legitimate approach.

 C. Asymmetric Partitioning / NULLs in partition column
 This is the classic Active/Inactive By Month setup for partitions.
 Could be addressed via special handling for NULL/infinity in the
 partitioned column.
 
 If we allowed for a catchall partition and supported normal
 inheritance/triggers on that partition then users could continue to do
 whatever they needed with data that didn't fit the normal partitioning
 pattern.

That sounds to me like it would fall under the heading of impossible
levels of backwards-compatibility.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : REINDEX SCHEMA

On Tue, Dec 2, 2014 at 3:42 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Adding on top of that a couple of things cleaned up, like docs and
 typos, and I got the patch attached. Let's have a committer have a
 look a it now, I am marking that as Ready for Committer.
For the archives, this has been committed as fe263d1. Thanks Simon for
looking and the final push. And sorry that I didn't spot the issue
with tap tests when reviewing, check-world passed but my dev VM missed
necessary perl packages.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_recvlogical description

On Mon, Dec 8, 2014 at 11:53 PM, Euler Taveira eu...@timbira.com.br wrote:
 Hi,

 The pg_recvlogical docs was rewritten but someone forgot to tweak the
 help description. It is a bit late in the 9.4 cycle but let be consistent.
Yeah, that makes sense. +1 for making docs consistent on master. For
9.4 this is a bit too late IMO.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alter user set local_preload_libraries.

2014-12-08 Thread Peter Eisentraut

On 12/8/14 12:39 PM, Robert Haas wrote:
 On Sun, Dec 7, 2014 at 9:54 AM, Peter Eisentraut pete...@gmx.net wrote:
 My radical proposal therefore would have been to embrace this
 inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether,
 relying on users interpreting the parameter names to indicate that
 changing them later may or may not have an effect.  Unfortunately, the
 concerns about ignore_system_indexes prevent that.
 
 What exactly are those concerns?  Do you have a link to previous discussion?

Earlier in the thread:
http://www.postgresql.org/message-id/20108.1415120...@sss.pgh.pa.us


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Status of Commit fest 2014-10

2014-12-08 Thread Craig Ringer

On 12/08/2014 09:19 AM, Michael Paquier wrote:
 - Use faster, higher precision timer API GetSystemTimeAsFileTime on windows

This is now committed.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alter user set local_preload_libraries.

2014-12-08 Thread Tom Lane

Peter Eisentraut pete...@gmx.net writes:
 On 12/8/14 12:39 PM, Robert Haas wrote:
 On Sun, Dec 7, 2014 at 9:54 AM, Peter Eisentraut pete...@gmx.net wrote:
 My radical proposal therefore would have been to embrace this
 inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether,
 relying on users interpreting the parameter names to indicate that
 changing them later may or may not have an effect.  Unfortunately, the
 concerns about ignore_system_indexes prevent that.

 What exactly are those concerns?  Do you have a link to previous discussion?

 Earlier in the thread:
 http://www.postgresql.org/message-id/20108.1415120...@sss.pgh.pa.us

The core of the mentioned issues is that catalog searches done via the
systable_beginscan/systable_getnext API will ordinarily visit catalog
entries in the order of the specified index.  However, if
ignore_system_indexes is set, you get a seqscan that will return the same
tuples in heap order (effectively, random order).  There are known cases
where this results in minor planner inefficiencies, and I'm worried that
there might be outright bugs we don't know about, since that whole
operating mode can be best be described as entirely untested outside of
the bootstrap sequence.

Barring someone committing to spend the time to improve that situation
(time that would be poorly invested IMO), I don't think that we want to
open up ignore_system_indexes as USERSET, or do anything else to encourage
its use.

If we're intent on removing PGC_BACKEND then I'd be okay with
reclassifying ignore_system_indexes as SUSET; but I'm not exactly
convinced that we should be trying to get rid of PGC_BACKEND.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-08 Thread Amit Langote


 From: Robert Haas [mailto:robertmh...@gmail.com]
 On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  I guess you could list or hash partition on multiple columns, too.
 
  How would you distinguish values in list partition for multiple
  columns? I mean for range partition, we are sure there will
  be either one value for each column, but for list it could
  be multiple and not fixed for each partition, so I think it will not
  be easy to support the multicolumn partition key for list
  partitions.
 
 I don't understand.  If you want to range partition on columns (a, b),
 you say that, say, tuples with (a, b) values less than (100, 200) go
 here and the rest go elsewhere.  For list partitioning, you say that,
 say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
 rest go elsewhere.  I'm not sure how useful that is but it's not
 illogical.
 

In case of list partitioning, 100 and 200 would respectively be one of the 
values in lists of allowed values for a and b. I thought his concern is whether 
this list of values for each column in partkey is as convenient to store and 
manipulate as range partvalues. 

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] moving from contrib to bin

2014-12-08 Thread Peter Eisentraut

Let's take another crack at moving stuff out of contrib.  Nobody likes
contrib.  The task is only finding something that most people like better.

Last time this was attempted, the discussion got lost in exactly which
extensions are worthy enough to be considered official or something like
that.  I want to dodge that here by starting at the opposite end:

1. move programs to src/bin/

2. move test things to src/test/  (I had that in my notes, but someone
already did that, so the stars must be aligned.)

3. deal with extensions later

Here are the contrib programs:

oid2name
pg_archivecleanup
pg_standby
pg_test_fsync
pg_test_timing
pg_upgrade
pg_xlogdump
pgbench
vacuumlo

The proposal would basically be to mv contrib/$x src/bin/$x and also
move the reference pages in the documentation.

We could consider alternative arrangements, if there is interest, such
as moving vacuumlo to scripts or moving pg_archivecleanup and pg_standby
into one directory.  It doesn't matter very much to me.

There is precedent for this: Some time ago we moved reindexdb from
contrib to scripts.

Besides moving things out of contrib, there is also a practical
motivation for this.  Putting both client and server programs into
contrib creates issues for packagers.  One would have to create a
separate -contrib-client package to package this properly.  Not to
mention packaging a bunch of unrelated extensions with these programs.
If we make these normal programs in src/bin/, packagers can put them
into the normal -client and -server packages, and everything will fall
into place.

Besides, a number of packagers have been treating pg_upgrade specially
and moved it out of contrib, so this would just be catching up with
reality a bit.

Comments?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] moving from contrib to bin

2014-12-08 Thread Tom Lane

Peter Eisentraut pete...@gmx.net writes:
 Last time this was attempted, the discussion got lost in exactly which
 extensions are worthy enough to be considered official or something like
 that.  I want to dodge that here by starting at the opposite end:
 1. move programs to src/bin/

 Here are the contrib programs:

 oid2name
 pg_archivecleanup
 pg_standby
 pg_test_fsync
 pg_test_timing
 pg_upgrade
 pg_xlogdump
 pgbench
 vacuumlo

 The proposal would basically be to mv contrib/$x src/bin/$x and also
 move the reference pages in the documentation.

Personally, I'm good with moving pg_archivecleanup, pg_standby,
pg_upgrade, pg_xlogdump, and pgbench this way.  (Although wasn't there
just some discussion about pg_standby being obsolete?  If so, shouldn't
we remove it instead of promoting it?)  As for the others:

I'm not exactly convinced that we want to encourage packagers to include
either pg_test_fsync or pg_test_timing in standard packages.  They are not
all that useful to ordinary users.

oid2name and vacuumlo, besides being of very dubious general utility,
are fails from a namespacing standpoint.  If we were to promote them
into standard install components I think a minimum requirement should be
to rename them to pg_something.  (oid2name is an entirely bogus name for
what it does, anyway.)  That would also be a good opportunity to revisit
their rather-ad-hoc APIs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote langote_amit...@lab.ntt.co.jp
wrote:
  From: Robert Haas [mailto:robertmh...@gmail.com]
  On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila amit.kapil...@gmail.com
  wrote:
   I guess you could list or hash partition on multiple columns, too.
  
   How would you distinguish values in list partition for multiple
   columns? I mean for range partition, we are sure there will
   be either one value for each column, but for list it could
   be multiple and not fixed for each partition, so I think it will not
   be easy to support the multicolumn partition key for list
   partitions.
 
  I don't understand.  If you want to range partition on columns (a, b),
  you say that, say, tuples with (a, b) values less than (100, 200) go
  here and the rest go elsewhere.  For list partitioning, you say that,
  say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
  rest go elsewhere.  I'm not sure how useful that is but it's not
  illogical.
 

 In case of list partitioning, 100 and 200 would respectively be one of
the values in lists of allowed values for a and b. I thought his concern is
whether this list of values for each column in partkey is as convenient
to store and manipulate as range partvalues.


Yeah and also how would user specify the values, as an example
assume that table is partitioned on monthly_salary, so partition
definition would look:

PARTITION BY LIST(monthly_salary)
(
PARTITION salary_less_than_thousand VALUES(300, 900),
PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
...
)

Now if user wants to define multi-column Partition based on
monthly_salary and annual_salary, how do we want him to
specify the values.  Basically how to distinguish which values
belong to first column key and which one's belong to second
column key.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] On partitioning

On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund and...@2ndquadrant.com
wrote:
  I don't think that's mutually exclusive with the idea of
  partitions-as-tables.  I mean, you can add code to the ALTER TABLE
  path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
  wherever you want.
 
  That'll be a lot of places you'll need to touch. More fundamentally: Why
  should we name something a table that's not one?

 Well, I'm not convinced that it isn't one.  And adding a new relkind
 will involve a bunch of code churn, too.  But I don't much care to
 pre-litigate this: when someone has got a patch, we can either agree
 that the approach is OK or argue that it is problematic because X.  I
 think we need to hammer down the design in broad strokes first, and
 I'm not sure we're totally there yet.

That's right, I think at this point defining the top level behaviour/design
is very important to proceed, we can decide about the better
implementation approach afterwards (may be once initial patch is ready,
because it might not be a major work to do it either way).  So here's where
we are on this point till now as per my understanding, I think that direct
operations should be prohibited on partitions, you think that they should be
allowed and Andres think that it might be better to allow direct operations
on partitions for Read.


  - Direct access to individual partitions to bypass
  tuple-routing/query-planning overhead.
 
  I think that might be ok in some cases, but in general I'd be very wary
  to allow that. I think it might be ok to allow direct read access, but
  everything else I'd be opposed. I'd much rather go the route of allowing
  to few things and then gradually opening up if required than the other
  way round (as that pretty much will never happen because it'll break
  deployed systems).

 Why?


Because I think it will be difficult for users to write/maintain more of
such
code, which is one of the complaints with previous system where user
needs to write triggers to route the tuple to appropriate partition.
I think in first step we should try to improve the tuple routing algorithm
so that it is not pain for users or atleast it should be at par with some of
the other competitive database systems and if we are not able
to come up with such an implementation, then may be we can think of
providing it as a special way for users to improve performance.

Another reason is that fundamentally partitions are managed internally
to divide the user data in a way so that access could be cheaper and we
take the specifications for defining the partitions from users and allowing
operations on internally managed objects could lead to user writing quite
some code to do what database actually does internally.  If we see that
TOAST table are internally used to manage large tuples, however we
don't want users to directly perform dml on those tables.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] moving from contrib to bin

On 2014-12-08 22:50:30 -0500, Tom Lane wrote:
 I'm not exactly convinced that we want to encourage packagers to include
 either pg_test_fsync or pg_test_timing in standard packages.  They are not
 all that useful to ordinary users.

I actually think both are quite useful when setting up new systems to
quickly screen for problems. There still is a fairly large number of
virtualized systems with pretty much broken timing functions; and
checking whether fsync actually takes some time is also good thing to do
in virtualized environments - it's not an infrequent thing to see fsyncs
taking unrealistically low time.

Neither is likely to be harmful. So it doesn't seem harmful to move
them.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

On Mon, Dec 8, 2014 at 3:17 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 8 December 2014 at 11:46, Michael Paquier michael.paqu...@gmail.com
wrote:
  I don't really like those new names, but I'd prefer
  wal_compression_level if we go down that road with 'none' as default
  value. We may still decide in the future to support compression at the
  record level instead of context level, particularly if we have an API
  able to do palloc_return_null_at_oom, so the idea of WAL compression
  is not related only to FPIs IMHO.

 We may yet decide, but the pglz implementation is not effective on
 smaller record lengths. Nor has any testing been done to show that is
 even desirable.


It's even much worse for non-compressible (or less-compressible)
WAL data.  I am not clear here that how a simple on/off switch
could address such cases because the data could be sometimes
dependent on which table user is doing operations (means schema or
data in some tables are more prone for compression in which case
it can give us benefits).  I think may be we should think something on
lines what Robert has touched in one of his e-mails (context-aware
compression strategy).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] moving from contrib to bin

2014-12-08 Thread Peter Geoghegan

On Mon, Dec 8, 2014 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote:
 I actually think both are quite useful when setting up new systems to
 quickly screen for problems. There still is a fairly large number of
 virtualized systems with pretty much broken timing functions; and
 checking whether fsync actually takes some time is also good thing to do
 in virtualized environments - it's not an infrequent thing to see fsyncs
 taking unrealistically low time.

 Neither is likely to be harmful. So it doesn't seem harmful to move
 them.

+1

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

On Mon, Dec 8, 2014 at 11:21 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  I think we have access to this information in planner (RelOptInfo -
pages),
  if we want, we can use that to eliminate the small relations from
  parallelism, but question is how big relations do we want to consider
  for parallelism, one way is to check via tests which I am planning to
  follow, do you think we have any heuristic which we can use to decide
  how big relations should be consider for parallelism?

 Surely the Path machinery needs to decide this in particular cases
 based on cost.  We should assign some cost to starting a parallel
 worker via some new GUC, like parallel_startup_cost = 100,000.  And
 then we should also assign a cost to the act of relaying a tuple from
 the parallel worker to the master, maybe cpu_tuple_cost (or some new
 GUC).  For a small relation, or a query with a LIMIT clause, the
 parallel startup cost will make starting a lot of workers look
 unattractive, but for bigger relations it will make sense from a cost
 perspective, which is exactly what we want.


Sounds sensible.  cpu_tuple_cost is already used for some other
purpose so not sure if it is right thing to override that parameter,
how about cpu_tuple_communication_cost or cpu_tuple_comm_cost.

 There are probably other important considerations based on goals for
 overall resource utilization, and also because at a certain point
 adding more workers won't help because the disk will be saturated.  I
 don't know exactly what we should do about those issues yet, but the
 steps described in the previous paragraph seem like a good place to
 start anyway.


Agreed.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Seq Scan

On Mon, Dec 8, 2014 at 11:27 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost sfr...@snowman.net wrote:
  For my 2c, I'd like to see it support exactly what the SeqScan node
  supports and then also what Foreign Scan supports.  That would mean we'd
  then be able to push filtering down to the workers which would be great.
  Even better would be figuring out how to parallelize an Append node
  (perhaps only possible when the nodes underneath are all SeqScan or
  ForeignScan nodes) since that would allow us to then parallelize the
  work across multiple tables and remote servers.

 I don't see how we can support the stuff ForeignScan does; presumably
 any parallelism there is up to the FDW to implement, using whatever
 in-core tools we provide.  I do agree that parallelizing Append nodes
 is useful; but let's get one thing done first before we start trying
 to do thing #2.

  I'm not entirely following this.  How can the worker be responsible for
  its own plan when the information passed to it (per the above
  paragraph..) is pretty minimal?  In general, I don't think we need to
  have specifics like this worker is going to do exactly X because we
  will eventually need some communication to happen between the worker and
  the master process where the worker can ask for more work because it's
  finished what it was tasked with and the master will need to give it
  another chunk of work to do.  I don't think we want exactly what each
  worker process will do to be fully formed at the outset because, even
  with the best information available, given concurrent load on the
  system, it's not going to be perfect and we'll end up starving workers.
  The plan, as formed by the master, should be more along the lines of
  this is what I'm gonna have my workers do along w/ how many workers,
  etc, and then it goes and does it.  Perhaps for an 'explain analyze' we
  return information about what workers actually *did* what, but that's a
  whole different discussion.

 I agree with this.  For a first version, I think it's OK to start a
 worker up for a particular sequential scan and have it help with that
 sequential scan until the scan is completed, and then exit.  It should
 not, as the present version of the patch does, assign a fixed block
 range to each worker; instead, workers should allocate a block or
 chunk of blocks to work on until no blocks remain.  That way, even if
 every worker but one gets stuck, the rest of the scan can still
 finish.


I will check on this point and see if it is feasible to do something on
those lines, basically currently at Executor initialization phase, we
set the scan limits and then during Executor Run phase use
heap_getnext to fetch the tuples accordingly, but doing it dynamically
means at ExecutorRun phase we need to reset the scan limit for
which page/pages to scan, still I have to check if there is any problem
with such an idea.  Do you any different idea in mind?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] On partitioning

2014-12-08 Thread Amit Langote


On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote langote_amit...@lab.ntt.co.jp
 wrote:
  From: Robert Haas [mailto:robertmh...@gmail.com]
  I don't understand.  If you want to range partition on columns (a, b),
  you say that, say, tuples with (a, b) values less than (100, 200) go
  here and the rest go elsewhere.  For list partitioning, you say that,
  say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
  rest go elsewhere.  I'm not sure how useful that is but it's not
  illogical.
 

 In case of list partitioning, 100 and 200 would respectively be one of the
 values in lists of allowed values for a and b. I thought his concern is
 whether this list of values for each column in partkey is as convenient to
 store and manipulate as range partvalues.


 Yeah and also how would user specify the values, as an example
 assume that table is partitioned on monthly_salary, so partition
 definition would look:

 PARTITION BY LIST(monthly_salary)
 (
 PARTITION salary_less_than_thousand VALUES(300, 900),
 PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
 ...
 )

 Now if user wants to define multi-column Partition based on
 monthly_salary and annual_salary, how do we want him to
 specify the values.  Basically how to distinguish which values
 belong to first column key and which one's belong to second
 column key.


Amit, in one of my earlier replies to your question of why we may not want to 
implement multi-column list partitioning (lack of user interest in the feature 
or possible complexity of the code), I tried to explain how that may work if we 
do choose to go that way. Basically, something we may call PartitionColumnValue 
should be such that above issue can be suitably sorted out.

For example, a partition defining/bounding value would be a pg_node_tree 
representation of List of one of the (say) following parse nodes as appropriate 
- 

typedef struct PartitionColumnValue
{
NodeTag type,
Oid *partitionid,
char*partcolname,
charpartkind,
Node*partrangelower,
Node*partrangeupper,
List*partlistvalues
};

OR separately,

typedef struct RangePartitionColumnValue
{
NodeTag type,
Oid *partitionid,
char*partcolname,
Node*partrangelower,
Node*partrangeupper
};

 

typedef struct ListPartitionColumnValue
{
NodeTag type,
Oid *partitionid,
char*partcolname,
List*partlistvalues
};

Where a partition definition would look like

typedef struct PartitionDef
{
NodeTag type,
RangeVarpartition,
RangeVarparentrel,
char*kind,
Node*values,
List*options,
char*tablespacename
};

PartitionDef.values is an (ordered) List of PartitionColumnValue each of which 
corresponds to one column in the partition key in that order.

We should be able to devise a way to load the pg_node_tree representation of  
PartitionDef.values (on-disk pg_partition_def.partvalues) into relcache using a 
suitable data structure so that it becomes readily usable in variety of 
contexts that we are interested in using this information. 

Regards,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misunderstanding on the FSM README file

2014-12-08 Thread Guillaume Lelarge

2014-12-07 15:07 GMT+01:00 Heikki Linnakangas hlinnakan...@vmware.com:

 On 12/07/2014 02:03 PM, Guillaume Lelarge wrote:

 Hi,

 I've been reading the FSM README file lately
 (src/backend/storage/freespace/README), and I'm puzzled by one of the
 graph
 (the binary tree structure of an FSM file). Here it is:

  4
   4 2
 3 4   0 2- This level represents heap pages

 Shouldn't the last line be:
 4 3   2 0

 (ie, highest number of free space on the left node, lowest on the right
 one)

 Probably just nitpicking, but still, I'm wondering if I missed something
 out.


 No, that's not how it works. Each number at the bottom level corresponds
 to a particular heap page. The first number would be heap page #0 (which
 has 3 units of free space), the second heap page #1 (with 4 units of free
 space) and so forth. Each node on the upper levels stores the maximum of
 its two children.


Oh OK. Thanks Heikki, that makes perfect sense.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com

Re: [HACKERS] Proposal : REINDEX SCHEMA