Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-20 Thread Michael Paquier
On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:
> Oh, hang on, that's not necessarily true. On promotion, the standby
archives
> the last, partial WAL segment from the old timeline. That's just wrong
> (http://www.postgresql.org/message-id/52fcd37c.3070...@vmware.com), and in
> fact I somehow thought I changed that already, but apparently not. So
let's
> stop doing that.

Er. Are you planning to prevent the standby from archiving the last partial
segment from the old timeline at promotion? I thought from previous
discussions that we should do it as master (be it crashed, burned, burried
or dead) may not have the occasion to do it. By preventing its archiving
you close the door to the case where master did not have the occasion to
archive it.

+/* */
+static char primary_last_archived[MAX_XFN_CHARS + 1];
This is visibly missing a comment.

As primary_last_archived is used only by ProcessArchivalReport(), wouldn't
it be better to pass it as argument to this function?

+   /* Check that the filename the primary reported looks valid */
+   if (strlen(primary_last_archived) < 24 ||
+   strspn(primary_last_archived, "0123456789ABCDEF") != 24)
+   return;
Not related to this patch, but we had better have a macro doing this job I
think... It keeps spreading around.

People may be surprised that a base backup taken from a node that has
archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL:  archive_mode='on' cannot be used in archive recovery
HINT:  Use 'shared' or 'always' mode instead.
One idea would be to simply ignore the fact that archive_mode = on on nodes
in recovery instead of dropping an error. Note that I like the fact that it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.

Are both WalSndArchivalReport() and WalSndArchivalReportIfNecessary()
really necessary? I think that for simplicity you could merge them and use
last_archival_report as a local variable.

Creating a dependency between the pgstat machinery and the WAL sender looks
weak to me. For example with this patch a master cannot stop, as it waits
indefinitely:
LOG:  using stale statistics instead of current ones because stats
collector is not responding
LOG:  sending archival report:
You could scan archive_status/ but that would be costly if there are many
entries to scan and I think that walsender should be highly responsive. Or
you could directly store the name of the lastly archived WAL segment marked
as .done in let's say archive_status/last_archived. An entry for that in
the control file does not seem the right place as a node may not have
archive_mode enabled that's why I am not mentioning it.

Regards,
-- 
Michael


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Andres Freund
On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote:
> Didn't you think any of the TODO threads had workable solutions?  And
> don't expect adding an additional file per relation will be zero cost
> --- added over the lifetime of 200M transactions, I question if this
> approach would be a win.

Note that normally you'd not run with a 200M transaction freeze max age
on a busy server. Rather around a magnitude more.

Think about this being used on a time partionioned table. Right now all
the partitions have to be fully rescanned on a regular basis - quite
painful. With something like this normally only the newest partitions
will have to be.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-04-20 Thread Etsuro Fujita

On 2015/04/21 10:07, Kyotaro HORIGUCHI wrote:

At Mon, 20 Apr 2015 16:40:52 +0900, Etsuro Fujita  wrote 
in <5534ad84.3020...@lab.ntt.co.jp>

However, I'd
like to propose to rename "Foreign Update" ("Foreign Delete") of
ModifyTable simply to "Update" ("Delete") not only because (1) that
solves the duplication problem but also because (2) ISTM that is
consistent with the non-inherited updates in both of the
non-pushed-down-update case and the pushed-down-update case.  Here are
examples for (2).



Update node without "Foreign" that runs "Remote SQL" looks to me
somewhat unusual..


I think that has a similarity with the existing EXPLAIN outputs for 
non-inherited non-pushed-down updates, as shown in the below exaple.


>> postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 + 
1)::int;

>>QUERY PLAN
>> 
--

>>   Update on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
>> Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
>> ->  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 
width=6)

>>   Output: (trunc(((random() * '9'::double precision) +
>> '1'::double precision)))::integer, ctid
>>   Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
>> (5 rows)


It seems to me that the problem is "Foreign Update"s for
ModifyTable that does nothing eventually.



I think the ForeignUpdate nodes should
be removed during planning if it is really ineffective,



If removed it looks like,

| =# explain verbose update p set b = b + 1;
|   QUERY PLAN
| --
|  Update on public.p  (cost=0.00..360.08 rows=4311 width=14)
|Update on public.p
|->  Seq Scan on public.p  (cost=0.00..0.00 rows=1 width=14)
|  Output: p.a, (p.b + 1), p.ctid
|->  Foreign Update on public.ft1  (cost=100.00..180.04 rows=2155 width=14)
|  Remote SQL: UPDATE public.t1 SET b = (b + 1)
|->  Foreign Update on public.ft2  (cost=100.00..180.04 rows=2155 width=14)
|  Remote SQL: UPDATE public.t2 SET b = (b + 1)


On that point, I agree with Tom that that would cause the problem that 
the user has to guess at which of the child plans goes with which target 
relation of ModifyTable [1].


Thanks for the comments!

Best regards,
Etsuro Fujita

[1] http://www.postgresql.org/message-id/22505.1426986...@sss.pgh.pa.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Decoding follows timelines

2015-04-20 Thread Michael Paquier
On Fri, Feb 13, 2015 at 4:57 PM, Michael Paquier wrote:
> Moved patch to CF 2015-02 to not lose track of it, also because it does not
> seem it received a proper review.

This patch does not apply anymore, so attached is a rebased version.
The comments mentioned here have not been addressed:
http://www.postgresql.org/message-id/54a7bf61.9080...@vmware.com
Also, what kind of tests have been done? Logical decoding cannot be
used while a node is in recovery.
Regards,
-- 
Michael
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4a20569..3036ce6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -216,7 +216,8 @@ static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, Transac
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
-
+static XLogRecPtr GetLatestRequestPtr(void);
+static TimeLineID ReadSendTimeLine(TimeLineID tli);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -535,8 +536,6 @@ StartReplication(StartReplicationCmd *cmd)
 
 	if (cmd->timeline != 0)
 	{
-		XLogRecPtr	switchpoint;
-
 		sendTimeLine = cmd->timeline;
 		if (sendTimeLine == ThisTimeLineID)
 		{
@@ -545,18 +544,13 @@ StartReplication(StartReplicationCmd *cmd)
 		}
 		else
 		{
-			List	   *timeLineHistory;
-
 			sendTimeLineIsHistoric = true;
 
 			/*
 			 * Check that the timeline the client requested for exists, and
 			 * the requested start location is on that timeline.
 			 */
-			timeLineHistory = readTimeLineHistory(ThisTimeLineID);
-			switchpoint = tliSwitchPoint(cmd->timeline, timeLineHistory,
-		 &sendTimeLineNextTLI);
-			list_free_deep(timeLineHistory);
+			(void) ReadSendTimeLine(cmd->timeline);
 
 			/*
 			 * Found the requested timeline in the history. Check that
@@ -576,8 +570,8 @@ StartReplication(StartReplicationCmd *cmd)
 			 * that's older than the switchpoint, if it's still in the same
 			 * WAL segment.
 			 */
-			if (!XLogRecPtrIsInvalid(switchpoint) &&
-switchpoint < cmd->startpoint)
+			if (!XLogRecPtrIsInvalid(sendTimeLineValidUpto) &&
+sendTimeLineValidUpto < cmd->startpoint)
 			{
 ereport(ERROR,
 		(errmsg("requested starting point %X/%X on timeline %u is not in this server's history",
@@ -586,10 +580,9 @@ StartReplication(StartReplicationCmd *cmd)
 cmd->timeline),
 		 errdetail("This server's history forked from timeline %u at %X/%X.",
    cmd->timeline,
-   (uint32) (switchpoint >> 32),
-   (uint32) (switchpoint;
+   (uint32) (sendTimeLineValidUpto >> 32),
+   (uint32) (sendTimeLineValidUpto;
 			}
-			sendTimeLineValidUpto = switchpoint;
 		}
 	}
 	else
@@ -928,6 +921,8 @@ static void
 StartLogicalReplication(StartReplicationCmd *cmd)
 {
 	StringInfoData buf;
+	XLogRecPtr	FlushPtr;
+	List	   *timeLineHistory;
 
 	/* make sure that our requirements are still fulfilled */
 	CheckLogicalDecodingRequirements();
@@ -940,6 +935,8 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	 * Force a disconnect, so that the decoding code doesn't need to care
 	 * about an eventual switch from running in recovery, to running in a
 	 * normal environment. Client code is expected to handle reconnects.
+	 * This covers the race condition where we are promoted half way
+	 * through starting up.
 	 */
 	if (am_cascading_walsender && !RecoveryInProgress())
 	{
@@ -948,6 +945,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		walsender_ready_to_stop = true;
 	}
 
+	if (am_cascading_walsender)
+	{
+		/* this also updates ThisTimeLineID */
+		FlushPtr = GetStandbyFlushRecPtr();
+	}
+	else
+		FlushPtr = GetFlushRecPtr();
+
 	WalSndSetState(WALSNDSTATE_CATCHUP);
 
 	/* Send a CopyBothResponse message, and start streaming */
@@ -974,6 +979,24 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	logical_startptr = MyReplicationSlot->data.restart_lsn;
 
 	/*
+	 * Find the timeline for the start location, or throw an error.
+	 *
+	 * Logical replication relies upon replication slots. Each slot has a
+	 * single timeline history baked into it, so this should be easy.
+	 */
+	timeLineHistory = readTimeLineHistory(ThisTimeLineID);
+	sendTimeLine = tliOfPointInHistory(logical_startptr, timeLineHistory);
+	if (sendTimeLine != ThisTimeLineID)
+	{
+		sendTimeLineIsHistoric = true;
+		sendTimeLineValidUpto = tliSwitchPoint(sendTimeLine, timeLineHistory,
+		 &sendTimeLineNextTLI);
+	}
+	list_free_deep(timeLineHistory);
+
+	streamingDoneSending = streamingDoneReceiving = false;
+
+	/*
 	 * Report the location after which we'll send out further commits as the
 	 * current sentPtr.
 	 */
@@ -2179,93 +2202,10 @@ XLogSendPhysical(void)
 		return;
 	}
 
-	/* Figure out how far we can safely send the WAL. */
-	if (sendTimeLineIsHistoric)
-	{
-		/*
-		 * Streaming an old timeline that's in this server's history, but is
-		 * not the one we're

[HACKERS] installcheck missing in src/bin/pg_rewind/Makefile

2015-04-20 Thread Michael Paquier
Hi all,

As mentioned in $subject, the TAP tests of pg_rewind are currently not
run by buildfarm machines as the buildfarm client uses installcheck to
run the tests in src/bin. A patch is attached to correct the problem.
Regards,
-- 
Michael
diff --git a/src/bin/pg_rewind/Makefile b/src/bin/pg_rewind/Makefile
index e3400f5..98213c4 100644
--- a/src/bin/pg_rewind/Makefile
+++ b/src/bin/pg_rewind/Makefile
@@ -49,3 +49,6 @@ clean distclean maintainer-clean:
 
 check:
 	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-20 Thread Shigeru HANADA
Kaigai-san,

I reviewed the Custom/Foreign join API patch again after writing a patch of 
join push-down support for postgres_fdw.

2015/03/26 10:51、Kouhei Kaigai  のメール:

>>> Or bottom of make_join_rel().  IMO build_join_rel() is responsible for just
>> building (or searching from a list) a RelOptInfo for given relids.  After 
>> that
>> make_join_rel() calls add_paths_to_joinrel() with appropriate arguments per 
>> join
>> type to generate actual Paths implements the join.  make_join_rel() is called
>> only once for particular relid combination, and there SpecialJoinInfo and
>> restrictlist (conditions specified in JOIN-ON and WHERE), so it seems 
>> promising
>> for FDW cases.
>>> 
>>> I like that idea, but I think we will have complex hook signature, it won't
>> remain as simple as hook (root, joinrel).
>> 
>> Signature of the hook (or the FDW API handler) would be like this:
>> 
>> typedef void (*GetForeignJoinPaths_function ) (PlannerInfo *root,
>>   RelOptInfo *joinrel,
>>   RelOptInfo *outerrel,
>>   RelOptInfo *innerrel,
>>   JoinType jointype,
>>   SpecialJoinInfo *sjinfo,
>>   List *restrictlist);
>> 
>> This is very similar to add_paths_to_joinrel(), but lacks semifactors and
>> extra_lateral_rels.  semifactors can be obtained with
>> compute_semi_anti_join_factors(), and extra_lateral_rels can be constructed
>> from root->placeholder_list as add_paths_to_joinrel() does.
>> 
>> From the viewpoint of postgres_fdw, jointype and restrictlist is necessary to
>> generate SELECT statement, so it would require most work done in 
>> make_join_rel
>> again if the signature was hook(root, joinrel).  sjinfo will be necessary for
>> supporting SEMI/ANTI joins, but currently it is not in the scope of 
>> postgres_fdw.
>> 
>> I guess that other FDWs require at least jointype and restrictlist.
>> 
> The attached patch adds GetForeignJoinPaths call on make_join_rel() only when
> 'joinrel' is actually built and both of child relations are managed by same
> FDW driver, prior to any other built-in join paths.
> I adjusted the hook definition a little bit, because jointype can be 
> reproduced
> using SpecialJoinInfo. Right?

Yes, it can be derived from the expression below:
jointype = sjinfo ? sjinfo->jointype : JOIN_INNER;

> 
> Probably, it will solve the original concern towards multiple calls of FDW
> handler in case when it tries to replace an entire join subtree with a 
> foreign-
> scan on the result of remote join query.
> 
> How about your opinion?

AFAIS it’s well-balanced about calling count and available information.

New FDW API GetForeignJoinPaths is called only once for a particular 
combination of join, such as (A join B join C).  Before considering all joins 
in a join level (number of relations contained in the join tree), possible join 
combinations of lower join level are considered recursively. As Tom pointed out 
before, say imagine a case like ((huge JOIN large) LEFT JOIN small), expensive 
path in lower join level might be 

Here, please let me summarize the changes in the patch as the result of my 
review.

* Add set_join_pathlist_hook_type in add_paths_to_joinrel
This hook is intended to provide a chance to add one or more CustomPaths for an 
actual join combination.  If the join is reversible, the hook is called for 
both A * B and B * A.  This is different from FDW API but it seems fine because 
FDWs should have chances to process the join in more abstract level than CSPs.

Parameters are same as hash_inner_and_outer, so they would be enough for 
hash-like or nestloop-like methods.  I’m not sure whether mergeclause_list is 
necessary as a parameter or not.  It’s information for merge join which is 
generated when enable_mergejoin is on and the join is not FULL OUTER.  Does 
some CSP need it for processing a join in its own way?  Then it must be in 
parameter list because select_mergejoin_clauses is static so it’s not 
accessible from external modules.

The timing of the hooking, after considering all built-in path types, seems 
fine because some of CSPs might want to use built-in paths as a template or a 
source.

One concern is in the document of the hook function.  "Implementing Custom 
Paths” says:

> A custom scan provider will be also able to add paths by setting the 
> following hook, to replace built-in join paths by custom-scan that performs 
> as if a scan on preliminary joined relations, which us called after the core 
> code has generated what it believes to be the complete and correct set of 
> access paths for the join.

I think “replace” would mis-lead readers that CSP can remove or edit existing 
built-in paths listed in RelOptInfo#pathlist or linked from cheapest_foo.  IIUC 
CSP can just add paths for the jo

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-04-20 Thread Kyotaro HORIGUCHI
Hi,

At Mon, 20 Apr 2015 16:40:52 +0900, Etsuro Fujita  
wrote in <5534ad84.3020...@lab.ntt.co.jp>
> On 2015/04/17 13:16, Amit Langote wrote:
> > On 17-04-2015 PM 12:35, Etsuro Fujita wrote:
> >> (2) that might cause the problem of associating subplans' update
> >> information with subplans' scan information, pointed out by Tom [1].
> 
> > Having realized how it really works now, my +1 to "Foreign Modifying Scan" 
> > for
> > cases of pushed down update as suggested by Albe Laurenz. I guess it would 
> > be
> > signaled by the proposed ForeignScan.CmdType being CMD_UPDATE / CMP_UPDATE
> > (/CMD_INSERT).
> 
> Thanks for the opinion!  I think that that is an idea.  However, I'd
> like to propose to rename "Foreign Update" ("Foreign Delete") of
> ModifyTable simply to "Update" ("Delete") not only because (1) that
> solves the duplication problem but also because (2) ISTM that is
> consistent with the non-inherited updates in both of the
> non-pushed-down-update case and the pushed-down-update case.  Here are
> examples for (2).

Update node without "Foreign" that runs "Remote SQL" looks to me
somewhat unusual..

It seems to me that the problem is "Foreign Update"s for
ModifyTable that does nothing eventually. Even though I don't
understand this fully, especially what "Foreign Update" for
ModifyTable does when "Foreign Update" in place of "Foreign Scan"
finished substantial work, I think the ForeignUpdate nodes should
be removed during planning if it is really ineffective, or such
"Foreign Update"s shoud be renamed or provided with some
explaination in explain output if it does anything or unremovable
from some reason.

If removed it looks like,

| =# explain verbose update p set b = b + 1;
|   QUERY PLAN  
| --
|  Update on public.p  (cost=0.00..360.08 rows=4311 width=14)
|Update on public.p
|->  Seq Scan on public.p  (cost=0.00..0.00 rows=1 width=14)
|  Output: p.a, (p.b + 1), p.ctid
|->  Foreign Update on public.ft1  (cost=100.00..180.04 rows=2155 width=14)
|  Remote SQL: UPDATE public.t1 SET b = (b + 1)
|->  Foreign Update on public.ft2  (cost=100.00..180.04 rows=2155 width=14)
|  Remote SQL: UPDATE public.t2 SET b = (b + 1)

regards,

> * Inherited and non-inherited updates for the non-pushed-down case:
> 
> postgres=# explain verbose update parent set c1 = trunc(random() * 9 +
> 1)::int;
>  QUERY PLAN
> -
>  Update on public.parent  (cost=0.00..452.06 rows=5461 width=6)
>Update on public.parent
>Update on public.ft1
>  Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
>Update on public.ft2
>  Remote SQL: UPDATE public.t2 SET c1 = $2 WHERE ctid = $1
>->  Seq Scan on public.parent  (cost=0.00..0.01 rows=1 width=6)
>  Output: (trunc(((random() * '9'::double precision) +
> '1'::double precision)))::integer, parent.ctid
>->  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
>  Output: (trunc(((random() * '9'::double precision) +
> '1'::double precision)))::integer, ft1.ctid
>  Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
>->  Foreign Scan on public.ft2  (cost=100.00..226.03 rows=2730 width=6)
>  Output: (trunc(((random() * '9'::double precision) +
> '1'::double precision)))::integer, ft2.ctid
>  Remote SQL: SELECT ctid FROM public.t2 FOR UPDATE
> (14 rows)
> 
> postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 + 1)::int;
>   QUERY PLAN
> --
>  Update on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
>Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
>->  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
>  Output: (trunc(((random() * '9'::double precision) +
> '1'::double precision)))::integer, ctid
>  Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
> (5 rows)
> 
> * Inherited and non-inherited updates for the pushed-down case:
> 
> postgres=# explain verbose update parent set c1 = c1 + 1;
>   QUERY PLAN
> --
>  Update on public.parent  (cost=0.00..376.59 rows=4819 width=10)
>Update on public.parent
>Update on public.ft1
>Update on public.ft2
>->  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=10)
>  Output: (parent.c1 + 1), parent.ctid
>->  Foreign Update on public.ft1  (cost=100.00..188.29 rows=2409
> width=10)
>  Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1)
>->  Foreign Update on public.ft2  (cost=100.00..18

Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Josh Berkus
On 04/20/2015 02:13 PM, Bruce Momjian wrote:
> Didn't you think any of the TODO threads had workable solutions?  And
> don't expect adding an additional file per relation will be zero cost
> --- added over the lifetime of 200M transactions, I question if this
> approach would be a win.

Well, the only real way to test that is a prototype, no?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-20 Thread Amit Langote
On 2015-04-21 AM 03:29, Robert Haas wrote:
> On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote wrote:
>> On 08-04-2015 PM 12:46, Amit Kapila wrote:
>>> Going forward, I think we can improve the same if we decide not to shutdown
>>> parallel workers till postmaster shutdown once they are started and
>>> then just allocate them during executor-start phase.
>>
>> I wonder if it makes sense to invent the notion of a global pool of workers
>> with configurable number of workers that are created at postmaster start and
>> destroyed at shutdown and requested for use when a query uses parallelizable
>> nodes.
> 
> Short answer: Yes, but not for the first version of this feature.
> 
> Longer answer: We can't actually very reasonably have a "global" pool
> of workers so long as we retain the restriction that a backend
> connected to one database cannot subsequently disconnect from it and
> connect to some other database instead.  However, it's certainly a
> good idea to reuse the same workers for subsequent operations on the
> same database, especially if they are also by the same user.  At the
> very minimum, it would be good to reuse the same workers for
> subsequent operations within the same query, instead of destroying the
> old ones and creating new ones.  Notwithstanding the obvious value of
> all of these ideas, I don't think we should do any of them for the
> first version of this feature.  This is too big a thing to get perfect
> on the first try.
> 

Agreed.

Perhaps, Amit has worked (is working) on "reuse the same workers for
subsequent operations within the same query"

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-20 Thread Peter Geoghegan
On Sun, Apr 19, 2015 at 9:37 PM, Peter Geoghegan  wrote:
> Attached patch, V3.4, implements what I believe you and Heikki have in
> mind here.

Any plans to look at this, Svenne? You are signed up as a reviewer on
the commitfest app. A review of the documentation, and interactions
with other features like inheritance, updatable views and postgres_fdw
would be useful at this point. Obviously I've gone to a lot of effort
to document how things fit together at a high level on the UPSERT wiki
page, but these aspects have not been commented on all that much.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Jim Nasby

On 4/20/15 4:13 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:

I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.


See the TODO list:

https://wiki.postgresql.org/wiki/Todo
o  Avoid the requirement of freezing pages that are infrequently
   modified


Right, but do you have a proposal for how that would actually happen?

Perhaps I'm mis-understanding you, but it sounded like you were
opposed to this patch because it doesn't do anything to avoid the
need to freeze. My point is that no one has any good ideas on how to
avoid freezing, and I think it's a safe bet that any ideas people do
come up with there will be a lot more invasive than a FrozenMap is.


Didn't you think any of the TODO threads had workable solutions?  And


I didn't realize there were threads there.

The first three are discussion around the idea of eliminating the need 
to freeze based on a page already being all visible. No patches.


http://www.postgresql.org/message-id/ca+tgmoaemnolzmvbb8gvy69na8zw9bwpiz9+tlz-lnabozi...@mail.gmail.com 
has a WIP patch that goes the route of using a tuple flag to indicate 
frozen, but also raises a lot of concerns about visibility, because it 
means we'd stop using FrozenXID. That impacts a large amount of code. 
There were some followup patches as well as a bunch of discussion of how 
to make it visible that a tuple was frozen or not. That thread died in 
January 2014.


The fifth thread is XID to LSN mapping. AFAICT this has a significant 
drawback in that it breaks page compatibility, meaning no pg_upgrade. It 
ends 5/14/2014 with this comment:


"Well, Heikki was saying on another thread that he had kind of gotten
cold feet about this, so I gather he's not planning to pursue it.  Not
sure if I understood that correctly.  If so, I guess it depends on
whether someone else can pick it up, but we might first want to
establish why he got cold feet and how worrying those problems seem to
other people." - 
http://www.postgresql.org/message-id/CA+TgmoYoN8LzSuaffUaEkyV8Mhv1wi=zlbxq3vofezno1db...@mail.gmail.com


So work was done on two alternative approaches, and then abandoned. Both 
of those approaches might still be valid, but seem to need more work. 
They're also higher risk because they're changing MVCC at a very 
fundamental level.


As I mentioned, I think there's a lot better stuff we could be doing 
about tuple lifetime, but there's no easy fixes to be had. This patch 
solves a problem today, using a concept that's now well proven 
(visibility map). If we had something more sophisticated being developed 
then I'd be inclined not to pursue this patch, but that's not the case.


Perhaps others can elaborate on where those two patches are at...


don't expect adding an additional file per relation will be zero cost
--- added over the lifetime of 200M transactions, I question if this
approach would be a win.


Can you elaborate on this? I don't see how the number of transactions 
would come into play, but the overhead here is not large; the FrozenMap 
would be the same size as the VM map, which is 1/64,000th as large as 
the heap. So a 64G table means a 1M FM. That doesn't seem very expensive.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-20 Thread Peter Geoghegan
On Thu, Mar 19, 2015 at 11:13 AM, Robert Haas  wrote:
> Here is yet another version of this patch.  In addition to the fixes
> mentioned above, this version includes some minor rebasing around
> recent commits, and also better handling of the case where we discover
> that we cannot launch workers after all.  This can happen because (1)
> dynamic_shared_memory_type=none, (2) the maximum number of DSM
> segments supported by the system configuration are already in use, or
> (3) the user creates a parallel context with nworkers=0.  In any of
> those cases, the system will now create a backend-private memory
> segment instead of a dynamic shared memory segment, and will skip
> steps that don't need to be done in that case.  This is obviously an
> undesirable scenario.  If we choose a parallel sequential scan, we
> want it to launch workers and really run in parallel.  Hopefully, in
> case (1) or case (3), we will avoid choosing a parallel plan in the
> first place, but case (2) is pretty hard to avoid completely, as we
> have no idea what other processes may or may not be doing with dynamic
> shared memory segments ... and, in any case, degrading to non-parallel
> execution beats failing outright.

I see that you're using git format-patch to generate this. But the
patch is only patch 1/4. Is that intentional? Where are the other
pieces?

I think that the parallel seqscan patch, and the assessing parallel
safety patch are intended to fit together with this patch, but I can't
find a place where there is a high level overview explaining just how
they fit together (I notice Amit's patch has an "#include
"access/parallel.h", which is here, but that wasn't trivial to figure
out). I haven't been paying too much attention to this patch series.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Alvaro Herrera
Bruce Momjian wrote:
> On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> > 
> > This seems simple to implement: keep two counters, where the second one
> > is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
> > reset the first counter so that further 5 pages will get HOT pruned.  5%
> > seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
> > essentially +infinity.)
> 
> This would tend to dirty non-sequential heap pages --- it seems best to
> just clean as many as we are supposed to, then skip the rest, so we can
> write sequential dirty pages to storage.

Keep in mind there's a disconnect between dirtying a page and writing it
to storage.  A page could remain dirty for a long time in the buffer
cache.  This writing of sequential pages would occur at checkpoint time
only, which seems the wrong thing to optimize.  If some other process
needs to evict pages to make room to read some other page in, surely
it's going to try one page at a time, not write "many sequential dirty
pages."

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Jim Nasby

On 4/20/15 2:45 AM, Sawada Masahiko wrote:

Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.


My feeling is we'd definitely want this refactored; it looks to be a 
whole lot of duplicated code. But before working on that we should get 
consensus that a FrozenMap is a good idea.


Are there any meaningful differences between the two, besides the 
obvious name changes?


I think there's also a bunch of XLOG stuff that could be refactored too...


Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.


That's probably something else that can be factored out, since it's 
basically the same logic. I suspect we just need to && some of the 
checks so we're looking at both FM and VM at the same time.


Other comments...

It would be nice if we didn't need another page bit for FM; do you see 
any reasonable way that could happen?


+* If we didn't pin the visibility(and frozen) map page and the page has
+* become all visible(and frozen) while we were busy locking the buffer,
+* or during some subsequent window during which we had it unlocked,
+* we'll have to unlock and re-lock, to avoid holding the buffer lock
+* across an I/O.  That's a bit unfortunate, especially since we'll now
+* have to recheck whether the tuple has been locked or updated under 
us,
+* but hopefully it won't happen very often.
 */

s/(and frozen)/ or frozen/


+ * Reply XLOG_HEAP3_FROZENMAP record.
s/Reply/Replay/


+   /*
+* XLogReplayBufferExtended locked the buffer. But frozenmap_set
+* will handle locking itself.
+*/
+   LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);

Doesn't this create a race condition?


Are you sure the bit in finish_heap_swap() is safe? If so, we should add 
the same the same for the visibility map too (it certainly better be all 
visible if it's frozen...)




+   /*
+* Current block is all-visible.
+* If frozen map represents that it's all frozen and 
this
+* function is called for freezing tuples, we can skip 
to
+* vacuum block.
+*/

I would state this as "Even if scan_all is true, we can skip blocks that 
are marked as frozen."


+   if (frozenmap_test(onerel, blkno, &fmbuffer) && 
scan_all)

I suspect it's faster to reverse those tests (scan_all && 
frozenmap_test())... but why do we even need to look at scan_all? AFAICT 
if a block as frozen we can skip it unconditionally.



+   /*
+* If the un-frozen tuple is remaining in current page 
and
+* current page is marked as ALL_FROZEN, we should 
clear it.
+*/

That needs to NEVER happen. If it does then we're going to consider 
tuples as visible/frozen that shouldn't be. We should probably throw an 
error here, because it means the heap is now corrupted. At the minimum 
it needs to be an assert().




Note that I haven't reviewed all the logic in detail at this point. If 
this ends up being refactored it'll be a lot easier to spot logic 
problems, so I'll hold off on that for now.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 09:56:20PM +0100, Simon Riggs wrote:
> On 20 April 2015 at 20:28, Jeff Janes  wrote:
>  
> 
> But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
> while the user waits, which is fundamentally VACUUM's duty to do in the
> background? 
> 
> 
> Agreed. I don't see a % as giving us anything at all.
> 
> The idea is that we want to turn an O(N) problem for one query into an O(1)
> task.
>  
> 
> The use case I see for this is when there is a mixed workload.  There is
> one select which reads the entire table, and hundreds of thousands of
> selects/updates/insert that don't, and of course vacuum comes along every
> now and then and does it thing.  Why should the one massive SELECT have
> horrible performance just because it was run right before autovacuum would
> have kicked in instead of right after if finished?
> 
> 
> +1

You can +1 all you want, but if you ignore the specific workloads I
mentioned, you are not going to get much traction.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> 
> > I think the limit has to be in terms of a percentage of the table size. 
> > For example, if we do one SELECT on a table with all non-dirty pages, it
> > would be good to know that 5% of the pages were pruned --- that tells me
> > that another 19 SELECTs will totally prune the table, assuming no future
> > writes.
> 
> This seems simple to implement: keep two counters, where the second one
> is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
> reset the first counter so that further 5 pages will get HOT pruned.  5%
> seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
> essentially +infinity.)

This would tend to dirty non-sequential heap pages --- it seems best to
just clean as many as we are supposed to, then skip the rest, so we can
write sequential dirty pages to storage.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 05:04:14PM -0400, Robert Haas wrote:
> On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian  wrote:
> > Slightly improved patch applied.
> 
> Is it?

The patch has a slightly modified 'if' statement to check a constant
before calling a function, and use elseif:

< + if (!interpretOidsOption(stmt->options, true) && cxt.hasoids)
---
> + if (cxt.hasoids && !interpretOidsOption(stmt->options, true))
47c57
< + if (interpretOidsOption(stmt->options, true) && !cxt.hasoids)
---
> + else if (!cxt.hasoids && interpretOidsOption(stmt->options, 
true))

I realize the change is subtle.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Peter Geoghegan
On Tue, Apr 14, 2015 at 7:02 PM, Robert Haas  wrote:
> On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan  wrote:
>> Why is that good?
>
> We did discuss this before.  I've recapped some of what I believe to
> be the most salient points below.
>
>> I think that people were all too quick to dismiss the idea of a wall
>> time interval playing some role here (at least as a defense against
>> correlated references, as a correlated reference period). I suppose
>> that that's because it doesn't fit with an intuition that says that
>> that kind of interval ought to be derived algebraically - magic delay
>> settings are considered suspect.
>
> Yep, Tom gave that reason here:
>
> http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us
>
> But there was also this point from Andres - gettimeofday is not free:
>
> http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de
>
> And this point from me - this can degrade to random eviction under
> high pressure:
>
> http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com
>
> You'll notice that my proposal avoids all three of those objections.

All I'm saying is that you shouldn't dismiss the idea without trying
it out properly. The LRU-K paper, from the early 1990s, recommends
this, and gettimeofday() calls were a lot more expensive back then.
I'm sure that there is a way to overcome these issues if it turns out
to be worth it (by amortizing gettimeofday() calls by driving it from
an auxiliary process like the bgwriter, for example). In fact, I'm
almost certain there is, because at least one other major database
system uses just such a reference period.

Back when ARC (or was it 2Q?) was committed before being reverted
shortly thereafter, there was a similar idea actually implemented, but
with XIDs preventing correlated references (which the LRU-K paper also
hints at). I think that an actual delay is more robust than that,
though. Even though this correlated reference delay is all I
implemented with my prototype,it's just one piece of the puzzle.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Robert Haas
On Mon, Apr 20, 2015 at 3:28 PM, Jeff Janes  wrote:
> But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while
> the user waits, which is fundamentally VACUUM's duty to do in the
> background?  If there are a handful of very hot pages, then it makes sense
> not to wait for vacuum to get to them.  And that is what a block-count limit
> does.

I think that's a fundamental mischaracterization of the problem.  As
soon as you define this as "vacuum's problem", then of course it makes
no sense to prune in the foreground, ever.  But if you define the
problem as "get the best overall system performance", then it clearly
DOES sometimes make sense to prune in the foreground, as benchmark
results upthread demonstrate.

The fact is that on a workload like pgbench - and it doesn't have to
be exactly pgbench, just any kind of workload where there are lots of
changes to the table - vacuum can at any given time be pruning at most
one page of the table.  That is because only one vacuum process can be
running in a given table at one time, and it can't be doing two things
at once.  But there can be many processes doing inserts, updates, or
deletes on that table, as many as whatever you have max_connections
set to.  There can easily be dozens even on a well-configured system;
on a poorly configured system, there could be hundreds.  It seems
obvious that if you can have dozens or hundreds of processes creating
garbage and at most one process cleaning it up, there will be cases
where you get further and further behind.

Now, it might well be that the right solution to that problem is to
allow multiple vacuum processes in the same database, or add
background workers to help with opportunistic HOT-pruning of pages so
it doesn't get done in the foreground.  Fine.  But as of today, on a
heavily-modified table, the ONLY way that we can possibly remove junk
from the table as fast as we're creating junk is if the backends
touching the table do some of the work.  Now, Simon is making the
argument that it should be good enough to have people *modifying* the
table help with the cleanup rather than imposing that load on the
people who are only *reading* it, and that's not a dumb argument, but
there are still cases where that strategy loses - specifically, where
the table churn has stopped or paused, by autovacuum hasn't run yet.
If you're going to do 1 sequential scan of the table and then go home
for the day, HOT-pruning is dumb even in that case.  If you're going
to do 1000 sequential scans of that table in a row, HOT-pruning may
very well be smart.  There's no guarantee that the table has met the
autovacuum threshold, but HOT-pruning it could well be a win anyway.
Or it might be a loss.  You can make any policy here look smart or
dumb by picking a particular workload, and you don't even have to
invent crazy things that will never happen in real life to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:
> >>I also think there's better ways we could handle *all* our cleanup
> >>work. Tuples have a definite lifespan, and there's potentially a lot
> >>of efficiency to be gained if we could track tuples through their
> >>stages of life... but I don't see any easy ways to do that.
> >
> >See the TODO list:
> >
> > https://wiki.postgresql.org/wiki/Todo
> > o  Avoid the requirement of freezing pages that are infrequently
> >modified
> 
> Right, but do you have a proposal for how that would actually happen?
> 
> Perhaps I'm mis-understanding you, but it sounded like you were
> opposed to this patch because it doesn't do anything to avoid the
> need to freeze. My point is that no one has any good ideas on how to
> avoid freezing, and I think it's a safe bet that any ideas people do
> come up with there will be a lot more invasive than a FrozenMap is.

Didn't you think any of the TODO threads had workable solutions?  And
don't expect adding an additional file per relation will be zero cost
--- added over the lifetime of 200M transactions, I question if this
approach would be a win.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

2015-04-20 Thread Robert Haas
On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian  wrote:
> Slightly improved patch applied.

Is it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Jim Nasby

On 4/20/15 2:09 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:

On 4/20/15 1:48 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:

Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.

Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.


So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE?  Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions?  I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.


How would you propose we do that?

I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.


See the TODO list:

https://wiki.postgresql.org/wiki/Todo
o  Avoid the requirement of freezing pages that are infrequently
   modified


Right, but do you have a proposal for how that would actually happen?

Perhaps I'm mis-understanding you, but it sounded like you were opposed 
to this patch because it doesn't do anything to avoid the need to 
freeze. My point is that no one has any good ideas on how to avoid 
freezing, and I think it's a safe bet that any ideas people do come up 
with there will be a lot more invasive than a FrozenMap is.


While not perfect, a FrozenMap is something we can do today, without a 
lot of effort, and it will provide definite value for any tables that 
have a "good" amount of frozen pages. Without performance testing, we 
don't know what "good" actually looks like, but we can't test without a 
patch (which we now have). Assuming performance numbers look good I 
think it would be folly to reject this patch in the hopes that 
eventually we'll have some way to avoid the need to freeze.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Simon Riggs
On 20 April 2015 at 20:28, Jeff Janes  wrote:


> But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
> while the user waits, which is fundamentally VACUUM's duty to do in the
> background?
>

Agreed. I don't see a % as giving us anything at all.

The idea is that we want to turn an O(N) problem for one query into an O(1)
task.


> The use case I see for this is when there is a mixed workload.  There is
> one select which reads the entire table, and hundreds of thousands of
> selects/updates/insert that don't, and of course vacuum comes along every
> now and then and does it thing.  Why should the one massive SELECT have
> horrible performance just because it was run right before autovacuum would
> have kicked in instead of right after if finished?
>

+1

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

2015-04-20 Thread Bruce Momjian
On Thu, Apr  9, 2015 at 05:33:20PM -0400, Bruce Momjian wrote:
> On Thu, Apr  9, 2015 at 12:32:23PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> > 
> > > Should this be listed in the release notes as a backward-incompatibility?
> > 
> > Isn't this a backpatchable bug fix?
> 
> Uh, I don't think so.  I think users are used to the existing behavior
> and changing it on them will cause more harm than good.  Also, we have
> had zero field reports about this problem.
> 
> The updated attached patch handles cases where the default_with_oids =
> true.

Slightly improved patch applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 12:28:11PM -0700, Jeff Janes wrote:
> But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while
> the user waits, which is fundamentally VACUUM's duty to do in the background? 
> If there are a handful of very hot pages, then it makes sense not to wait for
> vacuum to get to them.  And that is what a block-count limit does.  
> 
> But if the entire table is very hot, I think that that is just another of way
> of saying that autovacuum is horribly misconfigured.  I think the purpose of

Well, we have to assume there are many misconfigured configurations ---
autovacuum isn't super-easy to configure, so we can't just blame the
user if this makes things worse.  In fact, page pruning was designed
spefically for cases where autovacuum wasn't running our couldn't keep
up.

> this patch is to fix something that can't be fixed through configuration 
> alone.
>  
> 
> If there are future writes, they would dirty the pages and
> cause even more pruning, but the 5% gives me the maximum pruning number
> of SELECTs.  If there aren't another 19 SELECTs, do I care if the table
> is pruned or not? 
> 
> 
> The use case I see for this is when there is a mixed workload.  There is one
> select which reads the entire table, and hundreds of thousands of selects/
> updates/insert that don't, and of course vacuum comes along every now and then
> and does it thing.  Why should the one massive SELECT have horrible 
> performance
> just because it was run right before autovacuum would have kicked in instead 
> of
> right after if finished?

I see your point, but what about the read-only workload after a big
update?  Do we leave large tables to be non-pruned for a long time? 
Also, consider cases where you did a big update, the autovacuum
thresh-hold was not met, so autovacuum doesn't run on that table ---
again, do we keep those non-pruned rows around for millions of scans?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] optimizing vacuum truncation scans

2015-04-20 Thread Qingqing Zhou
On Sun, Apr 19, 2015 at 7:09 PM, Jeff Janes  wrote:
> I did literally the simplest thing I could think of as a proof of concept 
> patch, to see if it would actually fix things.  I just jumped back a certain 
> number of blocks occasionally and prefetched them forward, then resumed the 
> regular backward scan.
IIRC, this patches introduces optional prefetch for the backward scan
where file system prefetch may not work well. The result is promising!

> Also, what would be the best way to drill a hole through bufmgr.c into md.c 
> so that the prefetch could specify an entire range, rather than looping over 
> each individual block?
>
Different from mdread(), which requires a buffer in argument.
Extending mdprefetch() looks natural and safe to me. This shall also
layout a foundation for other scan style optimizations.

> What would have to be done to detect people running on SSD and disable the 
> feature, if anything?
>
There are other call sites of prefetch - so this shall be a topic not
just for this optimization.

Regards,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [pgsql-packagers] Palle Girgensohn's ICU patch

2015-04-20 Thread Peter Eisentraut
On 4/19/15 7:46 AM, Palle Girgensohn wrote:
> Just poking this old thread again. What happened here, is anyone putting work 
> into this area at the moment?

I plan to look at it for 9.6.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> 
> > I think the limit has to be in terms of a percentage of the table size. 
> > For example, if we do one SELECT on a table with all non-dirty pages, it
> > would be good to know that 5% of the pages were pruned --- that tells me
> > that another 19 SELECTs will totally prune the table, assuming no future
> > writes.
> 
> This seems simple to implement: keep two counters, where the second one
> is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
> reset the first counter so that further 5 pages will get HOT pruned.  5%
> seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
> essentially +infinity.)

Oh, I pulled 5% out of the air.  Thinking of a SELECT-only workload,
which would be our worse case, I was thinking how many SELECTS running
through HOT update chains would it take to be slower than generating the
WAL to prune the page.  I see the percentage as something that we could
reasonably balance, while a fixed page count couldn't be analyzed in
that way.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Jeff Janes
On Mon, Apr 20, 2015 at 10:33 AM, Bruce Momjian  wrote:

> On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote:
> > That is how we arrive at the idea of a cleanup limit, further enhanced
> by a
> > limit that applies only to dirtying clean blocks, which we have 4?
> recent votes
> > in favour of.
> >
> > I would personally be in favour of a parameter to control the limit,
> since
> > whatever we chose is right/wrong depending upon circumstances. I am
> however
> > comfortable with not having a parameter if people think it is hard to
> tune
> > that, which I agree it would be, hence no parameter in the patch.
>
> I think the limit has to be in terms of a percentage of the table size.
> For example, if we do one SELECT on a table with all non-dirty pages, it
> would be good to know that 5% of the pages were pruned --- that tells me
> that another 19 SELECTs will totally prune the table, assuming no future
> writes.


But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
while the user waits, which is fundamentally VACUUM's duty to do in the
background?  If there are a handful of very hot pages, then it makes sense
not to wait for vacuum to get to them.  And that is what a block-count
limit does.

But if the entire table is very hot, I think that that is just another of
way of saying that autovacuum is horribly misconfigured.  I think the
purpose of this patch is to fix something that can't be fixed through
configuration alone.


> If there are future writes, they would dirty the pages and
> cause even more pruning, but the 5% gives me the maximum pruning number
> of SELECTs.  If there aren't another 19 SELECTs, do I care if the table
> is pruned or not?


The use case I see for this is when there is a mixed workload.  There is
one select which reads the entire table, and hundreds of thousands of
selects/updates/insert that don't, and of course vacuum comes along every
now and then and does it thing.  Why should the one massive SELECT have
horrible performance just because it was run right before autovacuum would
have kicked in instead of right after if finished?

Cheers,

Jeff


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Alvaro Herrera
Bruce Momjian wrote:

> I think the limit has to be in terms of a percentage of the table size. 
> For example, if we do one SELECT on a table with all non-dirty pages, it
> would be good to know that 5% of the pages were pruned --- that tells me
> that another 19 SELECTs will totally prune the table, assuming no future
> writes.

This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned.  5%
seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:
> On 4/20/15 1:48 PM, Bruce Momjian wrote:
> >On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
> >>Attached WIP patch adds Frozen Map which enables us to avoid whole
> >>table vacuuming even when full scan is required: preventing XID
> >>wraparound failures.
> >>
> >>Frozen Map is a bitmap with one bit per heap page, and quite similar
> >>to Visibility Map. A set bit means that all tuples on heap page are
> >>completely frozen, therefore we don't need to do vacuum freeze that
> >>page.
> >>A bit is set when vacuum(or autovacuum) figures out that all tuples on
> >>corresponding heap page are completely frozen, and a bit is cleared
> >>when INSERT and UPDATE(only new heap page) are executed.
> >
> >So, this patch avoids reading the all-frozen pages if it has not been
> >modified since the last VACUUM FREEZE?  Since it is already frozen, the
> >running VACUUM FREEZE will not modify the page or generate WAL, so is it
> >really worth maintaining a new per-page bitmap just to avoid the
> >sequential scan of tables every 200MB transactions?  I would like to see
> >us reduce the need for VACUUM FREEZE, rather than go in this direction.
> 
> How would you propose we do that?
> 
> I also think there's better ways we could handle *all* our cleanup
> work. Tuples have a definite lifespan, and there's potentially a lot
> of efficiency to be gained if we could track tuples through their
> stages of life... but I don't see any easy ways to do that.

See the TODO list:

https://wiki.postgresql.org/wiki/Todo
o  Avoid the requirement of freezing pages that are infrequently
   modified

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-20 Thread David G. Johnston
On Sun, Apr 19, 2015 at 3:02 PM, Jim Nasby  wrote:

> Is there a fundamental reason SQL/plpgsql functions won't accept record as
> an input type? If not, can someone point me at a patch that might show how
> much work would be involved in adding support?
>
> My particular use case is a generic function that will count how many
> fields in a record are NULL. I can do it in pure SQL (below), but was
> hoping to wrap the entire thing in a function. Right now, I have to add a
> call to row_to_json() to the function call.
>
> SELECT count(*)
>   FROM json_each_text( row_to_json($1) ) a
>   WHERE value IS NULL


​See also:

​SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v)​;
ERROR: record type has not been registered

While it may not be necessary to solve both problems I suspect they have
the same underlying root cause - specifically the separation of concerns
between the planner and the executor.

ISTM that the planner needs to be able to create arbitrarily named
composite types and leave them "registered" in the session somewhere for
the executor to find.  Session because:

PREPARE prep_rec AS SELECT record_input_func(v) FROM ( VALUES
(ROW($1::integer,$2::boolean,$3::text)) src (v);
EXECUTE prep_rec USING (1, true, 'hi!');

If it requires additional smarts in the executor to make this all work I
suspect the cost-benefit equations end up supporting the somewhat more
verbose but workable status-quo.

I'm not sure how { row_to_json(record) } works but SQL (including pl/pgsql)
needs to have some source of definition for what the record type should be
in reality - and that source currently is the catalogs whose rows are
locked by the planner and injected, I think, into a session cache.  The
source query in pl/pgsql defines the type for fully embedded use of the
record placeholder while the caller's function alias provides that
information for RETURNS record.  The calling query needs to provide the
same information for "CREATE FUNCTION func( arg1 record )" since the body
of the pl/pgsql function needs to instantiate "arg1" with a known type as
soon as the function is entered.  It is theoretically possible to impute
the needed anonymous type from the query definition - the problem is how
and where to register that information for execution.

At least for pl/pgsql I could see possibly doing something like "func( arg1
packed_record_bytes)" and having pl/pgsql understand how to unpack those
bytes into an anonymous but structured record (like it would with SELECT
... INTO record_var) seems plausible.  I would not expect pl/SQL to allow
anything of the sort as it doesn't seem compatible with the idea of
inline-ability.

Maybe the "C" code for "row_to_json" (or libpq in general) can provide
inspiration (particularly for the "pack/unpack bytes") but as I do not know
"C" I'm going to have to leave that to others.

David J.


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Jim Nasby

On 4/20/15 1:48 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:

Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.

Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.


So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE?  Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions?  I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.


How would you propose we do that?

I also think there's better ways we could handle *all* our cleanup work. 
Tuples have a definite lifespan, and there's potentially a lot of 
efficiency to be gained if we could track tuples through their stages of 
life... but I don't see any easy ways to do that.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Jim Nasby

On 4/20/15 11:11 AM, Robert Haas wrote:

On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark  wrote:

This is my point though (you're right that "flushed" isn't always the
same as eviction but that's not the important point here). Right now
we only demote when we consider buffers for eviction. But we promote
when we pin buffers. Those two things aren't necessarily happening at
the same rate and in fact are often orders of magnitude different.


I am absolutely, positively, violently in 100% agreement with this.  I
have made the same point before, but it sure is nice to hear someone
else thinking about it the same way.


+1


What I'm saying is that we should demote a buffer every time we
promote a buffer. So every time we pin a buffer we should advance the
clock a corresponding amount. I know I'm being intentionally vague
about what the corresponding amount is.) The important thing is that
the two should be tied together.


Yes, absolutely.  If you tilt your head the right way, my proposal of
limiting the number of promotions per clock sweep has the effect of
tying buffer demotion and buffer promotion together much more tightly
than is the case right now.  You are limited to 2 promotions per
demotion; and practically speaking not all buffers eligible to be
promoted will actually get accessed, so the number of promotions per
demotion will in reality be somewhere between 0 and 2.  Ideally it
would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
at present.  Which is not to say there isn't some other idea that is
better still.


I think that would help, but it still leaves user backends trying to 
advance the clock, which is quite painful. Has anyone tested running the 
clock in the background? We need a wiki page with all the ideas that 
have been tested around buffer management...

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
> Attached WIP patch adds Frozen Map which enables us to avoid whole
> table vacuuming even when full scan is required: preventing XID
> wraparound failures.
> 
> Frozen Map is a bitmap with one bit per heap page, and quite similar
> to Visibility Map. A set bit means that all tuples on heap page are
> completely frozen, therefore we don't need to do vacuum freeze that
> page.
> A bit is set when vacuum(or autovacuum) figures out that all tuples on
> corresponding heap page are completely frozen, and a bit is cleared
> when INSERT and UPDATE(only new heap page) are executed.

So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE?  Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions?  I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-20 Thread Robert Haas
On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote
 wrote:
> On 08-04-2015 PM 12:46, Amit Kapila wrote:
>> Going forward, I think we can improve the same if we decide not to shutdown
>> parallel workers till postmaster shutdown once they are started and
>> then just allocate them during executor-start phase.
>
> I wonder if it makes sense to invent the notion of a global pool of workers
> with configurable number of workers that are created at postmaster start and
> destroyed at shutdown and requested for use when a query uses parallelizable
> nodes.

Short answer: Yes, but not for the first version of this feature.

Longer answer: We can't actually very reasonably have a "global" pool
of workers so long as we retain the restriction that a backend
connected to one database cannot subsequently disconnect from it and
connect to some other database instead.  However, it's certainly a
good idea to reuse the same workers for subsequent operations on the
same database, especially if they are also by the same user.  At the
very minimum, it would be good to reuse the same workers for
subsequent operations within the same query, instead of destroying the
old ones and creating new ones.  Nonwithstanding the obvious value of
all of these ideas, I don't think we should do any of them for the
first version of this feature.  This is too big a thing to get perfect
on the first try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-20 Thread Robert Haas
On Wed, Apr 8, 2015 at 3:34 AM, David Rowley  wrote:
> On 8 April 2015 at 14:24, Robert Haas  wrote:
>> I think one of the philosophical questions that has to be answered
>> here is "what does it mean to talk about the cost of a parallel
>> plan?".  For a non-parallel plan, the cost of the plan means both "the
>> amount of effort we will spend executing the plan" and also "the
>> amount of time we think the plan will take to complete", but those two
>> things are different for parallel plans.  I'm inclined to think it's
>> right to view the cost of a parallel plan as a proxy for execution
>> time, because the fundamental principle of the planner is that we pick
>> the lowest-cost plan.  But there also clearly needs to be some way to
>> prevent the selection of a plan which runs slightly faster at the cost
>> of using vastly more resources.
>
> I'd agree with that as far as CPU costs, or maybe I'd just disagree with the
> alternative, as if we costed in  *  of workers> then we'd never choose a parallel plan, as by the time we costed
> in tuple communication costs between the processes a parallel plan would
> always cost more than the serial equivalent. I/O costs are different, I'd
> imagine these shouldn't be divided by the estimated number of workers.

It's hard to say.  If the I/O is from the OS buffer cache, then
there's no reason why several workers can't run in parallel.   And
even if it's from the actual storage, we don't know what degree of I/O
parallelism will be possible.  Maybe effective_io_concurrency should
play into the costing formula somehow, but it's not very clear to me
that captures the information we care about.  In general, I'm not sure
how common it is for the execution speed of a sequential scan to be
limited by I/O.

For example, on a pgbench database, scale factor 300, on a POWERPC
machine provided by IBM for performance testing (thanks, IBM!) a
cached read of the pgbench_accounts files took 1.122 seconds.  After
dropping the caches, it took 10.427 seconds.  "select * from
pgbench_accounts where abalance > 3" took 10.244 seconds with a
cold cache and 5.029 seconds with a warm cache.  So on this particular
hardware, on this particular test, parallelism is useless if the cache
is cold, but it could be right to use ~4-5 processes for the scan if
the cache is warm.  However, we have no way of knowing whether the
cache will be cold or warm at execution time.

This isn't a new problem.  As it is, the user has to set seq_page_cost
and random_page_cost based on either a cold-cache assumption or a
warm-cache assumption, and if they guess wrong, their costing
estimates will be off (on this platform, on this test case) by 4-5x.
That's pretty bad, and it's totally unclear to me what to do about it.
I'm guessing it's unclear to other people, too, or we would likely
have done something about it by now.

>> Some ideas for GUCs:
>>
>> max_parallel_degree = The largest number of processes we'll consider
>> using for a single query.
>> min_parallel_speedup = The minimum percentage by which a parallel path
>> must be cheaper (in terms of execution time) than a non-parallel path
>> in order to survive.  I'm imagining the default here might be
>> something like 15%.
>> min_parallel_speedup_per_worker = Like the previous one, but per
>> worker.  e.g. if this is 5%, which might be a sensible default, then a
>> plan with 4 workers must be at least 20% better to survive, but a plan
>> using only 2 workers only needs to be 10% better.
>
> max_parallel_degree feels awfully like it would have to be set
> conservatively, similar to how work_mem is today. Like with work_mem, during
> quiet periods it sure would be nice if it could magically increase.

Absolutely.  But, similar to work_mem, that's a really hard problem.
We can't know at plan time how much work memory, or how many CPUs,
will be available at execution time.  And even if we did, it need not
be constant throughout the whole of query execution.  It could be that
when execution starts, there's lots of memory available, so we do a
quicksort rather than a tape-sort.  But midway through the machine
comes under intense memory pressure and there's no way for the system
to switch strategies.

Now, having said that, I absolutely believe that it's correct for the
planner to make the initial decisions in this area.  Parallelism
changes the cost of execution nodes, and it's completely wrong to
assume that this couldn't alter planner decisions at higher levels of
the plan tree.  At the same time, it's pretty clear that it would be a
great thing for the executor to be able to adjust the strategy if the
planner's assumptions don't pan out, or if conditions have changed.

For example, if we choose a seq-scan-sort-and-filter over an
index-scan-and-filter thinking that we'll be able to do a quicksort,
and then it turns out that we're short on memory, it's too late to
switch gears and adopt the index-scan-and-filter plan after all.
That's long since been discarded. 

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Simon Riggs
On 20 April 2015 at 18:33, Bruce Momjian  wrote:


> Also, I am also not sure we should be designing features at this stage
> in our release process.
>

I see this more as a process of gaining approval. I don't think patches at
the back of the queue should get the "its too late treatment" just because
they are at the back of the queue. They are logically all at the same
stage. There should be a way to allow people that show patience and respect
for the process to get the same timeshare as those that push their patches
daily.

Anyway, in this case, the patch conflicts with other things going in now,
so changing things isn't really sensible for this release, in terms of my
time.

We clearly need to do a better job of piggybacking actions on a dirty
block. The discussion has been about whether to do early cleanup, but we
should also consider post-access cleanup if we set hint bits or dirtied the
block in other ways.

Since we have many votes in favour of change in this area I'll post a new
version and look for an early review/commit for next release.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Bruce Momjian
On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote:
> That is how we arrive at the idea of a cleanup limit, further enhanced by a
> limit that applies only to dirtying clean blocks, which we have 4? recent 
> votes
> in favour of.
> 
> I would personally be in favour of a parameter to control the limit, since
> whatever we chose is right/wrong depending upon circumstances. I am however
> comfortable with not having a parameter if people think it is hard to tune
> that, which I agree it would be, hence no parameter in the patch.

I think the limit has to be in terms of a percentage of the table size. 
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes.  If there are future writes, they would dirty the pages and
cause even more pruning, but the 5% gives me the maximum pruning number
of SELECTs.  If there aren't another 19 SELECTs, do I care if the table
is pruned or not?  Probably not.  Measuring in page count doesn't do
that, and a large table could receive millions of selects before being
fully cleaned.

Also, I am also not sure we should be designing features at this stage
in our release process.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] optimizing vacuum truncation scans

2015-04-20 Thread Jim Nasby

On 4/20/15 1:50 AM, Jeff Janes wrote:

Shouldn't completely empty pages be set as all-visible in the VM? If
so, can't we just find the largest not-all-visible page and move
forward from there, instead of moving backwards like we currently do?


If the entire table is all-visible, we would be starting from the
beginning, even though the beginning of the table still has read only
tuples present.


Except we'd use it in conjunction with nonempty_pages. IIRC, that's set 
to the last page that vacuum saw data on. If any page after that got 
written to after vacuum visited it, the VM bit would be cleared. So 
after we acquire the exclusive lock, AFAICT it's safe to just scan the 
VM starting with nonempty_pages.



For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...


nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan.  But maybe the combination of the
two?  If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.


Right.


In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.


Yeah, but I think we'd do a LOT better with the VM idea, because we 
could immediately truncate without scanning anything.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-20 Thread Robert Haas
On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila  wrote:
> One disadvantage of retaining parallel-paths could be that it can
> increase the number of combinations planner might need to evaluate
> during planning (in particular during join path evaluation) unless we
> do some special handling to avoid evaluation of such combinations.

Yes, that's true.  But the overhead might not be very much.  In the
common case, many baserels and joinrels will have no parallel paths
because the non-parallel paths is known to be better anyway.  Also, if
parallelism does seem to be winning, we're probably planning a query
that involves accessing a fair amount of data, so a little extra
planner overhead may not be so bad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

2015-04-20 Thread Robert Haas
On Thu, Apr 9, 2015 at 5:33 PM, Bruce Momjian  wrote:
> On Thu, Apr  9, 2015 at 12:32:23PM -0300, Alvaro Herrera wrote:
>> Bruce Momjian wrote:
>>
>> > Should this be listed in the release notes as a backward-incompatibility?
>>
>> Isn't this a backpatchable bug fix?
>
> Uh, I don't think so.  I think users are used to the existing behavior
> and changing it on them will cause more harm than good.  Also, we have
> had zero field reports about this problem.

I agree.  This should not be back-patched, but fixing it in master seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Spinlock Documentation

2015-04-20 Thread Robert Haas
On Sat, Apr 11, 2015 at 12:03 PM, Artem Luzyanin  wrote:
> Thank you again for your feedback. I have improved the patch with your
> suggestions. Please let me know what you think and if I can do anything
> else.
>
> Current CommitFest link for this patch is:
> https://commitfest.postgresql.org/5/208/

Some review comments:

- The first hunk in s_lock.h touches only whitespace.  Changing the
space to a tab on the "Usually" line would make sense for consistency,
but adding a trailing space to the "override them" line does not.

- As Tom basically said before, I think the "File layout" block
comment will just get out of date and be a maintenance annoyance to
future updaters of this file.   It's not really that hard to see the
structure of the file just by going through it, so I don't think this
is worthwhile.

- Similarly, adding all of the "Currently implemented" lines looks
useless to me.  Why can't somebody see that from just reading the code
itself?

Overall, I'm not seeing much point to this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sequence Access Method WIP

2015-04-20 Thread Alvaro Herrera
Heikki Linnakangas wrote:

> * The transformations of the arrays in get_state() and set_state() functions
> are a bit complicated. The seqam_get_state() function returns two C arrays,
> and pg_sequence_get_state() turns them into a text[] array. Why not
> construct the text[] array directly in the AM? I guess it's a bit more
> convenient to the AM, if the pg_sequence_get_state() do that, but if that's
> an issue, you could create generic helper function to turn two C string
> arrays into text[], and call that from the AM.

Um, see strlist_to_textarray() in objectaddress.c if you do that.  Maybe
we need some generic place to store that kind of helper function.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Robert Haas
On Mon, Apr 20, 2015 at 11:00 AM, Merlin Moncure  wrote:
>> Hmm, interesting point.  It's possible that we'd still have problems
>> with everything maxing out at 32 on some workloads, but at least it'd
>> be a little harder to max out at 32 than at 5.
>
> Do we have any reproducible test cases to evaluate these assumptions?

The particular point that you are responding to here is easy to
reproduce.  Just create any workload that fits in shared_buffers - a
small pgbench database, for example - and access stuff.  No usage
counts will go down, but every access will drive usage counts up.
Eventually everything will hit any maximum you care to install, and
actually I don't think it takes very long.  You can use pg_buffercache
to see the results.

>  I haven't looked at this stuff for a while, but my main issue with
> the clock sweep was finding sweep heavy cases that did not also have
> trouble with the buffer mapping locks (although the facts on the
> ground my have changed since then).

We increased the number of buffer mapping locks to 128, so that
problem should be considerably ameliorated now.  But it's not totally
gone, as demonstrated by Andres's experiments with my chash patch.

There was also a patch that eliminated BufFreelistLock in favor of a
spinlock held for much shorter time periods, and then Andres took that
one step further and made it use atomics.  That used to be a
*terrible* bottleneck on eviction-heavy workloads and is now much
improved.  More work may remain to be done, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Robert Haas
On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark  wrote:
> This is my point though (you're right that "flushed" isn't always the
> same as eviction but that's not the important point here). Right now
> we only demote when we consider buffers for eviction. But we promote
> when we pin buffers. Those two things aren't necessarily happening at
> the same rate and in fact are often orders of magnitude different.

I am absolutely, positively, violently in 100% agreement with this.  I
have made the same point before, but it sure is nice to hear someone
else thinking about it the same way.

> So
> it makes sense that we end up with a lot of buffers promoted all the
> way to the most recently used ntile and then have to do n passes
> around the clock and have no good information about which buffer to
> evict.

Right.

> What I'm saying is that we should demote a buffer every time we
> promote a buffer. So every time we pin a buffer we should advance the
> clock a corresponding amount. I know I'm being intentionally vague
> about what the corresponding amount is.) The important thing is that
> the two should be tied together.

Yes, absolutely.  If you tilt your head the right way, my proposal of
limiting the number of promotions per clock sweep has the effect of
tying buffer demotion and buffer promotion together much more tightly
than is the case right now.  You are limited to 2 promotions per
demotion; and practically speaking not all buffers eligible to be
promoted will actually get accessed, so the number of promotions per
demotion will in reality be somewhere between 0 and 2.  Ideally it
would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
at present.  Which is not to say there isn't some other idea that is
better still.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Merlin Moncure
On Mon, Apr 20, 2015 at 9:56 AM, Robert Haas  wrote:
> On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
>  wrote:
>> I've been following this thread from the side with interest and got
>> twigged by the point about loss of information.  If you'd like better
>> information about relative ages, you can acheive this by raising the
>> cap on the usage count and dividing (or right-shifting) each sweep.
>
> Yeah, I thought about that, too.  It might be worth experimenting with.
>
>> This would allow you to remember much more about about the relative
>> worth of often used pages.  With a cap of 32 you'd have the same effect
>> as now where after 5 sweeps the buffer is evicted.  Mathematically the
>> count would converge to the number of times the block is used per
>> sweep.
>
> Hmm, interesting point.  It's possible that we'd still have problems
> with everything maxing out at 32 on some workloads, but at least it'd
> be a little harder to max out at 32 than at 5.

Do we have any reproducible test cases to evaluate these assumptions?
 I haven't looked at this stuff for a while, but my main issue with
the clock sweep was finding sweep heavy cases that did not also have
trouble with the buffer mapping locks (although the facts on the
ground my have changed since then).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Robert Haas
On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
 wrote:
> I've been following this thread from the side with interest and got
> twigged by the point about loss of information.  If you'd like better
> information about relative ages, you can acheive this by raising the
> cap on the usage count and dividing (or right-shifting) each sweep.

Yeah, I thought about that, too.  It might be worth experimenting with.

> This would allow you to remember much more about about the relative
> worth of often used pages.  With a cap of 32 you'd have the same effect
> as now where after 5 sweeps the buffer is evicted.  Mathematically the
> count would converge to the number of times the block is used per
> sweep.

Hmm, interesting point.  It's possible that we'd still have problems
with everything maxing out at 32 on some workloads, but at least it'd
be a little harder to max out at 32 than at 5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Supporting src/test/modules in MSVC builds

2015-04-20 Thread Alvaro Herrera
Michael Paquier wrote:

> And currawong is satisfied with this patch and the new buildfarm code,
> test modules being run as testmodules-install-check-C:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=currawong&dt=2015-04-18%2014%3A51%3A14
> So this closes the loop for this thread.

Yay!  Thanks, Michael and Andrew.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-04-20 Thread David Steele
On 4/20/15 4:40 AM, Sawada Masahiko wrote:
> 
> Thank you for updating the patch.
> 
> One question about regarding since v7 (or later) patch is;
> What is the different between OBJECT logging and SESSION logging?

In brief, session logging can log anything that happens in a user
session while object logging only targets DML and SELECT on selected
relations.

The pg_audit.log_relation setting is intended to mimic the prior
behavior of pg_audit before object logging was added.

In general, I would not expect pg_audit.log = 'read, write' to be used
with pg_audit.role.  In other words, session logging of DML and SELECT
should probably not be turned on at the same time as object logging is
in use.  Object logging is intended to be a fine-grained version of
pg_audit.log = 'read, write' (one or both).

> I used v9 patch with "pg_audit.log_relation = on", and got quite
> similar but different log as follows.
> 
> =# select * from hoge, bar where hoge.col = bar.col;
> NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.hoge,"select *
> from hoge, bar where hoge.col = bar.col;"
> NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.hoge,"select *
> from hoge, bar where hoge.col = bar.col;"
> NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.bar,"select * from
> hoge, bar where hoge.col = bar.col;"
> NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.bar,"select *
> from hoge, bar where hoge.col = bar.col;"
> 
> The behaviour of SESSION log is similar to OBJECT log now, and SESSION
> log per session (i.g, pg_audit.log_relation = off) is also similar to
> 'log_statement = all'. (enhancing log_statement might be enough)
> So I couldn't understand needs of SESSION log.

Session logging is quite different from 'log_statement = all'.  See:

http://www.postgresql.org/message-id/552323b2.8060...@pgmasters.net

and/or the "Why pg_audit?" section of the pg_audit documentation.

I agree that it may make sense in the future to merge session logging
into log_statements, but for now it does provide important additional
functionality for creating audit logs.

Regards,
-- 
- David Steele
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] alternative compression algorithms?

2015-04-20 Thread Tomas Vondra



On 04/20/15 05:07, Andres Freund wrote:

Hi,

On 2015-04-19 22:51:53 +0200, Tomas Vondra wrote:

The reason why I'm asking about this is the multivariate statistics patch -
while optimizing the planning overhead, I realized that considerable amount
of time is spent decompressing the statistics (serialized as bytea), and
using an algorithm with better decompression performance (lz4 comes to mind)
would help a lot. The statistics may be a few tens/hundreds kB, and in the
planner every millisecond counts.


I've a hard time believing that a different compression algorithm is
 going to be the solution for that. Decompressing, quite possibly
repeatedly, several hundreds of kb during planning isn't going to be
 fun. Even with a good, fast, compression algorithm.


Sure, it's not an ultimate solution, but it might help a bit. I do have 
other ideas how to optimize this, but in the planner every milisecond 
counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.


regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Supporting TAP tests with MSVC and Windows

2015-04-20 Thread Michael Paquier
On Sun, Apr 19, 2015 at 10:01 PM, Michael Paquier
 wrote:
> Note as well that this patch uses the following patches fixing
> independent issues:
> ...

Attached is v4. I added a switch in config.pl to be consistent with
./configure and --enable-tap-tests.
-- 
Michael
From f532bbd2522b3a1784038d9863325d055fc445bf Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 20 Apr 2015 04:57:37 -0700
Subject: [PATCH 1/2] Add support for TAP tests on Windows

Nodes initialized by the TAP tests use SSPI to securely perform the
tests, and test scripts are patched in a couple of places to support
Windows grammar. In the case of MSVC, tests can be run with this
command:
vcregress tapcheck
---
 .gitignore |   3 +
 doc/src/sgml/install-windows.sgml  |   1 +
 src/Makefile.global.in |   2 +-
 src/bin/initdb/t/001_initdb.pl |  18 ++--
 src/bin/pg_basebackup/t/010_pg_basebackup.pl   |  79 +---
 src/bin/pg_controldata/t/001_pg_controldata.pl |   5 +-
 src/bin/pg_ctl/t/001_start_stop.pl |  14 ++-
 src/bin/pg_ctl/t/002_status.pl |  12 ++-
 src/bin/pg_rewind/RewindTest.pm| 119 -
 src/bin/scripts/t/020_createdb.pl  |   3 +
 src/test/perl/TestLib.pm   |  25 --
 src/tools/msvc/Solution.pm |   1 +
 src/tools/msvc/config_default.pl   |   1 +
 src/tools/msvc/vcregress.pl|  60 -
 14 files changed, 271 insertions(+), 72 deletions(-)

diff --git a/.gitignore b/.gitignore
index 8d3af50..3cd37fe 100644
--- a/.gitignore
+++ b/.gitignore
@@ -36,3 +36,6 @@ lib*.pc
 /pgsql.sln.cache
 /Debug/
 /Release/
+
+# Generated by tests
+/tmp_check/
diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index d154b44..2047790 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -439,6 +439,7 @@ $ENV{CONFIG}="Debug";
 vcregress modulescheck
 vcregress ecpgcheck
 vcregress isolationcheck
+vcregress tapcheck
 vcregress upgradecheck
 
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 4b06fc2..b27ed78 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -324,7 +324,7 @@ endef
 define prove_check
 $(MKDIR_P) tmp_check/log
 $(MAKE) -C $(top_builddir) DESTDIR='$(CURDIR)'/tmp_check/install install >'$(CURDIR)'/tmp_check/log/install.log 2>&1
-cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(CURDIR)/tmp_check/install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
+cd $(srcdir) && TESTDIR='$(CURDIR)' TESTREGRESS='$(top_builddir)/src/test/regress/pg_regress' PATH="$(CURDIR)/tmp_check/install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
 endef
 
 else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index d12be84..0865107 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -1,6 +1,8 @@
 use strict;
 use warnings;
+use Config;
 use TestLib;
+use File::Path qw(rmtree);
 use Test::More tests => 19;
 
 my $tempdir = TestLib::tempdir;
@@ -18,27 +20,31 @@ command_fails([ 'initdb', '-S', "$tempdir/data3" ],
 mkdir "$tempdir/data4" or BAIL_OUT($!);
 command_ok([ 'initdb', "$tempdir/data4" ], 'existing empty data directory');
 
-system_or_bail "rm -rf '$tempdir'/*";
-
+rmtree($tempdir);
+mkdir $tempdir;
 command_ok([ 'initdb', '-X', "$tempdir/pgxlog", "$tempdir/data" ],
 	'separate xlog directory');
 
-system_or_bail "rm -rf '$tempdir'/*";
+rmtree($tempdir);
+mkdir $tempdir;
 command_fails(
 	[ 'initdb', "$tempdir/data", '-X', 'pgxlog' ],
 	'relative xlog directory not allowed');
 
-system_or_bail "rm -rf '$tempdir'/*";
+rmtree($tempdir);
+mkdir $tempdir;
 mkdir "$tempdir/pgxlog";
 command_ok([ 'initdb', '-X', "$tempdir/pgxlog", "$tempdir/data" ],
 	'existing empty xlog directory');
 
-system_or_bail "rm -rf '$tempdir'/*";
+rmtree($tempdir);
+mkdir $tempdir;
 mkdir "$tempdir/pgxlog";
 mkdir "$tempdir/pgxlog/lost+found";
 command_fails([ 'initdb', '-X', "$tempdir/pgxlog", "$tempdir/data" ],
 	'existing nonempty xlog directory');
 
-system_or_bail "rm -rf '$tempdir'/*";
+rmtree($tempdir);
+mkdir $tempdir;
 command_ok([ 'initdb', '-T', 'german', "$tempdir/data" ],
 	'select default dictionary');
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 7e9a776..4cb1b5a 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -1,8 +1,9 @@
 use strict;
 use warnings;
 use Cwd;
+use Config;
 use TestLib;
-use Test::More tests => 35;
+u

Re: [HACKERS] Sequence Access Method WIP

2015-04-20 Thread Petr Jelinek

On 20/04/15 12:05, Andres Freund wrote:

On 2015-04-20 12:49:39 +0300, Heikki Linnakangas wrote:

With the patch, pg_class.relam column references to the pg_seqam table for
sequences, but pg_indexam for indexes. I believe it's the first instance
where we reuse a "foreign key" column like that. It's not a real foreign
key, of course - that wouldn't work with a real foreign key at all - but
it's a bit strange. That makes me a bit uncomfortable. How do others feel
about that?


Hm. I'd modeled it more as an extension of the 'relkind' column
mentally. I.e. it further specifies how exactly the relation is
behaving. Given that the field has been added to pg_class and not
pg_index, combined with it not having index in its name, makes me think
that it actually was intended to be used the way it's done in the patch.



That's how I think about it too. It's also short for access method, 
nothing really suggests to me that it should be index specific by design.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sequence Access Method WIP

2015-04-20 Thread Andres Freund
On 2015-04-20 12:49:39 +0300, Heikki Linnakangas wrote:
> With the patch, pg_class.relam column references to the pg_seqam table for
> sequences, but pg_indexam for indexes. I believe it's the first instance
> where we reuse a "foreign key" column like that. It's not a real foreign
> key, of course - that wouldn't work with a real foreign key at all - but
> it's a bit strange. That makes me a bit uncomfortable. How do others feel
> about that?

Hm. I'd modeled it more as an extension of the 'relkind' column
mentally. I.e. it further specifies how exactly the relation is
behaving. Given that the field has been added to pg_class and not
pg_index, combined with it not having index in its name, makes me think
that it actually was intended to be used the way it's done in the patch.

It's not the first column that behaves that way btw, at least pg_depend
comes to mind.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sequence Access Method WIP

2015-04-20 Thread Heikki Linnakangas

On 03/15/2015 09:07 PM, Petr Jelinek wrote:

Slightly updated version of the patch.

Mainly rebased against current master (there were several conflicts) and
fixed some typos, no real functional change.

I also attached initial version of the API sgml doc.


Thanks!

With the patch, pg_class.relam column references to the pg_seqam table 
for sequences, but pg_indexam for indexes. I believe it's the first 
instance where we reuse a "foreign key" column like that. It's not a 
real foreign key, of course - that wouldn't work with a real foreign key 
at all - but it's a bit strange. That makes me a bit uncomfortable. How 
do others feel about that?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-04-20 Thread Sawada Masahiko
On Thu, Apr 16, 2015 at 2:34 AM, David Steele  wrote:
> On 4/15/15 11:30 AM, Sawada Masahiko wrote:
>> On Wed, Apr 15, 2015 at 10:52 AM, Sawada Masahiko  
>> wrote:
>>> I tested v8 patch with CURSOR case I mentioned before, and got
>>> segmentation fault again.
>>> Here are log messages in my environment,
>>>
>>> =# select test();
>>>  LOG:  server process (PID 29730) was terminated by signal 11:
>>> Segmentation fault
>>> DETAIL:  Failed process was running: select test();
>>> LOG:  terminating any other active server processes
>>> WARNING:  terminating connection because of crash of another server process
>>> DETAIL:  The postmaster has commanded this server process to roll back
>>> the current transaction and exit, because another server process
>>> exited abnormally and possibly corrupted shared memory.
>>> HINT:  In a moment you should be able to reconnect to the database and
>>> repeat your command.
>>> FATAL:  the database system is in recovery mode
>>
>> I investigated this problem and inform you my result here.
>>
>> When we execute test() function I mentioned, the three stack items in
>> total are stored into auditEventStack.
>> The two of them are freed by stack_pop() -> stack_free() (i.g,
>> stack_free() is called by stack_pop()).
>> One of them is freed by PortalDrop() -> .. ->
>> MemoryContextDeleteChildren() -> ... -> stack_free().
>> And it is freed at the same time as deleting pg_audit memory context,
>> and stack will be completely empty.
>>
>> But after freeing all items, finish_xact_command() function could call
>> PortalDrop(), so ExecutorEnd() function could be called again.
>> pg_audit has ExecutorEnd_hook, so postgres tries to free that item.. SEGV.
>>
>> In my environment, the following change fixes it.
>>
>> --- pg_audit.c.org2015-04-15 14:21:07.541866525 +0900
>> +++ pg_audit.c2015-04-15 11:36:53.758877339 +0900
>> @@ -1291,7 +1291,7 @@
>>  standard_ExecutorEnd(queryDesc);
>>
>>  /* Pop the audit event off the stack */
>> -if (!internalStatement)
>> +if (!internalStatement && auditEventStack != NULL)
>>  {
>>  stack_pop(auditEventStack->stackId);
>>  }
>>
>> It might be good idea to add Assert() at before calling stack_pop().
>>
>> I'm not sure this change is exactly correct, but I hope this
>> information helps you.
>
> I appreciate you taking the time to look - this is the same conclusion I
> came to.  I also hardened another area that I thought might be vulnerable.
>
> I've seen this problem with explicit cursors before (and fixed it in
> another place a while ago).  The memory context that is current in
> ExecutorStart is freed before ExecutorEnd is called only in this case as
> far as I can tell.  I'm not sure this is very consistent behavior.
>
> I have attached patch v9 which fixes this issue as you suggested, but
> I'm not completely satisfied with it.  It seems like there could be an
> unintentional pop from the stack in a case of deeper nesting.  This
> might not be possible but it's hard to disprove.
>
> I'll think about it some more, but meanwhile this patch addresses the
> present issue.

Thank you for updating the patch.

One question about regarding since v7 (or later) patch is;
What is the different between OBJECT logging and SESSION logging?

I used v9 patch with "pg_audit.log_relation = on", and got quite
similar but different log as follows.

=# select * from hoge, bar where hoge.col = bar.col;
NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.hoge,"select *
from hoge, bar where hoge.col = bar.col;"
NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.hoge,"select *
from hoge, bar where hoge.col = bar.col;"
NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.bar,"select * from
hoge, bar where hoge.col = bar.col;"
NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.bar,"select *
from hoge, bar where hoge.col = bar.col;"

The behaviour of SESSION log is similar to OBJECT log now, and SESSION
log per session (i.g, pg_audit.log_relation = off) is also similar to
'log_statement = all'. (enhancing log_statement might be enough)
So I couldn't understand needs of SESSION log.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2015-04-20 Thread Svenne Krap
Oh, and I build it on top of f92fc4c95ddcc25978354a8248d3df22269201bc

On 20-04-2015 10:36, Svenne Krap wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:tested, passed
>
> Hi, 
>
> I have (finally) found time to review this. 
>
> The syntax is as per spec as I can see, and the queries I have tested have 
> all produced the correct output. 
>
> The documentation looks good and is clear.
>
> I think it is spec compliant, but I am not used enough to the spec to be 
> sure. Also I have not understood the function of  
> (DISTINCT,ALL) part in the group by clause (and hence not tested it). Hence I 
> haven't marked the spec compliant part.
>
> The installcheck-world fails, but in src/pl/tcl/results/pltcl_queries.out (a 
> sorting problem when looking at the diff) which should be unrelated to GSP. I 
> don't know enough of the check to know if it has already run the GSP tests..
>
> I have also been running a few tests on some real data. This is run on my 
> laptop with 32 GB of memory and a fast SSD. 
>
> The first dataset is a join between a data table of 472 MB (4,3 Mrows) and a 
> tiny multi-column lookup table. I am returning a count(*).
> Here the data is hierarchical so CUBE does not make sense. GROUPING SETS and 
> ROLLUP both works fine and if work_buffers are large enough it slightly beats 
> the handwritten "union all" equivalent (runtimes as 7,6 seconds  to 7,7 
> seconds). If work_buffers are the default 4MB the union-all-equivalent (UAE) 
> beats the GS-query almost 2:1 due to disk spill (14,3 (GS) vs. 8,2 (UAE) 
> seconds). 
>
> The other query is on the same datatable as before, but with three "columns" 
> (two calculated and one natural) for a cube. I am returning a count(*). 
> First column is "extract year from date column"
> Second column is "divide a value by something and truncate" (i.e. make 
> buckets)
> Third column is a litteral integer column. 
> Here the GS-version is slightly slower than the UAE-version (17,5 vs. 14,2). 
> Nothing obvious about why in the explain (analyze,buffers,costs,timing) .
>
> I have the explains, but as the dataset is semi-private and I don't have any 
> easy way to edit out names in it, I will send it on request (non-disclosure 
> from the recipient is of course a must) and not post it on the list.
>
> I think the feature is ready to be commited, but am unsure whether I am 
> qualified to gauge that :)
>
> /Svenne
>
> The new status of this patch is: Ready for Committer
>
>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2015-04-20 Thread Svenne Krap
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Hi, 

I have (finally) found time to review this. 

The syntax is as per spec as I can see, and the queries I have tested have all 
produced the correct output. 

The documentation looks good and is clear.

I think it is spec compliant, but I am not used enough to the spec to be sure. 
Also I have not understood the function of  (DISTINCT,ALL) part 
in the group by clause (and hence not tested it). Hence I haven't marked the 
spec compliant part.

The installcheck-world fails, but in src/pl/tcl/results/pltcl_queries.out (a 
sorting problem when looking at the diff) which should be unrelated to GSP. I 
don't know enough of the check to know if it has already run the GSP tests..

I have also been running a few tests on some real data. This is run on my 
laptop with 32 GB of memory and a fast SSD. 

The first dataset is a join between a data table of 472 MB (4,3 Mrows) and a 
tiny multi-column lookup table. I am returning a count(*).
Here the data is hierarchical so CUBE does not make sense. GROUPING SETS and 
ROLLUP both works fine and if work_buffers are large enough it slightly beats 
the handwritten "union all" equivalent (runtimes as 7,6 seconds  to 7,7 
seconds). If work_buffers are the default 4MB the union-all-equivalent (UAE) 
beats the GS-query almost 2:1 due to disk spill (14,3 (GS) vs. 8,2 (UAE) 
seconds). 

The other query is on the same datatable as before, but with three "columns" 
(two calculated and one natural) for a cube. I am returning a count(*). 
First column is "extract year from date column"
Second column is "divide a value by something and truncate" (i.e. make buckets)
Third column is a litteral integer column. 
Here the GS-version is slightly slower than the UAE-version (17,5 vs. 14,2). 
Nothing obvious about why in the explain (analyze,buffers,costs,timing) .

I have the explains, but as the dataset is semi-private and I don't have any 
easy way to edit out names in it, I will send it on request (non-disclosure 
from the recipient is of course a must) and not post it on the list.

I think the feature is ready to be commited, but am unsure whether I am 
qualified to gauge that :)

/Svenne

The new status of this patch is: Ready for Committer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Andres Freund
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> I just realized that it talks about "replication identifier" as the new
> fundamental concept. The system table is called "pg_replication_identifier".
> But that's like talking about "index identifiers", instead of just indexes,
> and calling the system table pg_index_oid.
> 
> The important concept this patch actually adds is the *origin* of each
> transaction. That term is already used in some parts of the patch. I think
> we should roughly do a search-replace of "replication identifier" ->
> "replication origin" to the patch. Or even "transaction origin".

Sounds good to me.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Heikki Linnakangas

On 04/17/2015 11:54 AM, Andres Freund wrote:

I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.


Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

I just realized that it talks about "replication identifier" as the new 
fundamental concept. The system table is called 
"pg_replication_identifier". But that's like talking about "index 
identifiers", instead of just indexes, and calling the system table 
pg_index_oid.


The important concept this patch actually adds is the *origin* of each 
transaction. That term is already used in some parts of the patch. I 
think we should roughly do a search-replace of "replication identifier" 
-> "replication origin" to the patch. Or even "transaction origin".


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-20 Thread Etsuro Fujita
On 2015/04/16 16:05, Etsuro Fujita wrote:
> On 2015/03/23 2:57, Tom Lane wrote:
>> Etsuro Fujita  writes:
>>> [ fdw-inh-8.patch ]
>>
>> I've committed this with some substantial rearrangements, notably:
> 
>> * As I mentioned earlier, I got rid of a few unnecessary restrictions on
>> foreign tables so as to avoid introducing warts into inheritance behavior.
>> In particular, we now allow NOT VALID CHECK constraints (and hence ALTER
>> ... VALIDATE CONSTRAINT), ALTER SET STORAGE, and ALTER SET WITH/WITHOUT
>> OIDS.  These are probably no-ops anyway for foreign tables, though
>> conceivably an FDW might choose to implement some behavior for STORAGE
>> or OIDs.
> 
> I agree with you on this point.  However, ISTM there is a bug in
> handling OIDs on foreign tables; while we now allow for ALTER SET
> WITH/WITHOUT OIDS, we still don't allow the default_with_oids parameter
> for foreign tables.  I think that since CREATE FOREIGN TABLE should be
> consistent with ALTER FOREIGN TABLE, we should also allow the parameter
> for foreign tables.  Attached is a patch for that.

I also updated docs.  Attached is an updated version of the patch.

Best regards,
Etsuro Fujita
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 6745,6752  dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'


 
! This controls whether CREATE TABLE and
! CREATE TABLE AS include an OID column in
  newly-created tables, if neither WITH OIDS
  nor WITHOUT OIDS is specified. It also
  determines whether OIDs will be included in tables created by
--- 6745,6753 


 
! This controls whether CREATE TABLE,
! CREATE TABLE AS and
! CREATE FOREIGN TABLE include an OID column in
  newly-created tables, if neither WITH OIDS
  nor WITHOUT OIDS is specified. It also
  determines whether OIDs will be included in tables created by
*** a/doc/src/sgml/ref/create_foreign_table.sgml
--- b/doc/src/sgml/ref/create_foreign_table.sgml
***
*** 293,298  CHECK ( expression )
--- 293,304 
  responsibility to ensure that the constraint definition matches
  reality.
 
+ 
+   
+To add OIDs to the table created by CREATE FOREIGN TABLE,
+enable the  configuration
+variable.
+   
   
  
   
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 580,586  DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
  	descriptor = BuildDescForRelation(schema);
  
  	localHasOids = interpretOidsOption(stmt->options,
! 	   (relkind == RELKIND_RELATION));
  	descriptor->tdhasoid = (localHasOids || parentOidCount > 0);
  
  	/*
--- 580,587 
  	descriptor = BuildDescForRelation(schema);
  
  	localHasOids = interpretOidsOption(stmt->options,
! 	   (relkind == RELKIND_RELATION ||
! 		relkind == RELKIND_FOREIGN_TABLE));
  	descriptor->tdhasoid = (localHasOids || parentOidCount > 0);
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Andres Freund
On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote:
> Can you name some of the bigger problems you'd have?

Several parts of the system are O(#max_replication_slots). Having 100k
outgoing logical replication slots is going to be expensive.

Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be
the biggest problem.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Andres Freund
On 2015-04-20 01:04:18 -0700, Jeff Janes wrote:
> Was this reproducible?

Yes, at least with an old version of the patch.

I don't think you could see a difference using exactly that with the
newer versions which have the 5 page limit. After all it'll pretty much
never reach it.

> > That's not a unrealistic testcase.
> >
> > I'm pretty sure this could be made quite a bit more pronounced by not
> > using a uniform distribution in the pgbench runs. And selecting a test
> > that's more vulnerable to the change (e.g. using a wider distribution
> > for the read only statements than the modifying ones) would make the the
> > CPU overhead of the additional heap_hot_search_buffer() overhead
> > heavier.
>
> Sorry I don't understand this description.  Why would queries selecting
> data that is not changing have any extra overhead?

The idea, I think, was that by having a uniform (or just wider)
distribution of the reads they'd be more likely to land on values that
have been updated at some point, but not been pruned since (because at
that point the patch IIRC didn't prune during reads at all). I.e. ones
wer

> Is the idea that the hot part of the table for updates would move around
> over time, but the hot part for selects would be even throughout?

Pretty much.

> I'm not sure how to put that to the test.

That pretty much was what I'd tried to model, yea. I guess it'd be
possible to model this by inserting NOW()/updating values NOW() - 5 and
selecting values up to NOW() - 60. That'd roughly model some realistic
insert/update/select patterns I've seen.

To possibly see any difference with the new patch this would have to be
done in a way that regularly a couple of pages would be touched, with
not that many selected tuples on each.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Heikki Linnakangas

On 04/17/2015 11:45 PM, Petr Jelinek wrote:

>The argument to move to 4 bytes is a poor one. If it was reasonable in
>terms of code or cosmetic value then all values used in the backend
>would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
>don't do that.
>
>The change does nothing useful, since I doubt anyone will ever need
>  >32768 nodes in their cluster.
>

And if they did there would be other much bigger problems than
replication identifier being 16bit (it's actually >65534 as it's
unsigned btw).


Can you name some of the bigger problems you'd have?

Obviously, if you have 10 high-volume OLTP nodes connected to a 
single server, feeding transactions as a continous stream, you're going 
to choke the system. But you might have 10 tiny satellite databases 
that sync up with the master every few hours, and each of them do only a 
few updates per day.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Heikki Linnakangas

On 04/17/2015 11:36 PM, Simon Riggs wrote:

On 17 April 2015 at 19:18, Heikki Linnakangas  wrote:

To be honest, I'm not entirely sure what we're arguing over.


When arguing over something you consider small, it is customary to allow
the author precedence. We can't do things our own way all the time.


Sure, I'm not going to throw a tantrum if Andres commits this as it is.


I said that IMO the difference in WAL size is so small that we should just
use 4-byte OIDs for the replication identifiers, instead of trying to make
do with 2 bytes. Not because I find it too likely that you'll run out of
IDs (although it could happen), but more to make replication IDs more like
all other system objects we have. Andreas did some pgbench benchmarking to
show that the difference in WAL size is about 10%. The WAL records
generated by pgbench happen to have just the right sizes so that the 2-3
extra bytes bump them over to the next alignment boundary. That's why there
is such a big difference - on average it'll be less. I think that's
acceptable, Andreas seems to think otherwise. But if the WAL size really is
so precious, we could remove the two padding bytes from XLogRecord, instead
of dedicating them for the replication ids. That would be an even better
use for them.


The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend would
be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
that.


That's a straw man argument. I'm not saying we should never use 2 byte 
values anywhere. OID is usually used as the primary key in system 
tables. There are exceptions, but that is nevertheless the norm. I'm 
saying that saving in WAL size is not worth making an exception here, 
and we should go with the simplest option of using OIDs.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Jeff Janes
On Mon, Sep 29, 2014 at 2:13 AM, Andres Freund  wrote:
>
> On 2014-09-28 19:51:36 +0100, Simon Riggs wrote:
> > On 27 September 2014 09:29, Andres Freund  wrote:
> > > On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote:
> > >> This patch has gotten a fair amount of review, and has been
rewritten once
> > >> during the commitfest. I think it's pretty close to being
committable, the
> > >> only remaining question seems to be what to do with system catalogs.
I'm
> > >> marking this as "Returned with feedback", I take it that Simon can
proceed
> > >> from here, outside the commitfest.
> > >
> > > FWIW, I don't think it is, even with that. As is it seems very likely
> > > that it's going to regress a fair share of workloads. At the very
least
> > > it needs a fair amount of benchmarking beforehand.
> >
> > There is some doubt there. We've not seen a workload that does
> > actually exhibit a negative behaviour.
>
> Neither is there much data about the magnitude of positive effect the
> patch has...
>
> > I'm not saying one doesn't exist, but it does matter how common/likely
> > it is. If anyone can present a performance test case that demonstrates
> > a regression, I think it will make it easier to discuss how wide that
> > case is and what we should do about it. Discussing whether to do
> > various kinds of limited pruning are moot until that is clear.
>
> I doubt it'll be hard to construct a case where it'll show. My first try
> of using a pgbench scale 100, -M prepared, -cj8 with a custom file with
> 1 write and 5 read transaction yielded the following on my laptop:
>
> Baseline:
>  relname| pgbench_tellers
>  pg_total_relation_size | 458752
>  relname| pgbench_accounts
>  pg_total_relation_size | 1590337536
>  relname| pgbench_branches
>  pg_total_relation_size | 286720
>  relname| pgbench_history
>  pg_total_relation_size | 49979392
> Patched:
>  relname| pgbench_tellers
>  pg_total_relation_size | 516096
>  relname| pgbench_accounts
>  pg_total_relation_size | 1590337536
>  relname| pgbench_branches
>  pg_total_relation_size | 360448
>  relname| pgbench_history
>  pg_total_relation_size | 49528832
>
> So, there's a noticeable increase in size. Mostly on the smaller tables,
> so probably HOT cleanup was sometimes skipped during UPDATEs due to
> locks.
>
> Baseline was:
> tps = 9655.486532 (excluding connections establishing)
> Patched was:
> tps = 9466.158701 (including connections establishing)

Was this reproducible?  I've run your custom sql file with 4 clients (that
is how many CPUs I have) on a machine
with a BBU.  I had wal_level = hot_standby, but the archive_command just
returned true without archiving anything. And using the latest patch.

The size of the pgbench_tellers and pgbench_branches relations were
surprisingly variable in both patched and unpatched, but there was no
reliable difference between them, just within them.

On the TPS front, there was a hint that patched one was slightly slower but
the within sample variation was also high, and the p-val for difference was
only 0.214 on n of 66.

test case attached.

> That's not a unrealistic testcase.
>
> I'm pretty sure this could be made quite a bit more pronounced by not
> using a uniform distribution in the pgbench runs. And selecting a test
> that's more vulnerable to the change (e.g. using a wider distribution
> for the read only statements than the modifying ones) would make the the
> CPU overhead of the additional heap_hot_search_buffer() overhead
> heavier.

Sorry I don't understand this description.  Why would queries selecting
data that is not changing have any extra overhead?

Is the idea that the hot part of the table for updates would move around
over time, but the hot part for selects would be even throughout?  I'm not
sure how to put that to the test.

Cheers,

Jeff


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Sawada Masahiko
On Tue, Apr 7, 2015 at 11:22 AM, Sawada Masahiko  wrote:
> On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby  wrote:
>> On 4/6/15 5:18 PM, Greg Stark wrote:
>>>
>>> Only I would suggest thinking of it in terms of two orthogonal boolean
>>> flags rather than three states. It's easier to reason about whether a
>>> table has a specific property than trying to control a state machine in
>>> a predefined pathway.
>>>
>>> So I would say the two flags are:
>>> READONLY: guarantees nothing can be dirtied
>>> ALLFROZEN: guarantees no unfrozen tuples are present
>>>
>>> In practice you can't have the later without the former since vacuum
>>> can't know everything is frozen unless it knows nobody is inserting. But
>>> perhaps there will be cases in the future where that's not true.
>>
>>
>> I'm not so sure about that. There's a logical state progression here (see
>> below). ISTM it's easier to just enforce that in one place instead of a
>> bunch of places having to check multiple conditions. But, I'm not wed to a
>> single field.
>>
>>> Incidentally there are number of other optimisations tat over had in
>>> mind that are only possible on frozen read-only tables:
>>>
>>> 1) Compression: compress the pages and pack them one after the other.
>>> Build a new fork with offsets for each page.
>>>
>>> 2) Automatic partition elimination where the statistics track the
>>> minimum and maximum value per partition (and number of tuples) and treat
>>> then as implicit constraints. In particular it would magically make read
>>> only empty parent partitions be excluded regardless of the where clause.
>>
>>
>> AFAICT neither of those actually requires ALLFROZEN, no? You'll need to
>> uncompact and re-compact for #1 when you actually freeze (which maybe isn't
>> worth it), but freezing isn't absolutely required. #2 would only require
>> that everything in the relation is visible; not frozen.
>>
>> I think there's value here to having an ALLVISIBLE state as well as
>> ALLFROZEN.
>>
>
> Based on may suggestions, I'm going to deal with FM at first as one
> patch. It would be simply mechanism and similar to VM, at first patch.
> - Each bit of FM represent single page
> - The bit is set only by vacuum
> - The bit is un-set by inserting and updating and deleting
>
> At second, I'll deal with simply read-only table and 2 states,
> Read/Write(default) and ReadOnly as one patch. ITSM the having the
> Frozen state needs to more discussion. read-only table just allow us
> to disable any updating table, and it's controlled by read-only flag
> pg_class has. And DDL command which changes these status is like ALTER
> TABLE SET READ ONLY, or READ WRITE.
> Also as Alvaro's suggested, the read-only table affect not only
> freezing table but also performance optimization. I'll consider
> including them when I deal with read-only table.
>

Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.

Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.

Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.
Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.

Please give me feedbacks.

Regards,

---
Sawada Masahiko
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..53f07fd 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/heap
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o frozenmap.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/frozenmap.c b/src/backend/access/heap/frozenmap.c
new file mode 100644
index 000..6e64cb8
--- /dev/null
+++ b/src/backend/access/heap/frozenmap.c
@@ -0,0 +1,567 @@
+/*-
+ *
+ * frozenmap.c
+ *	  bitmap for tracking frozen heap tuples
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/heap/frozenmap.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-04-20 Thread Etsuro Fujita
On 2015/04/17 13:16, Amit Langote wrote:
> On 17-04-2015 PM 12:35, Etsuro Fujita wrote:
>> (2) that might cause the problem of associating subplans' update
>> information with subplans' scan information, pointed out by Tom [1].

> Having realized how it really works now, my +1 to "Foreign Modifying Scan" for
> cases of pushed down update as suggested by Albe Laurenz. I guess it would be
> signaled by the proposed ForeignScan.CmdType being CMD_UPDATE / CMP_UPDATE
> (/CMD_INSERT).

Thanks for the opinion!  I think that that is an idea.  However, I'd
like to propose to rename "Foreign Update" ("Foreign Delete") of
ModifyTable simply to "Update" ("Delete") not only because (1) that
solves the duplication problem but also because (2) ISTM that is
consistent with the non-inherited updates in both of the
non-pushed-down-update case and the pushed-down-update case.  Here are
examples for (2).

* Inherited and non-inherited updates for the non-pushed-down case:

postgres=# explain verbose update parent set c1 = trunc(random() * 9 +
1)::int;
 QUERY PLAN
-
 Update on public.parent  (cost=0.00..452.06 rows=5461 width=6)
   Update on public.parent
   Update on public.ft1
 Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
   Update on public.ft2
 Remote SQL: UPDATE public.t2 SET c1 = $2 WHERE ctid = $1
   ->  Seq Scan on public.parent  (cost=0.00..0.01 rows=1 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, parent.ctid
   ->  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, ft1.ctid
 Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
   ->  Foreign Scan on public.ft2  (cost=100.00..226.03 rows=2730 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, ft2.ctid
 Remote SQL: SELECT ctid FROM public.t2 FOR UPDATE
(14 rows)

postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 + 1)::int;
  QUERY PLAN
--
 Update on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
   Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
   ->  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, ctid
 Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
(5 rows)

* Inherited and non-inherited updates for the pushed-down case:

postgres=# explain verbose update parent set c1 = c1 + 1;
  QUERY PLAN
--
 Update on public.parent  (cost=0.00..376.59 rows=4819 width=10)
   Update on public.parent
   Update on public.ft1
   Update on public.ft2
   ->  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=10)
 Output: (parent.c1 + 1), parent.ctid
   ->  Foreign Update on public.ft1  (cost=100.00..188.29 rows=2409
width=10)
 Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1)
   ->  Foreign Update on public.ft2  (cost=100.00..188.29 rows=2409
width=10)
 Remote SQL: UPDATE public.t2 SET c1 = (c1 + 1)
(10 rows)

postgres=# explain verbose update ft1 set c1 = c1 + 1;
  QUERY PLAN
--
 Update on public.ft1  (cost=100.00..188.29 rows=2409 width=10)
   ->  Foreign Update on public.ft1  (cost=100.00..188.29 rows=2409
width=10)
 Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1)
(3 rows)

Comments are welcome.

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers